Security and confidentiality approach for the Clinical E-Science by asafwewe


									Security and confidentiality approach for the Clinical E-Science
                     Framework (CLEF)
                  1                   1,4             1,         2              3              4           5
          D Kalra , P Singleton , D Ingram J Milan , J MacKay , D Detmer , A Rector
                          Centre for Health Informatics and Multiprofessional Education (CHIME)
                                                University College London
                                Holborn Union Building, Highgate Hill, London N19 5LW
                                                 Royal Marsden NHS Trust
                          The Genetics Unit, Institute of Child Health, University College London
                                         Judge Institute, University of Cambridge
                                Department of Computer Science, University of Manchester


          CLEF is an MRC sponsored project in the E           -Science programme that aims to
          establish policies and infrastructure for the next generation of integrated clinical and
          bioscience research. One of the major goals of the project is to provide a
          pseudonymised repository of histories of cancer patients that can be accessed by
          researchers. Robust mechanisms and policies are needed to ensure that patient
          privacy and confidentiality are preserved while delivering a repository of such
          medically rich information for the purposes of scientific research. This paper
          summarises the overall approach adopted by CLEF to meet data protection
          requirements, including the data flows and pseudonymisation mechanisms that are
          currently being developed. Intended constraints and monitoring policies that will
          apply to research interrogation of the repository are also outlined. Once evaluated, it
          is hoped that the CLEF approach can serve as a model for other distributed
          electronic health record repositories to be accessed for research. .

Background: The CLEF Project                                     the key problem of linking genomic information
                                                                 to the clinical course of patients’ illnesses.
CLEF is an MRC sponsored project in the E-
Science programme. It aims to establish                          Objectives of the security and
policies and infrastructure for the next                         confidentiality policy
generation of integrated clinical and bioscience
                                                                 The key ethico-legal goal of CLEF is to provide
research. The project’s core aims are:
                                                                 mechanisms and policies to ensure that patient
1.    to develop novel technology and software                   privacy and confidentiality are preserved while
      tools to analyse patient records. Language                 delivering a repository of medically rich
      tools have been identified as a key                        information for the purposes of scientific
      technology in two areas:                                   research. This requires policy/organisational
     a) to enable information to be extracted                    safeguards and a multilevel technical
           from the text in clinical narratives; and             framework.
     b)   to assist in removing residual                         There is a well-recognised need to establish a
          potentially identifying information                    scalable methodology for deriving large
          from clinical narratives;                              numbers of longitudinal pseudonymised health
2.  to establish best practice in the                            records (de-identified, identifiable only by the
    pseudonymisation of clinical records, and                    originating health authority), in order to conduct
    the development of systematic methods and                    the next generation of clinical and bio-scientific
    tools to do this on a scalable basis.                        research and to recruit for national clinical trials
                                                                 in ways not possible using current resources,
CLEF seeks to provide an end-to-end solution
                                                                 e.g. cancer registries. To do so requires a
for collecting and managing longitudinal data
                                                                 managed and monitored framework for
about cancer patients for both healthcare and
                                                                 maintaining privacy and confidentiality. It must
biomedical research. It is designed to address
both conceal patients’ identities and manage and     2.      depersonalisation – methods of access via
monitor authentication and access so that risk to            language extraction and generation that
privacy is minimised.                                        conceal or remove potentially identifying
One key strand of the CLEF project, therefore,               information;
focuses on the development of rigorous generic       3.      security – policies and technical measures
methods to solve this problem using cancer care              for the supervision and maintenance of the
as an exemplar domain.                                       pseudonymous Electronic Health Record
                                                             repository as if it contained identified
Requirements                                                 patient records, in conformance with NHS
                                                             and international standards including
There are strong legal protections on personal
                                                             privacy enhancing technologies to reduce
patient information, from the Data Protection                the risk of re-identification through queries;
Act 1998 (following on from the European
Directive 95/46/EC), the Human Rights Act            4.      oversight – specific policies for controlling
1998, as well as the common law of                           access to CLEF repository and handling
confidentiality. These generally require either              requests to link researchers back to real
the consent of the data subject or the                       patients;
pseudonymisation of the information.                 5.      monitoring – organisational and technical
Most research requirements do not need                       measures to identify potential threats and
identifiable information. What they require are              intrusions.
longitudinal records that reliably link the          The first four aspects of the approach are
various episodes for a single patient into a         discussed in more detail below. The fifth is a
coherent “chronicle”.            A dynamically       current area of exploration within the project.
pseudonymised record that offers the ability to      The high level view of the flow of information
use real Electronic Health Records, and observe      showing the points of control for privacy and
patients’ histories as they evolve is highly         confidentiality is given in Figure 1. The
attractive.                                          specific implementation of this scheme within
However, there is also a requirement to be able      the current state of the CLEF project is shown
to re-identify specific patients in special          in Figure 2.
circumstances, e.g. to warn patients of risks                                     Patient care &
uncovered by research or in order to recruit                                        dictated text
patients for clinical trials. Some Research Ethics
Committees (RECs) may even place such a                                                      Pseudonymise

requirement on research projects so that patients
can directly benefit from the research where                                                            Depersonalise

appropriate. Such re -identification must only be
possible via the originating health care                                     Construct
organisations.                                                              ‘Chronicle’
                                                          & Formulate                           Integrate &
Technical Approach                                          Queries                              Aggregate

The Electronic Health Record (EHR) at the
Royal Marsden Hospital (RMH) is one of the                                   Repository
main providers of pseudonymised patient
records to the project. An approach has been
developed by which real patient records                   Knowledge                                 & Formulate
                                                          Enrichment                                 Queries
(comprising structured data sets and narrative
letters and reports) can be suitably                                     Ethical Oversight
pseudonymised for removal from the ROYAL                                    Committee
MARSDEN HOSPITAL and included within                  Monitoring                                       Privacy
the CLEF Electronic Health Record Repository.                                                        Technologies
The process provides multiple layers for the
protection of patient confidentiality and privacy:
1. pseudonymisation – the removal of patient,                                   Researchers
                                                                            CLEF Workbench
     geographical and organisational identifiers
     at source;                                      Figure 1: High level view of CLEF information
                                                     flow cycle with points of control for privacy
                                                                                Define classes of EPR data to
                Royal Marsden Hospital                                          be excluded, or to be marked
                                                           RMH                  as sensitive
                                                         EPR system             Identify patients for CLEF

                                                                                Extract relevant EPR data
                 Cambridge                                                      Remove any patient identifiers
               University Health                  Create export (XML)           Create CLEF entry ID

               Advice and evaluation                                            Transfer securely to UCL

                Brighton,                UCL                                    Remove any hospital internal
                                                     Import into EHR            identifiers
                Manchester,                                                     Create CLEF internal ID
                Sheffield                                                       Label sensitive data items as
                                                                                Label narratives as
                                                Narratives                      restricted
                 Depersonalisation              (restricted)
                   of narratives

                  Clinical coding                                               [Mark up codes and generated
                   of narratives                         CLEF                   text are not restricted, but
                                                    Pseudonymised EHR           original narratives remain very
                                                   Access control filter
                 Text generation
                                                  Privacy Enhancement
                   from codes
                                                  Monitoring & logging

              Research community
                     Record                   Detailed                Ad hoc                  queries &
                     extracts                  queries                queries                 Metadata

                 Ethics committee      Ethics committee        Registered CLEF users        Registered CLEF
                 approved              approved                approved for research        users

                 Includes restricted   Includes restricted      Excludes restricted data Metadata for
                 data & may drill      data                                              resource discovery &
                 down to individuals                           Aggregated data only      predetermined
                                       Aggredated data only    May not drill down        aggregate datasets
                 Inclusion of          May not drill down      to individuals            only
                 narratives            to individuals
                 requires special

          Figure 2: Data flow within the current phase of CLEF project to generate the
          pseudonymised repository of EHRs .

Pseudonymisation (1)                                             ethico-legal requirements. UCL has been active
                                                                 in several EU projects over the past decade to
The CLEF pseudonymised repository of
                                                                 investigate and specify the requirements,
electronic health records (EHRs) will be
                                                                 information models and middleware services
established at University College London
                                                                 that are needed to underpin comprehensive
(UCL) using the results of European research
                                                                 mu lti-professional electronic health records [e.g.
into the design of Electronic Health Record
                                                                 Ingram 1995, Grimson et al 1998]. UCL has
systems, meeting established clinical and
designed and built a federated health record              d)   all occurrences of the patient’s name in
server based on these models, which has been                   text fields will be removed.
evaluated in the Department of Cardiovascular        2.   Prior to transfer to the CLEF Electronic
Medicine at the Whittington H     ospital in north        Health Record system at UCL, the extracted
London [Kalra et al 2001, Kalra 2002] and in              data-set will be further depersonalised by
the South West Devon ERDIP project 1 .                    running a number of procedures to remove
During the initial stages of the project until the        other potential identifiers. These procedures
methodology is proven, records will be                    will be developed through the various
restricted to those of deceased patients to               phases of the project (see below) and
minimise risk of harm to existing patients.               applied particularly to narratives which are
The overall process is implemented and split              considered to have the highest risk of
amongst the different partners as shown in                containing identifiable information.
Figure 2.                                            3.   At UCL, the incoming data will be re-
Records of patients at the ROYAL MARSDEN                  mapped into the CLEF EHR data-schema
HOSPITAL will be extracted from the main                  and the “clef-entry identifier” replaced by
computer system and subjected to a                        the internal clef-identifier, providing a
combination of computerised and manual de-                second barrier between the identifiers in the
identification on site before being sent via a            repository and the original identifiers at the
secure communication to UCL. These steps will             originating hospital.
include:                                             Identifiable patient information will not be
1. limiting extraction to the particular             released by the Royal Marsden Hospital under
    structural data elements of the Royal            any circumstances. The Royal Marsden Hospital
    Marsden Hospital Electronic Health Record        paper record systems are not accessible by this
    that are needed to support the anticipated       project.
    research queries for the CLEF Electronic         Additional policies and procedures, which are
    Health Record;                                   still being defined, will be put in place:
2. the exclusion of principal patient identifiers    1. at the time of querying: for monitoring and
    such as name, address, next of kin and GP              controlling queries;
    information;                                     2.   for returning information to the Royal
3. marking as ‘sensitive’ any demographic                 Marsden, ensuring that only the Royal
    and "social history" information that may             Marsden can re-identify patients and only in
    be needed to support realistic research               appropriate circumstances;
    queries (such as age, postal district,           3.   an overall supervisory and regulatory
    occupation).                                          framework, through responsibility to an
The various confidentiality-enhancing measures            oversight CLEF Ethics Board that will be
are:                                                      established towards the end of the CLEF
                                                          project, before any data is made available to
1.   At the Royal Marsden NHS Trust:
                                                          external research groups.
     a)   any patient records flagged as not to be
          included in research (at the request of    Depersonalisation - Extraction of data
          the patient and/or consultant) will be     elements from narratives (2)
          excluded from data extraction;             The text fields, particularly narratives, will be
     b)   key identifying fields, such as name,                                        t
                                                     parsed by routines developed a the University
          address, full postcode, NHS Number,        of Sheffield to extract only clinically structured
          will not be extracted                      data – in doing so any extraneous socially
                                                     significant information would be removed.
     c)   a secure “clef entry identifier” will
          replace the Royal Marsden Hospital         In the real world, much medical information is
          patient ID field, so that there is no      transferred through exchange of letters between
          reference whereby a researcher could       clinicians, through default of proper work-flow
          link back to the primary medical record    systems to support clinical care. Hence much of
          and the patient’s identity;                the data that is available is perforce in free-text
                                                     (or quasi-free-text) format. On the plus side,
                                                     such correspondence usually references only
                                                     key relevant information, filtering out much of
  Please see                                         the chaff generated by individual laboratory
reports and tests, which may not actually be           2.   limiting    access    to   the     individual
pertinent to the condition.                                 pseudonymised records to clinical research
The processing of the free text data to identify            projects that have themselves obtained
clinically relevant information and to extract              ethical approval for the queries they intend
this into a structured and codified format will             to run.
greatly increase the value of such data to             The main risk to patients would be through a
researchers, even if some recall and precision is      mechanism of inferential data-mining (whereby
lost (following the principle that half a loaf is      known information about a person’s medical
better than no bread, and that inaccessible data       history are used to identify a unique set of
trapped in free text format is virtually useless).     records which might then reveal more about that
This will be done through semantic analysis and        individual). In order to limit such risks the
extensive use of clinical vocabularies and             following restrictions are placed on access:
ontologies.                                            1. only individuals registered with REC
One positive side-effect of this data extraction is         approved projects may have access to the
that by focusing solely on medical facts much of            system and this will be time limited to the
the social context is omitted. While social                 project;
context may be critical in certain areas, e.g.         2.   projects and researchers will only be
mental health, removing it reduces the                      allowed access to specific fields or ranges
likelihood that extraneous information might                of records relevant to their project;
identify an individual, e.g. ‘ … <the patient>
                                                       3.   generally, only aggregate data will be
attended the <clinic> accompanied by her
                                                            provided unless ethical approval permits
partner, a well known politician. …’. The text
                                                            access to individual record-level data
extraction process will aim only to record that
the patient attended a clinic on a certain day         4.   there will be checks on query criteria to
(and even the exact date might be blurred to                identify possible inferential attacks, either
limit the risk of identification still further.)            through overlapping queries or highly
                                                            specific queries;
Security policy and technical
                                                       5. where individual record data is to be
measures (3)
                                                            provided with a facility for longitudinal
The information to be held in the CLEF                      linking, a project-specific re -mapping of the
repository might still be considered ‘sensitive             unique identifier will be used so that data-
personal data’ under the definition of the Data             sets provided to different projects cannot be
Protection Act 1998, so the general approach                re-linked.
taken by CLEF will be to treat these records as
                                                       There is a growing body of literature
if they still retain some (albeit hypothetical) risk
                                                       investigating the risks of person re-
of re-identification. Whatever the precautions,
                                                       identification through data mining and
there is always some chance that some unusual
                                                       probabilistic techniques [e.g. Sweeny L 2002],
or unique characteristics of an individual
                                                       and a similarly expanding set of algorithmic
clinical journey in an EHR might make the
                                                       techniques proposed for profiling and
patient recognisable to someone with sufficient
                                                       monitoring serial queries and result sets to
knowledge from other sources.
                                                       detect attempts to triangulate towards unique
A draft security policy has been proposed for          person characteristics [e.g. Ferris et al 2002,
the CLEF implementation that would meet                Murphy and Chueh 2002]. This and other work
many of the requirements of data protection,           in the field is being reviewed within the project
Caldicott-Guardian responsibilities and other          to determine the kinds of audit trails that need to
published requirements that would pertain to the       be built in and constraints that ought to apply to
control of access to real and identifiable patient     the specification of queries by the CLEF
records. This includes local security policies for     workbench tools.
each CLEF partner site that needs to access or
process data from the repository. The approach         Oversight – policies for access (4)
for research query access to the final CLEF            A series of requirements have been drafted that
repository includes:                                   will apply to research communities accessing
1. limiting the majority of research queries to        the final live CLEF services, via GRID
    the return of aggregate data (e.g. frequency       networks.
    tables) and not the findings in individual         1.   Reliable identification and traceability of
    patients;                                               any GRID users accessing CLEF
2.   Assignment of GRID access security levels          requests and acknowledgement of data-set
     as access to medical data sets may impose          delivery (to manage any re-requests of data
     restrictions  (e.g.   not    undergraduate         which might be spoofed).
     students)                                          Secure links would definitely be required,
3.   Authentication of users during sessions to         probably at least SSL 128-bit.
     ensure that sessions cannot be hi-jacked        3. Access to disaggregated pseudonymised
4.   Security of data transmissions                     record sets – special approval only
5.   Non-repudiation of query requests                  Specific projects (and hence specific users
6.   Local decryption of data packages                  within that project group, but possibly only
                                                        the Principal Investigator) would be
7.   Local screen security, both for user entry of
                                                        permitted access to the individual
     passwords, and to ensure that potentially
                                                        pseudonymised data-sets (though nearly
     sensitive data is not displayed without user
                                                        always with restrictions on the table
     presence and knowledge
                                                        columns that could be accessed; almost
Four ‘Use Cases’ are envisaged for research             never access to the entire record).
access to repository data, which will be
                                                        This would require an even greater emphasis
managed via the CLEF workbench and by
                                                        on identification and authentication, as
attribute certificate services within the EHR
                                                        well as security of the data provided through
repository middleware.
                                                        a session.
1. Open, Aggregated data – all users
                                                        The data provided may need to be encrypted
    CLEF may make available to GRID users               to a higher level than SSL 128-bit, and
    generally aggregated data-sets which are            hence may require some form of local
    fully pseudonymised and approved for                processing to de-crypt the data, and secure
    release to bona fide researchers.                   its delivery at the user workbench.
    CLEF would expect reliable identification
                                                     4. Downloading of subsets of the repository –
    of users making requests for such data, both
                                                        special approval only
    to develop performance statistics to justify
    ongoing funding, and to be able to vet for          Some projects may be allowed to download
    any unusual activity that may indicate              sub-sets of the CLEF database (subject to
    security or confidentiality breaches.               approval by a Research Oversight
    CLEF may require some measure of secure             Committee) which will have a specific
    links to such users to be able to meet              encryption of identifying fields for each
    assurances to Ethics Committees that data is        specific project (to prevent linking of
    only being provided to bona fide                    separate approved extracts to re-create the
    researchers, e.g. SSL links as standard. This       CLEF database in whole or in part).
    may also have implications for general              The required mechanisms will need to be
    access controls within GRID, and the                explored more fully, and may be covered
    passing of a general GRID access level to           within the requirements 1-3 above.
    CLEF to permit access for even                   A special requirement is that it should be
    pseudonymised medical data.                      possible to link back to the original patient id
2. Aggregated queries on individual records –        for ethically approved reasons, for example to
  CLEF registered users only                         contact high risk to patients or as part subsidiary
    CLEF will allow most CLEF registered             research project, or to identify patients for
    users to run queries on the pseudonymised        recruitment to trials. This process for direct
    data-sets to extract aggregated statistics       access to patient records is proposed to be as
    (possibly with cell-size restrictions to limit   follows:
    identifiability). Privacy enhancement and        1. researchers will be required to submit a
    monitoring techniques will be used to blur            request to the CLEF Ethics Board
    results so as to minimise the risk that               administration. The request will be assessed
    queries, singly or together, might allow              for ethical appropriateness, including a
    individuals to be identified.                         check against the original Research Ethics
    Access to such aggregated queries would               Committee approval);
    require a high-level of identification and       2. the CLEF administration will then submit
    authentication of the individual making               the request to the repository holders, in this
    access, including session control . This              case UCL,        to identify which CLEF
    would include non-repudiation of query                repository records are required.         Such
    access requires unusual authorisation and         initiative. CLEF will identify processes and
    will be strictly logged and monitored             procedures that are both technically feasible as
3. the repository holders (UCL) will then have        well as politically and socially acceptable to
    to make a similar request to originating          permit continuing and more efficient access to
    hospital, Royal Marsden, subject to the           medical records to further medical research.
    same vetting processes, and accompanied           CLEF may give rise to an ongoing research
    by the original entry IDs for the patients        database if there is continuing funding and
    involved. If the originating hospital, the        sufficient subscribing organisations are prepared
    Royal Marsden, agrees, only then can it           to provide data under this approach. Equally,
    trace the entry IDs back to the original          the policies and methods developed may serve
    Royal Marsden Hospital ID that allows the         to inform other projects within the UK (e.g. the
    hospital to reidentify the patient. The           current NHS National Programme for IT) or
    originating hospital can then contact the         around the world.
    patient either directly or via their general      An important objective of the project
    practitioner, in order to gain consent to         methodology is to establish best practice in
    further research access to their full medical     pseudonymisation and in the security policies
    records, to participate in a trial, or to be      that should pertain to such a repository. A
    recalled if at risk.                              formal evaluation of the proposed approach will
This three-stage process across three separate        be carried out and published.
organisations and identifiers should ensure that
identification is only possible when appropriate      Acknowledgements
and duly authorised.
                                                      CLEF is supported in part by the grant
                                                      G0100852 from the MRC under the E-Science
Progress to date                                      Initiative. Special thanks are due to its clinical
One-way key encryption is now being used to           collaborators at the Royal Marsden and Royal
create a CLEF “patient” identifier that is distinct   Free hospitals, to colleagues at the National
from any health service issued numbers,               Cancer Research Institute (NCRI) and NTRAC
permitting longitudinal growth of the CLEF            and to its industrial collaborators – see
record through a non-reciprocal link from the for further details.
RMH to CLEF. The mapping between these sets
of identifiers is held securely at the Royal          References & Extended
Marsden Hospital.                                     Bibliography
Parts of the clinical records are now being           Legal aspects
extracted from the Royal Marsden Hospital
                                                      Data Protection Act 1998, The Stationery Office
Electronic Health Record system, initially only
                                                      Limited London 1998,
for deceased patients, for transfer to the CLEF
Electronic Health Record server at UCL,
                                                      Human Rights Act 1998, The Stationery Office
beginning with the narrative case notes and           Limited London 1998,
correspondence parts of the records. In parallel,
the research teams at the Universities of
                                                      Regina v Department of Health, Ex parte Source
Sheffield, Brighton and Manchester have begun         Informatics Ltd (CA) [2001] QB 424, [2000] EuLR
analysing a number of manually de-identified          397, [2000] 1 AllER 786, The Times 18 January
sample narratives to design the target templates      2000 (use of anonymised patient information),
and data structures that are anticipated to be
derived from the complete corpus. A clinical          EU Directive 95/46,
advisory group has been active throughout the
project in proposing the kinds of clinical queries    w_en.htm
and data elements that are likely to be of            Confidentiality Guidance
greatest value to the research community, as          British Medical Association, Confidentiality and
well as contributing to the ethical and security      disclosure of health information, BMJ Publishing
approaches described in this paper.                   Group, London 1999,
Conclusions                                           General Medical Council, Seeking patients’ consent:
                                                      the ethical considerations, GMC London 1999,
CLEF explores options and policies concerning
a pseudonymisation solution to parallel the
‘consent’ approach underpinning the BioBank
Medical Research Council, Personal Information in     New Communications Age. IOS Press: Amsterdam;
Medical Research, MRC London 2000,                    1995; pp. 66-74                            Grimson J., Grimson W., Berry D., Stephens G.,
Privacy enhancing techniques                          Felton E., Kalra D., Toussaint P., and Weier O.W. A
L. Sweeney. k-anonymity: a model for protecting       CORBA-based integration of distributed electronic
privacy. International Journal on Uncertainty,        healthcare records using the synapses approach. IEEE
Fuzziness and Knowledge-based Systems, 10 (5),        Trans Inf Technol Biomed. Sep 1998; 2(3):124-38
2002; 557-570                                         Kalra D, Austin A, O’Connor A, Patterson D, Lloyd
Ferris TA., Garrison GM, M, Lowe HJ. Proposed         D, Ingram D. Design and Implementation of a
Key Escrow System for Secure Patient Information      Federated Health Record Server. Toward an
Disclosure in Biomedical Research Databases. Procs    Electronic Health Record Europe 2001, Paper 001: 1-
AMIA 2002 Annual Symposium 245-249                    13. Medical Records Institute for the Centre for
                                                      Advancement of Electronic Records Ltd.
Murphy S, Chueh H. A Security Architecture for
Query Tools used to Access Large Biomedical           Kalra D. Clinical foundations and information
Databases. Procs AMIA 2002 Annual Symposium           architecture for the implementation of a federated
552-556                                               health record service. PhD Thesis. Univ. London.
Electronic Health Records
                                                      NHS National Programme for IT,
Ingram D. The Good European Health Record Project (accessed 29/07/03)
in: Laires, Laderia Christensen, Eds. Health in the

To top