Slide 1 - CTSA Wiki

Document Sample
Slide 1 - CTSA Wiki Powered By Docstoc
					Science. Service. Technology.

                          Quantitative Imaging &
                          Reader Variability
                           David A. Clunie, CTO

 • Kristin Borradaile
 • Robert Ford
 • Michael O’Neal
 • Kevin Byrne
 • Jeff Toyes
 • Nick Petrick
 • Mike McNitt-Gray
 • Grace Kim
 • Chuck Fenimore

 • Dictionary:
    – “having reality independent of the mind”
    – “reducing subjective factors to a minimum”
    – “uninfluenced by … personal prejudices”
 • “objective” != “quantifiable”
 • Why obsess about “quantification”,
   especially if the subjective human (expert
   mind) is involved in making or making use of
   the measurement ?
 • Beware of numbers masquerading as truth

 • Individual (lesion) measurement
    – RECIST – one linear distance (LD or SD)
    – WHO – orthogonal linear distances
    – computed volumes from distances
    – 3D volume from edges (or similar)
 • Summed lesion measurements
    – finite # of “target” lesions (vs. total load)
 • Manual vs. semi-automated vs. automated
 • Fully elucidated response criteria
    – quantitative + qualitative + detection (new)
Context - QIBA

 • QIBA defining “profiles” for achieving targets
   for “qualification” as “biomarker” with
   requirements for acquisition
 • E.g., if x mm slices with y kVP and z mAs then
   p % variability can be obtained for tumor
   volume measurement (trial statistical power)
 • Goal is to have evidence-based requirements
 • Evidence == experimental results
 • Experiments require measurements
 • People & algorithms vary in performance
 • Variability may overwhelm acquisition effects
Single measurement variability

 • Large body of literature for various tumor
   types, body regions, modalities, acquisition
   parameters, linear or volume measurements
 • Strong emphasis on small (round) lung
   nodule volume change (criteria for
   malignancy in screening) – finite variance
 • Are lessons learned applicable to advanced
   metastatic disease refractory to conventional
   therapy even in lung – big, heterogeneous
   shape & density, invasive – greater variance,
   fewer measurable lesions ?
 • Whole body (including liver) ?#$!?
QIBA 1A – Lung Phantom - Accuracy
QIBA 1B – Real lung lesions
                                     3D from contour           3D from contour            3D from contour
                                             1                        2                          3
    Relative Difference


                                                                                      0    50000 100000 150000

                                     3D from contour           3D from contour
                                             4                        5

                                 0     50000 100000 150000 0    50000 100000 150000
                                                         Average Measurement
                          Graphs by Reader
QIBA 1B – Real lung lesions
Reducing Single Measure Variability

 • Reducing the “human factor”
 • Consistency through more precise rules ?
 • Better understanding of what type of lesions
   can be measured reliably
 • Better rules for choosing (or rejecting)
   lesions to measure
 • Better rules for where the edge is
    – invasion, density, spiculation, contrast
      phase, necrosis, cavitation, collapse
 • Judgment (experience) vs. guesswork
 • Consistency within vs. across readers
Context - UPICT

 • Protocols for “real” clinical trials
 • Requires fully elucidated response criteria
 • Phase 2 (SPA)/3 – “surrogate” for survival +/-
   quality of end of life
 • Early phase “go/no-go” – detection of effect
 • Response criteria are categorical
    – progression (PD), stable (SD), response
      partial (PR) or complete (CR)
 • No (official) response criteria for volume yet
 • Substitute sum of volumes for SLD in RECIST
   (and change response thresholds)
Response Criteria “rules”

 • Achieving reasonable quantitative
   reproducibility is only one (small ?) problem
 • Choosing target lesions (measurable disease)
 • Rejecting bad choices of target lesions
 • Split/merge target lesions (or unmeasured)
 • Bone, lymph nodes (versus node masses), …
 • Progression of non-target disease
   (“unequivocal” is open to interpretation)
 • Detect new lesions (sensitivity/specificity)
 • Unevaluable images – poor quality, partial
 • Rules vs. experience, justifiable vs. not
Multiple Readers, Adjudication

 • Typical Phase 3 independent review
 • 2 radiology readers + adjudicator
 • +/- oncologist review of radiology + clinical
 • Adjudication process
    – readers agree … OK
    – disagree, adjudicator chooses or repeats
 • Choice of one or more adjudication variables
    – date of 1st response
    – date of progression (DOP)
    – more variables … more discordance
    – just primary end-point (PFS use DOP)
Many ways to measure agreement

 • Simple adjudication rate is common
    – does not account for agreement by chance
    – from 10 to 50% depending on choice of
      variable, tumor type and other factors
 • Kappa – e.g. for single reader studies
    – 10% inter-reader variability retest typical
    – around 0.58-0.71 (moderate-substantial)
 • Important to distinguish between
    – a difficult task (all readers perform poorly)
    – a poorly performing individual reader
 • Better defined task vs. QC procedures
Reasons for Discordance
 Reason                              Justifiable       Error   Total
 Lesion selection                    30%               7%      37%
 New lesion detection                19%               11%     30%
 Non-target progression              8%                4%      13%
 Image quality/missing               11%               0%      11%
 Measurement                         8%                1%      9%
 TOTAL                               76%               23%     100%

 Borradaile et al, Discordance Between BICR Readers,
 Applied Clinical Trials, Nov 2010
New Lesion Discordance
 Reason                                                         Total
 Inconsistent technique, poor quality or partial missing data   30%
 Distinguishing benign vs. malignant                            23%
 Detection threshold (operating point)                          14%
 Extensive disease in same organ                                9%
 Unequivocal CT confirmation of bone scan                       5%
 Healing existing versus new bone lesion                        2%
 Exam with slightly different date (same TP)                    2%

 Borradaile et al, unpublished

 • Does it matter ?
   – theoretically not, if the “right” answer is
     obtained by the adjudicator (and both
     readers are not “wrong”), but …
   – undermines confidence in the process
   – high proportion of “justifiable” differences
   – is task inadequately controlled or defined ?
 • A “justifiable difference in measurement”
   – where the edges are, which contrast phase
     or slice to measure on ?
   – versus a slight numeric difference that
     crosses categorical threshold ?
Reducing discordance - I

 • Choosing the same lesions
 • Choosing the right lesions
 • Not measuring lesions that are unmeasurable
 • Measuring them the same way if measurable
 • Choosing measurements that are robust
 • More automation of measurements ?
 • Rigorously defining non-target progression
 • Non-target progression automation (texture ?)
 • Improving detection of new lesions (CAD ?)
 • Better response criteria (continuous ?)
Reducing discordance - II

 • Control for small variations in exam date
   (DOP PFS) cause spurious discordance – e.g.,
   progress on brain two days later than C/A/P –
   single nominal date for time point window
 • Record rules and re-train readers for every
   encountered “exceptional” scenario (else
   they make it up as they go along)
 • Consistent handling of partially missing data
   (incomplete anatomical coverage); target vs.
   non-target and difference in lesion selection
 • Automation of response derivation algorithm
   (humans make procedural mistakes)
QC - Individual Reader Performance
QC - Individual Reader Performance
Who needs measurements anyway ?

 • Gottlieb RH et al SIIM 2011: “Quantitative
   Visual Based Scoring: Improving Early CT
   Prediction of Outcome of Treated Metastatic
   Melanoma Patients Compared with the
   Current RECIST Standard”
 • Forget numbers - use the radiologist’s brain
 • “demonstrated superior discrimination
   compared with RECIST 1.1 in predicting
   patients likely to improve or deteriorate on
   treatment for metastatic melanoma early in
   the course of treatment”

Shared By: