Finding Genetic Variation with PERL

Document Sample
Finding Genetic Variation with PERL Powered By Docstoc
					Exploring Human Genetic Variation
         A Bioinformatics Case Study

  Steve Pittard / BioMolecular Computing Resource at Emory
                                Human Genome Project

Human Genetic Variation - A Bioinformatics Case Study
                           Human Genome Information

    # The human genome is the complete list of coded instructions needed to make a human.

    # The human genome is composed of more than 3 billion nucleotide bases.( > 3Gb)

    # The total number of human genes is estimated to be between 20,000 - 25,000.

    # Worms have 19,098 genes, fruit flies have 13,602 and yeast has 6,034.

    # Almost all nucleotide bases (99.9%) are exactly the same in all people.

    # Our DNA is 98% identical to chimpanzees. The average amount of genetic difference between any
    2 chimpanzees is 4 or 5 times more than the average difference between any 2 humans.

    # The vast majority of the DNA in the genome (>97%) has no known function.

    # The functions remain unknown for over 50% of discovered genes.

    # The entire human genome requires more than 3 gigabytes of computer storage space.


Human Genetic Variation - A Bioinformatics Case Study
                           Human Genome Sequencing

   Shotgun sequencing was one of the precursor technologies that was responsible for enabling full
   genome sequencing. In shotgun sequencing, DNA is broken up randomly into numerous small
   segments, which are sequenced to obtain “reads” (e.g. “ACGTTGCGGTTT….”). Multiple overlapping
   reads for the target DNA are obtained by performing several rounds of this fragmentation and
   sequencing. Computer programs then use the overlapping ends of different reads to assemble them
   into a continuous sequence

                     A very,very basic sequence assembly example:

Human Genetic Variation - A Bioinformatics Case Study
                          A “Real” Sequence Assembly

Human Genetic Variation - A Bioinformatics Case Study
                                Sequencing Trace Files

   The files representing the “reads” (e.g. “ACGGTTGGCCC…”) are derived from “trace files” that are
   the output of the sequencing technology used to determine the correct nucleotide bases. These are
   then “assembled” into the previously mentioned longer sequences. Our research to find genetic
   variation begins with these trace files obtained from the public human genome sequencing projects.

      Natural Genetic Variation Caused by Transposable Elements in Humans E. Andrew Bennett,
      Laura E. Coleman, Circe Tsui,W. Stephen Pittard, and Scott E. Devine*, Department of
      Biochemistry, Center for Bioinformatics, Genetics and Molecular Biology Graduate Program
      and Bimcore, Emory University School of Medicine, Atlanta, Georgia

Human Genetic Variation - A Bioinformatics Case Study
                        Examining Transposable Elements

   Transposons are sequences of DNA that can move or transpose themselves to
   new positions within the genome of a single cell. The mechanism of transposition
   can be either "copy and paste" or "cut and paste". Transposition can create
   significant mutations and alter the cell's genome size.

                                            The first transposons were discovered in maize by
                                            Barbara McClintock in 1948, for which she was
                                            awarded a Nobel Prize in 1983. She noticed
                                            insertions, deletions, and translocations, caused by
                                            these transposons. These changes in the genome
                                            could, for example, lead to a change in the color of
                                            corn kernels. About 50% of the total genome of maize
                                            consists of transposons.

    Transposons and transposon-like repetitive elements collectively occupy 44% of the
    human genome sequence. The most common form of transposon in humans is the
    Alu sequence. The Alu sequence is approximately 300 bases long and can be found
    between 300,000 and a million times in the human genome.

Human Genetic Variation - A Bioinformatics Case Study
                    Insertions from the Chromosome View

Human Genetic Variation - A Bioinformatics Case Study
                    Deletions from the Chromosome View

Human Genetic Variation - A Bioinformatics Case Study
          Insertions/Deletions from the Sequence View

                                                    Insertions (primarily those within genes) have been
                                                    found to cause altered human phenotypes, including
                                                    diseases. For example, disease-causing Alu insertions
                                                    have been observed in the BRCA2 gene (Miki et al.
                                                    1996), the glycerol kinase gene (Zhang et al. 2000), and
                                                    others (Deininger and Batzer 1999).
                                                    Disease-causing L1 insertions likewise have been
                                                    observed in at least 14 different genes, causing cancers
                                                    (Morse et al. 1988; Miki et al. 1992; Liu et al. 1997),
                                                    hemophilia (Kazazian et al. 1988), muscular dystrophy
                                                    (Narita et al. 1993), and other diseases. It is likely that
                                                    additional transposon insertions will be found to affect
                                                    human phenotypes as well.

Human Genetic Variation - A Bioinformatics Case Study
                      Pipeline for Discovering Variation

Human Genetic Variation - A Bioinformatics Case Study
                          Variation PipeLine Continued

Human Genetic Variation - A Bioinformatics Case Study
                          Variation PipeLine Continued

 Obtain DNA Sequencing files - 16,378,975 DNA sequencing traces and accompanying quality files
 were obtained from Cold Spring Harbor Laboratory [traces generated by the SNP Consortium (TSC)]
 or from the Trace DB archive at the National Center for Biotechnology Information (NCBI).
 Data management issues included lack of disk space due to presence of numerous files as well as
 poor system performance during intensive I/O operations. To address this we employed a file naming
 algorithm based on MD5 to “compute” the name and location of the trace (and associated quality) file
 to avoid using operating system level commands.

 Trim out vector and low quality sequences - Our method identified the longest high-quality region
 of each trace and then trimmed the flanking data upon encountering 5 bases in a row with Phred
 scores < 25. The longest high- quality interval from each trace was chosen for further analysis and
 the remaining data were set aside. Trimmed traces also were required to have average Phred
 scores of at least 25 and minimum lengths of 100 bases.

 Mask Repeats and map each trace to a unique location -After trimming, each trace was filtered
 and masked for common repetitive sequence elements and was then mapped to a unique location in
 the human genome sequence. The single longest unmasked “anchor” sequence of the trace then
 was used to assign each trace to a unique genomic location using Mega-BLAST. After the traces
 were successfully mapped, they were completely unmasked and aligned to their assigned genomic
 locations using the Bl2Seq program (NCBI).

Human Genetic Variation - A Bioinformatics Case Study
                          Variation PipeLine Continued
 Identify transposon insertion polymorphisms by screening human indels -
 • Indels were identified for which at least 80% of the indel sequence was occupied by a known
 transposon. This step was accomplished by querying a relational database that stored RepeatMasker
 output data for each indel.
 • Next, selected candidates were examined with a custom Perl program to determine whether
 potential target site duplications were present. Such duplications generally flank transposon insertions
 and are hallmarks of most transposons.
 • Candidate transposon insertions also were screened with a custom Perl program to identify potential
 poly(A) tails, which are associated with certain retrotransposons. Finally, the genomic contexts of all
 transposon indel candidates were examined to identify true de novo insertions vs. indels that were
 caused by deletions or duplications within existing transposon copies.
 • All indels that met at least the first test were inspected and curated manually (see supplemental
 Table 1 at http:/ for the final curated set).
 • Validation of the computational pipeline by PCR: Sixty-one transposon insertions were chosen
 arbitrarily from the TSC data set and examined by PCR to evaluate the accuracy of our computational
 • A total of 68 PCR assays were designed initially. Seven (10%) of these assays failed due to
 technical reasons and these assays were abandoned. The remaining 61 assays (90%) yielded band(s)
 of the expected size(s) and were used to assay DNA samples from the Coriell diversity panel.

Human Genetic Variation - A Bioinformatics Case Study
                                      Key Technologies
                                            Used extensively to automate data download, unpack files,
                                            parse and filter trace files. Also used as a Meta tool to run
                                            standalone bioinformatics tools (e.g. BLAST, RepeatMasker,
                                            VecScreen) from within a program. Used to generate
                                            reports, figures, and websites.

                                            Used to catalogue repeatmasked results, indels and transposon
                                            associations. Also used as back end relational database for
                                            website promoting research results.

                                            Linux provided a flexible command-line environment with well
                                            performing file systems even under heavy I/O loads. Linux
                                            provides optimized memory management (> 16GB RAM).

                                             Popular load management tool to launch, manage, and
                                             organize the numerous batch jobs for masking, Blasting, and
                                             aligning candidate indels to the human genome. Also used on
                                             RSPH cluster and Ellipse enterprise cluster.

Human Genetic Variation - A Bioinformatics Case Study
                    Lessons Learned (and still learning)

Human Genetic Variation - A Bioinformatics Case Study
         Lessons Learned (and still learning) continued
  • Avoid writing large, monolithic programs - break it up into modular code that can later be called
  from or integrated into other code and pipelines.
  • Write test cases. Write test cases that fail. Never assume that a warning-free execution implies
  correctness of output. Never assume input data is “clean”. Filtering input is a very common activity.
  Try to learn the basics of the Perl debugger. Use a revision control system.
  • Put user documentation in source files. Use full-line comments to explain an algorithm. Comment
  anything that has puzzled you or tricked you as someone else may inherit your code or try to use
  • Clean up your mess. Have your program remove temporary/scratch files once they are no longer
  needed. Disk space is not infinite. Check return codes from I/O operations (e.g. open, close) and
  array indices to avoid unexpected problems with data.
  • Associative arrays / hashes in Perl are great. Learn them ! Use them ! You can also create on-
  disk hashes to store large datasets not yet ready for a relational database. Get a copy of the “Perl
  • Learn basic Linux system commands to check performance and check output progress (e.g. top,
  ps, du, df, find). BACKUP YOUR CODE AND DATA ! (flash drives are cheap).
  • Don’t use MySQL as a large spreadsheet. If the data is not relational then don’t bother with a
  database technology. That said - learning to model relationships within data is invaluable. Always
  think about how other would like to mine your output.
  • Keep your working directory organized. Files accumulate rapidly. Adopt a naming convention
  ASAP (one for source code, another for data files, another for sub-directories).

Human Genetic Variation - A Bioinformatics Case Study