Next Generation Sequencing Tools and Algorithms

Instruction Mode: 1 Lecture + 2 Tutorials + 1 Practice

The objective of the course: To get acquainted with the high throughput sequencing data and its processing. Since these data often pose a problem of the big data domain, the existing algorithm to tackle such problems will be discussed with the limits and lacunas of each such existing technique. This will enable the students to ponder more about the string processing techniques and to come with the novel approach of genomic strings processing.

The outcome of the course: Trained individuals with the basic know-how of the string processing techniques and a good understanding of the tools for such data analytics.

 

Component Unit Topics for Coverage
Component 1 Unit 1 DNA sequencing, strings, and matching: DNA sequencers and working principle, DNA as a string. Parsing and manipulating real genome sequences and real DNA sequencing data.  Naive exact matching, homology detection; optimal pair-wise sequence alignment, alignment score statistics, efficient database searches (BLAST), Data science of metabolomics, pathway models
Unit 2 Preprocessing, indexing and approximate matching: Improving on naive exact matching with Boyer-Moore.  Preprocessing and indexing.  Indexing through grouping and ordering,  k-mers and k-mer indexes.  Approximate matching and the pigeonhole principle. Edit distance, assembly, overlaps: Hamming and edit distance. Algorithms for computing edit distance. Dynamic programming. Global and local alignment. De novo assembly. Overlaps and overlap graphs.
Component 2 Unit 3 Algorithms for assembly: Shortest common superstring and the greedy version. How repetitive DNA makes assembly difficult. De Bruijn graphs and Eulerian walks. How real assemblers work. The future of assembly.
Unit 4 Data variability and replication, Data transforms, Clustering, Dimension reduction, Pre-processing and normalization, Linear models with categorical covariates, Logistic regression, Null and alternative hypotheses analysis, false discovery rate, permutation and bootstrapping, Gene expression repository (GEO).

 

References/ Books:

  1. DNA Sequencing From Experimental Methods To Bioinformatics   by Alphey, Luke
  2. Analytical Techniques In DNA Sequencing by Veena Kumari
  3. Next-Generation Sequencing Data Analysis by Xinkun Wang
  4. Primer to Analysis of Genomic Data Using R (Use R!) by Cedric Gondro