Analysis of Biological Sequences (140.638.01)

Synopsis

In genomic data analysis and computation biology, a good understanding of the probabilistic nature and statistical modeling of biological sequences, such as nucleotide and protein sequences, is key to understanding the existing algorithms/tools and to the ability to develop new tools. This course provides statistical foundations and an in-depth overview of the core algorithms of sequence analysis. Sequence analysis algorithms will include alignment/motif-finding (pairwise local alignment, heuristic local alignment such as BLAST, optimal pairwise local alignment, i.e. Smith-Waterman, pairwise global alignment and multiple alignment), gene finding (Glimmer), protein structure prediction and phylogenetic trees. Topics covered will include background on probability (including conditional probabilities and Bayes' rule), Markov models, hidden Markov models.

 


Second Term, 2005-2006 (some details below will be updated)

Official course catalog entry

MPH degree program of Bioinformatics

Instructors:

Sining Chen (sichen@jhsph.edu) and Guests

Course time and location:

Monday 3:30 – 4:50pm at Wolfe 4013;

Thursday 4:25 – 5:30pm at Wolfe 4013 (new time & location!)

Office hours:

When: Monday 2:30-3:30 (before class); where: W7033A and by appointment

Required text:

Biological Sequence Analysis, by R. Durbin, S. Eddy, et al. 

Supplementary texts:

Bioinformatics, by David Mount (more biology oriented); Statistical Methods in Bioinformatics, by W. Ewens and G. Grant (more quantitative)

Prerequisites:

basic probability; some programming 

Grades:

Student grades will be based 60% on homework, 30% on a presentation + written critique of a paper + 10% attendance

Reading assignment

Homework assigned by every Monday will be due the next Monday. 

For information on computing in R, please click here.                               


Syllabus

N = lecture Notes; R = References; P = Problems;

date

N

R

P

Topic

 

 

 

 

 

 

Oct

28

(2.9MB)

Overview of the course; basic molecular biology terminology. Review of useful probability concepts: random variables, conditional probability, expectation & variance.

Oct

31

(0.7MB)

Pairwise alignment (global alignment): dot matrix; Needleman-Wunch; Smith-Waterman (local alignment)

 

 

 

 

 

 

Nov

3

(1.4MB)

Significance of alignment scores; Development of scoring matrices

 

7

(1MB)

 

 

Multiple sequence alignment;

 

 

 

 

 

 

 

10

(1MB)

 

Phylogenetic trees: UPGMA, neighbor-joining,

 

14

(0.3MB)

 

parsimony likelihood approach, comparison to other methods

 

 

(0.2MB)

 

 

 

 

17

 

 

Guest lecture: Ingo Ruczinski on protein structure prediction from amino acid sequences.

 

21

 

 

 

Continued from last week: Ingo Ruczinski on protein structure prediction from amino acid sequences.

 

 

 

 

 

 

 

24

 

 

 

Thanksgiving Break

 

28

(629k)

 

 

Database search: BLAST . Jon: Bayesian bootstrap in evaluating alignment matrix performance

 

 

 

 

 

 

Dec

1

 

 

 

Alex: likelihood approach for morphological phylogenetics, Deepti: selection of oligoes

 

5

(1.1M)

 

HMM. Tao: protein function prediction using phylogenomics

 

 

 

 

 

 

 

8

(143K)

 

 

HMM-continued. Euiju:Selection strength

 

12

(1.3M)

 

 

HMM in gene finding. GLIMMER Matt: first generation HMM

 

 

 

 

 

 

 

15

 

 

 

Gene-finding, general. No student.

 

19

 

 

 

No class. Homework and paper summary due.