The Big ORF theory: The Algorithm Implementation and Triplet Approximation

Friday, Nov. 28, 2014, 4-5 p.m.
SN-2067

BGSA/Biology Departmental Seminar

The Big ORF theory: The Algorithm Implementation and Triplet Approximation

Steve Carr, Departments of Biology and Computer Science (cross-appointment) (with Todd Wareham, Dept of Computer Science & Donald Craig, eHealth Research Unit, Faculty of Medicine), Memorial University of Newfoundland, St John’s NL.

In the genomic era, a frequent data-mining task is the exploration of double-stranded DNA sequences for the occurrence of protein-coding regions. The expectation is that five of the six possible three-letter reading frames will be “closed” by one or more “stop” triplets in the Genetic Code: the sixth will be an “Open Reading Frame” (ORF) without stops that specifies a polypeptide sequence. The same constraint can be built into short (L < 25 bp) exemplars used to teach “ORF finding” in genetics and bioinformatics education. Where there are 4L possible sequences, however, the search for such examples is tedious.

Over beer, I propounded “Carr’s Conjecture,” that there is an upper bound on L for which no sequences satisfy the “5 & 1” condition, and asked for an algorithm to construct exemplars of any desired length.

Todd Wareham and Don Craig implemented a two-stage recursive search algorithm that identifies such sequences, and an exhaustive algorithm that enumerates them. There are no solutions for L ≤ 10, and 96 for L=11: enumeration is limited by exponential CPU requirements for L>25. We developed a ‘triplet approximation’ that models DNA sequences as sets of triplets length T = L/3, with “5 & 1” constraints on “Go” and “Stop” triplets. The results for higher values of L are unexpected. The models have implications for the optimum size of random DNA sequences recruited as functional domains, and for the evolution alternative Genetic Codes with 1, 2, 3, and 4 “stop" codons.

The recursive algorithm has been implemented as an educational web application (“RandomORF”) available at [http://www.ucs.mun.ca/~donald/orf/bioscience/]. The two-stage display presents a sequence with a “5&1” solution, and allows it to be worked “cold” by students, with the correct ORF identified afterward (Carr, Craig, & Wareham. 2014. CBE Life Sci Educ, 13,56)


Contact

Marketing & Communications

230 Elizabeth Ave, St. John's, NL, CANADA, A1B 3X9

Postal Address: P.O. Box 4200, St. John's, NL, CANADA, A1C 5S7

Tel: (709) 864-8000