The Big ORF theory: The Algorithm Implementation and Triplet Approximation
BGSA/Biology Departmental Seminar
The Big ORF theory: The Algorithm Implementation and Triplet Approximation
Steve Carr, Departments of Biology and Computer Science (cross-appointment) (with Todd Wareham, Dept of Computer Science & Donald Craig, eHealth Research Unit, Faculty of Medicine), Memorial University of Newfoundland, St Johns NL.
In the genomic era, a frequent data-mining task is the exploration of double-stranded DNA sequences for the occurrence of protein-coding regions. The expectation is that five of the six possible three-letter reading frames will be closed by one or more stop triplets in the Genetic Code: the sixth will be an Open Reading Frame (ORF) without stops that specifies a polypeptide sequence. The same constraint can be built into short (L < 25 bp) exemplars used to teach ORF finding in genetics and bioinformatics education. Where there are 4L possible sequences, however, the search for such examples is tedious.
Over beer, I propounded Carrs Conjecture, that there is an upper bound on L for which no sequences satisfy the 5 & 1 condition, and asked for an algorithm to construct exemplars of any desired length.
Todd Wareham and Don Craig implemented a two-stage recursive search algorithm that identifies such sequences, and an exhaustive algorithm that enumerates them. There are no solutions for L ≤ 10, and 96 for L=11: enumeration is limited by exponential CPU requirements for L>25. We developed a triplet approximation that models DNA sequences as sets of triplets length T = L/3, with 5 & 1 constraints on Go and Stop triplets. The results for higher values of L are unexpected. The models have implications for the optimum size of random DNA sequences recruited as functional domains, and for the evolution alternative Genetic Codes with 1, 2, 3, and 4 stop" codons.
The recursive algorithm has been implemented as an educational web application (RandomORF) available at [http://www.ucs.mun.ca/~donald/orf/bioscience/]. The two-stage display presents a sequence with a 5&1 solution, and allows it to be worked cold by students, with the correct ORF identified afterward (Carr, Craig, & Wareham. 2014. CBE Life Sci Educ, 13,56)