|
Life Sciences Division
|
Computational Biosciences Section Staff
Administrative Support
Postdoctoral Fellows
Consultants/Subcontractors
Students/Faculty/Visiting Scientists
Advances in computational sciences have facilitated approaches based on modeling and simulation for the prediction of behavior in many biological systems. The Computational Biosciences Section represents an inter-disciplinary, multi-investigator program designed to integrate key elements of the biosciences with expertise computing, Internet tools, intelligent systems, and bioinformation systems. The section's research program covers several high-impact scientific areas including methods for gene discovery and the analysis of genomes, simulation and modeling of protein folding and functional and structural classification using artificial intelligence-based techniques, systems for computational forensics and video processing, medical applications of intelligent systems, and the development of bioinformatics tools and systems for use by the research community. The interpretation of sequenced genomes represents the next great challenge at the interface of computing and biology, providing knowledge which is of immeasurable value to medical research, biotechnology, the pharmaceutical industry and researchers in a host of fields ranging from microorganism metabolism, to structural biology, to bioremediation. The section is developing new ways of recognizing and understanding the many important features of genomes and defining the infrastructure necessary for high quality and comprehensive analysis and sequence annotation processes. A comprehensive framework for analysis and annotation is being constructed using a distributed interoperable system of analysis servers and biodatabases together in a project called the "Genome Annotation Consortium." The section has designed and continues to develop major online analytical systems for genomics such as the GRAIL® and genQuest Internet servers. The generation of new genome sequence and functional data is also a focus of the section, which has an emerging effort to develop high-throughput sequencing capabilities and advanced sequencing automation. Due to the flood of genome data, it is a high priority to improve the efficiency of protein characterization based on protein sequence. The section is developing new algorithms and tools for protein structure prediction and functional classification using a variety of techniques such as neural networks, protein threading, global optimization, genetic algorithms, and molecular dynamics. Also with the increased amount and complexity of biodata, an important emphasis is the architectural design and construction of integrated analysis systems, databases, and knowledge bases for such data. The section addresses fundamental research issues related to the design, development and maintenance of bioinformation databases, Internet tools, query systems, and other types of bioresources. Research issues include biodata representation, database and analysis system interoperation, data integration, automated agent-based data retrieval and update technologies, semantics, data mining, and knowledge discovery methods. Selected Accomplishments The GRAIL EXP High Performance Gene Modeling System. GRAIL® EXP is the world's most advanced and accurate sequence-based gene modeling code. It provides a unique capability to determine the structure of multiple genes in sequences using pattern recognition and ESTs. Significant changes in the structure of the GRAIL® EXP codes and advancements in the algorithms have made it possible to process the current 150 million base pairs of human genomic DNA for gene content. Most recently, the codes were restructured to improve accuracy and performance, and modularization into smaller logical units has enabled the system to be incorporated into a streamlined, client/server architecture. Tests of the code on the Paragon 150 and on workstation clusters have been completed with parallelization of the database search components of GRAIL® EXP facilitated using MPI (message passing interface) and PVM (parallel virtual machine) to allow execution on multiple processors. These scaling tests showed near-linear speedup and are a prerequisite to providing community-wide Internet access to Paragon-based GRAIL® EXP services by January 1, 1999. Globally Optimal Protein Threading. Predicting the folded state of a protein sequence is essential to the understanding of the function of the protein. Protein threading, a technique for finding a globally optimal sequence-structure alignment, is considered to be one of the most promising computational techniques for protein tertiary structure predictions. Up till now there have been no computer algorithms that promise to solve the protein threading problem in its full complexity due to its overwhelming computational complexity and the lack of understanding about the problem. Researchers have recently, for the first time, developed a rigorous polynomial-time algorithm which can realistically solve the protein threading problem under a condition that is widely accepted by the protein threading community. This research has also lead to a number of discoveries about the computational properties of the protein threading problem, which potentially have significant theoretical implications. These algorithms, as top computational biologist Dr. Richard Lathrop commented, "are innovative advances on the state of the art and represent much needed progress in the field." A summary of the initial research results were published in the Journal of Computational Biology, Fall 1998, as an invited paper. The PROSPECT Protein Structure Prediction Toolkit. A suite of computational tools was recently developed for protein structure predictions by the Computational Protein Structure Group. These computational tools include a new polynomial-time protein threading program, a fast probabilistic protein threading program, an innovative protein secondary structure prediction program, and a number of new and effective methods for modeling and computing protein potential energies. A computer package called Protein Structure Prediction and Evaluation Computer Toolkit (PROSPECT), consisting of these tools, allows a user to predict a protein tertiary structure based on recognition of structures or partial structures in a large database of known protein structures that are potentially similar to the unknown. Using this package, the Computational Protein Structure Group has made predictions for all 43 target proteins in the recent community-wide protein structure prediction contest CASP-3. A number of research groups from organizations such as NIH, Lawrence Berkeley Laboratory, Amgen, and Boston University have expressed interest in utilizing the system in their research and in collaborative development related to the project.
Improvements in GRAIL® Spliced Message Prediction. A new approach to splice site recognition. Once genes have been located in sequences, predicting the detailed structure for the gene accurately using computation is still a considerable challenge. A major problem is false prediction of Donor and Acceptor splice sites because of the large number of false signals in DNA sequences. A new clustering technique was used to better separate real signals from noise based on a subpartitioning of splice sites (for example donors) into a number of clusters with similar properties, where the set of clusters reflects the diversity of donor classes. Two different strategies for cluster-initialization were tried: greedy algorithm and random selection. The greedy algorithm gave better results in terms of inter- and intra-cluster distances. The scoring function for evaluation of a candidate donor site is based on a linear combination of weighted distances between the candidate and each cluster. This approach outperforms the standard GRAIL® method and a combination of this clustering technique with the original GRAIL® donor prediction system provides even greater improvement in splice site accuracy.
Progress in Genome Annotation. ORNL and the Genome Annotation Consortium have made progress in building a computational resource that will help us understand the basic biology of humans, microbes, and other biological organisms. This resource will include a comprehensive genome-wide analysis of genome sequence data from different organisms and facilitate an integration of biological data around a genome-sequence framework. The need for a comprehensive genome analysis process is pressing. Several organisms already have complete sequence data available; other organisms will be completed soon. The human genome and a considerable portion of the mouse genome will likely be completed in a few years. However, molecular biologists, medical researchers, environmental biologists, biotechnologists, and others will be unable to make best use of these sequences without a more organized and comprehensive computational analysis. ORNL and the Genome Annotation Consortium are building this needed genome analysis framework. The steps in this computational process are: (1) retrieving biological data and assembling genomes; (2) computing genes, proteins, and genome features from sequences and experimental data; (3) computing homology and function among genomes, genes, and gene products; (4) three-dimensional structure modeling of gene products; and (5) linking genes and gene products to biological pathways and systems. We have built infrastructure to keep up with the data flow in steps 1 and 2 and processes to add value to the sequence data, but the scale up to data flow in the next phase will require considerable enhancements. These computational analysis results, together with the relevant experimental results, need to be stored and accessed by researchers in several ways. We have made considerable progress in some data management, data storage, and data access issues. We have constructed one method of data access (Genome Channel) that is currently being used by the community and other data access methods will be released over the next fiscal year.
The Genome Channel Interface and Analysis Framework. The Genome Channel is a high-throughput distributed computational environment providing the genome community with various services, tools, and infrastructure for high-quality analysis and annotation of large-scale genome sequence data. The Genome Channel provides the only current and comprehensive assembled view of the human genome and attaches high-quality computational annotation to this data on a consistent basis. Users can access graphical and text-based interfaces for a comprehensive view of the genome information, including known genes and predicted gene structures, gene relatives in sequence databases, links to function information about new genes. The Genome Channel browser provides a highly intuitive view of the data at various levels of detail. An automated system has been constructed that provides and updates various kinds of computational analysis on genomes in the Genome Channel repository. It schedules the analysis of sequence contigs by the supported analysis tools in a concurrent, pipelined fashion. Software agents (search and update agents) perform automated data retrieval, collation, assembly, and fusion to link relevant function information from many databases to newly discovered genes. Computer-intensive analysis tools use distributed processing systems (namely, PVM or MPI) to achieve speedup by distributing the subtasks among a cluster of workstations or MPP machines such as the Intel Paragons.
|