Training
Home New Items To Order Contact Us Search
![]() |
||
| FAQ/Price | ||
| CURRENT PROTOCOLS IN BIOINFORMATICS
March, 2002 RECENTLY PUBLISHED: UNIT 1.2 Searching Online Mendelian Inheritance in Man (OMIM) for Genetic Loci Involved in Human Disease (Andreas D. Baxevanis, National Human Genome Research Institute, Bethesda, Maryland). Online Mendelian Inheritance in Man (OMIM) is a non-sequence-based information resource that can be of tremendous use to genomics researchers, physicians, and patients. OMIM is the electronic version of the catalog of human genes and genetic disorders. It provides concise textual information from the literature on most human conditions having a genetic basis, as well as pictures illustrating the condition or disorder (where appropriate) and full citation information. This unit gives an overview of the OMIM database, the layout of the records, and the information that is available within each entry. UNIT 2.2 Using the Blocks Database to Recognize Functional Domains (Jorja G. Henikoff, Elizabeth A. Greene, Nick Taylor, and Steven Henikoff, Fred Hutchinson Cancer Research Center, Seattle, Washington; and Shumel Pietrokovski, Weizmann Institute of Science, Rehovot, Isreal). Blocks are ungapped multiple alignments of of related protein sequence segments that correspond to the most conserved regions of the proteins. The Blocks Database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins. Protocols in this unit describe the analysis of proteins and families using Blocks-based tools, including searching, exploring relationships with trees, making new blocks, and designing PCR primers from blocks for isolating homologous sequences. UNIT 2.3 Multiple Sequence Alignment Using ClustalW and ClustalX (Julie D. Thompson, Institut de Genetique et de Biologie Moleculaire et Cellulaire, Illkirch Cedex, France; Toby J. Gibson, European Molecular Biology Laboratory, Heidelberg, Germany; and Des G. Higgins, University College, Cork, Ireland). The Clustal programs are widely used for carrying out automatic multiple alignment of nucleotide or amino acid sequences. The most familiar version is ClustalW, which uses a simple text menu system that is portable to more or less all computer systems. ClustalX features a graphical user interface and some powerful graphical utilities for aiding the interpretation of alignments and is the preferred version for interactive usage. Users may run Clustal remotely from several sites using the Web or the programs may be downloaded and run locally on PCs, Macintosh, or Unix computers. The protocols in this unit discuss how to use ClustalX and ClustalW to construct an alignment, and create profile alignments by merging existing alignments. UNIT 3.1 An Overview of Sequence Similarity ("Homology") Searching (Daniel B. Davison, Bristol-Myers Squibb Pharmaceutical Research Institute, Hopewell, New Jersey). Sequence similarity searching is an essential tool for molecular biologists. It is used to support inference of protein function and for phylogenetic analysis. Every searching procedure requires some understanding of the underlying principles, so at the very least the investigators selection of parameters is correct. The principles underlying the most commonly used procedures are presented in this unit. The discussion is intentionally non-mathematical, but does contain references for those who desire a mathematical and statistical discussion of these procedures. UNIT 3.2 Finding Homologs to Nucleic Acid or Protein Sequences Using the Framesearch Program (Matthew Healy, Bristol-Myers Squibb, Wallingford, Connecticut). Framesearch is an extension of the classic Smith-Waterman pairwise sequence comparison algorithm. When the classic Smith-Waterman search algorithm is used to compare a nucleotide query sequence against a database of peptide sequences single-nucleotide indels (INsertion or DELetion errors) are not taken into account because the alignments are between whole codons translated into amino acids. On the other hand, the Framesearch algorithm includes the possibility of a frameshift error in its alignment algorithm, and therefore can find alignments that span different reading frames. Basic Protocol 1 in this unit describes the use of Framesearch to search a protein sequence database for sequences that are similar to a query nucleotide sequence. Basic Protocol 2 describes the use of Framesearch to search a nucleotide sequence database for sequences that are similar to a query protein sequence. Three Alternate Protocols describe ways to improve the speed of Framesearch and thus make it practical for routine use. Framesearch is especially appropriate for low-quality single-read nucleotide sequence data, such as ESTs (expressed sequence tags) or early drafts of genomic sequences; it does not offer any significant advantage over less CPU-intensive algorithms for relatively high-quality nucleotide sequences without many single-nucleotide insertion or deletion errors. UNIT 3.3 Finding Homologs to Nucleotide Sequences Using Network BLAST Searches (Istvan Ladunga, Celera Genomics Corporation, Foster City, California). Basic Local Alignment Search Tool (BLAST) can identify possible homologs in nucleotide and protein databases with high sensitivity and selectivity, and at an amazing speed. Those homologs may provide inference for the biochemical function, the exon boundaries, the domain architecture, the secondary and tertiary structure of the protein, and many other features. The purpose of this unit is not restricted to providing optimal ways of applying the BLAST tool. Just running a BLAST search is an easy task and still can be of great service to researchers. The unit details how to fine tune the arguments of the programs , allowing the user to take advantage of frequently overlooked capabilities of the tool. It also addresses the pitfalls in over- or under-interpretation of results. UNIT 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations (Andreas D. Baxevanis, National Human Genome Research Institute, Bethesda, Maryland). Modern biology is on the verge of officially ushering in a new era in science with the completion of the sequencing of the human genome in April 2003. While often erroneously called the "post-genome era", this will actually truly mark the beginning of the "genome era," a time in which the availability of sequence data for many genomes will have a significant effect on how science is performed in the 21st century. This unit offers an overview of many of the gene prediction methods that are currently available and offers a general assessment of how well the methods work for various problems. UNIT 4.5 Using MZEF to Find Internal Coding Exons (Michael Q. Zhang, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York). MZEF (Michael Zhangs Exon Finder) was designed to help identify one of the most important classes of exons, i.e. the internal coding exons, in human genomic DNA sequences. It is neither for predicting intronless genes, nor for assembling predicted exons into complete gene models. There is also a mouse version (mMZEF) and an Arabidopsis version (aMZEF). This unit presents the Unix and Web versions of MZEF and reviews how to interpret the MZEF results. UNIT 4.3 Using geneid to Identify Genes (Roderic Guigo, Universitat Pompeu Fabra, Barcelona, Spain). The unit describes the usage of geneid. geneid is a very efficient gene finding program. It allows for the analysis of large genomic sequences, including whole mammalian chromosomes. These sequences can be partially annotated, and geneid can be used to refine this initial annotation. Parameter configurations exist for a number of eukaryotic species. Geneid produces output in a variety of standard formats. The results, thus, can be processed by a variety of software tools, including visualization programs. Geneid software is in the public domain, and it is undergoing constant development. It is easy to install and use. Exhaustive benchmark evaluations show that geneid compares favorably with other existing gene finding tools. UNIT 6.2 Visualizing Phylogenetic Trees Using TreeView (Roderic D.M. Page, University of Glasgow, Glasgow, UK). TreeView provides a simple way to view the phylogenetic trees produced by a range of programs, such as PAUP*, PHYLIP, TREE-PUZZLE, and ClustalX. While some phylogenetic programs (such as the Macintosh version of PAUP*) have excellent tree printing facilities, many programs do not have the ability to generate publication quality trees. TreeView addresses this need. The program can read and write a range of tree file formats, display trees in a variety of styles, print trees, and save the tree as a graphic file. Basic Protocols 1 and 2 cover displaying and printing a tree, respectively. The Support Protocols describe how to download and install TreeView, and how to display bootstrap values in trees generated by ClustalX and PAUP*. UNIT 9.1 Creating Databases for Biological Information: An Introduction (Lincoln D. Stein, Cold Spring Harbor Laboratory Cold Spring Harbor, New York). The essence of bioinformatics is dealing with large quantities of information. Whether it be sequencing data, microarray data files, mass spectrometric data (e.g., fingerprints), the catalog of strains arising from an insertional mutagenesis project, or even large numbers of PDF files, there inevitably comes a time when the information can simply no longer be managed with files and directories. This is where databases come into play. This unit briefly reviews the characteristics of several database management systems, including flat file, indexed file, and relational databases, as well as ACeDB. It compares their strengths and weaknesses and offers some general guidelines for selecting an appropriate database management system. APPENDIX 1C Unix Survival Guide (Lincoln D. Stein, Cold Spring Harbor Laboratory Cold Spring Harbor, New York). For a mixture of historical and practical reasons, much of the bioinformatics software discussed in this series runs on Linux, Mac OS X, Solaris, or one of the many other Unix variants. This appendix provides the novice with easy-to-understand information needed to survive in the Unix environment. FORTHCOMING: UNIT 1.3 Searching the NCBI Databases using Entrez (Juliane Murphy and Andreas D. Baxevanis, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland). One of the most widely-used interfaces for the retrieval of information from biological databases is the NCBI Entrez system. Entrez capitalizes on the fact that there are pre-existing, logical relationships between the individual entries found in numerous public databases. The existence of such natural connections, mostly biological in nature, argued for the development of a method through which all the information about a particular biological entity could be found without having to sequentially visit and query disparate databases. Two Basic Protocols describe simple, text-based searches, illustrating the types of information that can be retrieved through the Entrez system. An alternate Protocol builds upon the first Basic Protocol, using additional, built-in features of the Entrez system, as well as providing alternative ways to issue the initial query. The Support Protocol reviews how to save frequently-issued queries. UNIT 2.4 Discovering Novel Sequence Motifs with MEME (Timothy L. Bailey, ACMC/Mathematics Department, The University of Queensland, Brisbane, Australia). This unit illustrates how to use MEME to discover motifs in a group of related nucleotide or peptide sequences. A MEME motif is a sequence pattern that occurs repeatedly in one or more sequences in the input group. MEME can be used to discover novel patterns because it bases its discoveries only on the input sequences, not on any prior knowledge (such as databases of known motifs). The input to MEME is a set of unaligned sequences of the same type (peptide or nucleotide). For each motif it discovers, MEME reports the occurrences (sites), consensus sequence, and the level of conservation (information content) at each position in the pattern. MEME also produces block diagrams showing where all of the discovered motifs occur in the training set sequences. This illuminates the spatial arrangement of protein domains or DNA features (e.g., protein binding sites) within the input sequences. MEME's hypertext (HTML) output also contains buttons that allow for the convenient use of the motifs in other searches (e.g., searching sequence databases for sequences containing the motifs, searching motif databases for similar motifs, constructing a phylogeny tree from the motif occurrences, and creating a sequence model that accounts for the ordering and spacing of the motifs in the input sequences). UNIT 3.4 Finding Homologs in Amino Acid Sequences Using Network BLAST Searches (Istvan Ladunga, Celera Genomics Corporation, Foster City, California). BLAST, Basic Local Alignment Search Tool is used more frequently than any other biosequence database search program. The purpose of this unit is not only to show how to run searches on the Web, but also to demonstrate how to fine-tune arguments for a specific research project. It also offers guidance for interpreting results, handling statistical significance and biological relevance issues, and selecting complementary analyses. This unit covers three classes of the BLAST program: standard protein-to-protein searches, translated searches when either the query or the database consists of nucleotide sequences translated into proteins, and finally programs for comparing two sequences (as opposed to searching one sequence against a database of sequences). UNIT 3.5 Correctly Choosing a Scoring Matrix (David Wheeler, Human Genome Center, Baylor College of Medicine, Houston, Texas). Every program for searching protein sequences against a database includes a choice of a protein weight matrix, also called a scoring matrix. Weight matrices add sensitivity to the search, while statistical significance adds selectivity. Virtually every user chooses the default, typically PAM 250 or BLOSUM62. Despite the fact that the choice of matrix can strongly influence the outcome of the analysis, most users do not know why a particular matrix should be used. In general, scoring matrices implicitly represent a particular theory of protein sequence evolution. Understanding the assumptions underlying the PAM and BLOSUM scoring matrices can aid in making the proper choice. The purpose of this unit is to guide the choice of a scoring matrix. It covers the selection of PAM matrices, BLOSUM matrices and provides a brief overview of the wide variety of specialized scoring matrices. UNIT 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes (Mihaela Pertea and Steven L. Salzberg, The Institute for Genomic Research, Rockville, Maryland). GlimmerM is a eukaryotic gene finder that has been used in the annotation of the genomes of Plasmodium falciparum (the malaria parasite), the model plant Arabidopsis thaliana, Oryza sativa (rice), the parasite Theileria parva, and the fungus Aspergillus fumigatus. A unique feature of the system compared to other eukaryotic gene finders is a module that allows users to provide their own data and train GlimmerM for any organism. UNIT 6.3 Getting a Tree Fast: Neighbor Joining and Distance-Based Methods (Olivier Gascuel, Département Informatique Fondamentale et Applications, L.I.R.M.M., Montpellier Cedex, France). This unit provides instructions on how to construct a phylogenetic tree with several distance-based methods. Neighbor Joining (NJ) and other distance-based approaches (e.g., BIONJ, WEIGHBOR and FITCH) are fast methods for building phylogenetic trees. This makes them particularly effective for large-scale studies or for bootstrap applications, which require runs on multiple data sets. Like maximum-likelihood methods, distance methods are based on a sequence evolution model that is used to estimate the distance matrix. Computer simulations indicate that the topological accuracy of BIONJ, WEIGHBOR and FITCH is significantly better than that of NJ, and that the best distance-based methods are equivalent to parsimony in most cases. UNIT 7.2 The Gene Ontology Project: Structured Vocabularies for Molecular Biology and Their Application to Genome and Expression Analysis (Judith A. Blake, The Jackson Laboratory, Bar Harbor, Maine; and Midori A. Harris, The European Bioinformatics Institute, Hinxton, Cambridge, England). While it is difficult to persuade laboratory scientists to employ standardized descriptions of experimental procedures and results in their publications, those wishing to utilize genomic data have quickly come to realize the significance and utility of such standards to computer-driven information retrieval systems. The focus of the Gene Ontology project is three fold. First, the project goal is to compile and provide the Gene Ontologies; structured vocabularies describing domains of molecular biology. Second, the project supports the use of these structured vocabularies in the annotation of gene products. Third, the gene product-to-GO annotation sets are provided by participating groups to the public through open access to the GO database and Web resource. This unit describes the current ontologies and what is beyond the scope of the Gene Ontology project. It addresses the issue of how GO vocabularies are constructed and subsequently related to genes and gene products. It concludes with a brief discussion of how researchers can access, browse, and utilize the GO project in the course of their own research. UNIT 7.3 Analysis of Gene Expression Data Using J-Express (Inge Jonassen and Bjarte Dysvik, Department of Informatics, University of Bergen, Bergen, Norway). The J-Express package has been designed to facilitate the analysis of microarray data with an emphasis on efficiency, usability, and comprehensibility. The J-Express system provides a powerful and integrated platform for the analysis of microarray gene expression data. It is platform independent in that it requires only the availability of a Java virtual machine on the system. The system includes a range of analysis tools and a project management system supporting the organization and documentation of an analysis project. This unit describes the J-Express tool emphasizing central concepts and principles, and shows through examples how it can be used to explore gene expression data sets. UNIT 9.2 SQL: Structured Query Language (Curtis Jamison, School of Computational Sciences, George Mason University, Manassas, Virginia). Relational databases provide the most common platform for storing data. The Structured Query Language (SQL) is a powerful tool for interacting with relational database systems. SQL enables the user to concoct complex and powerful queries in a straight-forward manner, allowing sophisticated data analysis using simple syntax and structure. This unit demonstrates how to use the MySQL package to build and interact with a relational database. APPENDIX 1D X Window Survival Guide (Lincoln D. Stein, Cold Spring Harbor Laboratory Cold Spring Harbor, New York). Logging-in to a Unix system from a console, typically initiates a graphical desktop environment that is similar to the Microsoft Windows and Apple Macintosh desktops. Logging-in remotely to a Unix system, however, typically limits the user to a small text-only window, which is unable to launch graphical applications. This appendix describes the two main options for overcoming this obstacle: a Virtual Network Computer (VNC) and the X Window System. This Web site Copyright © 1990-2002 by John Wiley & Sons, Inc. All rights reserved. CP Online is Powered by Teton Server 2.0.4 |