Advanced Micro Devices
Large Scale Pedigree Haplotyper (LSPH)
Eyal Seroussi, Dept.of Quantitative and Molecular Genetics, Institute of Animal Science, The Agricultural Research Organization (ARO),Volcani Center, Israel
The genotyping data of diploid organisms obtained by current laboratory techniques provides unordered allele pairs for each marker. Reconstruction of haplotypes from this data is a crucial step in many applications. Haplotypes of tightly linked markers provide insight into old and rare recombination events, and thus are more informative than single markers. In particular, haplotypes are essential for linkage disequilibrium (LD)based gene mapping. Approaches to reconstruction of haplotypes by genotype inference can be divided into statistical methods and non-statistical methods. The first approach applies computational or statistical inference to find the most likely haplotype configuration consistent with the observed genotypic data. In the non-statistical approach, every possible trio of two parents and an offspring is examined. The haplo types are resolved by forward inference of Mendelian rules from the parental genotypes to the offspring genotype and backward inferenc e from offspring genotype to his parents.As the pedigree size in creases,the number of computations for most rule-based algorithms increases exponentially. Thus, analysis of pedigrees consisting of hundreds of individuals would require several hours or even days of computations.
LSPH (Baruch et.al.,2006 Genetics.172:1757-1765) is a rule based program for inference of haplotypes from genotypes of large animal pedigree or population for Windows OS. It assumes nothing about the haplotype frequency in the population, but assumes zero recombination events.The application reads the raw genotype file which consists of rows.Each row contains the gender and identities of an individual and its parents followed by the available genotypes of this individual. Then it verifies that the structure of the families is free from errors. The result of LSPH run is a file which describes resolved haplotypes for each of the pedigree members. It also records some statistical estimation about the amount of resolved haplotypes in a log file. Output is presented in a user friendly XML format allowing the use of Excel as a tool for viewing the resolved haplotype. The parental and maternal haplotypes are colored in blue and red, respectively and genotype errors can be traced by viewing the type and description of conflicts which are displayed as Excel comments indicated with red triangles at right corners of the allele cells.
Grid Computing for Bioinformatics
David Horn, School of Physics and Astronomy, Tel Aviv University, Israel
Edward Aronovich, School of Computer Science, Tel Aviv University, Israel
Assaf Gottlieb, School of Physics and Astronomy, Tel Aviv University, Israel
We shall demonstrate the use of a simple web interface to perform common bioinformatics tasks and show the power of parallel computing on grid to reduce the running time of these tasks.
We offer researchers a tool thorough which they can transform their proprietary software from a batch mode task into an interactive task. This enables researchers to expose their software capabilities through the web without distribution of their source code/ binaries.
The system is based on Gilda and Genius portal which are part of Enabling Grids for E-science (EGEE II) project. The system benefits from the idle time of a large number of computers.
The Israel Academic Grid (IAG) team will demonstrate the system. Further details may be found at http://iag.iucc.ac.il/
Selecton: A Web Server for the Detection of Site-Specific Positive Darwinian Selection and Purifying Selection
Adi Stern, Adi Doron-Faigenboim, Eran Bacharach and Tal Pupko, Department of Cell Research and Immunology, Tel Aviv University, Israel
The ratio of non-synonymous to synonymous substitutions, known as the Ka/Ks ratio,is used to estimate both purifying and positive Darwinian selection.A Ka/Ks ratio significantly greater than 1 is indicative of positive selection, whereas values significantly smaller than 1 are indicative of purifying selection.We present an algorithmic web-based tool which calculates the Ka/Ks ratio for each codon site in a codon-based multiple sequence alignment (Doron-Faigenboim et al. 2005;Stern et al.2006).Selecton implements both an empirical Bayesian algorithm (Yang et al.2000)as well as a maximum-likelihood algorithm (Goldman and Yang 1994),which the user may choose (the default algorithm is the Bayesian one).On the one hand,through its user-friendly interface Selecton enables simplicity of use for non-expert users.The minimal input to the server consists of merely a file of homologous DNA coding sequences.On the other hand,Selecton further implements a wide variety of user options which enable maximal fine-tuning of the server to the user's needs.These include control over the different parameters implemented in the Bayesian algorithm,computation of the Ka/Ks scores under different genetic codes, and input of a user-built phylogenetic tree.Following calculation of Ka/Ks ratios at each site of the protein,these ratios are converted into a discrete color scale,and projected onto one of the homologous sequences specified by the user.If a 3- dimensional structure of the protein is available,the scores will also be projected onto the Van-der-Waals surface of the protein.
To exemplify its use,we used Selecton to analyze the HIV-1 protease for mutations that confer resistance to the Ritonavir drug (Doron-Faigenboim et al., 2005).The analysis was done on 70 sequences obtained from patients who were treated with the drug.Selecton projected the Ka/Ks scores onto the tertiary structure of the protease,enabling us to detect several patches of positive selection and purifying selection
Doron-Faigenboim A,Stern A,Mayrose I,Bacharach E,Pupko T (2005)Selecton:a server for detecting evolutionary forces at a single amino-acid site. Bioinformatics 21:2101-3
Goldman N,Yang Z (1994)A codon-based model of nucleotide substitution for protein-coding DNA sequences.Mol Biol Evol 11:725-36
Stern A,Doron-Faigenboim A,Bacharach E,Pupko T (2006)Selecton 2006: detecting positive Darwinian selection and purifying selection in proteins using a Bayesian approach.Submitted to BMC Evolutionary Biology
Yang Z,Nielsen R,Goldman N,Pedersen AM (2000)Codon-substitution models for
heterogeneous selection pressure at amino acid sites.Genetics 155:431-49
Text Mining as Web Services Provided from the EBI
Dietrich Rebholz-Schuhmann, EBI, Wellcome Trust Genome Campus, UK
Whatizit is a modular text processing system that is openly available through EBI's Web pages.It allows processing of any type of text to identify named entities and to annotate the named entities with links to bioinformatics databases.In addition,Whatizit enables retrieval of Medline abstracts either by keyword query,by UniProtKB/Swiss-Prot accession key or by PubMed ID.Retrieved abstracts can be processed automatically via single modules or complex text processing pipelines.Whatizit is also available as 1)a webservice and as 2)a streamed servlet.The Web service allows enriching content of individual Web sites in a similar way as in wikipedia.The streamed servlet allows processing large amounts of text.
EBIMed is a web application that combines Information Retrieval and Extraction from Medline.EBIMed retrieves Medline abstracts in a way similar to PubMed.In addition it analyses them to offer a complete overview on associations between UniProtKB/Swiss-Prot protein/gene names,GO annotations,Drugs and Species.The results are represented in a table that displays all the associations and links to the sentences that support them and to the original abstracts.By selecting relevant sentences and highlighting the biomedical terminology EBIMed enhances your ability to acquire knowledge,relate facts,discover implications and,overall,have a good overview economizing the effort in reading.Design principles and implementation of EBIMed will be presented in the text mining session of the ECCB 2006.
EVEREST: A Collection of Evolutionary Conserved Protein Domains
Elon Portugaly, School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
Nathan Linial, School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
Michal Linial, Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel
Proteins are comprised of one or several domains that are evolutionary recurrent.We developed a computational tool to identify protein domains and to classify them into their families.Our tool and database,is combined in the website called EVEREST.We have applied EVEREST (Ver1.1) to SwissProt database (Ver 1.1;133,000 proteins).EVEREST is a collection of 13,569 domain families.These families include over one million domains,and jointly cover 83%of the amino acids in the Swiss-Prot database.The average (median)size of a domain family in this collection is 81 (41)amino acids and the average length of the domains is 117 (76)amino acids.
The process underlying EVEREST begins by constructing a database of protein segments that emerge in an all vs.all pairwise sequence comparison.It then proceeds to cluster these segments,choosing the best clusters using machine learning techniques,and creating a statistical model for each of the these clusters.This procedure is then iterated:The aforementioned statistical models are used to scan all protein sequences,to recreate a segment database and to cluster them again.Performance tests show that EVEREST recovers 63%of Pfam families and 40%of SCOP families with high accuracy,and suggests new families with about 40%fidelity.EVEREST domains are frequently a combination of domains as defined by Pfam or SCOP and are also frequently subdomains of such domains.
EVEREST is the only web based tool that provides in depth analysis of major domain based classifications including SCOP and CATH (structurally based)and Pfam (sequence based).It is designed to allow users to test their own protein sequence against all HMM models.It is a very rich resource for domains that will be essential for structural genomics initiatives (for domain boundary definition),for evolutionary based view on proteins,and as a source for new discoveries.
EVEREST presents one statistical model per domain family.The website provides rich navigating tools at the level of the proteins,domain families and the entire system.EVEREST is available at http://www.everest.cs.huji.ac.il
In the presentation,EVEREST navigating tools will be demonstrated.Specifically,we will demonstrate the usefulness of EVEREST for navigating in the domain space through phyletic groups.In addition,the use of EVEREST as a discovery tool and as an assessment tool will be demonstrated by testing several biological questions.
Sun Microsystems, Inc.
Ana Conesa, Centro de Genomica, Instituto Valenciano de Investigaciones, Agrarias, Spain
Stefan Goetz, Grupo de Informatica Biomedica, Instituto de Aplicaciones de las Tecnologias de la Informacion y de las Comunicaciones Avanzadas, Universidad Politecnica de Valencia, Spain
Blast2GO (B2G)is bioinformatics software developed with the aim of providing biologists with a user-friendly tool that joins in one application similarity search-based functional annotation of (novel)genes or proteins,and the functional analysis of genomics data.Basically,B2G performs blast to find sequences similar to a query set and assigns GO annotation based on an annotation rule that takes into consideration multiplicity of homologues,amount of similarity and the annotation quality of the hit sequences.Blast searches can be performed remotely against the NCBI using default parameters of locally on specific databases.B2G offers the possibility of direct statistical analysis on gene function information and visualization of relevant functional features on an interactive and highlighted GO direct acyclic graph (DAG).Powerful navigation algorithms with interactive functions have been incorporated for explorative analysis.The application includes various descriptive statistics features for summarizing results obtained at blasting,GO-mapping,annotation as well as functional data mining functions such as significance analysis for GO-term over or under-representation of sets of sequences.
Processing steps are configurable and data import and export are supported at any stage.The application has been endowed with an array of data input possibilities.Not only raw sequences can be processed,but also,pre-existing blast or annotation files are accepted and taken to subsequent steps.While blast results can be imported in XML format,existing sequences annotation are supported as Gene-Symbol or Accession identifier lists.
Improvements have been recently incorporated at different stages of the annotation and analysis process.The blast module has been enhanced by various effective possibilities to perform massive blast searches in less time,also permitting the search against multiple DB sources. A species filter can be applied to select results obtained from extensive DB's like the non redundant (nr) NCBI sequence DB. During the species independent annotation mining, blast results hits are now matched with Gene Ontology terms,enzyme code,KEGG pathways and in near future InterPro domains.The automatic annotation module offers the possibility of manual curation making use of various linked web resources.Two extra functionalities are available for modulating the intensity of annotation:the "Second Layer"approach that enhances GO annotation by considering GO term cross-relationships,and GO-Slim simplification which is supported for various species-specific "slim"mappings.
Blast2GO is a freely available desktop Java application that can be started by Java Web Start technology.B2G is platform-independent,installable with minimal requirements and automatically updated via internet.Additionally,the annotation module is also amenable to annotation pipelines through the B2G4Pipe command line interface.
The software demo will include the following aspects:
-Sequence annotation using diversity of databases
-Modulation of annotation intensity and diversity
-Formats and alternative ways for data import and export
-Explorative analysis through interactive combined GO Graphs
-Generation of annotation statistics and visualization
-Statistical analysis of GO term distributions in functional genomics data
What's new with ArrayExpress
Misha Kapushesky, EBI, UK
ArrayExpress is a public repository for microarray data, which is aimed at storing MIAME-compliant data in accordance with MGED recommendations. The ArrayExpress Data Warehouse stores gene-indexed expression profiles from a curated subset of experiments in the repository. We will demonstrate a new interface to the ArrayExpress database, new workflow and web service capabilities of its data analysis tools, REST interfaces for automated database queries, novel functionality for the gene expression warehouse and advanced data submission tools.
CFinder: Locating Cliques and Overlapping Modules in Biological Networks
Gergely Palla, Biological Physics Research Group of the Hungarian Academy of Sciences,
High-throughput experimental techniques, e.g., protein-protein interaction (PPI) and mRNA expression methods, have largely advanced our knowledge about the functioning of the cell. Gene (protein) association networks integrate the broadest possible set of evidence -- including high-throughput data -- on protein linkages: they provide an integrated list of binary interactions [1, 2] and allow the discovery of previously uncharacterized cellular systems . One major goal of current research efforts is to elucidate how the observed behaviors of an entire cell can be understood in terms of the interactions of its protein modules. To identify such modules, a common approach is to search for groups of densely interconnected nodes in the cell's protein association network [4, 5]. Note, however, that modules strongly overlap. According to the CYGD database , in Saccharomyces cerevisiae the number of proteins in known protein complexes (modules where the participating proteins physically interact at the same time) vs. the sum of the sizes of these complexes is 2750/8932. Thus, most protein modules probably share many of their proteins with other modules.
We introduce CFinder , a platform-independent application locating overlapping groups of densely interconnected nodes in graphs, and illustrate its use on biological networks. Generic graph visualization and analysis programs  are frequently used for the layout and structural analysis of networks.
Recent bioinformatics software platforms , on the other hand, enable the user to integrate many different types of data, e.g., PPI, expression levels, and annotation information. CFinder reads a list of binary interactions, performs a search for dense subgraphs (groups), and -- unlike several currently used algorithms  -- it allows for any node to belong to more than one group.
Due to its algorithm and implementation, CFinder is e±cient for networks with millions of nodes and, as a byproduct of its search, the full clique overlap matrix of the network is determined. Below we will show that in gene association networks CFinder's results can be used to predict novel modules and novel individual protein functions.
Overview of CFinder
The input of CFinder is a file containing strings and numbers ordered into three columns; in each row the first two strings correspond to the two end points of a link and the third item is the weight of this link.
The computational core of CFinder was implemented in C++, while the visualization and analysis components were written in Java. The search algorithm uses the Clique Percolation Method (CPM, see ) to locate the k-clique percolation clusters of the network that we interpret as modules. A k-clique is a complete subgraph on k nodes (k = 3; 4; : : :), and two k-cliques are said to be adjacent, if they share exactly k-1 nodes. A k-clique percolation cluster consists of (i) all nodes that can be reached via chains of adjacent k-cliques from each other and (ii) the links in these cliques. Note that larger values of k correspond to a higher stringency during the identification of dense groups and provide smaller groups with a higher density of links inside them. For both local and global analysis in a network, we suggest using such a value of k (typically between 4 and 6) that provides the user with the richest group structure (see ). In the presence of link weights CFinder can apply lower and upper cutoff values to keep only the set of connections judged to be significant by the user.
The user interface of CFinder offers several views of the analyzed network and its module structure. As an example, Fig. 1. shows the modules of the protein Pwp2 in the DIP "yeast core" network  at clique size k = 4. Alternative views currently available in CFinder are "Communities" (displaying the identified modules), "Cliques", "Stats" (statistics of, e.g., module and overlap sizes) and \Graph of communities". The special buttons "forward", "back", "zoom" and "walk" allow a quick navigation between the views. A wide variety of visualization settings can be adjusted in the "Tools" menu.
In Fig. 2. we display the network of modules produced by CFinder (k = 4) in the DIP "yeast full" data set. In the complete map (a) each node represents a module, the area of a node is proportional to the number of proteins in the corresponding module, and the width of a link is proportional to the number of proteins shared by the two modules. Panel (b) shows a previously known complex identified by CFinder. Panels (c) and (d) both display a known complex grouped together with one additional protein (Msh2 and Vps8, respectively), leading to an improved functional annotation of that protein. In panel (e) Eeb1 (function currently unknown) is grouped together with proteins participating in vesicle-mediated transport, thus, we predict this to be a key function of Eeb1. Proteins in the marked dark blue and brown groups of panel (e) cooperate on the establishment of cell polarity, a function performed by a total of 103 proteins in the cell. We anticipate that these two groups are biologically meaningful, novel modules within that larger set of 103 proteins. (Gene names and annotations were handled with Perl tools, e.g., GO::TermFinder .)
 von Mering,C. et al. (2005) STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433-D437.
 Salwinski,L. et al. (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 32, D449-451.
 Date,S.V. and Marcotte,E.M. (2003) Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat. Biotechnol., 21, 1055-1062.
 Bader,G.D. and Hogue,C.W.V. (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2.
 Rives,A.W. and Galitski,T. (2003) Modular organization of cellular networks. Proc. Natl. Acad. Sci. U S A, 100, 1128-1133.
 Guldener,U. et al. (2005) CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Res., 33, D364-D368.
 Adamcsek, B. et al. (2006) CFinder: Locating cliques and overlapping modules in biological networks Bioinformatics, 22, 1021-1023.
 Batagelj,V. and Mrvar,A. (1998) Pajek - program for large network analysis. Connections, 21, 47-57.
 Shannon,P. et al. (2003) Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res., 13, 2498-2504.
 Newman,M.E.J. (2004) Detecting community structure in networks. Eur. Phys. J. B, 38, 321-330.
 Derenyi,I. et al. (2005) Clique percolation in random networks. Phys. Rev. Lett., 94, 160202.
 Palla,G. et al. (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435, 814-818.
 Boyle,E.I. et al. (2004) GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20, 3710-3715.
SABIO-RK (System for the Analysis of Biochemical Pathways - Reaction Kinetics)
Olga Krebs, Scientific Databases and Visualization Group ,EML Research gGmbH, Germany
The simulation of biochemical reaction networks requires experimental data about substance concentrations, metabolic fluxes, enzyme activities, or reaction kinetics, describing the dynamics of the reactions with their respective parameters determined under certain experimental conditions. These data are widely scattered through various publications and described in many different formats.
To address these needs we have developed an integrated biological pathway database system, SABIO-RK (System for the Analysis of Biochemical Pathways - Reaction Kinetics), to store and offer access to information about biochemical reactions and their kinetics in a comprehensive and standardised manner. It stores the fundamental information about biochemical pathways, like reactions and their participants (enzymes, compounds, modifiers) and descriptive information about proteins, protein complexes and genes, all this linked to organism (including strains) and to biochemical reactions (in the case of enzymes). Most of the reactions, their associations with biochemical pathways, and their enzymatic classifications (EC number of the International Union of Biochemistry and Molecular Biology) are extracted from the KEGG database (Kyoto Encyclopedia of Genes and Genomes).
The kinetic data is mainly manually extracted from published scientific articles and then verified by curators. It includes data about the organism, tissue, and cellular location where the reaction takes place, as well as the type of the kinetic law and the reaction's rate equation. The latter is shown with its parameters (Km, Vmax, concentration, etc.) and the experimental conditions (e.g. pH, temperature, buffer) under which the parameters were determined.
During the curation process, the data is unified and structured consistently in order to facilitate the comparison of the kinetic data extracted from different sources, since they are usually obtained under different experimental conditions or from different organisms, tissues etc.
The data integration process is supported by a web-based input interface which allows users to load and store information about reactions and enzyme kinetics. It matches in most points with the current recommendations for standards by the STRENDA commission (http://www.strenda.org/) especially with respect to the kinetics of enzymes and reactions.
SABIO RK has a web-based user interface that enables the user to search for biochemical reactions and their kinetics, based on the characteristics of the reactions and on the environmental conditions under which its kinetics were obtained. For example the user can define a pathway, e.g. glycolysis; or can specify reaction participants (compounds or enzymes), organisms, tissues, or cell types in which the reaction is reported to occur. Additional search terms include cellular locations, environmental conditions (pH and temperature), or publications in which kinetic data are reported. The interface enables the export of data in SBML (Systems Biology Mark-Up Language) format that can then be used as the basis to create a simulation model of a biochemical network.
The SimpAT Package: Integrating SIMAP into its own Applications
Roland Arnold, Thomas Rattei, Volker Stumpflen and Hans-Werner Mewes
MIPS/IBI Institute for Bioinformatics at the GSF - National Research Center for Environment and Health, GmbH, Germany
Similarity searches of a query protein between a set of other proteins is one of the basic procedures in many in silico analyses. They are used to detect homologous sequences in order to transfer annotation, detect members of a protein family, find putative orthologs, create phylogenetic profiles and in many more applications. Very often, this similarity searches query for sequences which are already known in public databases and are therefore redone again and again by different scientists or even different rounds of one analysis. BLAST (1) is therefore the most prominent heuristic for similarity searches, because of its speed.
SIMAP(2), the Similarity Matrix of Proteins, is a database containing pre-calculated similarity searches of protein sequences based on the Fasta 3 (3) procedure. The datapoints in SIMAP are saved as ordered lists of hits which can be retrieved very rapidly. The database covers a comprehensive list of public databases as SWISSPROT/UNIPROT (4) or the NCBI gene-bank (6) and contains almost every sequenced genome in the public domain. Taxonomic information is added to each protein-entry.
SIMAP is accessible via the Web Service (http://www.w3.org/2002/ws/) technology. To simplify the integration of the SIMAP Web Service into own applications, we developed SimpAT, the SIMAP Access Tool. SimpAT is a Java package which can be used without any knowledge about Web Service technology. This integration can be done either directly in the Java code or in any other program via a wrapper application. This wrapper application can also be used as stand-alone client. Since SimpAT makes use of Web Service technology to invoke SIMAP via the internet, no parts of SIMAP itself have to be downloaded and the data fetched is always up-to-date.
SimpAT retrieves the result as a comprehensive XML document. This document is automatically parsed into convenient java classes, so no own parsing is necessary.
The query scheme is very flexible and based on the selection of appropriate cut-offs and search-spaces. These parameters are evaluated on the server-side to minimize download-traffic. Queries to SIMAP use the unique md5 key (5) of a amino-acid sequence, or the sequence itself, or a protein identifier in combination with a specific database, avoiding identifier problems across different databases. The search-space can be defined by the selection of (several) individual datasets or types of datasets (as 'complete genomes') or by including and excluding branches of the taxonomy or a combination of them. Therefore queries like "return all hits with an E-value better than 10e-5 in UNIPROT but only of eukaryotic origin which are not human" can be formulated in a few lines of code.
The result attributes include taxonomic information for each protein entry returned referencing the NCBI taxonomic database (5) and a link-out to the database of origin as HTML-link. Also, synonymous protein entries in other databases are returned. It also contains E-values, similarity scores as the Smith-Waterman alignment score, the bit-score, the identity, the coordinates of the Smith-Waterman alignment for both sequences (for details see (2)).
SimpAT also directly supports access to data of putative orthologs defined by bi-directional hits and putative in-paralogs.
In the software demonstration, we will perform a live implementation of two typical use-cases.
The first part of the demonstration will show how to retrieve a hit-list of a certain sequence. We will demonstrate how to define customized search-spaces and the usage of taxonomic filtering and discuss the possible cut-off parameters. We will implement an example application which answers questions like "which sequences of a bunch of input-sequences have close homologs only in bacteria but not in eukaryota" in a few lines of code.
The second part will demonstrate how to use SIMAP for the prediction of orthologs. As examples, we will discuss a very short implementation to create phylogenetic profiles and a function which returns the number of putative gene duplications after a speciation event, both for a complete protein set of a genome by one rapid whole genome query.
The presentation targets on software developers in projects dealing with large amounts of similarity searches but also on non-programmers since SIMAP can be accessed using the wrapper application without further programming. Knowledge of Java is therefore desired but not recommended.
1. Altschul, S. F., Gish, W., Miller, W., Myers, G. and Lipman, D. J. (1990) A basic local alignment search tool. J. Mol. Biol., 215, 403410
2. Arnold R, Rattei T, Tischler P, Truong MD, Stumpflen V, Mewes W. (2005) SIMAP--The similarity matrix of proteins. Bioinformatics. Sep 1;21 Suppl 2:ii42-ii46
3. Pearson, W. R. (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132, 185-219
4. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005 Jan 1;33(Database issue):D154-9
5. Smith M, Kunin V, Goldovsky L, Enright AJ, Ouzounis CA. (2005) MagicMatch--cross-referencing sequence identifiers across databases. Bioinformatics, Aug 15;21(16):3429-30.
6. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Helmberg W, Kapustin Y, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E. (2006) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D173-80.
COLOMBO/SIGI-HMM: Prediction of Genomic Islands in Procaryotic Genomes Using HMMs
Oliver Keller, Institut fur Informatik, Goettingen, Germany
The process of acquiring genes from foreign species, with or without a closer taxonomical relation to the host, is called horizontal gene transfer. This phenomenon is frequent among microbial species and is considered one major means of rapid adaptation to changing environmental demands. Researchers are seeking for algorithms that detect anomalies characterising, for example, horizontally transferred genes, often found in larger contiguous regions called genomic islands. We have implemented the algorithm SIGI-HMM that predicts GIs in Procaryotic Genomes, as well as the putative source of each individual alien gene.
Benchmark tests showed the output of SIGI-HMM in agreement with known findings, consistent both with annotated GIs and predictions generated by other methods. It can be concluded that SIGI-HMM is a valuable tool for the identification of GIs in microbial genomes. It is publicly available and ready to use. It processes annotated genomic sequences given in EMBL format, and can as well produce GFF output.
The Scriptome: Protocols for Manipulating Biological Data
Amir Karger, Bauer Center for Genomics Research, Harvard, USA
The Scriptome tools perform simple analysis and formatting of biological data. The focus is on basic biological data manipulation, not complex statistics or algorithms. There are currently six tool categories: Calculate, Change, Choose, Fetch, Merge, and Sort. Popular tools include:
- Choosing lines where the value in a given column exceeds a certain threshold. This can be especially useful with files larger than Excel's limit of 65535 lines.
- Merging files together based on shared values in certain columns. This tool essentially performs a SQL join.
- Changing FASTA files to a tabular format. The output can be viewed in Excel, or filtered with other Scriptome tools.
Each tool is a short Perl script embedded in a one-line shell command. These tools can be cut and pasted from the website onto the command line. The tools' simplicity makes it possible to develop tools rapidly, to keep up with biologists' changing needs.
The interface is extremely lightweight. Users browse the web site's hierarchical table of contents to find a tool, then cut and paste the tool text onto the command line. They edit filenames and parameters, which are highlighted in red on the website, and then hit Enter to run the tool.
Once biologists can use single tools, they may begin the more difficult process of chaining tools together to form protocols (workflows). In the command-line interface, protocols become batch scripts; alternatively, some users have chosen to save the tools in a Word document. Users can therefore create reproducible, commented protocols.
Ease of installation
The Scriptome requires no installation at all on UNIX and Macintosh, since Perl is standard on those platforms. Windows users need only a one-click installation of Perl from ActiveState, http://www.activestate.com/Products/ActivePerl. The Scriptome has been tested for compatibility with versions of Perl as old as 5.005_03, released in 1999. A few tools also require Bioperl, http://www.bioperl.org.
Ease of learning
The Scriptome avoids introducing a new, graphical user interface that will require user training. The tools perform simple tasks requiring only a few parameters. The website features a hierarchical, English table of contents (so that users do not need to memorize command names), along with concise documentation and examples. Most users can therefore begin using tools after only a few minutes of training.
High-throughput data analysis without programming
By using the Scriptome, experienced programmers can avoid rewriting commonly used tools. Novice programmers can also learn from its short, working, relevant Perl examples. But most importantly, the Scriptome allows non-programmers to perform high-throughput analysis without learning any programming at all.
Some argue that all biologists should learn how to program. However, many biologists focus on performing laboratory experiments, spending only a limited time analyzing data. There would not be enough return on the investment of time needed to become effective programmers. In addition, in the weeks or months before the next analysis, they could lose their programming skills.
Despite the wide range of available software tools and programming languages, most biologists still rely on older methods for high-throughput data analysis. When confronted with large data sets in incompatible formats, the average biologist will hand-edit in Excel, depend on bioinformaticists, or just give up. The most important factor behind this problem is the barrier to entry: biologists are simply too busy keeping up with the rapid development of their own field to learn how to program or operate complicated software. The Scriptome focuses on ease of learning and installation to surmount this initial barrier. In doing so, it deliberately sacrifices the scope and complexity of more sophisticated software tools or full fledged programming.
The Scriptome's early users have quickly recognized that the Scriptome empowers them. On their own, without extensive training, they can filter and reformat data, perform high-throughput analysis, and explore their data in new ways.
DeltaProt: Molecular Comparison of Proteins Based on Sequence Alignments
A Matlab(c) Companion Toolbox
Steinar Thorvaldsen, Tor Flå and Nils P. Willassen, University of Tromso and Norwegian Structural Biology Centre, Tromso, Norway
We present statistical methods, trend-tests and visualisations that are useful when the protein sequences in alignments can be divided into two or more populations based on known phenotypic traits such as preference of temperature, pH, salt concentration or pressure. The algorithms have been successfully applied in the research on extremophile organisms.
S. Thorvaldsen, T. Flå and N. P. Willassen: Extracting molecular diversity between populations through sequence alignments. Lecture Notes in Bioinformatics, Vol. 3745, Springer-Verlag 2005, pp. 317-328.
S. Thorvaldsen, E. Ytterstad and T. Fla: Property-dependent analysis of aligned proteins from two or more populations. Proceedings of the 4th Asia-Pacific Bioinformatics Conference (Eds.: T. Jiang et al.). Imperial College Press 2006, pp. 169-178.
Download Matlab code of the DeltaProt Toolbox here!
This software can be used freely for academic, non-profit purposes.
Note: Matlab Statistical Toolbox is used in some functions.
CoryneRegNet: An Integrative Bioinformatics Platform for the Analysis of Transcription Factors and Regulatory Networks
Jan Baumbach, Sven Rahmann and Andreas Tauch, Bielefeld University,Germany
Background: Recently several whole-genome sequencing projects have generated huge amounts of data related to various organisms including gene sequences and their functional annotations. Since the gene activity varies under different conditions the goal is to understand the process of their (transcriptional) regulation. The application of post-genomic analysis techniques to bacterial genome sequences provides knowledge to encoded proteins involved in the gene regulation. This data along with literature-derived knowledge on the regulation of gene expression has opened the way for genome-wide reconstruction of transcriptional regulatory networks. These large-scale reconstructions can be converted into in silico models of corynebacterial cells that allow systematic analysis of network behaviour in response to changing environmental conditions. Besides pathogenic corynebacterial species of medical importance, like C. diphtheriae that causes the upper respiratory tract illness diptheria and C. jeikeium, other corynebacteria like C. glutamicum and C. efficiens are traditionally used in biotechnological production processes particularly of amino acids.
Description: CoryneRegNet  is an ontology-based data warehouse designed to facilitate the genome-wide reconstruction of transcriptional regulatory networks of corynebacteria relevant in biotechnology and human medicine. CoryneRegNet is based on a multi-layered, hierarchical and modular concept of transcriptional regulation and was implemented by using an ontology-based data structure. We integrated PoSSuMsearch , a fast and statistically sound method to detect transcription factor binding site motifs within and across species. CoryneRegNet provides an user-friendly interface to PoSSuMsearch that is the only available software package that is fast enough to provide interactive response times for large-scale Position Specific Scoring Matrix (PSSM) searches and at the same time integrates exact statistics for p-value computations. Reconstructed regulatory networks can be visualized on a web interface and as graphs. Special graph layout algorithms have been developed and implemented to facilitate the comparison of gene regulatory networks across species and to assist biologists with the evaluation of predicted and graphically visualized networks.
Conclusion: CoryneRegNet allows a pertinent data management of regulatory interactions along with the genome-scale reconstruction of transcriptional regulatory networks. These models can further be combined with metabolic networks to build integrated models of cellular function.
The public release of CoryneRegNet is freely accessible at
https://www.cebitec.uni-bielefeld.de/groups/gi/software/coryneregnet/. The final slash (/) is mandatory. In order to use the GraphVis feature, Java (at least version 1.4.2) is required.
1. Baumbach J, Brinkrolf K, Czaja LF, Rahmann S, Tauch A: CoryneRegNet: An ontology-based data warehouse of corynebacterial transcription factors and regulatory networks. BMC Genomics 2006, 7(1):24.
2. Beckstette M, Strothmann D, Homann R, Giegerich R, Kurtz S: PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices using Enhanced Suffix Arrays. GI Lecture Notes in Informatics 2004:53-64.