Computational Research Division nameplate
Berkeley Lab Computing Sciences header graphic  
 
   
  CRD Home  
 
 
     
 
An Overview of the Biological Data Management and Technology Center

Background

Biological data management, which addresses the problems of collection, storage, organization, management, retrieval, and integration of rapidly expanding, evolving, and heterogeneous biological data, is considered today one of the most critical areas of modern data intensive biology research.[1] The main focus of the past several years has been the development of methods and technologies supporting high-throughput generation of biological data, such as DNA sequence and gene expression data. Compared to the rapid advances in the area of instrumentation, biological data management is still relatively immature.

Commercial Biological Data Management

In pharmaceutical and biotech companies, biological data management supports research and development in various stages of drug development. Data from a variety of experiments, such as gene expression experiments, need to be collected, interpreted, validated, tracked, managed and integrated. Oftentimes native experimental data need to be set in the context of extensive annotations collected from diverse public and private biological data repositories, and therefore further data collection, validation and integration are required. Biological data management activities in industry settings are carried out as part of specialized bioinformatics groups, involve mainly commercial off-the-shelf (COTS) data management systems and tools, and are sensitive to data acquisition and tracking (data provenance) requirements which are sometimes mandated.[2] Custom data management systems are sometimes developed in order to address application specific requirements that cannot be satisfied using COTS systems or tools.[3]

Bioinformatics companies are mainly focused on developing tools, such as LION's DiscoveryCenter,[4] for managing, integrating, and/or analyzing biological data. Data from public biological data sources are sometimes also assembled, re-packaged, and provided as part of a data integration platform, such as LION's SRS system.

Public Biological Data Management

Public biological data are usually generated in specialized laboratories or centers, such as the Joint Genome Institute (JGI), and then collected by data centers, such as those at NCBI [5] and EBI,[6] where data may undergo some level of annotation.

Public data centers sometimes prefer using free, rather than commercial, data management system, such as MySQL, or develop native data management systems, such as the AceDB system used for managing databases at the Sanger Institute.[7] Reasons for this preference include the high cost of commercial systems such as Oracle, compounded by the expense entailed by trained data management professionals, needed to use effectively such systems, and limitations of commercial systems to support biological data specific structures and operations. Public data centers operate under less stringent data provenance requirements than their industry counterparts.

In addition to the large centers, such as NCBI and EBI, that manage community repositories, there are also numerous smaller scale centers, such as CBIL,[8] developing specialized biological database. Such centers are working under more restricted funding, and therefore follow less stringent maintenance policies. Some of these centers have also an educational role (for example, CBIL is part of University of Pennsylvania's bioinformatics program) and engage in more forward looking R&D projects.


Biological Data Management Challenges

Biological data management involves the traditional areas of data generation and acquisition, data modeling, data integration, and data analysis. Technology platforms for generating biological data present data management challenges arising from the need to capture, organize, interpret and archive vast amounts of experimental data. Platforms keep evolving with new versions benefiting from technological improvements, such as higher density arrays and better probe selection for microarrays.[9] This evolution raises the additional problem of collecting potentially incompatible data generated using different versions of the same platform, encountered both when these data need to be integrated and analyzed. Further challenges include qualifying the data generated using inherently imprecise tools and techniques and the high complexity of integrating data residing in diverse and poorly correlated repositories.

A number of biological data management challenges have been examined in the context of both traditional and scientific database applications. When considering these challenges, it is important to determine whether they require new or additional research, or can be addressed by adapting and/or applying existing data management tools and methods to the biological domain. Successful commercial biological data management systems and products[10] suggest that existing data management tools and methods, such as commercial database management systems, data warehousing tools, and statistical methods can be adapted effectively to the biological domain. For example, the development of Gene Logic's gene expression data management system has involved modeling and analyzing microarray data in the context of gene annotations (including sequence data from a variety of sources), pathways, and sample (e.g., morphology, demography, clinical) annotations, and has been carried out using or adapting existing tools.[11] Dealing with data uncertainty or inconsistency for experimental data has required statistical, rather than traditional data management, methods; adapting statistical methods to gene expression data analysis at various levels of granularity has been the subject of intense research and development in recent years.[12] The most difficult problems have been encountered in the area of data semantics and data slicing - the former regards properly qualifying data values (e.g., an expression estimated value) and their relationships, especially in the context of continuously changing platforms and evolving biological knowledge, while the latter regards identifying the logical units of data for analysis in order to allow effective data mining. While such problems are encountered across all biological data management areas, from data generation through data collection and integration to data analysis, the solutions require domain specific knowledge and extensive data definition and curation work, with data management providing the framework (e.g., controlled vocabularies, ontologies) to address these problems.

A different, but no less serious, challenge is posed by the complexity of selecting methods and tools to develop a biological data management system. Such a system may involve a mix of commercial off the shelve (COTS) tools, open source, and custom developed software. COTS tool vendors, such as Oracle,[13] IBM,[14] and EMC,[15] have established Life Sciences divisions or programs that are dedicated to show how their tools address key problems in a Life Science organization. However, the complexity of COTS tools pose a substantial challenge when devising a biological data management system. For example, while relational DBMSs have been used extensively for developing both commercial and public biological data management systems, employing effectively a DBMS is a demanding and complex task. Furthermore, COTS based solutions could lead to overly expensive and not necessarily optimal solutions to a specific problem. Conversely, open source tools and software, such as MySQL, do not carry any up front costs, but are sometimes more limited than COTS tools.

Solutions to biological data management challenges need to be considered in terms of complexity, cost, robustness, performance, user and application specific requirements, as well as in the context of well defined timeframes- depending on context, partial but rapidly developed solutions may be more valuable than complete but time consuming solutions. Systems that are appropriate in a given context may be inadequate in a different context - for example, a system that is appropriate in the context of a small exploratory system confined to a small group is likely to be inadequate in a data intensive environment with numerous users, where reliability, robustness, comprehensibility, and performance are critical.

Addressing data management challenges effectively requires expertise in several areas, such as data modeling, database administration, data sharing and security, software engineering, software and data management quality control, statistics, data management infrastructure. Few organizations, especially in academia, can afford setting up data management groups because of the high complexity and cost involved. This problem can be addressed by pulling together resources for a Data Management and Technology Center that can serve multiple organizations.


Biological Data Management and Technology Center

Rationale

The need for biological data, biocomputing, and software centers is discussed in DOE's Genome to Life (GTL)[16] program and NIH's Roadmap for Accelerating Medical Discovery to Improve Health.[17] GTL envisions four different types of facilities generating data that would be organized in a variety of databases, including expression, proteomic, protein-function, chemistry, and pathway databases. Data generation in these facilities will be controlled using workflow management and/or Laboratory Information Management Systems (LIMS). Data will be collected, archived, and passed through a number of processing stages, including data annotation and integration. GTL also envisions computing infrastructure facilities in the form of software, biocomputing and data centers. In particular, a "seamless and effectively centralized capability to deal with data" in the form of data centers collecting and integrating effectively large scale biological data is seen as key to GTL's success.

Requirements for computing infrastructure have been discussed in a series of GTL workshops.[18] These workshops have identified a number of data management issues that are deemed important for GTL's success and that may require further research, but the workshops have not addressed the question of how the basic facilities and the various computing infrastructure facilities would interact. The specific goals and functions of a biological data center have also not been discussed at these workshops.

Structure and Functions

A data center as envisioned by the GTL initiative needs to address key data management challenges including the massive and ongoing increase in the amount and range of biological data, the difficulty of quantifying the quality of data generated using inherently imprecise tools and techniques, and the high complexity of integrating data residing in diverse and sometimes poorly correlated repositories. Addressing these challenges requires a strategy for devising effective solutions that respond to the immediate requirement of supporting both ongoing data generation and pursuing longer term goals.

A Biological Data Management and Technology Center should be based on proven strengths of both commercial and public centers. Setting up such a center employing industry practices in funding and organization ensures maintaining a focused effort in conjunction with the development of "industrial strength" databases and data management tools. A biological data center also needs academic high standards, the discipline and rigor that are required for the development of scientifically sound methods and techniques for generating and interpreting biological data.[19]

Rigorous data management practices and sound expertise are needed for addressing large scale biological data generation, collection and validation, which often involve complex data acquisition, tracking and control systems. Such problems mainly require deploying or adapting existing tools and platforms, such as Laboratory Information Management Systems, Database Management Systems, and Data Warehouse tools. Accordingly, a biological data center needs qualified database management and administration professionals, software engineers for adapting and/or integrating tools, and (bio) statisticians for handling platform specific data interpretation and validation. An important task for biological data management centers is to provide efficient, reliable and secure access to its data to a large community of scientists as well as other centers. This task can be addressed by using or adapting existing (hardware or software) data mirroring or (hardware or software) accessing technologies.

A Biological Data Management and Technology Center also needs to pursue long term goals with regard to critical data management problems that cannot be resolved using existing technology. Since data management technology is evolving, the center must be involved in a continuous and detailed technology assessment, including benchmarking [20] and cost assessment of potential solutions.[21] Cost effectiveness and ability to take advantage of rapid technological advances without loss of quality, time, and cost, should be build into data management solutions that are inherently evolving.

Research needs to be conducted in order to address critical problems that are not supported by existing technology. Collaboration with commercial companies, such as Oracle, Sun, IBM, may defray the costs of such activities.

Goals

The main goal of the Biological Data Management and Technology Center will be to serve as a source of expertise in and provide support for data management activities at the Joint Genome Institute, Life Sciences and Physical Biosciences Divisions at LBNL, UCSF's Cancer Center, and other Biomedical and Biotechnology Centers in the Bay Area. The Center will provide services based on collaborations with these organizations. Collaborations with the Center will be cost effective by allowing multiple organizations to share the experience, skills, and data management technology at the Center.

Initially, the Center will focus on providing support to the Join Genome Institute (JGI) where a number of areas that will benefit from the Center's services have been identified. JGI provides key sources of data to be managed at as well as the initial biological programmatic context for the Center. Several JGI data management areas that could be improved are briefly discussed below. Once the Center is established, additional areas in which the Center can provide support for JGI will be identified after further review of JGI's planned activities.

  • Sequence Data Organization and Retrieval. The Production Genomic Facility (PGF) produces about 2 million files per month of trace data, 100 assembled projects per month, and several very large assembled projects per year. PGF is currently increasing its sequencing capability increasing the challenge of making data available online, whereby online access to trace files may be required for quality control and functional genomics purposes. The Center will provide PGF with a solution to this problem. Specific tasks will include revising existing procedures for capturing information (metadata) about sequence data files and grouping these files in order to improve their organization at all levels of granularity, and developing mechanisms for automatic organization of these files as well as for their effective retrieval and processing.

  • JGI Portal. JGI makes available its completed sequences to the scientific community through a web portal. The Center will assist JGI in enhancing its portal. Specific tasks will include, reviewing the organization of the current portal, working jointly with JGI's portal group to enhance its functionality (for example, through a closer integration of sequence data with genome annotations, such as functional and pathway information, in other public genome data resources), and extending the portal's search and query capabilities. Another area of improvement for the portal that will be considered, is on-line management of sponsored sequencing projects, whereby sponsors would be able to follow the progress of their sequencing projects and gain access to data without delay.

  • Microbial Sequencing. JGI is ramping up its microbial sequencing efforts and starts work in the new area of "community sequencing" which involves novel microbial genomes from environmental samples found in a diverse range of habitats. This new type of sequencing requires a change in the way sequence data are modeled, including support for acquiring and storing contextual (e.g., environmental) properties that are essential in characterizing the sequence data. Although, completed sequences will continue to be deposited in GenBank to allow public access to these data, this may not be sufficient for holding all the information associated with community sequencing data. The Center will address this problem by devising a new sequence data resource that would complement GenBank and will include data that does not fit GenBank. Specific tasks will include gathering requirements specific to the community sequencing activities, designing and developing a data management system for acquiring data for both sequence and contextual data, and developing a data resource storing these data and available to the scientific community.

Longer term, the Center will establish collaborative relationships with and provide support for biological research programs at LBNL's Life Sciences and Physical Biosciences Divisions, the Biotechnology Programs at UCB and UCSF, and will be involved in future National Centers set up in the Bay Area. Dr. Joe Gray, head of LBNL's Life Sciences Division has expressed strong support for establishing a Biological Data Management and Technology Center and pledged to involve it in future proposals that have a data management component.

Lawrence Berkeley National Lab (LBNL) provides an ideal location for a biological data center with its premier multidisciplinary research environment. In particular, NERSC, and ongoing image analysis, visualization and scientific data management research in the Computational Research Division can complement the data interpretation, visualization, and analysis efforts in a Biological Data Management and Technology Center.

From an educational point of view, the center will provide an ideal environment for students to gain practical experience in large scale biological data management and analysis, and can draw upon and complement programs in the Computer Science, Statistics, and Bioengineering Departments at UC Berkeley.


1. "Bioinformatics: Getting Results in the Era of High-Throughput Genomics", Branca, M.A., Goodman, N., and Venkatesh, T.V., Cambridge Healthtech Institute Report 9, May 2001.

2. FDA, Guidance for Industry, Part 11, Electronic Records; Electronic Signatures — Scope and Application, http://www.fda.gov/cder/guidance/5667fnl.htm.

3. An example of such a system is described in "Process Biology: Managing Information Flow for Improved Decision Making in Preclinical R&D", Reidhaar-Olson, J.F, Ohkawa, H., Babiss, L.E., and Hammer, J., Preclinica, Vol. 1, No. 4, 2003.

4. LION DiscoveryCenter, http://www.lionbioscience.com/solutions/discoverycenter.

5. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/.

6. European Bioinformatics Institute Databases, http://www.ebi.ac.uk/Databases/.

7. AceDB, Sanger Institute, http://www.acedb.org/.

8. Computational Biology and Informatics Laboratory in the Center for Bioinformatics at the University of Pennsylvania, http://www.cbil.upenn.edu/.

9. "DNA Microarray Informatics: Key Technological Trends and Commercial Opportunities", Branca M.A. and Goodman, N., Cambridge Healthtech Institute Report 19, February 2002.

10. For example, see gene expression data products such as Gene Logic's Genesis Enterprise System, http://www.genelogic.com/solutions/genesis/, Silicon Genetics's GeneSpring System, http://www.silicongenetics.com/cgi/SiG.cgi/Products/GeneSpring/index.smf, and Rosetta's Resolver System, http://www.rosettabio.com/products/resolver/default.htm.

11. Markowitz, V.M., Campbell, J., Chen, I.A., Kosky, A., Palaniappan, K., and Topaloglou, T., "Integration Challenges in Gene Expression Data Management." Chapter in Bioinformatics: Managing Scientific Data, Morgan Kauffman Publishers (Elsevier Science), 2003, pp. 277-301.

12. See for example, http://oz.berkeley.edu/users/terry/zarray/Html/index.html.

13. Oracle, Solutions for Life Sciences, http://www.oracle.com/industries/life_sciences/index.html?content.html.

14. IBM Life Sciences, http://www-3.ibm.com/solutions/lifesciences/.

15. EMC, Life Sciences Infrastructure Solutions, http://www.emc.com/vertical/pdfs/life_sciences/interstitial_data_warehouse.jsp.

16. "User facilities for 21st Century Systems Biology: Providing Critical Technologies for the Research Community", Department of Energy, Office of Biological and Environment Research, November 2002, http://www.doegenomestolife.org/pubs.html.

17. NIH Roadmap: Bioinformatics and Computational Biology, http://nihroadmap.nih.gov/bioinformatics/index.asp.

18. Mathematics for GTL Workshop, Gaithersburg, Maryland; March 18-19, 2002, http://www.doegenomestolife.org/pubs/GTLMath-6.pdf. Computer Science for GTL Workshop, Gaithersburg, Maryland; March 6-7, 2002, http://www.doegenomestolife.org/compbio/mtg_1_22_02/infrastructure.pdf.

19. For example, gene expression data interpretation methods have been improved in recent years mainly due to active academic research — see for example, A Benchmark for Affymetrix GeneChip Expression Measures, http://affycomp.biostat.jhsph.edu/.

20. Benchmarking is needed in order to gain a good understanding of existing technologies, beyond the hype usually surrounding them.

21. Industry (so called P&L) cost assessment is a good way of determining both the short and long term advantage of developing in house solutions compared to acquiring off shelf solutions.

 

Mathematical, Information and Computational Sciences Division, DOE
  NERSC
National Energy Research Scientific Computing Center
 
   
  Privacy and Security Notice  
  Department of Energy Office of Science logo  
Berkeley Lab A-Z Index Search Phone Book