April 12, 2005
The Biological Data Management and Technology Center (BDMTC) at Lawrence Berkeley National Laboratory marked its first anniversary with the release of the Integrated Microbial Genomes (IMG) system, a complex data management system developed in collaboration with the Joint Genome Institute. Developed as a community resource, IMG will integrate JGI's microbial genome data with publicly available microbial genome data, and thus provide a powerful comparative context for microbial genome analysis.
While IMG is the first academic “product” BDMTC undertook, its success “demonstrates the viability of the center's rationale,” said BDMTC head Victor Markowitz, who launched the center in January 2004. “BDMTC is based on the premise that addressing effectively biological data management challenges requires extensive data management and system development experience and expertise consolidated in a central core,” according to Markowitz.
Creating a data management center at Berkeley Lab directly addresses issues raised in recent assessments by the DOE and National Institutes of Health (NIH) with regard to the need for improved data management and software development capabilities. In particular, the NIH documents recommend employing advanced data management technologies and software engineering principles for delivering robust and reliable tools and systems for biomedical research.
Rationale for BDMTC
Biological data management involves data generation and acquisition, data modeling, data integration and data analysis. Data management poses challenges on several fronts. First, there are the increasing amounts of experimental data generated by life science applications. Next is the difficulty of qualifying data generated using inherently imprecise tools and techniques. Finally, there is the complexity of integrating data residing in diverse and poorly correlated repositories.
At research institutions such as LBNL and the University of California , San Francisco (UCSF), biological data management systems have typically been developed with an eye toward rapid development and low cost. This often meant that minimal consideration was given to requirements analysis, system development practices, system evolution, maintenance and scalability. While such as approach was perceived as less expensive because it could be achieved without experienced data management professionals and software engineers, the savings came at the expense of overall system quality, including reliability, maintenance and evolution.
The problems associated with academic/research systems and software have been recognized and addressed in two NIH reports — “The Biomedical Information Science and Technology Initiative” prepared by the Working Group on Biomedical Computing Advisory Committee to the NIH Director, and the NIH Roadmap for Accelerating Medical Discovery to Improve Health. Both documents recommend employing advanced data management technologies for developing interoperable biomedical databases and software engineering practices for delivering robust and reliable systems and tools.
Following NIH's recommendations requires expertise in several areas, such as data modeling, data integration, database administration, data sharing and security, software engineering, software and data management quality control. Due to the complexity and cost involved, few public institutions can afford to acquire such expertise. Therefore a central core such as BDMTC could provide an effective solution to this problem. BDMTC's premise is also consistent with DOE's Genomes to Life (GTL) program, which envisions consolidated computing infrastructure facilities in the form of software, biocomputing and data centers. In particular, a “seamless and effectively centralized capability to deal with data” in the form of data centers collecting and effectively integrating large-scale biological data is seen as key to GTL's success.
Exploring Partnership Possibilities
Over the course of its first year, members of BDMTC approached a number of academic organizations in the Bay Area, both to assess their data management needs and to identify potential areas for collaboration. The organizations included the Berkeley Structural Genomics Center (BSGC), the Joint Genome Institute (JGI), the P50 Integrative Cancer Biology Program (ICBP) in the Life Sciences Division at LBNL, and the Immune Tolerance Network (ITN) at UCSF.
The analysis of JGI's data management goals subsequently led to the development of IMG. BSGC's data management needs, in particular in the area of experimental data tracking and work scheduling, were examined in order to prepare the Laboratory Information Management Systems (LIMS) and data management component of BSGC's PSI-II application for a Large Scale Structural Genomics Center . ICBP's data management core provided a concrete framework for exploring caBIG related opportunities for providing better data management support for NCI sponsored programs and centers.
BDMTC has also pursued collaborations with the Immune Tolerance Network (ITN) at UCSF and was part of UC Berkeley's proposal for a National Center for Biomedical Computing (NCBC): the former was not finalized because of budget cuts, and the latter was not selected for funding. However these initiatives provide additional evidence for BDMTC's potential to establish collaborations. In particular, NIH's call for establishing NCBC envisions software development and data management cores similar in scope to BDMTC.
Challenges and Plans
While there is clearly a growing need for enhanced data management tools, there is also a preference among many life science groups for a do-it-yourself approach, rather than collaborating with other groups or centers. An additional challenge is posed by the emphasis put on experimental results over data management, which may entail reconciling the relatively low budgets these groups assign to data management and the cost associated with outside collaborations.
In 2005, BDMTC will continue to pursue collaboration opportunities, primarily at LBNL, UC Berkeley and UCSF. Helping research groups realize that collaborations could lead to potentially higher quality data management results, reduced effort duplication, and savings coming from sharing resources and expertise will be part of BDMTC's outreach efforts.
BDMTC is involved in preparing a new GTL proposal, “Metagenomics enabled analysis of termite hindguts for biomass conversion and cleaner energy” (Eddy Rubin, PI). If funded, this project will involve the development of an Integrated Microbial Community Genome (IMCG) data management system that will serve as a community resource for metagenome data generated by this project integrated with data collected from public sources. IMCG will benefit from the experience gained by BDMTC in developing IMG, as well as from sharing components with IMG. IMCG will also provide an opportunity for devising an effective way of employing NERSC's computational and storage infrastructure to life science data management applications. BDMTC will also provide support to scientists at JGI and LBNL's Genomics Division for preparing NIH R-01 proposals in the metagenome-biomedical domain.
BDMTC's involvement in discussing the data management needs for the P50 Integrative Cancer Biology Program at LBNL has led to the identification of new opportunities in the context of the National Cancer Institute's Center for Bioinformatics (NCICB) initiatives. NCICB is developing a suite of data management systems and software tools for the cancer research community, with the goal of improving data sharing and system interoperability through employment of common systems and tools. Adapting NCICB tools to program- or project-specific requirements can be addressed through the cancer Biomedical Informatics Grid (caBIG) program. BDMTC will work with the P50 group in pursuing caBIG related funding in support of long-term data management goals for P50 and related programs at UCSF.
Developing productive collaborations will require a change in the way groups tend to operate, Markowitz said. Life science groups would benefit from a higher level of collaboration in the area of biological data management system and bioinformatics tool development. Data management and software engineering groups need to improve their ability to support life science applications through enhanced understanding of these applications. Prerequisites for such an endeavor include finding incentives to encourage collaborations and raising the awareness of the critical role played by biological data management in competing for large-scale projects or centers such as those envisioned by the GTL and NIH programs.