M3 Data Abstraction Library

1.0.0

Motivation

Complexity of CMB Data

Cosmic Microwave Background (CMB) data is usually stored in many files. These files are interrelated and here we will refer to these relationships as the data layout. Each of these files may represent the data in a different way on disk, and here we will refer to the representation as the data format.

Impact on Analysis Applications

The relationships that exist between data files on disk is potentially quite complex. There is also quite a range of possible formats that CMB data files can have. Data layout and format variability impacts the way in which CMB data analysis applications are written. A CMB data analysis application programmer has one of three possible ways of dealing with format and layout variability. The first two options have clear limitations.

Adressing the Problem

Three Possible Solutions

One solution is to have input data files can be converted to a fixed format and layout that is unique to the data analysis software. The second solution is for the data analysis application developer to maintain a different version of the application for every layout and formatting scheme. The third solution is to have a middle layer software interface to the data so that the application developer is oblivious to the format and layout of the data.

Fixed File Format and Layout

The first solution is a disk resource intensive solution, and a programmer resource intensive solution. This can potentially imply that a different version of the data needs to be created for every analysis application that will be used with the data. For large data sets and large collaborations this quickly becomes a significant limitation. The first solution also requires a duplication of programming effort. For every combination of data analysis code, data format, and data layout someone must write a piece of software that will efficiently convert the data.

Multiple Application Versions

The second solution is a programmer resource intensive solution. Maintaining multiple versions of a single code can be very difficult. One difficulty comes from not knowing if the different versions behave identically outside of the issues of data layout and data format. When a change is made to one version of the analysis software it may not be made in the others. This leads to more debugging and testing of each code, and frustration for the developer. With this solution there is no way to simultaneously analyze data in multiple formats or data layouts without resorting to file format conversions.

Middle Layer Software Interface

The third solution addresses all of the above mentioned shortcomings of the first two, and adds some useful benefits. The primary benefits of the M3 interface are flexibility of data management, the consolidation of software development, the ability to chain together analysis applications and do data manipulations in conjunction with reading data.

Benefits of the Middle Layer Approach

Data Volume and Flexibility

With the M3 middle layer data interface there is no need for reformatting data, or converting data layouts. There is no strain on disk space due to multiple versions of the data. Data analysis applications developers can access large and complicated data sets in an organized and cohesive fashion that is oblivious to data file format and layout. Through the M3 interface applications can analyze data with mixed formatting and layout.

Single Solution

By using the M3 data interface, analysis applications programmers inherit the functionality built into the library. M3 provides data base functionality that has been designed specifically for CMB data analysis applications. The package provides extensive on-the-fly data manipulations (e.g. calibration, linear combinations, coordinate conversions). The package is growing in functionality, and modular extensions are being developed by several programmers in a growing collaboration. Two analysis applications using the M3 library running on the same data can be sure that their applications are interpreting the data identically.

Application Chaining

One of the great advantages of using the M3 library is that it allows for analysis applications that use the library to be chained together in a data analysis pipeline. The back end interface of the M3 library is an XML data description file. This file serves as a record on disk of exactly how a set of data files are to be interpreted together. This file serves as a record that can be updated, and passed from one application to another. The XML record allows for the output data from one analysis application to be passed as input to another analysis application. This concept can be extended to create a long analysis application chain, or analysis pipeline. This chaining functionality is implemented so that analysis applications can output their data in whatever format and layout they prefer, because input of data is format and layout blind. The library is not magic, so code to read new data formats must be added to the library, but these additions are very modular, and extensible, and inherited by all of the users of the library.

M3: The Middle Layer Paradigm

The M3 library serves as a middle layer of software that connects the data on disk to the applications programmer. The M3 library can be thought of as a black box with two interfaces. The front end interface is an object oriented application programmer interface (API) that can be called from C, C++, FORTRAN 77 and FORTRAN 90/95. This front end API provides the programmer with data navigation and data input routines. The back end interface is an XML based cosmic microwave background (CMB) specific data structure for relating files containing CMB data for an analysis run. This file can be edited by hand in a text or XML editor. There are also functional calls available through the front end API for editing the data structure in memory and outputting the data structure to disk in XML format.

middleDiagram.jpg

Diagram of M3 Functionality

In this diagram the red arrows correspond to data requests a the blue arrows represent data transfers. The M3 library serves as a format and layout blind reading interface to the data. It is important to note that the output of data analysis applications is not done through the M3 library. The output of data is left up to the analysis programmer. Analysis applications that use the M3 library can be chained together by making alterations to the XML back end interface to reflect the changes made on disk by the analysis code. There is an additional functional requirement for chaining analysis applications, the output format must be recognized by the M3 library. Changes to the XML can be done by hand in a text or XML editor, or they can be done through functions available as part of the front end interface.


Generated on Mon Nov 24 10:05:11 2008 for M3 by  doxygen 1.5.3-20071008