FastBit is an open-source data processing library in the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It recognizes user data in the column-oriented fashion similar to MonetDB and Vertica. Because it is available as a library, the users are free to build their own data processing system on top of it. In particular, the user data is NOT required to be under the control of FastBit software.
An index in a database system is a data structure that utilizes redundant information about the base data to speed up common searching and retrieval operations. Most of the commonly used indexes are variants of B-trees, such as B+-tree and B*-tree. FastBit implements a set of alternative indexes called compressed bitmap indexes. Compared with B-tree variants, these indexes provide very efficient searching and retrieval operations but are somewhat slower to update after a modification of an individual record.
In addition to the well-known strengths of bitmap indexes, FastBit has a special strength stemming from the bitmap compression scheme used. The compression method is called the Word-Aligned Hybrid (WAH) code. It reduces the bitmap indexes to reasonable sizes, and at the same time allows very efficient bitwise logical operations directly on the compressed bitmaps. Compared with the well-known compression methods such as LZ77 and Byte-aligned Bitmap code (BBC), WAH sacrifices some space efficiency for a significant improvement in operational efficiency [SSDBM 2002, DOLAP 2001]. Since the bitwise logical operations are the most important operations needed to answer queries, using WAH compression has been shown to answer queries significantly faster than using other compression schemes.
Theoretical analyses showed that WAH compressed bitmap indexes are optimal for one-dimensional range queries. Only the most efficient indexing schemes such as B+-tree and B*-tree have this optimality property. However, bitmap indexes are superior because they can efficiently answer multi-dimensional range queries by combining the answers to one-dimensional queries.
FastBit process queries on one table at a time. Currently, there are two sets of interface for query processing, one more abstract and the other more concrete. The more abstract interface is represented by the class ibis::table and the more concrete interface is represented by the class ibis::part. A table (with rows and columns) is divided into groups of rows called data partitions. Each data partition is stored in a column-wise organization known as vertical projections. At the abstract level, queries on a table produces another table in the spirit of the relational algebra. At the concrete level, the queries on data partitions produce bit vectors representing rows satisfying the user specified query conditions.
The main class representing this interface is ibis::table. The main query function of this class is ibis::table::select, whose functionality resembles a simplified form of the SELECT statement from the SQL language. This function takes two string as arguments, one corresponds to the select clause in SQL and the other corresponds to the where clause. In the following, we will call them the select clause and the where clause and discuss the requirements and restriction on these clauses. The function ibis::table::select returns a new ibis::table when it completes successfully. This new table can be used in further query operations.
The select clause passed to function ibis::table::select can only contain column names separated by comma (,). Aggregate operations such as MIN, MAX, AVG, SUM, VARPOP, VARSAMP, STDPOP, STDSAMP, or DISTINCT are supported through another function named ibis::table::groupby. A group-by operation normally specified as one SQL statement needs to be split into two FastBit, one to select the values and the other to perform the aggregation operations. We've taken this approach to simplify the implementation. These aggregation operations are not directly supported by bitmap indexes, therefore, they are not essential to demonstrate the effectiveness of the bitmap indexes.
The where clause passed to function ibis::table::select can be a combination of range conditions connected with logical operators such as AND, OR, XOR, and NOT. Assuming that temperature and pressure are names of two columns, the following are valid where clauses (one on each line),
Using FastBit, one can only append new records to a table. These operations for extending a table is defined in the class ibis::tablex.
For most fixed-sized data, such as integers and floating-point values, FastBit functions expects raw binary data and also store them as raw binary, therefore the data files and index files are not portable across different platforms. This is common to both ibis::table interface and ibis::part interface. However, one difference is that ibis::table handles string values as
std::vector<std::string>, while the lower level interface ibis::part handles strings as raw
char* with null terminators.
The two key classes for query processing on a data partition are ibis::part and ibis::query, where the first represents the user data (or base data) and the second represents a user query. An ibis::part is primarily a container of ibis::column objects and some common information about the columns in a data partition. The class ibis::column has two specialization for handling string values, ibis::category for categorical values (keys) and ibis::text for arbitrary text strings.
The user query is represented as an ibis::query object. Each query is associated with one ibis::part object. The functions of the query class can be divided into three groups, (1) specifying a query, (2) evaluating a query, and (3) retrieving information about the hits. The queries accepted by FastBit are a subset of the SQL SELECT statement. Each query may have a WHERE clause and optionally a SELECT clause. Note that the FROM clause is implicit in the association with an ibis::part. The WHERE clause is a set of range conditions joined together with logical operators, e.g.,
A = 5 AND (B between 6.5 and 8.2 OR C > sqrt(5*D)). The SELECT clause can contain a list of column names and some of the four functions AVG, MIN, MAX, SUM, VARPOP, VARSAMP, STDPOP, STDSAMP and DISTINCT. Each of the functions can only take a column name as its argument. If a SELECT clause is omitted, it is assumed to be "SELECT count(*)." We refer to this type of queries as count queries since their primary purpose is to count the number of hits.
To evaluate a query, one calls either ibis::query::estimate or ibis::query::evaluate. After a query is evaluated, one may call various function to find the number of hits (ibis::query::getNumHits), the values of selected rows (ibis::query::getQualifiedInts, ibis::query::getQualifiedFloats, and ibis::query::getQualifiedDoubles), or the bitvector that represents the hits (ibis::query::getHitVector).
The indexes are considered auxiliary data, therefore even though they involve much more source files than ibis::part and ibis::query, they are not essential from a users point of view. In FastBit, the indexes are usually built automatically as needed. However, there are functions to explicitly force FastBit to build them through ibis::table::buildIndex, ibis::part::buildIndex and their variants.
Currently, all indexes are in a single class hierarchy with ibis::index as the abstract base class. The most convenient way to create an index is calling the class function ibis::index::create. One can control what type of bitmap index to use by either specifying an index specification for a whole table by calling ibis::table::indexSpec, for a whole data partition by calling ibis::part::indexSpec, or for each individual column by calling ibis::column::indexSpec. The index specification along with other metadata are written to a file named
-part.txt in the directory containing the base data and the index files. The directory name is needed when constructing an ibis::part. This information may be indirectly provided through an RC file specified to the function ibis::init.
The software programer gratefully acknowledges the support from the research colleagues Kurt Stockinger, Ekow Otoo and Arie Shoshani. They are crucial in establishing the foundation of the FastBit system and applying the software to a number of applications. Many thanks to the early users. Their generous feedback and suggestions are invaluable to the development of the software. A full list of contributors is in the file AUTHORS. User feedback affecting the code and documentation is recorded in ChangeLog along with user names.
This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and DE-AC03-76SF00098. It also uses resources of the National Energy Research Scientific Computing Center.
Send any comments, bug reports, and patches to firstname.lastname@example.org@m.sers@.email@example.com@firstname.lastname@example.org@m.v.