View on GitHub

Odoop

Opower's suite of useful Hadoop ecosystem libraries

The query planner is a system for running SQL-like queries against HBase. It is not intended as a complete implementation of SQL on top of HBase, but it is instead designed to be specific to HBase and thus uses its semantics. The current implementation is heavily biased towards helping run complex scans on wide tables.

Query Grammar

The currently implemented grammar is as follows (case sensitive):

query             := selectClause fromClause [ whereClause ]
selectClause      := scanClause
fromClause        := "from" tableName
whereClause       := "where" rowKeyConstraint
scanClause        := "scan" [ column { "," column } ]
column            := [ version ] family ":" qualifier [ timeRange ]
version           := "all versions of" | positiveNumber "versions of"
family            := /\w+/
qualifier         := literal | literal "*" | "*"
literal           := /([a-zA-Z0-9`~!@#$%^&()\-_=+\[\]\{\}\\|;:'".<>/?]|(\\x[0-9]{2}))+/
timeRange         := "between" parameter "and" parameter
tableName         := /\w[\w\-.]*/
rowKeyConstraint  := ( "rowkey" rowKeyOperator parameter | "rowkey between" parameter "and" parameter )
rowKeyOperator    := "<" | "<=" | ">" | ">=" | "="
parameter         := "{" /\w*/ "}"
positiveNumber    := /[1-9]\d*/

The data model for the example queries is roughly something like this:

HBase table named customer with a single column family: d. The row key is a synthetic primary key for a customer. Columns include:

Example Queries

Fetch the most recent address and all clicks that occurred between start and stop for a single row:

scan d:address,
     all versions of d:clicks between {start} and {stop}
from customer
where rowkey = {id}

Notice that the parameters of the query must be abstracted out. Literal values are not allowed. Instead all parameters must be demarcated with { and } and then set according to the parameter name.

Additional query functionality can include pattern matching at the column level like the following.

Fetch the four most recent versions of all columns from the d family:

scan 4 versions of d:* ...

Fetch the most recent version of all predictions:

scan d:preditions* ...

Additional constraints can be added to the where clause as well. This example shows a literal parameter value just for ease of understanding.

Assuming a click is deserialized as <target URL>|<referrer URL>, fetch all clicks coming from http://google.com:

scan all versions of d:clicks
from customer
where d:clicks =~ /.*\|http://google.com/

Fetch all customers between IDs 50 and 100, inclusive at the beginning, exclusive at the end:

scan * from customer where rowkey between 50 and 100

Example Usage

QueryPlanner planner = new QueryPlanner(new HTablePool(HBaseConfiguration.create()));
Query query = null;
ResultScanner scanner = null;
try {
    query = planner.parse("scan from customer where rowkey = {id}");
    query.setInt("id", 42);
    scanner = query.scan();
    for (Result result : scanner) {
        // do some real work
    }
}
finally {
    closeQuietly(scanner);
    closeQuietly(query);
    closeQuietly(planner);
}

The QueryPlanner is fully responsible for constructing an efficient Scan and managing the HTable and its connections.