-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.
Become the standard means of accessing HBase data through a well-defined, industry standard API.
Tired of reading already and just want to get started? Jump over to our quick start guide here.
##SQL Support##
To see what's supported, go to our language reference. It includes all typical SQL query statement clauses, including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, etc. It also supports a full set of DML commands as well as table creation and versioned incremental alterations through our DDL commands. We try to follow the SQL standards wherever possible.
Use JDBC to get a connection to an HBase cluster like this:
Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver");
Connection conn = DriverManager.getConnection("jdbc:phoenix:server1,server2:3333");
where the connection string is composed of:
jdbc:phoenix [ :<zookeeper quorum> [ :<port number> ] [ :<root node> ] ]
For any omitted part, the relevant property value, hbase.zookeeper.quorum, hbase.zookeeper.property.clientPort, and zookeeper.znode.parent will be used from hbase-site.xml configuration file.
Here's a list of what is currently not supported:
- Joins. Single table only currently.
- Derived tables. Nested queries along with TopN queries are coming soon.
- Relational operators. Union, Intersect, Minus.
- Miscellaneous built-in functions. These are easy to add - read this blog for step by step instructions.
Phoenix supports table creation and versioned incremental alterations through DDL commands. The table metadata is stored in an HBase table.
A Phoenix table is created through the CREATE TABLE DDL command and can either be:
- built from scratch, in which case the HBase table and column families will be created automatically.
- mapped to an existing HBase table, by creating either a read-write TABLE or a read-only VIEW, with the caveat that the binary representation of the row key and key values must match that of the Phoenix data types (see Data Types reference for the detail on the binary representation).
- For a read-write TABLE, column families will be created automatically if they don't already exist. An empty key value will be added to the first column family of each existing row to minimize the size of the projection for queries.
- For a read-only VIEW, all column families must already exist. The only change made to the HBase table will be the addition of the Phoenix coprocessors used for query processing. The primary use case for a VIEW is to transfer existing data into a Phoenix table, since data modification are not allowed on a VIEW and query performance will likely be less than as with a TABLE.
All schema is versioned, and prior versions are stored forever. Thus, snapshot queries over older data will pick up and use the correct schema for each row.
A table could also be declared as salted to prevent HBase region hot spotting. You just need to declare how many salt buckets your table has, and Phoenix will transparently manage the salting for you. You'll find more detail on this feature here, along with a nice comparison on write throughput between salted and unsalted tables here.
Another schema-related feature allows columns to be defined dynamically at query time. This is useful in situations where you don't know in advance all of the columns at create time. You'll find more details on this feature here.
##Transactions## The DML commands of Phoenix, UPSERT VALUES, UPSERT SELECT and DELETE, batch pending changes to HBase tables on the client side. The changes are sent to the server when the transaction is committed and discarded when the transaction is rolled back. The only transaction isolation level we support is TRANSACTION_READ_COMMITTED. This includes not being able to see your own uncommitted data as well. Phoenix does not providing any additional transactional semantics beyond what HBase supports when a batch of mutations is submitted to the server. If auto commit is turned on for a connection, then Phoenix will, whenever possible, execute the entire DML command through a coprocessor on the server-side, so performance will improve.
Most commonly, an application will let HBase manage timestamps. However, under some circumstances, an application needs to control the timestamps itself. In this case, a long-valued "CurrentSCN" property may be specified at connection time to control timestamps for any DDL, DML, or query. This capability may be used to run snapshot queries against prior row values, since Phoenix uses the value of this connection property as the max timestamp of scans.
The catalog of tables, their columns, primary keys, and types may be retrieved via the java.sql metadata interfaces: DatabaseMetaData, ParameterMetaData, and ResultSetMetaData. For retrieving schemas, tables, and columns through the DatabaseMetaData interface, the schema pattern, table pattern, and column pattern are specified as in a LIKE expression (i.e. % and _ are wildcards escaped through the \ character). The table catalog argument to the metadata APIs deviates from a more standard relational database model, and instead is used to specify a column family name (in particular to see all columns in a given column family).
##Download## To download, see our Download page
##Performance## For a comparison of Phoenix versus Hive, Impala, and OpenTSDB, see our Performance page.
##Configuration and Tuning## To get an idea of the knobs and dials offered by Phoenix to tune performance, see our Tuning page
##Release Notes## To see what has changed at each release of Phoenix, see our [Release Notes page] (https://github.com/forcedotcom/phoenix/wiki/Release-Notes)
##Roadmap## Our roadmap is driven by our user community. Below, in prioritized order, is the current plan for the Phoenix roadmap:
-
Secondary Indexes. Allow users to create indexes through a new
CREATE INDEXDDL command, and then, behind the scenes, create a separate HBase table with a different row key for the index. At query time, Phoenix will take care of choosing the best table to use based on how much of the row key can be formed. Initially two types of indexes would be supported: the simplest would be for immutable data loaded into HBase through HFiles. We already support getting at the uncommittedList<KeyValue>to allow an HFile to be built. We'd expand this to support getting the uncommittedList<KeyValue>for each index as well. This alleviates the need to do any kind of index maintenance as the data table changes. The second type of index supported would be for mutable data. In this case, as the data table changes, we'd augment the WAL entry to include enough information to guarantee that upon any failure, the index rows would always be written. -
Aggregation Enhancements. Although COUNT, SUM, AVG, MIN, and MAX are currently supported,
COUNT DISTINCTandPERCENTILEare not. We should support both an exact version of these as well as an approximate version where the accuracy is configurable. The exact version would require returning all distinct values back to the client for the final merge operation while the approximate version would not. - Monitoring and Management. Provide visibility into cpu, physical io, logical io, wait time, blocking time, and transmission time spent for each thread of execution across the HBase cluster, within coprocessors, and within the client-side thread pools for each query. On top of this, we should exposing things like active sessions and currently running queries. The EXPLAIN PLAN gives an idea of how a query will be executed, but we need more information to help users debug and tune their queries.
-
Type Additions. The basic types are supported, but
FLOAT,DOUBLE,TINYINT,SMALLINTshould be added. Additional work includes some general cleanup and support for DEFAULT and auto increment at DDL time (by surface the HBase put-and-increment functionality through the standard SQL sequence support) and support for the ARRAY type. It would be great to package this up and contribute it back to HBase as the basis of a type system. -
Derived Tables. Allow a
SELECTclause to be used in the FROM clause to define a derived table. This would include support for pipelining queries when necessary. - Hash Joins. Provide the ability to join together multiple tables, through a phased approach:
- Equi-join. Support left, right, inner, outer equi-joins where one side of the join is small enough to fit into memory.
- Semi/anti-join. Support correlated sub queries for exists and in where one side of the join is small enough to fit into memory.
- Cost-based Optimizer. Once secondary indexing and joins are implemented, we'll need to collect and maintains stats and drive query optimization decisions based on them to produce the most efficient query plan.
-
OLAP Extensions. Support the
WINDOW,PARTITION OVER,RANK, etc. functionality. - Parent/child Join. Unlike with standard relational databases, HBase allows you the flexibility of dynamically creating as many key values in a row as you'd like. Phoenix could leverage this by providing a way to model child rows inside of a parent row. The child row would be comprised of the set of key values whose column qualifier is prefixed with a known name and appended with the primary key of the child row. Phoenix could hide all this complexity, and allow querying over the nested children through joining to the parent row. Essentially, this would be an optimization of the general join case, but could support cases where both sides of the join are bigger than would fit into memory.
-
Table Sampling. Support the
TABLESAMPLEclause by implementing a filter that uses the guideposts established by stats gathering to only return n rows per region. - Schema Evolution. Phoenix supports adding and removing columns through the [ALTER TABLE] (http://forcedotcom.github.com/phoenix/index.html#alter_table) DDL command, but changing the data type of, or renaming, an existing column is not yet supported.
- Transactions. Support transactions by integrating a system that controls time stamps like OMID. For some ideas on how this might be done, see here.
- Nested-loop Join. Support joins where both sides are big enough that they wouldn't fit into memory. As projects like Apache Drill progress, the need for this may lessen, since these systems will be able to decompose the query and perform the join efficiently without Phoenix needing to as described here.
We'd love to hear other ideas and have folks jump in and contribute. Join one of our Google groups and drop us a line. Or better yet, send us a Pull request.
##Recently Implemented## As items are implemented from our road map, they are moved here to track the progress we've made:
- IN/OR/LIKE Optimizations. When an IN (or the equivalent OR) and a LIKE appears in a query using the leading row key columns, compile it into a skip scanning filter to more efficiently retrieve the query results.
- Support ASC/DESC declaration of primary key columns. Allow a primary key column to be declared as ascending (the default) or descending such that the row key order can match the desired sort order (thus preventing an extra sort).
- Salting Row Key. To prevent hot spotting on writes, the row key may be "salted" by inserting a leading byte into the row key which is a mod over N buckets of the hash of the entire row key. This ensures even distribution of writes when the row key is a monotonically increasing value (often a timestamp representing the current time).
- TopN Queries. Support a query that returns the top N rows, through support for ORDER BY when used in conjunction with TopN.
- Dynamic Columns. For some use cases, it's difficult to model a schema up front. You may have columns that you'd like to specify only at query time. This is possible in HBase, in that every row (and column family) contains a map of values with keys that can be specified at run time. So, we'd like to support that.
- Phoenix package for the Apache Bigtop distribution. See BIGTOP-993 for more information.
##Building## Phoenix is a fully mavenized project. That means you can build simply by doing:
$ mvn package
which will build, test and package Phoenix and put the resulting jars (phoenix-1.0.jar and phoenix-1.0-client.jar) in the generated target/ directory.
To build, but skip running the tests, you can do:
$ mvn package -DskipTests
To only build the parser, you can do:
$ mvn process-sources
##Developing## Use the m2e eclipse plugin and do Import->Maven Project and just pick the root 'phoenix' directory.
##Contributing## Join one or both of our Google groups:
- Phoenix HBase User for users of Phoenix.
- Phoenix HBase Dev for developers of Phoenix.
and follow the Phoenix blog here.