Partitioning in Hive and Impala vs Jethro
In this post, I would like to explain how partitioning was implemented in Hive and Impala, why their design is very problematic, and how our implementation avoids those problems. Design matters!
Partitioning in Hive and Impala
In Hive and Impala, a partition is a directory of files. The partition column attribute is embedded in the directory name, so all rows in the same partition share the partition key. In other words, each partition covers a single value of the partition key. As we will see, that design has some unfortunate consequences.
Limitation – Partition Key Must Be Low Cardinality
A good partitioning key is one that helps in two scenarios. First, it should be a key that is used in most of the queries to enable partition pruning – automatically skipping irrelevant partitions. Second, it should be used as a basis for rolling window maintenance – like purging old data once a month. For example, most very large tables (fact tables) store an event per row, like phone calls, page views etc, so the most natural key for partitioning is the event timestamp. In that case, partitioning is a physical grouping of rows (events) per an interval of the event time (hourly or daily or monthly etc).
However, since Hive and Impala are limited to a single value per partition, users can not partition by the event timestamp column, as that will create millions of small directories and files (one per second or millisecond), which will hurt query performance and may also overwhelm HDFS. Even if the event timestamp is less granular (at the minute, hour or day level), it may make sense to partition the data at a different level.
For example, the TPC-DS benchmark generates five years of data. Even though the fact table has a day-level column, it may still make sense to partition by month (or quarter or year), as daily partitions leads to about 1800 small partitions (or about 20 million rows per day for 15TB scale factor).
So, as a workaround, in Hive and Impala you have to generate a separate column(s) from the event timestamp to represent the partition key and adapt your ETL scripts accordingly. Some people pick a string or a number that represents the day (like 20140925 or '2014-09-25'), others pick a set of separate columns like year, month, and day columns for multi-level partitioning.
As an example for that, in kit for Hive, they have modified the benchmark – added to the fact table store_sales an extra string column called ss_sold_date and rewrote the queries to explicitly use that column! In Cloudera TPC-DS benchmark kit they also manually rewrote the SQLs to explicitly refer to a range of partition keys (without adding an extra column). That is understandable due to their limited partitioning implementation (I'm sure both companies have mentioned this modification somewhere in the fine print). That it leads us to the second problem:
Limitation – Must Manually Rewrite All Queries To Use The Partition Key
When there is a separate partition key column, users that ask questions must manually rewritetheir SQL to include the partition key, so the database could do partition pruning. For example, maybe your query searched for data from the last eight hours: event_ts_column >= now() – 8*60*60. If you use a separate partition key column, you must add it to your query as well: event_date_string >= '2014-05-24'. This is a manual, error-prone work for every SQL statement that users write. Of course, BI tools can't generate this condition automatically, so users of those tools must always add both conditions and change them in tandem.
Partitioning in Jethro
Jethro uses indexes to read only the relevant data from HDFS, with or without partitioning. So, we don't see partitioning as a major performance feature (though it can somewhat help performance and scalability occasionally).
Instead, partitioning in Jethro mostly helps ease of maintenance. This is especially important as Jethro format is append-only – so partitioning enables users to delete old data from fact tables or fix operational data problems (like loading the same file twice by mistake) without having to drop and reload large tables.
One design goal for our partitioning feature was ease of use. Partitioning a table should not require complicated CREATE TABLE syntax, manually adding partitions or having to explicitly name them.