Join Indexes in Jethro 1.2.0
November 24, 2015 by Boaz Raufman
The Jethro 1.2.0 release packs another powerful performance feature called Join Indexes. Join indexes accelerate performance of join queries with where clause on a dimension value. Combined with Jethro star transformation optimization, join indexes ensure that Jethro uses the most optimized index-based plan whenever possible for join queries.
Index-based Join Optimization – Typical Scenario Walkthrough
To explain how index-based join optimization works in Jethro, we will follow this example: a fact table store_sales is joined with item dimension via item sk columns (ss_item_sk=i_item_sk). Each item has name (i_item_name) attribute and color (i_item_color) attribute. While item name attribute is unique (different value for each item) item color attribute have repeated value, where multiple item may have the same color.
See following figure to illustrate the example:
A common analysis request could be: aggregate some measures from the fact with specific related item color(s). For this example we will use this SQL query:
SELECT SUM(SS_SALE_PRICE) FROM STORE_SALES JOIN ITEM ON SS_ITEM_SK=I_ITEM_SK WHERE I_ITEM_COLOR='BLACK';
Any non-index based query engine would execute this join by fully scanning all of the relevant columns from the fact table and joining each fact table with the dimension. With Jethro indexes, a much more optimized plan is used: this query is rewritten so that IN clause replaces the where on the dimension table. Rewrite SQL would be:
SELECT SUM(SS_SALE_PRICE) FROM STORE_SALES WHERE SS_ITEM_SK IN (SELECT I_ITEM_SK FROM ITEM WHERE I_ITEM_COLOR='BLACK');
Following this rewrite, Jethro now uses the SS_ITEM_SK index to fetch only the fact values that match the where criteria, thereby saving an expensive full-scan of all columns. In addition, the original join may be eliminated (as in the above example) if the join column on the dimension is unique for non-NULL values, thus saving the join calculation costs.
The above rewrite is called star transformation, and it is an extremely effective method for Jethro to accelerate queries performance in ratios of X10 to X1000. However, sometimes this is not enough.
Jehtro Star Transformation Limitations
Transforming join queries to index-based queries provides great performance acceleration but has its own limitation. When the number of values in the IN list becomes very large, performance can degrade as the number of index merges (OR operations) grows. For example, if item color BLACK represent over a 10,000 items the IN subquery returns a huge list of 10,000 i_item_sk values, which means 10,000 index merge operations in the fact table. In such a scenario Jethro planner may revert to full scan execution, to avoid an expensive multi-index values merge. To resolve this scenario we will use join indexes.
What is a Join Index
Join index is an index on one table, based on the values of a column in another table (dimension) and on a specific join criteria. Typically, it is an index on a large fact table based on the values of a dimension attribute. Join index accelerates queries by eliminating both the fetch of the join key from the fact table and the join implementation (hash join or IN - merging indexes).Join indexes are relevant when you have a relatively large dimension (few K values or more), and the attribute (the column in the dimension) is low cardinality, so that each value in the attribute represents many join key values..
A simple way to understand a join index is to envision it as a transparent denormalized index view of the schema. In fact, a join index is implemented as an additional virtual index only column on the fact table that is generated from the results of a join expression between the fact and the dimension. Let's include in our example a join index over the item.i_item_color; column using join keys store_sales.ss_item_sk and item.i_item_sk:
When the Jethro planner chooses to do star transformation, it checks if a join index exists for the relevant join condition and attribute column. If the join index is found it will be used and the where results are received within a single index access per attribute value.
Typical Join Index Use Case
Typically a join index is defined if the average ratio between unique dimension attribute and the related join keys value is 1000 or more, but if fact table is large (more than few billion) it is recommended to define join index for attribute with smaller number of related join keys per value.
How to Create a Join Index
A new join index is created using the command CREATE JOIN INDEX. A preliminary constraint requires that the dimension join key will be unique so it must be defined as a PRIMARY KEY.
In our example we will use the following commands:
To define i_item_sk as primary key:
ALTER TABLE item ADD PRIMARY KEY (i_item_sk)
To create join index over item.i_item_color, by using store_sales.ss_item_sk and item.i_item_sk as join keys:
CREATE JOIN INDEX store_sales_by_item_color_idx ON store_sales(item.item_color) FROM store_sales JOIN item ON ss_item_sk = i_item_sk;
Join Index Maintenance
The great thing about join indexes is that using this feature requires no maintenance by the user. Once defined, a join index is transparently updated following any change in the fact or related dimension data. The related join indexes are updated automatically and are always kept up-to-date whenever new data are appended to the fact or dimension or whenever the fact or dimension is overwritten or truncated.
Join indexes is yet another great performance enhancement tool by Jethro that leverages the power of Jethro indexes to run super fast join queries over normalized schemas.