Tips for Installing Jethro

So you've already downloaded Jethro, now you're ready to install it and start accelerating your database performance.
Follow these below tips to get you started on the right foot. You can always
contact us with any problems or questions along the way.

Top Tips for Optimizing Jethro Performance

Optimal Queries – Use as Many Filters as Possible
Jethro process queries by first evaluating the WHERE clause and determining the rows needed for the query. It then fetched column data only for those rows. The narrower the query, the faster it performs

Optimal Data Types
Use numeric formats (INT/BIGINT, FLOAT/DOUBLE) whenever possible – any string column that holds only numeric values should be converted. This is especially true for high cardinality columns.
Use TIMESTAMP for Date/Time columns. Jethro creates multiple date-related indexes for such columns to improve performance of date-range queries.

Partitions for Large FACT Tables
A TIMESTAMP column is typically best choice for partitions as it simplifies maintenance tasks like purging old data. 
Jethro recommends a total of 5-25 partitions although it comfortably supports hundreds of partitions.
J
ethro partition key can use range values. For example: PARTITION BY RANGE(ts_col) EVERY (INTERVAL '7' DAY)

Cache Space
Jethro uses server-side caching for metadata and frequently used file fragments. The greater the space the more data it will be able to store. Note that the benefit of the cache will be realized over time as filling up the cache can take some time. Cache space can be defined when an instance is created or updated later on by editing the local-conf.ini file. 
Jethro automatically enables query-result cache. The query-result cache is stored in HDFS and does not require local disk space.

Consolidate Tables When Possible 
While Jethro optimizes JOINs and automatically performs Star-Transformation, it is better to avoid them when not required. 
Jethro's columnar format and effective compression minimize the storage impact of such denormalization.

Hardware considerations: more is better!
More CPU and RAM Improves query speed as Jethro takes advantage of multi-threading. It also improves concurrency as more user/queries can be served in parallel.
10Gb network connectivity to cluster will speed up HDFS access. Local drives for caching – SSD is preferable. Trial servers can start with as little as 64GB and 8 cores.

Use a Cluster of Jethro Servers 
Multiple servers linearly increase Jethro's capacity for concurrent users and queries. 
When performing frequent incremental loads, it is recommended to run the JethroLoader on a different server.

Data sorting can improve performance 
If a large number of queries filter by a specific column (that is not already a partition column) it could be beneficial to pre-sort the input data by such column before it's loaded into Jethro.

Join Indexes
When attributes of large dimensions are often used as a filter it is recommended to define them as a JOIN INDEX on the fact table. There is no limit to the number of JOIN INDEXES that can be defined.

Jethro without Hadoop
Jethro is capable of using other storage systems besides Hadoop's HDFS. These include a local filesystem, cloud storage (eg S3) or network storage (SAN/NAS).
When the dataset used with Jethro can fit in a local filesystem it is often the best solution as it avoids Hadoop overhead.

Load "Overwrite" for table update with no downtime
When a dimension changes and need to be reloaded you can use Jethro's Load Overwrite feature. It loads the updated table and only when the process is completed the tables are swapped.

Use ALTER TABLE to add columns on the fly
Jethro, being a column oriented design, can dynamically add (or drop) columns without having to reload the table. The value NULL will be used for the new column over existing rows

Use Jethro's SHOW SQL command to learn about Jethro internals:

SHOW [SESSION | GLOBAL] PARAM parameter | ALL (show parameter values) 
SHOW TABLES [EXTENDED | MAINT] (show all tables, size stats, fragmanation) 
SHOW TABLE PARTITIONS table_name (show table's partition stats) 
SHOW TABLE COLUMNS [FULL] table_name (show column stats) 
SHOW VIEWS [EXTENDED] (show views) 
SHOW LOCAL CACHE (show local file cache usage) 
SHOW ADAPTIVE CACHE (show query result cache) 
SHOW ACTIVE QUERIES (show currently running and queued queries) 
SHOW SCHEMAS (show defined schemas) 
SHOW JOIN INDEXES (show all defined JOIN indexes)