Analyzing Data
Analyzing the data helps you determine the correct and optimal data type for each column of the table which you will create on your new Jethro instance. Performance-wise, it is better to use numeric formats (INT/BIGINT, FLOAT/DOUBLE) whenever possible and any string column that holds only numeric values should be converted. The AnalyzaData tool will also create the table creation script and description files that are needed for loading data. The description file describes the structure of the input file, the way to map the input file to the table, and special formatting rules.
Downloading the Data Analysis Tool
- Go to the following URL: https://github.com/JethroData/AnalyzeData/archive/master.zip
- Unzip the AnalyzeData-master file and go to AnalyzeData-master directory.
Convert the AnalyzaData file to an executable file, by running the command:
chmod +x AnalyzaData.py
Analyzing Sample Data
Switch to user root by running the command:
sudo su - root
This script requires python to be installed on the server (version 2.7.5 or higher).
The script also uses an optional python package called tabulate, which enables displaying the generated data analysis in a formatted table view instead of a tab-delimited file. Installing tabulate requires pip. If pip is not installed, retrieve and install it by running the commands:wget https://bootstrap.pypa.io/get-pip.py python get-pip.py
To install tabulate, run the command:
TABULATE_INSTALL=lib-only pip install tabulate
Run AnalyzaData with the following optional parameters:
AnalyzaData.py [-i <rows to read>] [-d <delimiter>] [-q <quote char>] [-n] [-c] [-g <table name>] [<input file>]
- i <rows to read> - Number of rows to read from the input file . Default=unlimited.
- d <delimiter> - The input data delimiter. Default=',' (comma).
- q <quote char> - The input data quote character; allows a delimiter character inside quotes as part of the data. Default=" (double quotes). This parameter may be used in files arriving from Hive, when the fields appear inside quotes.
- n - If specified, treats the first row as a header row, which contains the columns' names.
- c - CSV formatted output, namely: the output report is generated as a tab-delimited file rather than a formatted table. Installing tabulate is not required in this mode.
- g <table name> - Generate a create table script and a description file, by using the given table name.
- <input file> - The input file to read. If not specified, read from standard input.
It is recommended to limit the input up to 1 million rows:
-i 1000000
This will generate a report such as the following:
- Number - The column serial number.
- Name - The column name, if the data contains header. Otherwise it is c1..cN.
- Rows - The number of rows for the column.
- Type - The suggested type to use, based on the data. A non-string type is suggested, in case non-string values are found and there are no more than 5 distinct exceptional values.
- Category - For certain values, a category can be detected based on regular expressions. It also specifies "Primary Key" when a column has only unique values and "High Cardinality" if the column contains many unique values.
- Percent - The percentage of the values of the suggested type out of all values of the column.
- Exceptions - A list of up to 5 exceptional values, namely: values that do not match the suggested column type.
- Distinct - The number of distinct values.
- Samples - Sample values of the suggested type.
The following image shows the result of running the command:
./AnalyzaData.py -i 10000 -d '|' -n -g sales_demo sales_demo_Analyze.csv
The -g parameter is specified with a table name, therefor a create table script (*.ddl) and a description (*.desc) files are generated based on the data. For the above data, the following scripts files are generated as sales_demo.ddl and sales_demo.desc when given the table name sales_demo:
Create table script file sales_demo.ddl:
create table sales_demo ( customer_sk INTEGER, customer_salutation STRING, customer_first_name STRING, customer_last_name STRING, customer_preferred_flag STRING, customer_birth_year INTEGER, customer_birth_country STRING, customer_gender STRING, customer_marital_status STRING, customer_education_status STRING, store_sk INTEGER, store_name STRING, store_city STRING, store_county STRING, store_state STRING, store_country STRING, item_product_name STRING, item_class STRING, item_category STRING, item_size STRING, item_color STRING, sale_date TIMESTAMP, sale_is_holiday STRING, sale_is_weekend STRING, quantity INTEGER, list_price DOUBLE, net_paid DOUBLE, net_profit DOUBLE );
Description file sales_demo.desc:
table sales_demo row format delimited fields terminated by '|' null defined as 'NULL' OPTIONS SKIP 1 ( customer_sk, customer_salutation, customer_first_name, customer_last_name, customer_preferred_flag, customer_birth_year, customer_birth_country, customer_gender, customer_marital_status, customer_education_status, store_sk, store_name, store_city, store_county, store_state, store_country, item_product_name, item_class, item_category, item_size, item_color, sale_date format='yyyy-M-d H:m:s.SSS', sale_is_holiday, sale_is_weekend, quantity, list_price, net_paid, net_profit )
You can view the report to determine the followings:
- Is the suggested types acceptable?
- Are there more than one exception values?
Because Jethro only supports one NULL value for each column, if multiple exception values are found you either need to modify the values before loading or load this column as STRING type. - How many distinct values are there?
For a string column with many distinct values, you may choose not to load the column or to truncate the values so they become less unique (for example, in the case of URLs).
Then, you can modify the table create script and description files if necessary:
- Assign column names, if they were not provided as part of the input file (headers line).
- Change types as required.
- Add a partition definition.
For example, to partition the above table by sale_date every month, add the following script just before the create table ending semicolon:
Partition by range(sale_date) every (interval, '1', month)
- Change description file options, such as rows or columns to skip.
To learn more about partitions and description file, consult the Reference Guide.
Once the data is analyzed and the scripts are created, you can start loading your data.