How to Generate Data Mock for Testing
Do you want to test Jethro but don't have any data to use? With the csvgen utility you can very easily generate mock data and load it into Jethro.
The utility generates csv files that can be loaded into Jethro based on your specifications. This document describes how to use the csvgen utility to generate data and load into Jethro.
csvgen
To use csvgen:
- browse csvgen github page at https://github.com/JethroData/csvgen
- click on Download Zip to download a zip file containing all the required files.
- Extract the zip file, and open the README.md file for instructions on how to install and use csvgen.
Before generating the data, you need to create the input file that instructs csvgen how to generate the data. The input file contains text where each row corresponds to a column in the table and contains a description of how to generate data for that column.
Each row begins with the data type and may contain a list of parameters, as follows:
Data Type | Description | Parameters | Examples |
---|---|---|---|
row_number | Generates a unique row number for each row, starting at 1 or at the number provided as parameter |
|
|
boolean | Generates a Boolean value, true or false. |
|
|
uuid | Generates a unique id |
|
|
ip_address | Generates an IP address. |
|
|
regex | Generates a string based on a regular expression. |
|
|
word | Generates a random (meaningless) word. |
|
|
number | Generates a whole or decimal number. |
|
|
date | Generates a date with a specified format within a given period. |
| date "01-01-2012 00:00:00" "12-31-2012 23:59:59" "%m-%d-%Y %H:%M:%S" 30 date "05-27-2014" "06-31-2014" "%m-%d-%Y" 25 |
fixed | Generates a string provided as parameter. |
|
|
list | Generates a string from a list of values. |
| list "[dog cat mouse]" 20 * You must enclose the list with quotes. |
custom list | Generates a string from a custom list of strings. For a given type, there should be a file {type}.csv in the data directory. The file should contain a list of values, each one on a separate line. One of these values will be picked randomly. The data directory already contains several type files, such as first_name, last_name, state, and country. You can add new custom types by adding such a file to the data directory. |
| first_name color 10 brands 100 10 |
Generating Data and Loading it into Jethro
Let's say that we want to generate data for a customer table that has the following definition:
CREATE TABLE customer ( customer_id INTEGER, first_name STRING, last_name STRING, age INTEGER, birth_country STRING, gender STRING, phone_number STRING, last_transaction_date TIMESTAMP, balance FLOAT );
Save the above data definition in a file called customer.ddl, and run the following command from the Jethro server console as user Jethro to create the table:
JethroClient demo localhost:9111 -p jethro -i customer.ddl
Next we create a Jethro loader description file that will be used to load the data. Save the following in a file called customer.desc:
table customer overwrite row format delimited fields terminated by '|' ( customer_id, first_name, last_name, age, birth_country, gender STRING, phone_number, last_transaction_date format='MM-dd-yyyy', balance )
Now, we need to create the input file that instructs csvgen how to generate the data. Here is an example of such an input file for the customer table:
row_number 1000000 word 1 4 1000 word 1 4 3000 number 18 100 0 country list '[Male Female]' regex '[1-9]\d{2}\d{3}\d{4}' date '01-01-2000' '01-01-2016' '%m-%d-%Y' 25 number -999 999 2 10
It will generate the following for each column:
- customer_id – A unique number starting at 1000000 for the first row and incrementing by one for subsequent rows.
- first_name – A random word, 1 to 4 syllables, out of a list of 1000 words.
- last_name – A random word, 1 to 4 syllables, out of a list of 3000 words.
- age – A whole number between 18 and 100.
- birth_country – A value from a list of countries. The list is taken from an existing file named country.csv in the data directory.
- Gender – Either "Male" or "Female".
- phone_number – A string of 10 digits not starting with 0 generated by a regex with the format xxx xxx xxxx.
- last_transaction_date – A date between 01-01-2000 and 01-01-2016 with 25% of NULL values.
- balance – A decimal numbers with 2 places after the decimal point between -999 and 999 with 10% of NULL values.
Save the input text in a file called customer.in.
Verify that the data is generated correctly by generating 100 rows with this command:
./csvgen.py -i 100 -d '|' customer.in
./csvgen.py -i 100 -d '|' customer.in
It will send the output to the console. Here is an example of the last few lines of the output:
1000093|Paf|Yuc|76|Venezuela|Male|403-740-4749||706.68 1000094|Col|Axxa|64|Argentina|Male|371-486-9796|03-12-2009| 1000095|Uz|Ec|99|Italy|Male|624-579-8640|11-03-2000|-115.25 1000096|Bakdi|Iv|71|Iceland|Male|496-820-5813|03-13-2013|172.57 1000097|Vot|Iccife|71|United States|Male|637-522-6319||556.36 1000098|Nauhel|Owle|82|Senegal|Male|998-729-3208|11-24-2007|543.28 1000099|Ibri|Afzu|18|Slovenia|Female|629-826-0867|05-25-2013|432.62
Next, we will generate the data to be loaded to Jethro.
Let's generate data for 10000 customers, using the delimiter '|' and store it in a file customer.csv:
$./csvgen.py -i 10000 -d '|' -o customer.csv customer.in
Now that we have the csv file ready, we can load it into Jethro. Run the following command on the Jethro server console with user Jethro:
JethroLoader demo customer.desc customer.csv
You should see the name of the loader log file in the console. When the load finishes, view the log file to make sure that the load finished successfully. You should see lines like the following indicating that all the data was loaded:
Input records: 10000 Records loaded: 10000 Records skipped: 0 ( 0 empty lines, 0 no matching rules, 0 header lines, 0 outside partitions boundary ) Records rejected (see reject rep0 t file 'loader_rejects_20160304_212349_46837.out':Table rows - before :0 Table rows - truncated: 0 Table rows - after: 10000
Another option is to generate the data and pass it to the loader using a pipe instead of saving it in a file. This can be achieved using the following:
./csvgen.py -i 10000 -d '|' customer.in | JethroLoader demo customer.desc stdin
Speeding Up Data Generation
In order to generate a large amount of data, you can use the generate.sh script which can run multiple csvgen.py in parallel. This speeds up the data generation process and can be useful to generate data for large fact tables.
It spawns as many processes as requested and creates temporary output file for each. When all processes are done, it combines the output to one file.
To use it, run the following:
./generate.sh {name} {rows} {number of processes} {delimiter}
Where:
{name} is a description file named {name}.in. It also uses the same name for the output file which is generated as {name}.csv. This parameter is mandatory.
{rows} is the number of rows to generate for each process. Default =1M
{number of processes} is the number of processes to run in parallel. Default=1
{delimiter} is the column delimiter to use. Default='|'
To generate data for 1000000 customers using 20 parallel processes, issue the command:
./generate.sh customer 50000 20 '|'
Note that you need to specify the number of rows for each process and not the total number of rows. For best results, align the number of processes with the number of CPU's that the server has.