CREATE EXTERNAL TABLE

Maps a foreign data source table (such as Hive tables or files) into Jethro.


Syntax

CREATE EXTERNAL TABLE [schema_name.]new_external_table_name
(<columns_list_spec>)
DATA SOURCE=existing_external_data_source_name
LOCATION=<location_spec>
[<file_format_spec>] -- Relevant only for 'files' type of external data source
[<values_format_spec>]
[<rejects_handling_spec>]
[<rejects_thresholds_spec>]


<columns_list_spec> ::=
<column_spec> [,<column_spec>...n]

<column_spec> ::=
{
  column name
  column_type
  [<values_format_spec>]
}

<file_format_spec> ::= -- Relevant only for 'files' type of external data source
{ 
    [ROW FORMAT DELIMITED
        [FIELDS TERMINATED BY '<char>']
        [QUOTED BY '<char>' | NONE]
        [COMMENTED BY '<char>' | NONE]
        [LINES TERMINATED BY '<string>']
    ]
    [OPTIONS SKIP n]
}

<values_format_spec> ::=
{
    [NULL DEFINED AS ('<string>' [,'<string>'...])]
    [DATETIME FORMAT <ISO 8601 date format string>]
    [TIMEZONE=<timezone>]
}

<location_spec> ::=
{
	<table_name_from_the_external_data_source> | 
	<list_of_files>
}

<list_of_files> ::=
'<path>' [,'<path>'...]

<path> ::=
[HDFS://]folder[/file_name]

<rejects_handling_spec> ::=
REJECT_POLICY=
{
    FILL_NULL |
    SKIP_ROW |
    FAIL
}

<rejects_thresholds_spec> ::=
{
    [REJECT_LIMIT_COUNT=<a integer value specifying max number of allowed rows with reject>] |
    [
        REJECT_SAMPLE=<minimum number of rows to process before making reject decision>
        REJECT_LIMIT_RATIO=<a decimal between 0 and 1>
    ]
}

Description

Creates an external table in Jethro schema, mapped to a table on an external data source, or to a file(s) located on a local file system or on HDFS. Typically external tables are used for loading data from external data sources into Jethro table using INSERT INTO command.

When a policy or a formatting rule is set for an external table, it will override the policy/formatting rule of the external data source it is belonging to. Otherwise, it will inherit the settings of its external data source.

When a formatting rule is set for a column, it will override the formatting rule of the external table/data source it is belonging to. Otherwise, it will inherit the settings of its external table/data source.

<location_spec> - Specifies the location within the external data source from which the data will be fetched.

table_name_from_the_external_data_source - The name of the table in Hive.

list_of_files - The name or names of files to fetch the data from. When reading more than one file, it is important to varify that they are all using the same structure, order, and date types as defined during the external table creation. It is possible to use a direct full path for a specific file, or to use a hint which will allow access to multiple files, such as: folder path, or a file path with an hint char ('?' represents any single char, '*' represents any set of chars). Regular expression is not supported for the directories components of the path, only for the files. Anyhow the file location must include the full path.

<values_format_spec> - Specifies how to convert values, according to the following parameters:

[NULL DEFINED AS ('<string>' [,'<string>'...])] - A set of strings which should be converted to a NULL value once fetched.

[DATETIME FORMAT <ISO 8601 date format string>] - The format in which the source data is stored as. The default format in Jethro is 'yyyy/MM/dd HH:mm:ss'.

[TIMEZONE=<timezone>] - Converts a timestamp in UTC to a given timezone.

 For detailed information about available TIMESTAMP formats and timezone, click here to expand...

Timestamp Formats

The default timestamp format in Jethro is 'yyyy-MM-dd HH:mm:ss' (without a sub-second component).

However, an alternative format can be specified. For example:

1

DATETIME FORMAT 'dd-MM-yyyy HH:mm:ss.SSS'

Timestamp format strings are case sensitive. The valid format elements are:

Format Element

Meaning

yyyy

4-digit year (1970-2038)

M

1 or 2 digit month (1-12)

MM

2-digit month (01-12)

MMM

3-character month (Jan-Dec)

d

1 or 2 digit day (1-31)

dd

2-digit day (01-31)

H

1 or 2 digit hour (0-23)

HH

2-digit hour (00-23)

m

1 or 2 digit minute (0-59)

mm

2-digit minute (00-59)

s

1 or 2 digit second (0-59)

ss

2-digit second (00-59)

SSS...

1-6 -digit sub-second element 
For example: S → 1 digit, SSS → 3 digits, SSSSSS → 6 digits

unix_timestamp

The number of seconds since 1/1/1970 (microseconds/milliseconds optional after the decimal point). The format cannot be mixed with other format elements.

If the input string of the field is longer than the format defined, the rest of the field content is ignored. For example: for the format 'yyyy/MM/dd', the input field '2014-02-14 15:16:17' will be stored as '2014-02-14 00:00:00'. This allows truncating the input record to a lower precision (as in the example above - from a second-level to a day-level).

Time Zones

The query engine by default does not perform any timezone manipulation – the input is loaded as is. In other words, Jethro stores any TIMESTAMP data in UTC timezone and it assumes all input is already in UTC. 

However, in some cases you may want to adjust the timezone of an input field. For example, you may want to load log files from different data centers – each file has timestamps in the local timezone, so each needs its own adjustment.

You can ask for a timezone adjustment by using the TIMEZONE property. In that case, the query engine will compute the current offset of that timezone vs. UTC once (at the beginning of its run) and will apply it to all the values of those fields. Example:


1

TIMEZONE='America/New_York'

<rejects_handling_spec> - Specifies a policy for handling rejects at row/columns level. The allowed polices are:

FILL_NULL (default) - Replace the rejected value with a NULL.
SKIP_ROW - Skip the entire row when any value is rejected.
FAIL - Fail the entire request on the first time it encountered with a rejected value (Choosing this policy will ignore the rejects threshold defined under <rejects_thresholds_spec>).
Setting <rejects_handling_spec> override spec set at data source level via CREATE EXTERNAL DATA SOURCE command.

<rejects_thresholds_spec> - Specifies a rejects threshold policy for handling of multiple rejects at request level. The total number of rows with rejects are counted, and if that number reaches the defined threshold, the request fails (A row with 2 rejects will be counted as 1). The allowed rejects thresholds are:
REJECT_LIMIT_COUNT (default) - An integer value specifying the maximum number of allowed rows with rejects. If the number of rows with rejects reaches the limit, the request will fail. The default value is: 0.
REJECT_LIMIT_RATIO - A decimal value between 0 and 1, specifying the maximum ratio allowed for rows with rejects, from a sample of fetched rows (see REJECT_SAMPLE). If the actual sampled ratio is greater than the limit specified, the request will fail.
REJECT_SAMPLE - The minimum number of rows to process before making a reject decision (based on REJECT_LIMIT_RATIO). If not specified, the default value is '1000000'.

Reject policy & thresholds apply only within INSERT INTO SELECT operation. Otherwise fetching data from an external table using SELECT will fail, and issue an error whenever a mismatch between external table data type and the actual fetched value from the external data source is encountered.

Examples

1) Creation of an external data source and an external table for a single simple csv file, such as:

int1|int2|big|flt|dbl|str|dt
-1|-11|-111|-0.1|-0.11|negative|1999-01-01 00:00:00
0|0|0|0|0|zero|1971-01-01 00:00:00
1|11|111|0.1|0.11|positive|2017-09-17 00:00:00
3|33|333|3.3|3.33|pos2|2017-09-18 00:00:00
1|null|NULL|NULL|NULL|NULL|NULL

...

a. Analysis of the data:

  • The delimiter is '|'
  • Most of the Null value are represented as 'NULL', but only int2 field is using 'null'.
  • The first line consist of header names, and therefore it should be skipped.
  • The time format is: 'yyyy-MM-dd HH:mm:ss'.

b. Possible business preferences:

  • The timestamp fields should be adjusted to New York timezone.
  • The data is sometimes generated with mistakes, but the BI needs for updated data is more important than its need for accuracy. Therefore, the Reject policy should not cause a failur for every reject found. Partial data might be good enough, as long as the loss of possible data is not too big. Up to 10% of data loss is a limit that can be tolerated, but more than that should cause a failur and require a human inspection. 
  • Since the file is very big, a sample of 100 rows should be enough in order to understand quickly if this process should be stopped or to continue.
  • If a line includes one or more values which are rejected, it would be better to skip the whole line, rather than filling NULL values instead of the rejected values.
  • There could be more than one file that should be used for that table, and there are no other files under that directory, so every file from that directory should be accessed.

c. According to steps a+b, these are the commands that should be sent to Jethro:

CREATE EXTERNAL DATA SOURCE my_csv_files
TYPE=FILES
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
OPTIONS SKIP 1
null defined as ('NULL')
REJECT_POLICY=SKIP_ROW
REJECT_SAMPLE=100
REJECT_LIMIT_RATIO=0.1


CREATE EXTERNAL TABLE my_data
(
int1 INTEGER,
int2 INTEGER null defined as ('null'),
big BIGINT,
flt FLOAT,
dbl DOUBLE,
str STRING,
dt TIMESTAMP DATETIME FORMAT 'yyyy-MM-dd HH:mm:ss' TIMEZONE='America/New_York'
)
DATA SOURCE=my_csv_files
LOCATION='/home/jethro/data/';

2) Creating an external table for an Hive table called 'customer_address', with REJECT_POLICY=FAIL:

CREATE EXTERNAL TABLE hive_ext_customer_address
(
    ca_address_sk             INTEGER,
    ca_address_id             STRING,
    ca_street_number          STRING,
    ca_street_name            STRING,
    ca_street_type            STRING,
    ca_suite_number           STRING,
    ca_city                   STRING,
    ca_county                 STRING,
    ca_state                  STRING,
    ca_zip                    STRING,
    ca_country                STRING,
    ca_gmt_offset             FLOAT,
    ca_location_type          STRING
)
DATA SOURCE=exds1
LOCATION='customer_address'
REJECT_POLICY=FAIL;

3) Creating an external table for a directory of files, located on HDFS:

CREATE EXTERNAL TABLE my_folder_call_center 
( 
	cc_call_center_sk INTEGER, 
	cc_call_center_id STRING , 
	cc_rec_start_date TIMESTAMP, 
	cc_rec_end_date TIMESTAMP DATETIME FORMAT 'yyyy-MM-dd', 
	cc_closed_date_sk INTEGER, 
	cc_open_date_sk INTEGER , 
	cc_name STRING, 
	cc_class STRING , 
	cc_employees INTEGER, 
	cc_sq_ft INTEGER, 
	cc_hours STRING, 
	cc_manager STRING, 
	cc_mkt_id INTEGER, 
	cc_mkt_class STRING, 
	cc_mkt_desc STRING, 
	cc_market_manager STRING, 
	cc_division INTEGER, 
	cc_division_name STRING, 
	cc_company INTEGER, 
	cc_company_name STRING, 
	cc_street_number STRING, 
	cc_street_name STRING, 
	cc_street_type STRING, 
	cc_suite_number STRING, 
	cc_city STRING, 
	cc_county STRING, 
	cc_state STRING, 
	cc_zip STRING,
	cc_country STRING, 
	cc_gmt_offset FLOAT, 
	cc_tax_percentage DOUBLE 
)
DATA SOURCE=my_file_tpcds
LOCATION='HDFS:///jethro/data/tpcds/call_center/'
REJECT_POLICY=FILL_NULL
REJECT_LIMIT_COUNT=10;

4) Creating an external table for files that are located in a directory which is mixed with unrelated data. Therefore, hint signs are used to identify only the relevant files:

CREATE EXTERNAL TABLE my_files_call_center 
( 
	cc_call_center_sk INTEGER, 
	cc_call_center_id STRING , 
	cc_rec_start_date TIMESTAMP, 
	cc_rec_end_date TIMESTAMP DATETIME FORMAT 'yyyy-MM-dd', 
	cc_closed_date_sk INTEGER, 
	cc_open_date_sk INTEGER, 
	cc_name STRING null defined as ('NULL_column'), 
	cc_class STRING, 
	cc_employees INTEGER, 
	cc_sq_ft INTEGER, 
	cc_hours STRING, 
	cc_manager STRING, 
	cc_mkt_id INTEGER, 
	cc_mkt_class STRING, 
	cc_mkt_desc STRING, 
	cc_market_manager STRING, 
	cc_division INTEGER, 
	cc_division_name STRING, 
	cc_company INTEGER, 
	cc_company_name STRING, 
	cc_street_number STRING, 
	cc_street_name STRING, 
	cc_street_type STRING, 
	cc_suite_number STRING, 
	cc_city STRING, 
	cc_county STRING, 
	cc_state STRING, 
	cc_zip STRING, 
	cc_country STRING, 
	cc_gmt_offset FLOAT, 
	cc_tax_percentage DOUBLE 
)
DATA SOURCE=FILE_sanity_tpcds
LOCATION='/home/jethro/ext_table/Test1/call_center.csv','/home/jethro/ext_table/?all_center',
		 '/home/jethro/ext_table/Test1/call_center1.c*','/home/jethro/ext_table/Test1/c?ll_center?da?a'
REJECT_POLICY=SKIP_ROW
REJECT_LIMIT_COUNT=10;

See Also

INSERT INTO

CREATE TABLE

CREATE EXTERNAL DATA SOURCE

SHOW EXTERNAL DATA SOURCE[S]

SHOW EXTERNAL TABLES