CREATE EXTERNAL DATA SOURCE
Map a foreign data source (such as Hive table) into Jethro external tables.
Syntax
-- For HIVE: CREATE EXTERNAL DATA SOURCE new_data_source_name TYPE = HIVE LOCATION = '[sub-protocol://]<host>:<port>/<db>[;property=value[;...]]' CREDENTIALS=<username/password> [<rejects_handling_spec>] [<rejects_threshold_spec>] -- For files: CREATE EXTERNAL DATA SOURCE new_data_source_name TYPE = FILES [<file_format_spec>] [<values_format_spec>] [<rejects_handling_spec>] [<rejects_threshold_spec>] -- For GENERIC connections: CREATE EXTERNAL DATA SOURCE new_data_source_name TYPE = GENERIC_JDBC JDBC_LIBRARY=<jdbc jar path> JDBC_CLASSNAME=<jdbc class name> LOCATION = '[sub-protocol://]<host>:<port>/<db>[;property=value[;...]]' CREDENTIALS=<username/password> <file_format_spec> ::= { [STORED AS {PARQUET|ORC}] | [ROW FORMAT DELIMITED [FIELDS TERMINATED BY '<char>'] [QUOTED BY '<char>' | NONE] [COMMENTED BY '<char>' | NONE] [LINES TERMINATED BY '<string>'] ] [OPTIONS SKIP n] } <values_format_spec> ::= { [NULL DEFINED AS ('<string>' [,'<string>'...])] [DATETIME FORMAT <ISO 8601 date format string>] [TIMEZONE=<timezone>] } <rejects_handling_spec> ::= REJECT_POLICY= { FILL_NULL | SKIP_ROW | FAIL } <rejects_thresholds_spec> ::= { [REJECT_LIMIT_COUNT=<a integer value specifying max number of allowed rows with reject>] | [ REJECT_SAMPLE=<minimum number of rows to process before making reject decision> REJECT_LIMIT_RATIO=<a decimal between 0 and 1> ] }
Description
The following parameters sets the default behavior for all the tables which will be defined to use that data source. Data source defaults can be overrided per table, via the command CREATE EXTERNAL TABLE.
<file_format_spec> - Specifies how to parse the file, according to the following parameters:
[STORED AS {PARQUET|ORC}] - The default FILE format for external data sources, is CSV. If PARQUET or ORC are specified, row format delimiters should not be specified (field terminators, quating chars, comment chars, line termination chars).
[FIELDS TERMINATED BY '<char>'] - The default field teminator is a comma ',', unless other char is specified.
[QUOTED BY '<char>' | NONE] - Quotes can be used to prevent a char from being interpreted as a field delimiter. If a quoting char is not specified, the defailt value used for quoting would be double quotation marks ("). When 'NONE' is specified, any dlimiter char found would be considered as a field delimiter.
[COMMENTED BY '<char>' | NONE] - To ignore a single line of values during the load process, use a comment char. If a comment char is not specified, the defailt value used for commenting a line would be '#'. When 'NONE' is specified, no line would be ignored.
[LINES TERMINATED BY '<string>'] - The default strings that marks a line termination is: '\n' for Linux, and '\r\n' for Windows.
[OPTIONS SKIP n] - Skips n header lines (default is 0 - no skipping).
<values_format_spec> - Specifies how to convert values, according to the following parameters:
[NULL DEFINED AS ('<string>' [,'<string>'...])] - A set of strings which should be converted to a NULL value once fetched.
[DATETIME FORMAT <ISO 8601 date format string>] - The format in which the source data is stored as. The default format in Jethro is 'yyyy/MM/dd HH:mm:ss'.
[TIMEZONE=<timezone>] - Converts a timestamp in UTC to a given timezone.
<rejects_handling_spec> - Specifies a policy for handling rejects at row/columns level. The allowed polices are:
FILL_NULL (default) - Replace the rejected value with a NULL.
SKIP_ROW - Skip the entire row when any value is rejected.
FAIL - Fail the entire request on the first time it encountered with a rejected value (Choosing this policy will ignore the rejects threshold defined under <rejects_thresholds_spec>).<rejects_handling_spec> sets the defaults behavior for all columns of all tables defined on the data source. Data source defaults can be override per table columns and/or per specific column via the command CREATE EXTERNAL TABLE.
<rejects_thresholds_spec> - Specifies a rejects threshold policy for handling of multiple rejects at request level. The total number of rows with rejects are counted, and if that number reaches the defined threshold, the request fails (A row with 2 rejects will be counted as 1). The allowed rejects thresholds are:
REJECT_LIMIT_COUNT (default) - An integer value specifying the maximum number of allowed rows with rejects. If the number of rows with rejects reaches the limit, the request will fail. The default value is: 0.
REJECT_LIMIT_RATIO - A decimal value between 0 and 1, specifying the maximum ratio allowed for rows with rejects, from a sample of fetched rows (see REJECT_SAMPLE). If the actual sampled ratio is greater than the limit specified, the request will fail.
REJECT_SAMPLE - The minimum number of rows to process before making a reject decision (based on REJECT_LIMIT_RATIO). If not specified, the default value is '1000000'.
CREDENTIALS - Please note that the provided user and password will be used by Jethro for accessing the external data source. Any Jethro user who will query this external data source, will be passing to it queries as the user provided in the external data source settings. To some organizations, this might cause a possible conflict or risk with the intended privileges assigned for each user in the external data source. The solution for such use case is to enable and configure "impersonation". Impersonation means allowing one user account to act on behalf of another user account. It allows users to execute queries on external data sources, as the connected user, rather than the super user provided in the external source settings. For more information about Impersonation in Jethro, visit Authentication.
Examples
CREATE EXTERNAL DATA SOURCE exds1 TYPE = HIVE LOCATION = 'jdbc:hive2://10.1.1.199:1000/tpcds;UID=admin;PWD=password' CREDENTIALS= ''/'' REJECT_POLICY = FILL_NULL REJECT_LIMIT_COUNT = 500
CREATE EXTERNAL DATA SOURCE my_file_tpcds TYPE=FILES ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' QUOTED BY '"' COMMENTED BY '#' OPTIONS SKIP 3 null defined as ('na') REJECT_POLICY=SKIP_ROW REJECT_LIMIT_COUNT=10
CREATE EXTERNAL DATA SOURCE my_file_tpcds TYPE=FILES OPTIONS SKIP 1 null defined as ('NULL','null') REJECT_POLICY=FILL_NULL REJECT_SAMPLE=100 REJECT_LIMIT_RATIO=0.2
CREATE EXTERNAL DATA SOURCE my_parquet_tpcds TYPE=FILES STORED AS PARQUET;
CREATE EXTERNAL DATA SOURCE my_generic_tpcds TYPE=GENERIC_JDBC JDBC_LIBRARY='/usr/lib/hive/lib/hive-jdbc-standalone.jar' JDBC_CLASSNAME='org.apache.hive.jdbc.HiveDriver' LOCATION='jdbc:hive2://hdfs.comp.internal:10000/tpcds' CREDENTIALS=''/'';