Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Map a foreign data source (such as Hive table) into Jethro external tables.

Syntax

Code Block
languagesql
linenumberstrue
-- For HIVE:
CREATE EXTERNAL DATA SOURCE new_data_source_name
	TYPE = HIVE
	LOCATION = [sub-protocol://]<host>:<port>/<db>[;property=value[;...]]
	CREDENTIALS=<username/password>
	[<rejects_handling_spec>]
	[<rejects_threshold_spec>]

-- For files:
CREATE EXTERNAL DATA SOURCE new_data_source_name
	TYPE = FILES
	[<file_format_spec>]
	[<values_format_spec>]
	[<rejects_handling_spec>]
	[<rejects_threshold_spec>]

	
<file_format_spec> ::=  
{  
	[ROW FORMAT DELIMITED 
		[FIELDS TERMINATED BY '<char>']
		[QUOTED BY '<char>' | NONE]
        [COMMENTED BY '<char>' | NONE]
        [LINES TERMINATED BY '<string>']
	]
	[OPTIONS SKIP n]
}

<values_format_spec> ::=
{
	[NULL DEFINED AS ('<string>' [,'<string>'...])]
	[DATETIME FORMAT <ISO 8601 date format string>]
	[TIMEZONE=<timezone>]
}

<rejects_handling_spec> ::=
REJECT_POLICY=
{
	FILL_NULL |
	SKIP_ROW |
	FAIL
}

<rejects_thresholds_spec> ::=
{

	[REJECT_LIMIT_COUNT=<a integer value specifying max number of allowed rows with reject>] |
	[
		REJECT_SAMPLE=<minimum number of rows to process before making reject decision>
		REJECT_LIMIT_RATIO=<a decimal between 0 and 1>
	]
}

...

[LINES TERMINATED BY '<string>'] - The default strings that marks a line termination is: '\n' for Linux, and '\r\n' for Windows. 

[OPTIONS SKIP n] - Skips n header lines (default is 0 - no skipping).

<values_format_spec> - Specifies how to convert values, according to the following parameters:

...

Expand
titleFor detailed information about available TIMESTAMP formats and timezone, click here to expand...

Timestamp Formats

The default timestamp format in Jethro is 'yyyy-MM-dd HH:mm:ss' (without a sub-second component).

However, an alternative format can be specified. For example:

1

DATETIME FORMAT 'dd-MM-yyyy HH:mm:ss.SSS'

Timestamp format strings are case sensitive. The valid format elements are:

Format Element

Meaning

yyyy

4-digit year (1970-2038)

M

1 or 2 digit month (1-12)

MM

2-digit month (01-12)

MMM

3-character month (Jan-Dec)

d

1 or 2 digit day (1-31)

dd

2-digit day (01-31)

H

1 or 2 digit hour (0-23)

HH

2-digit hour (00-23)

m

1 or 2 digit minute (0-59)

mm

2-digit minute (00-59)

s

1 or 2 digit second (0-59)

ss

2-digit second (00-59)

SSS...

1-6 -digit sub-second element 
For example: S → 1 digit, SSS → 3 digits, SSSSSS → 6 digits

unix_timestamp

The number of seconds since 1/1/1970 (microseconds/milliseconds optional after the decimal point). The format cannot be mixed with other format elements.

If the input string of the field is longer than the format defined, the rest of the field content is ignored. For example: for the format 'yyyy/MM/dd', the input field '2014-02-14 15:16:17' will be stored as '2014-02-14 00:00:00'. This allows truncating the input record to a lower precision (as in the example above - from a second-level to a day-level).

Time Zones

The query engine by default does not perform any timezone manipulation – the input is loaded as is. In other words, Jethro stores any TIMESTAMP data in UTC timezone and it assumes all input is already in UTC. 

However, in some cases you may want to adjust the timezone of an input field. For example, you may want to load log files from different data centers – each file has timestamps in the local timezone, so each needs its own adjustment.

You can ask for a timezone adjustment by using the TIMEZONE property. In that case, the query engine will compute the current offset of that timezone vs. UTC once (at the beginning of its run) and will apply it to all the values of those fields. Example:

1

TIMEZONE='America/New_York'

...