Glossary
This is a reference list of terms related to Jethro products. Additional information is available from a number of resources, including the Reference Guide Glossary.
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
A
Adaptive Cache
As users work with their favorite BI tools, the sequence of SQL queries that these tools generate and send to Jethro has some predictable patterns. For example, most users start from a dashboard (that sends a predictable list of queries), than typically start adding filter conditions and aggregate conditions one at a time.
Adaptive cache is a unique JethroData feature, which leverages those patterns. It is a cache of re-usable, intermediate bitmaps and query results that is automatically maintained and used by the JethroServer engine on its local storage. The cache is also incremental – after new rows are loaded, the next time a cached query is executed it will in most cases automatically combine previously cached results, and computation will only be performed over the newly loaded rows.
For further details, see section Managing Adaptive Cache under chapter Administering Jethro"Administering Jethro" on page .
Adaptive Index Cache
The adaptive index cache is part of the adaptive cache that holds intermediate bitmap index entries that were computed on the fly during query execution.
For further details, see section Managing Adaptive Cache under chapter Administering Jethro"Administering Jethro" on page .
Adaptive Query Cache
The adaptive query cache is part of the adaptive cacheA cache of re-usable, intermediate bitmaps and query results that is automatically maintained and used by the JethroServer engine on its local storage.that holds intermediate query result sets (for SELECT statements only).
A query is considered for inclusion in the adaptive query cache if its running completed without errors and took more than adaptive.cache.query.min.response.time, which defaults to one second. Also, the cached result set must be lower than adaptive.cache.query.max.results rows, which defaults to 100,000 rows.
Append-only
Append-only data stores allow only adding data into a table; there is no possibility to remove or change data. Data deletion is only enabled by dropping an object.
Jethro format is append-only, and supports partitioning. As a result, dropping a partition is the only way to delete data.
B
BI
Business intelligence (BI) is a technology-driven process for analyzing data and presenting actionable information to help corporate executives, business managers and other end-users make more informed business decisions. BI refers to a wide variety of tools, strategies, applications, data products, and methodologies that enable companies and organizations to spot, collect, analyze, present, and disseminate business information that arrives from internal systems and external sources. Using BI tools allow companies to develop and run queries against the data, and create reports, dashboards, and data visualizations to make the analytical results available to corporate decision makers as well as operational workers.
BI tools allow understanding the current situation and identifying patterns, based on both historical information and new data gathered from source systems as it is generated. This information can assist in accelerating and improving decision making; optimizing internal business processes; increasing operational efficiency; driving new revenues; and gaining competitive advantages over business rivals. In addition, BI systems can help companies predict future market trends and spot possible business issues that need to be prevented or addressed.
Using BI tools allow making both basic operating decisions, such as product positioning or pricing, and strategic business decisions such as priorities and goals.
While initially BI tools were primarily used by data analysts and other IT professionals who ran analyses and produced reports with query results for business users, the emergence of self-service BI and data discovery tools such as Tableau and QlikSense/QlikView made these tools available to business executives and non-IT company workers.
Big Data
Big data is a term that describes data sets - be it structured, semistructured or unstructured data - that are too large or complex to be processed by traditional database and software techniques. The complexity of handling big data is often a result of its main characteristics, known as the 3vs: The extreme volume of data, the wide variety of data types, and the velocity at which the data must be processed.
While the term in itself does not denote any specific volume of data, it is often being used for describing terabytes, petabytes, and even exabytes of data captured over time.
Big data is mostly used by organizations and companies, as it assists in improving operations and making faster, more intelligent decisions. When this data is captured, formatted, manipulated, stored, and analyzed, it can assist in gaining useful insight to increase revenues, get or retain customers, and improve operations.
C
Columnar Database
A columnar database, also known as a column-oriented database, is a database management system (DBMS) that stores data in columns rather than in rows as relational DBMSs. Storing data in columns rather than rows allows the database to more precisely access the data required for answering a query, instead of scanning and discarding unwanted data in rows. As a result, query performance is often increased, particularly in very large data sets.
One of the main benefits of a columnar database is that data can be highly compressed, which allows for a very rapid execution of columnar operations such as MIN, MAX, SUM, COUNT, and AVG. Another benefit is that because a column-based DBMS is self-indexing, it uses less disk space than a relational database management system (RDBMS) that contains the same data. However, the loading process can take time depending on the size of data that is involved.
D
Database
A database is a collection of information that is organized to be easily accessed, managed, and updated by a collection of programs known as database management system (DBMS) . Computer databases typically contain aggregations of data records or files, such as sales transactions, product catalogs and inventories, and customer profiles.
The DBMS, which are sometimes loosely referred to as "databases", use standards such as SQL, JDBC, and ODBC to access applications, thereby allowing a single application (for example, Tableau) to work with multiple DBMS.
Modern DBMS are largely divided into two main types:
Relational databases (Oracle, SQL Server, DB2 and so on)
No-SQL (big data) databases (Cassandra, Hadoop and so on)
No-SQL databases, on the other hand, are used for handling rapid growth of unstructured data and scaling them out easily. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud, and are therefore used by companies that have such massive amounts of data, such as LinkedIn and Twitter.
DataNode
HDFS has a master/slave architecture, with a single master server called NameNode that manages the file system namespace and regulates access to files by clients, and multiple DataNodes, usually one per cluster, with data replicated across them.
DataNodes store the actual data in HDFS - namely, a series of named blocks - and serve read and write requests from the file system's clients by allowing client code to read these blocks or to write new block data. Upon startup, each DataNode announces communicates with the DataName to announce itself and the list of block for which it is responsible, and maintains constant communication with the DataName as long as the DataNode is running (up). Each DataNode also communicates with client code and other DataNodes from time to time.
The replication of data between DataNodes mean that when a DataNode is down, the availability of the data or the cluster is not affected; NameNode ensures that the blocks managed by the DataNode that is down is replicated to other DataNodes.
Data Types
A data type, in computer science and computer programming, is a classification that specifies which type of value a variable has, what is the meaning of the data, the way values of that type can be stored, and what type of mathematical, relational or logical operations can be applied to it without causing an error. A string, for example, is a data type that is used to classify text, a float is used for classifying numbers with a decimal point (3.14, for example), and an integer is a data type used to classify whole numbers (5, 15 and so on).
The data type defines which operations can safely be performed to create, transform and use the variable in another computation; for example, a float can be multiplied by an integer (1.5 x 5), but not by a string (1.5 x Dutch). In addition, data types are used for defining the length of information strings to be stored. For example, in MySQL the TEXT data object type can store up to 65,535 characters, and can therefore perhaps hold the text of a single article but is not suitable for storing an entire book.
H
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system and a framework for the analysis and transformation of very large data sets. HDFS is designed to reliably store very large files across low-cost machines in a large cluster, and to stream those data sets at high bandwidth to user application. By distributing storage and computation across many low-cost servers, the resource can grow with demand while remaining economical at every size; the scaling of computation capacity, storage capacity, and I/O bandwidth is caried out by simply adding commodity servers.
HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols.
I
Index
A database index is a data structure that improves the speed of data retrieval operations on a database table (in SQL, SELECT queries and WHERE clauses) by providing a pointer to data in the table. This pointer, known as index key, is used for fast retrieval of data, and can be based either on the primary key (the unique identifier of a row, such as ID number) or on any other, non-unique data such as first name or department name.
An index is a small copy of a database table sorted by key values, without which query languages such as SQL may have to scan the entire table from top to bottom to select relevant rows.
While indexes speed up retrieval operations, they slow input operations such as UPDATE and INSERT, because the index must be updated upon any update of the underlying table.
J
JDBC Driver
JDBC driver is a software component that allows a Java application to connect to databases that support SQL.
JDBC (Java Database Connectivity application programming interface or API) requires drivers to each database, to enable carrying out the following operations:
Establishing a connection with a data source
Sending queries and update statements to the data source
Processing the results