Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It is a subject-oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization but relational data volumes are increased day by day.
To overcome the challenges, distributed data warehouse system shares data across multiple data repositories for the purpose of Online Analytical Processing(OLAP). Each data warehouse may belong to one or more organizations. It performs load balancing and scalability. Metadata is replicated and centrally distributed.
Apache Tajo is a distributed data warehouse system which uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of MapReduce framework.
Hadoop is an open-source framework that allows to store and process big data in a distributed environment. It is extremely fast and powerful. However, Hadoop has limited querying capabilities so its performance can be made even better with the help of SQL on Hadoop. This allows users to interact with Hadoop through easy SQL commands.
Some of the examples of SQL on Hadoop applications are Hive, Impala, Drill, Presto, Spark, HAWQ and Apache Tajo.
Apache Tajo is a relational and distributed data processing framework. It is designed for low latency and scalable ad-hoc query analysis.
Tajo supports standard SQL and various data formats. Most of the Tajo queries can be executed without any modification.
Tajo has fault-tolerance through a restart mechanism for failed tasks and extensible query rewrite engine.
Tajo performs the necessary ETL (Extract Transform and Load process) operations to summarize large datasets stored on HDFS. It is an alternative choice to Hive/Pig.
The latest version of Tajo has greater connectivity to Java programs and third-party databases such as Oracle and PostGreSQL.
Apache Tajo has the following features −
Apache Tajo offers the following benefits −
The following are some of the use cases of Apache Tajo −
Korea’s SK Telecom firm ran Tajo against 1.7 terabytes worth of data and found it could complete queries with greater speed than either Hive or Impala.
The Korean music streaming service Melon uses Tajo for analytical processing. Tajo executes ETL (extract-transform-load process) jobs 1.5 to 10 times faster than Hive.
Bluehole Studio, a Korean based company developed TERA — a fantasy multiplayer online game. The company uses Tajo for game log analysis and finding principal causes of service quality interrupts.
Apache Tajo supports the following data formats −
Tajo supports the following storage formats −
The following illustration depicts the architecture of Apache Tajo.
The following table describes each of the components in detail.
S.No. | Component & Description |
---|---|
1 | Client Client submits the SQL statements to the Tajo Master to get the result. |
2 | Master Master is the main daemon. It is responsible for query planning and is the coordinator for workers. |
3 | Catalog server Maintains the table and index descriptions. It is embedded in the Master daemon. The catalog server uses Apache Derby as the storage layer and connects via JDBC client. |
4 | Worker Master node assigns task to worker nodes. TajoWorker processes data. As the number of TajoWorkers increases, the processing capacity also increases linearly. |
5 | Query Master Tajo master assigns query to the Query Master. The Query Master is responsible for controlling a distributed execution plan. It launches the TaskRunner and schedules tasks to TaskRunner. The main role of the Query Master is to monitor the running tasks and report them to the Master node. |
6 | Node Managers Manages the resource of the worker node. It decides on allocating requests to the node. |
7 | TaskRunner Acts as a local query execution engine. It is used to run and monitor query process. The TaskRunner processes one task at a time. It has the following three main attributes −
|
8 | Query Executor It is used to execute a query. |
9 | Storage service Connects the underlying data storage to Tajo. |
Tajo uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes.
The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and assigns to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators.
In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques.
The web-based interface of Tajo has the following capabilities −
To install Apache Tajo, you must have the following software on your system −
Let us now continue with the following steps to install Tajo.
Hopefully, you have already installed Java version 8 on your machine. Now, you just need to proceed by verifying it.
To verify, use the following command −
$ java -version
If Java is successfully installed on your machine, you could see the present version of the installed Java. If Java is not installed follow these steps to install Java 8 on your machine.
Download the latest version of JDK by visiting the following link and then, download the latest version.
The latest version is JDK 8u 92 and the file is “jdk-8u92-linux-x64.tar.gz”. Please download the file on your machine. Following this, extract the files and move them to a specific directory. Now, set the Java alternatives. Finally, Java is installed on your machine.
You have already installed Hadoop on your system. Now, verify it using the following command −
$ hadoop version
If everything is fine with your setup, then you could see the version of Hadoop. If Hadoop is not installed, download and install Hadoop by visiting the following link − https://www.apache.org
Apache Tajo provides two execution modes — local mode and fully distributed mode. After verifying Java and Hadoop installation proceed with the following steps to install Tajo cluster on your machine. A local mode Tajo instance requires very easy configurations.
Download the latest version of Tajo by visiting the following link − https://www.apache.org/dyn/closer.cgi/tajo
Now you can download the file “tajo-0.11.3.tar.gz” from your machine.
Extract the tar file by using the following command −
$ cd opt/ $ tar tajo-0.11.3.tar.gz $ cd tajo-0.11.3
Add the following changes to “conf/tajo-env.sh” file
$ cd tajo-0.11.3 $ vi conf/tajo-env.sh # Hadoop home. Required export HADOOP_HOME = /Users/path/to/Hadoop/hadoop-2.6.2 # The java implementation to use. Required. export JAVA_HOME = /path/to/jdk1.8.0_92.jdk/
Here, you must specify Hadoop and Java path to “tajo-env.sh” file. After the changes are made, save the file and quit the terminal.
To launch the Tajo server, execute the following command −
$ bin/start-tajo.sh
You will receive a response similar to the following −
Starting single TajoMaster starting master, logging to /Users/path/to/Tajo/tajo-0.11.3/bin/../ localhost: starting worker, logging to /Users/path/toe/Tajo/tajo-0.11.3/bin/../logs/ Tajo master web UI: http://local:26080 Tajo Client Service: local:26002
Now, type the command “jps” to see the running daemons.
$ jps 1010 TajoWorker 1140 Jps 933 TajoMaster
To launch the Tajo shell client, use the following command −
$ bin/tsql
You will receive the following output −
welcome to _____ ___ _____ ___ /_ _/ _ |/_ _/ / / // /_| |_/ // / / /_//_/ /_/___/ \__/ 0.11.3 Try \? for help.
Execute the following command to quit Tsql −
default> \q bye!
Here, the default refers to the catalog in Tajo.
Type the following URL to launch the Tajo web UI − http://localhost:26080/
You will now see the following screen which is similar to the ExecuteQuery option.
To stop the Tajo server, use the following command −
$ bin/stop-tajo.sh
You will get the following response −
localhost: stopping worker stopping master
Tajo’s configuration is based on Hadoop’s configuration system. This chapter explains Tajo configuration settings in detail.
Tajo uses the following two config files −
Distributed mode setup runs on Hadoop Distributed File System (HDFS). Let’s follow the steps to configure Tajo distributed mode setup.
This file is available @ /path/to/tajo/conf directory and acts as configuration for other Tajo modules. To access Tajo in a distributed mode, apply the following changes to “tajo-site.xml”.
<property> <name>tajo.rootdir</name> <value>hdfs://hostname:port/tajo</value> </property> <property> <name>tajo.master.umbilical-rpc.address</name> <value>hostname:26001</value> </property> <property> <name>tajo.master.client-rpc.address</name> <value>hostname:26002</value> </property> <property> <name>tajo.catalog.client-rpc.address</name> <value>hostname:26005</value> </property>
Tajo uses HDFS as a primary storage type. The configuration is as follows and should be added to “tajo-site.xml”.
<property> <name>tajo.rootdir</name> <value>hdfs://namenode_hostname:port/path</value> </property>
If you want to customize the catalog service, copy $path/to/Tajo/conf/catalogsite.xml.template to $path/to/Tajo/conf/catalog-site.xml and add any of the following configuration as needed.
For example, if you use “Hive catalog store” to access Tajo, then the configuration should be like the following −
<property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.HCatalogStore</value> </property>
If you need to store MySQL catalog, then apply the following changes −
<property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.MySQLStore</value> </property> <property> <name>tajo.catalog.jdbc.connection.id</name> <value><mysql user name></value> </property> <property> <name>tajo.catalog.jdbc.connection.password</name> <value><mysql user password></value> </property> <property> <name>tajo.catalog.jdbc.uri</name> <value>jdbc:mysql://<mysql host name>:<mysql port>/<database name for tajo> ?createDatabaseIfNotExist = true</value> </property>
Similarly, you can register the other Tajo supported catalogs in the configuration file.
By default, the TajoWorker stores temporary data on the local file system. It is defined in the “tajo-site.xml” file as follows −
<property> <name>tajo.worker.tmpdir.locations</name> <value>/disk1/tmpdir,/disk2/tmpdir,/disk3/tmpdir</value> </property>
To increase the capacity of running tasks of each worker resource, choose the following configuration −
<property> <name>tajo.worker.resource.cpu-cores</name> <value>12</value> </property> <property> <name>tajo.task.resource.min.memory-mb</name> <value>2000</value> </property> <property> <name>tajo.worker.resource.disks</name> <value>4</value> </property>
To make the Tajo worker run in a dedicated mode, choose the following configuration −
<property> <name>tajo.worker.resource.dedicated</name> <value>true</value> </property>
In this chapter, we will understand the Tajo Shell commands in detail.
To execute the Tajo shell commands, you need to start the Tajo server and the Tajo shell using the following commands −
$ bin/start-tajo.sh
$ bin/tsql
The above commands are now ready for execution.
Let us now discuss the Meta Commands. Tsql meta commands start with a backslash (‘\’).
“\?” Command is used to show the help option.
Query
default> \?
Result
The above \? Command list out all the basic usage options in Tajo. You will receive the following output −
To list out all the databases in Tajo, use the following command −
Query
default> \l
Result
You will receive the following output −
information_schema default
At present, we have not created any database so it shows two built in Tajo databases.
\c option is used to display the current database name.
Query
default> \c
Result
You are now connected to database "default" as user “username”.
To list out all the built-in function, type the query as follows −
Query
default> \df
Result
You will receive the following output −
\df function name − This query returns the complete description of the given function.
Query
default> \df sqrt
Result
You will receive the following output −
To quit the terminal, type the following query −
Query
default> \q
Result
You will receive the following output −
bye!
Tajo shell provides \admin option to list out all the admin features.
Query
default> \admin
Result
You will receive the following output −
To display the cluster information in Tajo, use the following query
Query
default> \admin -cluster
Result
You will receive the following output −
The following query displays the current master information.
Query
default> \admin -showmasters
Result
localhost
Similarly, you can try other admin commands.
The Tajo client connects to the Master via a unique session id. The session is live until the client is disconnected or expires.
The following command is used to list out all session variables.
Query
default> \set
Result
'SESSION_LAST_ACCESS_TIME' = '1470206387146' 'CURRENT_DATABASE' = 'default' ‘USERNAME’ = 'user' 'SESSION_ID' = 'c60c9b20-dfba-404a-822f-182bc95d6c7c' 'TIMEZONE' = 'Asia/Kolkata' 'FETCH_ROWNUM' = '200' ‘COMPRESSED_RESULT_TRANSFER' = 'false'
The \set key val will set the session variable named key with the value val. For example,
Query
default> \set ‘current_database’='default'
Result
usage: \set [[NAME] VALUE]
Here, you can assign the key and value in the \set command. If you need to revert the changes then use the \unset command.
To execute a query in a Tajo shell, open your terminal and move to the Tajo installed directory and then type the following command −
$ bin/tsql
You will now see the response as shown in the following program −
default>
You can now execute your queries. Otherwise you can run your queries through web console application to the following URL − http://localhost:26080/
Apache Tajo supports the following list of primitive data types −
S.No. | Data type & Description |
---|---|
1 | integer Used for storing integer value with 4 bytes storage. |
2 | tinyint Tiny integer value is 1 byte |
3 | smallint Used for storing small size integer 2 bytes value. |
4 | bigint Big range integer value has 8 bytes storage. |
5 | boolean Returns true/false. |
6 | real Used for storing real value. Size is 4 bytes. |
7 | float Floating point precision value which has 4 or 8 bytes storage space. |
8 | double Double point precision value stored in 8 bytes. |
9 | char[(n)] Character value. |
10 | varchar[(n)] Variable-length non-Unicode data. |
11 | number Decimal values. |
12 | binary Binary values. |
13 | date Calendar date (year, month, day). Example − DATE '2016-08-22' |
14 | time Time of day (hour, minute, second, millisecond) without a time zone. Values of this type are parsed and rendered in the session time zone. |
15 | timezone Time of day (hour, minute, second, millisecond) with a time zone. Values of this type are rendered using the time zone from the value. Example − TIME '01:02:03.456 Asia/kolkata' |
16 | timestamp Instant in time that includes the date and time of day without a time zone. Example − TIMESTAMP '2016-08-22 03:04:05.321' |
17 | text Variable-length Unicode text. |
The following operators are used in Tajo to perform desired operations.
S.No. | Operator & Description |
---|---|
1 | Arithmetic operators
Presto supports arithmetic operators such as +, −, *, /, %. |
2 | Relational operators
<, >, <=, >=, =, <> |
3 | Logical operators
AND, OR, NOT |
4 | String operators
The ‘||’ operator performs string concatenation. |
5 | Range operators
Range operator is used to test the value in a specific range. Tajo supports BETWEEN, IS NULL, IS NOT NULL operators. |
As of now, you were aware of running simple basic queries on Tajo. In the next subsequent few chapters, we will discuss the following SQL functions −
Math functions operate on mathematical formulae. The following table describes the list of functions in detail.
S.No. | Function & Description |
---|---|
1 | abs(x)
Returns the absolute value of x. |
2 | cbrt(x)
Returns the cube root of x. |
3 | ceil(x)
Returns x value rounded up to the nearest integer. |
4 | floor(x)
Returns x rounded down to the nearest integer. |
5 | pi()
Returns pi value. Result will be returned as double value. |
6 | radians(x)
converts the angle x in degree radians. |
7 | degrees(x)
Returns degree value for x. |
8 | pow(x,p)
Returns power of value‘p’ to the x value. |
9 | div(x,y)
Returns the division result for the given two x,y integer values. |
10 | exp(x)
Returns Euler’s number e raised to the power of a number. |
11 | sqrt(x)
Returns the square root of x. |
12 | sign(x)
Returns the signum function of x, that is −
|
13 | mod(n,m)
Returns the modulus (remainder) of n divided by m. |
14 | round(x)
Returns the rounded value for x. |
15 | cos(x)
Returns the cosine value(x). |
16 | asin(x)
Returns the inverse sine value(x). |
17 | acos(x)
Returns the inverse cosine value(x). |
18 | atan(x)
Returns the inverse tangent value(x). |
19 | atan2(y,x)
Returns the inverse tangent value(y/x). |
The following table lists out the data type functions available in Apache Tajo.
S.No. | Function & Description |
---|---|
1 | to_bin(x)
Returns the binary representation of integer. |
2 | to_char(int,text)
Converts integer to string. |
3 | to_hex(x)
Converts the x value into hexadecimal. |
The following table lists out the string functions in Tajo.
S.No. | Function & Description |
---|---|
1 | concat(string1, ..., stringN)
Concatenate the given strings. |
2 | length(string)
Returns the length of the given string. |
3 | lower(string)
Returns the lowercase format for the string. |
4 | upper(string)
Returns the uppercase format for the given string. |
5 | ascii(string text)
Returns the ASCII code of the first character of the text. |
6 | bit_length(string text)
Returns the number of bits in a string. |
7 | char_length(string text)
Returns the number of characters in a string. |
8 | octet_length(string text)
Returns the number of bytes in a string. |
9 | digest(input text, method text)
Calculates the Digest hash of string. Here, the second arg method refers to the hash method. |
10 | initcap(string text)
Converts the first letter of each word to upper case. |
11 | md5(string text)
Calculates the MD5 hash of string. |
12 | left(string text, int size)
Returns the first n characters in the string. |
13 | right(string text, int size)
Returns the last n characters in the string. |
14 | locate(source text, target text, start_index)
Returns the location of specified substring. |
15 | strposb(source text, target text)
Returns the binary location of specified substring. |
16 | substr(source text, start index, length)
Returns the substring for the specified length. |
17 | trim(string text[, characters text])
Removes the characters (a space by default) from the start/end/both ends of the string. |
18 | split_part(string text, delimiter text, field int)
Splits a string on delimiter and returns the given field (counting from one). |
19 | regexp_replace(string text, pattern text, replacement text)
Replaces substrings matched to a given regular expression pattern. |
20 | reverse(string)
Reverse operation performed for the string. |
Apache Tajo supports the following DateTime functions.
S.No. | Function & Description |
---|---|
1 | add_days(date date or timestamp, int day
Returns date added by the given day value. |
2 | add_months(date date or timestamp, int month)
Returns date added by the given month value. |
3 | current_date()
Returns today’s date. |
4 | current_time()
Returns today’s time. |
5 | extract(century from date/timestamp)
Extracts century from the given parameter. |
6 | extract(day from date/timestamp)
Extracts day from the given parameter. |
7 | extract(decade from date/timestamp)
Extracts decade from the given parameter. |
8 | extract(day dow date/timestamp)
Extracts day of week from the given parameter. |
9 | extract(doy from date/timestamp)
Extracts day of year from the given parameter. |
10 | select extract(hour from timestamp)
Extracts hour from the given parameter. |
11 | select extract(isodow from timestamp)
Extracts day of week from the given parameter. This is identical to dow except for Sunday. This matches the ISO 8601 day of the week numbering. |
12 | select extract(isoyear from date)
Extracts ISO year from the specified date. ISO year may be different from the Gregorian year. |
13 | extract(microseconds from time)
Extracts microseconds from the given parameter. The seconds field, including fractional parts, multiplied by 1 000 000; |
14 | extract(millennium from timestamp )
Extracts millennium from the given parameter.one millennium corresponds to 1000 years. Hence, the third millennium started January 1, 2001. |
15 | extract(milliseconds from time)
Extracts milliseconds from the given parameter. |
16 | extract(minute from timestamp )
Extracts minute from the given parameter. |
17 | extract(quarter from timestamp)
Extracts quarter of the year(1 - 4) from the given parameter. |
18 | date_part(field text, source date or timestamp or time)
Extracts date field from text. |
19 | now()
Returns current timestamp. |
20 | to_char(timestamp, format text)
Converts timestamp to text. |
21 | to_date(src text, format text)
Converts text to date. |
22 | to_timestamp(src text, format text)
Converts text to timestamp. |
The JSON functions are listed in the following table −
S.No. | Function & Description |
---|---|
1 | json_extract_path_text(js on text, json_path text)
Extracts JSON string from a JSON string based on json path specified. |
2 | json_array_get(json_array text, index int4)
Returns the element at the specified index into the JSON array. |
3 | json_array_contains(json_ array text, value any)
Determine if the given value exists in the JSON array. |
4 | json_array_length(json_ar ray text)
Returns the length of json array. |
This section explains the Tajo DDL commands. Tajo has a built-in database named default.
Create Database is a statement used to create a database in Tajo. The syntax for this statement is as follows −
CREATE DATABASE [IF NOT EXISTS] <database_name>
default> default> create database if not exists test;
The above query will generate the following result.
OK
Database is the namespace in Tajo. A database can contain multiple tables with a unique name.
To check the current database name, issue the following command −
default> \c
The above query will generate the following result.
You are now connected to database "default" as user “user1". default>
As of now, you have created a database named “test”. The following syntax is used to connect the “test” database.
\c <database name>
default> \c test
The above query will generate the following result.
You are now connected to database "test" as user “user1”. test>
You can now see the prompt changes from default database to test database.
To drop a database, use the following syntax −
DROP DATABASE <database-name>
test> \c default You are now connected to database "default" as user “user1". default> drop database test;
The above query will generate the following result.
OK
A table is a logical view of one data source. It consists of a logical schema, partitions, URL, and various properties. A Tajo table can be a directory in HDFS, a single file, one HBase table, or a RDBMS table.
Tajo supports the following two types of tables −
External table needs the location property when the table is created. For example, if your data is already there as Text/JSON files or HBase table, you can register it as Tajo external table.
The following query is an example of external table creation.
create external table sample(col1 int,col2 text,col3 int) location ‘hdfs://path/to/table';
Here,
External keyword − This is used to create an external table. This helps to create a table in the specified location.
Sample refers to the table name.
Location − It is a directory for HDFS,Amazon S3, HBase or local file system. To assign a location property for directories, use the below URI examples −
HDFS − hdfs://localhost:port/path/to/table
Amazon S3 − s3://bucket-name/table
local file system − file:///path/to/table
Openstack Swift − swift://bucket-name/table
An external table has the following properties −
TimeZone − Users can specify a time zone for reading or writing a table.
Compression format − Used to make data size compact. For example, the text/json file uses compression.codec property.
A Internal table is also called an Managed Table. It is created in a pre-defined physical location called the Tablespace.
create table table1(col1 int,col2 text);
By default, Tajo uses “tajo.warehouse.directory” located in “conf/tajo-site.xml” . To assign new location for the table, you can use Tablespace configuration.
Tablespace is used to define locations in the storage system. It is supported for only internal tables. You can access the tablespaces by their names. Each tablespace can use a different storage type. If you don’t specify tablespaces then, Tajo uses the default tablespace in the root directory.
You have “conf/tajo-site.xml.template” in Tajo. Copy the file and rename it to “storagesite.json”. This file will act as a configuration for Tablespaces. Tajo data formats uses the following configuration −
$ vi conf/storage-site.json { "spaces": { "${tablespace_name}": { "uri": “hdfs://localhost:9000/path/to/Tajo" } } }
$ vi conf/storage-site.json { "spaces": { "${tablespace_name}": { "uri": “hbase:zk://quorum1:port,quorum2:port/" } } }
$ vi conf/storage-site.json { "spaces": { "${tablespace_name}": { “uri”: “hdfs://localhost:9000/path/to/Tajo” } } }
Tajo’s internal table records can be accessed from another table only. You can configure it with tablespace.
CREATE TABLE [IF NOT EXISTS] <table_name> [(column_list)] [TABLESPACE tablespace_name] [using <storage_type> [with (<key> = <value>, ...)]] [AS <select_statement>]
Here,
IF NOT EXISTS − This avoids an error if the same table has not been created already.
TABLESPACE − This clause is used to assign the tablespace name.
Storage type − Tajo data supports formats like text,JSON,HBase,Parquet,Sequencefile and ORC.
AS select statement − Select records from another table.
Start your Hadoop services and open the file “conf/storage-site.json”, then add the following changes −
$ vi conf/storage-site.json { "spaces": { “space1”: { "uri": “hdfs://localhost:9000/path/to/Tajo" } } }
Here, Tajo will refer to the data from HDFS location and space1 is the tablespace name. If you do not start Hadoop services, you can’t register tablespace.
default> create table table1(num1 int,num2 text,num3 float) tablespace space1;
The above query creates a table named “table1” and “space1” refers to the tablespace name.
Tajo supports data formats. Let’s go through each of the formats one by one in detail.
A character-separated values’ plain text file represents a tabular data set consisting of rows and columns. Each row is a plain text line.
default> create external table customer(id int,name text,address text,age int) using text with('text.delimiter'=',') location ‘file:/Users/workspace/Tajo/customers.csv’;
Here, “customers.csv” file refers to a comma separated value file located in the Tajo installation directory.
To create internal table using text format, use the following query −
default> create table customer(id int,name text,address text,age int) using text;
In the above query, you have not assigned any tablespace so it will take Tajo’s default tablespace.
A text file format has the following properties −
text.delimiter − This is a delimiter character. Default is ‘|’.
compression.codec − This is a compression format. By default, it is disabled. you can change the settings using specified algorithm.
timezone − The table used for reading or writing.
text.error-tolerance.max-num − The maximum number of tolerance levels.
text.skip.headerlines − The number of header lines per skipped.
text.serde − This is serialization property.
Apache Tajo supports JSON format for querying data. Tajo treats a JSON object as SQL record. One object equals one row in a Tajo table. Let’s consider “array.json” as follows −
$ hdfs dfs -cat /json/array.json { "num1" : 10, "num2" : "simple json array", "num3" : 50.5 }
After you create this file, switch to the Tajo shell and type the following query to create a table using the JSON format.
default> create external table sample (num1 int,num2 text,num3 float) using json location ‘json/array.json’;
Always remember that the file data must match with the table schema. Otherwise, you can omit the column names and use * which doesn’t require columns list.
To create an internal table, use the following query −
default> create table sample (num1 int,num2 text,num3 float) using json;
Parquet is a columnar storage format. Tajo uses Parquet format for easy, fast and efficient access.
The following query is an example for table creation −
CREATE TABLE parquet (num1 int,num2 text,num3 float) USING PARQUET;
Parquet file format has the following properties −
parquet.block.size − size of a row group being buffered in memory.
parquet.page.size − The page size is for compression.
parquet.compression − The compression algorithm used to compress pages.
parquet.enable.dictionary − The boolean value is to enable/disable dictionary encoding.
RCFile is the Record Columnar File. It consists of binary key/value pairs.
The following query is an example for table creation −
CREATE TABLE Record(num1 int,num2 text,num3 float) USING RCFILE;
RCFile has the following properties −
rcfile.serde − custom deserializer class.
compression.codec − compression algorithm.
rcfile.null − NULL character.
SequenceFile is a basic file format in Hadoop which consists of key/value pairs.
The following query is an example for table creation −
CREATE TABLE seq(num1 int,num2 text,num3 float) USING sequencefile;
This sequence file has Hive compatibility. This can be written in Hive as,
CREATE TABLE table1 (id int, name string, score float, type string) STORED AS sequencefile;
ORC (Optimized Row Columnar) is a columnar storage format from Hive.
The following query is an example for table creation −
CREATE TABLE optimized(num1 int,num2 text,num3 float) USING ORC;
The ORC format has the following properties −
orc.max.merge.distance − ORC file is read, it merges when the distance is lower.
orc.stripe.size − This is the size of each stripe.
orc.buffer.size − The default is 256KB.
orc.rowindex.stride − This is the ORC index stride in number of rows.
In the previous chapter, you have understood how to create tables in Tajo. This chapter explains about the SQL statement in Tajo.
Before moving to create a table, create a text file “students.csv” in Tajo installation directory path as follows −
students.csv
Id | Name | Address | Age | Marks |
---|---|---|---|---|
1 | Adam | 23 New Street | 21 | 90 |
2 | Amit | 12 Old Street | 13 | 95 |
3 | Bob | 10 Cross Street | 12 | 80 |
4 | David | 15 Express Avenue | 12 | 85 |
5 | Esha | 20 Garden Street | 13 | 50 |
6 | Ganga | 25 North Street | 12 | 55 |
7 | Jack | 2 Park Street | 12 | 60 |
8 | Leena | 24 South Street | 12 | 70 |
9 | Mary | 5 West Street | 12 | 75 |
10 | Peter | 16 Park Avenue | 12 | 95 |
After the file has been created, move to the terminal and start the Tajo server and shell one by one.
Create a new database using the following command −
default> create database sampledb; OK
Connect to the database “sampledb” which is now created.
default> \c sampledb You are now connected to database "sampledb" as user “user1”.
Then, create a table in “sampledb” as follows −
sampledb> create external table mytable(id int,name text,address text,age int,mark int) using text with('text.delimiter' = ',') location ‘file:/Users/workspace/Tajo/students.csv’;
The above query will generate the following result.
OK
Here, the external table is created. Now, you just have to enter the file location. If you have to assign the table from hdfs then use hdfs instead of file.
Next, the “students.csv” file contains comma separated values. The text.delimiter field is assigned with ‘,’.
You have now created “mytable” successfully in “sampledb”.
To show tables in Tajo, use the following query.
sampledb> \d mytable sampledb> \d mytable
The above query will generate the following result.
table name: sampledb.mytable table uri: file:/Users/workspace/Tajo/students.csv store type: TEXT number of rows: unknown volume: 261 B Options: 'timezone' = 'Asia/Kolkata' 'text.null' = '\\N' 'text.delimiter' = ',' schema: id INT4 name TEXT address TEXT age INT4 mark INT4
To fetch all the records in the table, type the following query −
sampledb> select * from mytable;
The above query will generate the following result.
Tajo uses the following syntax to insert records in table.
create table table1 (col1 int8, col2 text, col3 text); --schema should be same for target table schema Insert overwrite into table1 select * from table2; (or) Insert overwrite into LOCATION '/dir/subdir' select * from table;
Tajo’s insert statement is similar to the INSERT INTO SELECT statement of SQL.
Let’s create a table to overwrite table data of an existing table.
sampledb> create table test(sno int,name text,addr text,age int,mark int); OK sampledb> \d
The above query will generate the following result.
mytable test
To insert records in the “test” table, type the following query.
sampledb> insert overwrite into test select * from mytable;
The above query will generate the following result.
Progress: 100%, response time: 0.518 sec
Here, “mytable" records overwrite the “test” table. If you don’t want to create the “test” table, then straight away assign the physical path location as mentioned in an alternative option for insert query.
Use the following query to list out all the records in the “test” table −
sampledb> select * from test;
The above query will generate the following result.
This statement is used to add, remove or modify columns of an existing table.
To rename the table use the following syntax −
Alter table table1 RENAME TO table2;
sampledb> alter table test rename to students;
The above query will generate the following result.
OK
To check the changed table name, use the following query.
sampledb> \d mytable students
Now the table “test” is changed to “students” table.
To insert new column in the “students” table, type the following syntax −
Alter table <table_name> ADD COLUMN <column_name> <data_type>
sampledb> alter table students add column grade text;
The above query will generate the following result.
OK
This property is used to change the table’s property.
sampledb> ALTER TABLE students SET PROPERTY 'compression.type' = 'RECORD', 'compression.codec' = 'org.apache.hadoop.io.compress.Snappy Codec' ; OK
Here, compression type and codec properties are assigned.
To change the text delimiter property, use the following −
ALTER TABLE students SET PROPERTY ‘text.delimiter'=','; OK
The above query will generate the following result.
sampledb> \d students table name: sampledb.students table uri: file:/tmp/tajo-user1/warehouse/sampledb/students store type: TEXT number of rows: 10 volume: 228 B Options: 'compression.type' = 'RECORD' 'timezone' = 'Asia/Kolkata' 'text.null' = '\\N' 'compression.codec' = 'org.apache.hadoop.io.compress.SnappyCodec' 'text.delimiter' = ',' schema: id INT4 name TEXT addr TEXT age INT4 mark INT4 grade TEXT
The above result shows that the table’s properties are changed using the “SET” property.
The SELECT statement is used to select data from a database.
The syntax for the Select statement is as follows −
SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, ...] [FROM <table reference> [[AS] <table alias name>] [, ...]] [WHERE <condition>] [GROUP BY <expression> [, ...]] [HAVING <condition>] [ORDER BY <expression> [ASC|DESC] [NULLS (FIRST|LAST)] [, …]]
The Where clause is used to filter records from the table.
sampledb> select * from mytable where id > 5;
The above query will generate the following result.
The query returns the records of those students whose id is greater than 5.
sampledb> select * from mytable where name = ‘Peter’;
The above query will generate the following result.
Progress: 100%, response time: 0.117 sec id, name, address, age ------------------------------- 10, Peter, 16 park avenue , 12
The result filters Peter’s records only.
A table column may contain duplicate values. The DISTINCT keyword can be used to return only distinct (different) values.
SELECT DISTINCT column1,column2 FROM table_name;
sampledb> select distinct age from mytable;
The above query will generate the following result.
Progress: 100%, response time: 0.216 sec age ------------------------------- 13 12
The query returns the distinct age of students from mytable.
The GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups.
SELECT column1, column2 FROM table_name WHERE [ conditions ] GROUP BY column1, column2;
select age,sum(mark) as sumofmarks from mytable group by age;
The above query will generate the following result.
age, sumofmarks ------------------------------- 13, 145 12, 610
Here, the “mytable" column has two types of ages — 12 and 13. Now the query groups the records by age and produces the sum of marks for the corresponding ages of students.
The HAVING clause enables you to specify conditions that filter which group results appear in the final results. The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on the groups created by the GROUP BY clause.
SELECT column1, column2 FROM table1 GROUP BY column HAVING [ conditions ]
sampledb> select age from mytable group by age having sum(mark) > 200;
The above query will generate the following result.
age ------------------------------- 12
The query groups the records by age and returns the age when the condition result sum(mark) > 200.
The ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. The Tajo database sorts query results in ascending order by default.
SELECT column-list FROM table_name [WHERE condition] [ORDER BY column1, column2, .. columnN] [ASC | DESC];
sampledb> select * from mytable where mark > 60 order by name desc;
The above query will generate the following result.
The query returns the names of those students in descending order whose marks are greater than 60.
The CREATE INDEX statement is used to create indexes in tables. Index is used for fast retrieval of data. Current version supports index for only plain TEXT formats stored on HDFS.
CREATE INDEX [ name ] ON table_name ( { column_name | ( expression ) }
create index student_index on mytable(id);
The above query will generate the following result.
id ———————————————
To view assigned index for the column, type the following query.
default> \d mytable table name: default.mytable table uri: file:/Users/deiva/workspace/Tajo/students.csv store type: TEXT number of rows: unknown volume: 307 B Options: 'timezone' = 'Asia/Kolkata' 'text.null' = '\\N' 'text.delimiter' = ',' schema: id INT4 name TEXT address TEXT age INT4 mark INT4 Indexes: "student_index" TWO_LEVEL_BIN_TREE (id ASC NULLS LAST )
Here, TWO_LEVEL_BIN_TREE method is used by default in Tajo.
The Drop Table Statement is used to drop a table from the database.
drop table table name;
sampledb> drop table mytable;
To check if the table has been dropped from the table, type the following query.
sampledb> \d mytable;
The above query will generate the following result.
ERROR: relation 'mytable' does not exist
You can also check the query using “\d” command to list out the available Tajo tables.
This chapter explains the aggregate and window functions in detail.
Aggregate functions produce a single result from a set of input values. The following table describes the list of aggregate functions in detail.
S.No. | Function & Description |
---|---|
1 | AVG(exp)
Averages a column of all records in a data source. |
2 | CORR(expression1, expression2)
Returns the coefficient of correlation between a set of number pairs. |
3 | COUNT()
Returns the number rows. |
4 | MAX(expression)
Returns the largest value of the selected column. |
5 | MIN(expression)
Returns the smallest value of the selected column. |
6 | SUM(expression)
Returns the sum of the given column. |
7 | LAST_VALUE(expression)
Returns the last value of the given column. |
The Window functions execute on a set of rows and return a single value for each row from the query. The term window has the meaning of set of row for the function.
The Window function in a query, defines the window using the OVER() clause.
The OVER() clause has the following capabilities −
The following table describes the window functions in detail.
Function | Return type | Description |
---|---|---|
rank() | int | Returns rank of the current row with gaps. |
row_num() | int | Returns the current row within its partition, counting from 1. |
lead(value[, offset integer[, default any]]) | Same as input type | Returns value evaluated at the row that is offset rows after the current row within the partition. If there is no such row, default value will be returned. |
lag(value[, offset integer[, default any]]) | Same as input type | Returns value evaluated at the row that is offset rows before the current row within the partition. |
first_value(value) | Same as input type | Returns the first value of input rows. |
last_value(value) | Same as input type | Returns the last value of input rows. |
This chapter explains about the following significant Queries.
Let us proceed and perform the queries.
Predicate is an expression which is used to evaluate true/false values and UNKNOWN. Predicates are used in the search condition of WHERE clauses and HAVING clauses and other constructs where a Boolean value is required.
Determines whether the value of expression to test matches any value in the subquery or the list. Subquery is an ordinary SELECT statement that has a result set of one column and one or more rows. This column or all expressions in the list must have the same data type as the expression to test.
Syntax
IN::= <expression to test> [NOT] IN (<subquery>) | (<expression1>,...)
Query
select id,name,address from mytable where id in(2,3,4);
Result
The above query will generate the following result.
id, name, address ------------------------------- 2, Amit, 12 old street 3, Bob, 10 cross street 4, David, 15 express avenue
The query returns records from mytable for the students id 2,3 and 4.
Query
select id,name,address from mytable where id not in(2,3,4);
Result
The above query will generate the following result.
id, name, address ------------------------------- 1, Adam, 23 new street 5, Esha, 20 garden street 6, Ganga, 25 north street 7, Jack, 2 park street 8, Leena, 24 south street 9, Mary, 5 west street 10, Peter, 16 park avenue
The above query returns records from mytable where students is not in 2,3 and 4.
The LIKE predicate compares the string specified in the first expression for calculating the string value, which is refered to as a value to test, with the pattern that is defined in the second expression for calculating the string value.
The pattern may contain any combination of wildcards such as −
Underline symbol (_), which can be used instead of any single character in the value to test.
Percent sign (%), which replaces any string of zero or more characters in the value to test.
Syntax
LIKE::= <expression for calculating the string value> [NOT] LIKE <expression for calculating the string value> [ESCAPE <symbol>]
Query
select * from mytable where name like ‘A%';
Result
The above query will generate the following result.
id, name, address, age, mark ------------------------------- 1, Adam, 23 new street, 12, 90 2, Amit, 12 old street, 13, 95
The query returns records from mytable of those students whose names are starting with ‘A’.
Query
select * from mytable where name like ‘_a%';
Result
The above query will generate the following result.
id, name, address, age, mark ——————————————————————————————————————- 4, David, 15 express avenue, 12, 85 6, Ganga, 25 north street, 12, 55 7, Jack, 2 park street, 12, 60 9, Mary, 5 west street, 12, 75
The query returns records from mytable of those students whose names are starting with ‘a’ as the second char.
Let us now understand how to use NULL Value in the search conditions.
Syntax
Predicate IS [NOT] NULL
Query
select name from mytable where name is not null;
Result
The above query will generate the following result.
name ------------------------------- Adam Amit Bob David Esha Ganga Jack Leena Mary Peter (10 rows, 0.076 sec, 163 B selected)
Here, the result is true so it returns all the names from table.
Query
Let us now check the query with NULL condition.
default> select name from mytable where name is null;
Result
The above query will generate the following result.
name ------------------------------- (0 rows, 0.068 sec, 0 B selected)
Explain is used to obtain a query execution plan. It shows a logical and global plan execution of a statement.
explain select * from mytable; explain ------------------------------- => target list: default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) => out schema: { (5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) } => in schema: { (5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) }
Result
The above query will generate the following result.
The query result shows a logical plan format for the given table. The Logical plan returns the following three results −
explain global select * from mytable; explain ------------------------------- ------------------------------------------------------------------------------- Execution Block Graph (TERMINAL - eb_0000000000000_0000_000002) ------------------------------------------------------------------------------- |-eb_0000000000000_0000_000002 |-eb_0000000000000_0000_000001 ------------------------------------------------------------------------------- Order of Execution ------------------------------------------------------------------------------- 1: eb_0000000000000_0000_000001 2: eb_0000000000000_0000_000002 ------------------------------------------------------------------------------- ======================================================= Block Id: eb_0000000000000_0000_000001 [ROOT] ======================================================= SCAN(0) on default.mytable => target list: default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) => out schema: { (5) default.mytable.id (INT4), default.mytable.name (TEXT),default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) } => in schema: { (5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) } ======================================================= Block Id: eb_0000000000000_0000_000002 [TERMINAL] ======================================================= (24 rows, 0.065 sec, 0 B selected)
Result
The above query will generate the following result.
Here, Global plan shows execution block ID, order of execution and its information.
SQL joins are used to combine rows from two or more tables. The following are the different types of SQL Joins −
Consider the following two tables to perform joins operations.
Id | Name | Address | Age |
---|---|---|---|
1 | Customer 1 | 23 Old Street | 21 |
2 | Customer 2 | 12 New Street | 23 |
3 | Customer 3 | 10 Express Avenue | 22 |
4 | Customer 4 | 15 Express Avenue | 22 |
5 | Customer 5 | 20 Garden Street | 33 |
6 | Customer 6 | 21 North Street | 25 |
Id | Order Id | Emp Id |
---|---|---|
1 | 1 | 101 |
2 | 2 | 102 |
3 | 3 | 103 |
4 | 4 | 104 |
5 | 5 | 105 |
Let us now proceed and perform the SQL joins operations on the above two tables.
The Inner join selects all rows from both the tables when there is a match between the columns in both tables.
Syntax
SELECT column_name(s) FROM table1 INNER JOIN table2 ON table1.column_name = table2.column_name;
Query
default> select c.age,c1.empid from customers c inner join customer_order c1 on c.id = c1.id;
Result
The above query will generate the following result.
age, empid ------------------------------- 21, 101 23, 102 22, 103 22, 104 33, 105
The query matches five rows from both the tables. Hence, it returns the matched rows age from the first table.
A left outer join retains all of the rows of the “left” table, regardless of whether there is a row that matches on the “right” table or not.
Query
select c.name,c1.empid from customers c left outer join customer_order c1 on c.id = c1.id;
Result
The above query will generate the following result.
name, empid ------------------------------- customer1, 101 customer2, 102 customer3, 103 customer4, 104 customer5, 105 customer6,
Here, the left outer join returns name column rows from the customers(left) table and empid column matched rows from the customer_order(right) table.
A right outer join retains all of the rows of the “right” table, regardless of whether there is a row that matches on the “left” table.
Query
select c.name,c1.empid from customers c right outer join customer_order c1 on c.id = c1.id;
Result
The above query will generate the following result.
name, empid ------------------------------- customer1, 101 customer2, 102 customer3, 103 customer4, 104 customer5, 105
Here, the Right Outer Join returns the empid rows from the customer_order(right) table and the name column matched rows from customers table.
The Full Outer Join retains all rows from both the left and the right table.
Query
select * from customers c full outer join customer_order c1 on c.id = c1.id;
Result
The above query will generate the following result.
The query returns all the matching and non-matching rows from both the customers and the customer_order tables.
This returns the Cartesian product of the sets of records from the two or more joined tables.
Syntax
SELECT * FROM table1 CROSS JOIN table2;
Query
select orderid,name,address from customers,customer_order;
Result
The above query will generate the following result.
The above query returns the Cartesian product of the table.
A Natural Join does not use any comparison operator. It does not concatenate the way a Cartesian product does. We can perform a Natural Join only if there is at least one common attribute that exists between the two relations.
Syntax
SELECT * FROM table1 NATURAL JOIN table2;
Query
select * from customers natural join customer_order;
Result
The above query will generate the following result.
Here, there is one common column id that exists between two tables. Using that common column, the Natural Join joins both the tables.
The SQL SELF JOIN is used to join a table to itself as if the table were two tables, temporarily renaming at least one table in the SQL statement.
Syntax
SELECT a.column_name, b.column_name... FROM table1 a, table1 b WHERE a.common_filed = b.common_field
Query
default> select c.id,c1.name from customers c, customers c1 where c.id = c1.id;
Result
The above query will generate the following result.
id, name ------------------------------- 1, customer1 2, customer2 3, customer3 4, customer4 5, customer5 6, customer6
The query joins a customer table to itself.
Tajo supports various storage formats. To register storage plugin configuration, you should add the changes to the configuration file “storage-site.json”.
The structure is defined as follows −
{ "storages": { “storage plugin name“: { "handler": "${class name}”, "default-format": “plugin name" } } }
Each storage instance is identified by URI.
Tajo supports PostgreSQL storage handler. It enables user queries to access database objects in PostgreSQL. It is the default storage handler in Tajo so you can easily configure it.
{ "spaces": { "postgre": { "uri": "jdbc:postgresql://hostname:port/database1" "configs": { "mapped_database": “sampledb” "connection_properties": { "user":“tajo", "password": "pwd" } } } } }
Here, “database1” refers to the postgreSQL database which is mapped to the database “sampledb” in Tajo.
Apache Tajo supports HBase integration. This enables us to access HBase tables in Tajo. HBase is a distributed column-oriented database built on top of the Hadoop file system. It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. The following steps are required to configure HBase integration.
Add the following changes to “conf/tajo-env.sh” file.
$ vi conf/tajo-env.sh # HBase home directory. It is opitional but is required mandatorily to use HBase. # export HBASE_HOME = path/to/HBase
After you have included the HBase path, Tajo will set the HBase library file to the classpath.
Create an external table using the following syntax −
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <table_name> [(<column_name> <data_type>, ... )] USING hbase WITH ('table' = '<hbase_table_name>' , 'columns' = ':key,<column_family_name>:<qualifier_name>, ...' , 'hbase.zookeeper.quorum' = '<zookeeper_address>' , 'hbase.zookeeper.property.clientPort' = '<zookeeper_client_port>') [LOCATION 'hbase:zk://<hostname>:<port>/'] ;
To access HBase tables, you must configure the tablespace location.
Here,
Table − Set hbase origin table name. If you want to create an external table, the table must exists on HBase.
Columns − Key refers to the HBase row key. The number of columns entry need to be equal to the number of Tajo table columns.
hbase.zookeeper.quorum − Set zookeeper quorum address.
hbase.zookeeper.property.clientPort − Set zookeeper client port.
Query
CREATE EXTERNAL TABLE students (rowkey text,id int,name text) USING hbase WITH ('table' = 'students', 'columns' = ':key,info:id,content:name') LOCATION 'hbase:zk://<hostname>:<port>/';
Here, the Location path field sets the zookeeper client port id. If you don’t set the port, Tajo will refer the property of hbase-site.xml file.
You can start the HBase interactive shell using the “hbase shell” command as shown in the following query.
Query
/bin/hbase shell
Result
The above query will generate the following result.
hbase(main):001:0>
To query HBase, you should complete the following steps −
Step 1 − Pipe the following commands to the HBase shell to create a “tutorial” table.
Query
hbase(main):001:0> create ‘students’,{NAME => ’info’},{NAME => ’content’} put 'students', ‘row-01', 'content:name', 'Adam' put 'students', ‘row-01', 'info:id', '001' put 'students', ‘row-02', 'content:name', 'Amit' put 'students', ‘row-02', 'info:id', '002' put 'students', ‘row-03', 'content:name', 'Bob' put 'students', ‘row-03', 'info:id', ‘003'
Step 2 − Now, issue the following command in hbase shell to load the data into a table.
main):001:0> cat ../hbase/hbase-students.txt | bin/hbase shell
Step 3 − Now, return to the Tajo shell and execute the following command to view the metadata of the table −
default> \d students; table name: default.students table path: store type: HBASE number of rows: unknown volume: 0 B Options: 'columns' = ':key,info:id,content:name' 'table' = 'students' schema: rowkey TEXT id INT4 name TEXT
Step 4 − To fetch the results from the table, use the following query −
Query
default> select * from students
Result
The above query will fetch the following result −
rowkey, id, name ------------------------------- row-01, 001, Adam row-02, 002, Amit row-03 003, Bob
Tajo supports the HiveCatalogStore to integrate with Apache Hive. This integration allows Tajo to access tables in Apache Hive.
Add the following changes to “conf/tajo-env.sh” file.
$ vi conf/tajo-env.sh export HIVE_HOME = /path/to/hive
After you have included the Hive path, Tajo will set the Hive library file to the classpath.
Add the following changes to the “conf/catalog-site.xml” file.
$ vi conf/catalog-site.xml <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.HiveCatalogStore</value> </property>
Once HiveCatalogStore is configured, you can access Hive’s table in Tajo.
Swift is a distributed and consistent object/blob store. Swift offers cloud storage software so that you can store and retrieve lots of data with a simple API. Tajo supports Swift integration.
The following are the prerequisites of Swift Integration −
Add the following changes to the hadoop “core-site.xml” file −
<property> <name>fs.swift.impl</name> <value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value> <description>File system implementation for Swift</description> </property> <property> <name>fs.swift.blocksize</name> <value>131072</value> <description>Split size in KB</description> </property>
This will be used for Hadoop to access the Swift objects. After you made all the changes move to the Tajo directory to set Swift environment variable.
Open the Tajo configuration file and add set the environment variable as follows −
$ vi conf/tajo-env.h export TAJO_CLASSPATH = $HADOOP_HOME/share/hadoop/tools/lib/hadoop-openstack-x.x.x.jar
Now, Tajo will be able to query the data using Swift.
Let’s create an external table to access Swift objects in Tajo as follows −
default> create external table swift(num1 int, num2 text, num3 float) using text with ('text.delimiter' = '|') location 'swift://bucket-name/table1';
After the table has been created, you can run the SQL queries.
Apache Tajo provides JDBC interface to connect and execute queries. We can use the same JDBC interface to connect Tajo from our Java based application. Let us now understand how to connect Tajo and execute the commands in our sample Java application using JDBC interface in this section.
Download the JDBC driver by visiting the following link − http://apache.org/dyn/closer.cgi/tajo/tajo-0.11.3/tajo-jdbc-0.11.3.jar.
Now, “tajo-jdbc-0.11.3.jar” file has been downloaded on your machine.
To make use of the JDBC driver in your program, set the class path as follows −
CLASSPATH = path/to/tajo-jdbc-0.11.3.jar:$CLASSPATH
Apache Tajo provides a JDBC driver as a single jar file and it is available @ /path/to/tajo/share/jdbc-dist/tajo-jdbc-0.11.3.jar.
The connection string to connect the Apache Tajo is of the following format −
jdbc:tajo://host/ jdbc:tajo://host/database jdbc:tajo://host:port/ jdbc:tajo://host:port/database
Here,
host − The hostname of the TajoMaster.
port − The port number that server is listening. Default port number is 26002.
database − The database name. The default database name is default.
Let us now understand Java application.
import java.sql.*; import org.apache.tajo.jdbc.TajoDriver; public class TajoJdbcSample { public static void main(String[] args) { Connection connection = null; Statement statement = null; try { Class.forName("org.apache.tajo.jdbc.TajoDriver"); connection = DriverManager.getConnection(“jdbc:tajo://localhost/default"); statement = connection.createStatement(); String sql; sql = "select * from mytable”; // fetch records from mytable. ResultSet resultSet = statement.executeQuery(sql); while(resultSet.next()){ int id = resultSet.getInt("id"); String name = resultSet.getString("name"); System.out.print("ID: " + id + ";\nName: " + name + "\n"); } resultSet.close(); statement.close(); connection.close(); }catch(SQLException sqlException){ sqlException.printStackTrace(); }catch(Exception exception){ exception.printStackTrace(); } } }
The application can be compiled and run using the following commands.
javac -cp /path/to/tajo-jdbc-0.11.3.jar:. TajoJdbcSample.java
java -cp /path/to/tajo-jdbc-0.11.3.jar:. TajoJdbcSample
The above commands will generate the following result −
ID: 1; Name: Adam ID: 2; Name: Amit ID: 3; Name: Bob ID: 4; Name: David ID: 5; Name: Esha ID: 6; Name: Ganga ID: 7; Name: Jack ID: 8; Name: Leena ID: 9; Name: Mary ID: 10; Name: Peter
Apache Tajo supports the custom / user defined functions (UDFs). The custom functions can be created in python.
The custom functions are just plain python functions with decorator “@output_type(<tajo sql datatype>)” as follows −
@ouput_type(“integer”) def sum_py(a, b): return a + b;
The python scripts with UDFs can be registered by adding the below configuration in “tajosite.xml”.
<property> <name>tajo.function.python.code-dir</name> <value>file:///path/to/script1.py,file:///path/to/script2.py</value> </property>
Once the scripts are registered, restart the cluster and the UDFs will be available right in the SQL query as follows −
select sum_py(10, 10) as pyfn;
Apache Tajo supports user defined aggregate functions as well but does not support user defined window functions.