In the previous chapter, you have understood how to create tables in Tajo. This chapter explains about the SQL statement in Tajo.
Before moving to create a table, create a text file “students.csv” in Tajo installation directory path as follows −
students.csv
Id | Name | Address | Age | Marks |
---|---|---|---|---|
1 | Adam | 23 New Street | 21 | 90 |
2 | Amit | 12 Old Street | 13 | 95 |
3 | Bob | 10 Cross Street | 12 | 80 |
4 | David | 15 Express Avenue | 12 | 85 |
5 | Esha | 20 Garden Street | 13 | 50 |
6 | Ganga | 25 North Street | 12 | 55 |
7 | Jack | 2 Park Street | 12 | 60 |
8 | Leena | 24 South Street | 12 | 70 |
9 | Mary | 5 West Street | 12 | 75 |
10 | Peter | 16 Park Avenue | 12 | 95 |
After the file has been created, move to the terminal and start the Tajo server and shell one by one.
Create a new database using the following command −
default> create database sampledb; OK
Connect to the database “sampledb” which is now created.
default> \c sampledb You are now connected to database "sampledb" as user “user1”.
Then, create a table in “sampledb” as follows −
sampledb> create external table mytable(id int,name text,address text,age int,mark int) using text with('text.delimiter' = ',') location ‘file:/Users/workspace/Tajo/students.csv’;
The above query will generate the following result.
OK
Here, the external table is created. Now, you just have to enter the file location. If you have to assign the table from hdfs then use hdfs instead of file.
Next, the “students.csv” file contains comma separated values. The text.delimiter field is assigned with ‘,’.
You have now created “mytable” successfully in “sampledb”.
To show tables in Tajo, use the following query.
sampledb> \d mytable sampledb> \d mytable
The above query will generate the following result.
table name: sampledb.mytable table uri: file:/Users/workspace/Tajo/students.csv store type: TEXT number of rows: unknown volume: 261 B Options: 'timezone' = 'Asia/Kolkata' 'text.null' = '\\N' 'text.delimiter' = ',' schema: id INT4 name TEXT address TEXT age INT4 mark INT4
To fetch all the records in the table, type the following query −
sampledb> select * from mytable;
The above query will generate the following result.
Tajo uses the following syntax to insert records in table.
create table table1 (col1 int8, col2 text, col3 text); --schema should be same for target table schema Insert overwrite into table1 select * from table2; (or) Insert overwrite into LOCATION '/dir/subdir' select * from table;
Tajo’s insert statement is similar to the INSERT INTO SELECT statement of SQL.
Let’s create a table to overwrite table data of an existing table.
sampledb> create table test(sno int,name text,addr text,age int,mark int); OK sampledb> \d
The above query will generate the following result.
mytable test
To insert records in the “test” table, type the following query.
sampledb> insert overwrite into test select * from mytable;
The above query will generate the following result.
Progress: 100%, response time: 0.518 sec
Here, “mytable" records overwrite the “test” table. If you don’t want to create the “test” table, then straight away assign the physical path location as mentioned in an alternative option for insert query.
Use the following query to list out all the records in the “test” table −
sampledb> select * from test;
The above query will generate the following result.
This statement is used to add, remove or modify columns of an existing table.
To rename the table use the following syntax −
Alter table table1 RENAME TO table2;
sampledb> alter table test rename to students;
The above query will generate the following result.
OK
To check the changed table name, use the following query.
sampledb> \d mytable students
Now the table “test” is changed to “students” table.
To insert new column in the “students” table, type the following syntax −
Alter table <table_name> ADD COLUMN <column_name> <data_type>
sampledb> alter table students add column grade text;
The above query will generate the following result.
OK
This property is used to change the table’s property.
sampledb> ALTER TABLE students SET PROPERTY 'compression.type' = 'RECORD', 'compression.codec' = 'org.apache.hadoop.io.compress.Snappy Codec' ; OK
Here, compression type and codec properties are assigned.
To change the text delimiter property, use the following −
ALTER TABLE students SET PROPERTY ‘text.delimiter'=','; OK
The above query will generate the following result.
sampledb> \d students table name: sampledb.students table uri: file:/tmp/tajo-user1/warehouse/sampledb/students store type: TEXT number of rows: 10 volume: 228 B Options: 'compression.type' = 'RECORD' 'timezone' = 'Asia/Kolkata' 'text.null' = '\\N' 'compression.codec' = 'org.apache.hadoop.io.compress.SnappyCodec' 'text.delimiter' = ',' schema: id INT4 name TEXT addr TEXT age INT4 mark INT4 grade TEXT
The above result shows that the table’s properties are changed using the “SET” property.
The SELECT statement is used to select data from a database.
The syntax for the Select statement is as follows −
SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, ...] [FROM <table reference> [[AS] <table alias name>] [, ...]] [WHERE <condition>] [GROUP BY <expression> [, ...]] [HAVING <condition>] [ORDER BY <expression> [ASC|DESC] [NULLS (FIRST|LAST)] [, …]]
The Where clause is used to filter records from the table.
sampledb> select * from mytable where id > 5;
The above query will generate the following result.
The query returns the records of those students whose id is greater than 5.
sampledb> select * from mytable where name = ‘Peter’;
The above query will generate the following result.
Progress: 100%, response time: 0.117 sec id, name, address, age ------------------------------- 10, Peter, 16 park avenue , 12
The result filters Peter’s records only.
A table column may contain duplicate values. The DISTINCT keyword can be used to return only distinct (different) values.
SELECT DISTINCT column1,column2 FROM table_name;
sampledb> select distinct age from mytable;
The above query will generate the following result.
Progress: 100%, response time: 0.216 sec age ------------------------------- 13 12
The query returns the distinct age of students from mytable.
The GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups.
SELECT column1, column2 FROM table_name WHERE [ conditions ] GROUP BY column1, column2;
select age,sum(mark) as sumofmarks from mytable group by age;
The above query will generate the following result.
age, sumofmarks ------------------------------- 13, 145 12, 610
Here, the “mytable" column has two types of ages — 12 and 13. Now the query groups the records by age and produces the sum of marks for the corresponding ages of students.
The HAVING clause enables you to specify conditions that filter which group results appear in the final results. The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on the groups created by the GROUP BY clause.
SELECT column1, column2 FROM table1 GROUP BY column HAVING [ conditions ]
sampledb> select age from mytable group by age having sum(mark) > 200;
The above query will generate the following result.
age ------------------------------- 12
The query groups the records by age and returns the age when the condition result sum(mark) > 200.
The ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. The Tajo database sorts query results in ascending order by default.
SELECT column-list FROM table_name [WHERE condition] [ORDER BY column1, column2, .. columnN] [ASC | DESC];
sampledb> select * from mytable where mark > 60 order by name desc;
The above query will generate the following result.
The query returns the names of those students in descending order whose marks are greater than 60.
The CREATE INDEX statement is used to create indexes in tables. Index is used for fast retrieval of data. Current version supports index for only plain TEXT formats stored on HDFS.
CREATE INDEX [ name ] ON table_name ( { column_name | ( expression ) }
create index student_index on mytable(id);
The above query will generate the following result.
id ———————————————
To view assigned index for the column, type the following query.
default> \d mytable table name: default.mytable table uri: file:/Users/deiva/workspace/Tajo/students.csv store type: TEXT number of rows: unknown volume: 307 B Options: 'timezone' = 'Asia/Kolkata' 'text.null' = '\\N' 'text.delimiter' = ',' schema: id INT4 name TEXT address TEXT age INT4 mark INT4 Indexes: "student_index" TWO_LEVEL_BIN_TREE (id ASC NULLS LAST )
Here, TWO_LEVEL_BIN_TREE method is used by default in Tajo.
The Drop Table Statement is used to drop a table from the database.
drop table table name;
sampledb> drop table mytable;
To check if the table has been dropped from the table, type the following query.
sampledb> \d mytable;
The above query will generate the following result.
ERROR: relation 'mytable' does not exist
You can also check the query using “\d” command to list out the available Tajo tables.