The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped. Joins can be of the following types −
This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitali,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the relations customers and orders as shown below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int);
Let us now perform various Join operations on these two relations.
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names). Therefore let us load the contents of the file customers.txt as two tables as shown below.
grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int);
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;
It will produce the following output, displaying the contents of the relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000) (2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500) (3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000) (4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500) (5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500) (6,Komal,22,MP,4500,6,Komal,22,MP,4500) (7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row.
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Let us perform inner join operation on the two relations customers and orders as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
You will get the following output that will the contents of the relation named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note −
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways −
The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation.
Given below is the syntax of performing left outer join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Let us perform left outer join operation on the two relations customers and orders as shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,)
The right outer join operation returns all rows from the right table, even if there are no matches in the left table.
Given below is the syntax of performing right outer join operation using the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Let us perform right outer join operation on the two relations customers and orders as shown below.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Verify the relation outer_right using the DUMP operator as shown below.
grunt> Dump outer_right
It will produce the following output, displaying the contents of the relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
The full outer join operation returns rows when there is a match in one of the relations.
Given below is the syntax of performing full outer join using the JOIN operator.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Let us perform full outer join operation on the two relations customers and orders as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Verify the relation outer_full using the DUMP operator as shown below.
grun> Dump outer_full;
It will produce the following output, displaying the contents of the relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,)
We can perform JOIN operation using multiple keys.
Here is how you can perform a JOIN operation on two tables using multiple keys.
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);
Assume that we have two files namely employee.txt and employee_contact.txt in the /pig_data/ directory of HDFS as shown below.
employee.txt
001,Rajiv,Reddy,21,programmer,003 002,siddarth,Battacharya,22,programmer,003 003,Rajesh,Khanna,22,programmer,003 004,Preethi,Agarwal,21,programmer,003 005,Trupthi,Mohanthy,23,programmer,003 006,Archana,Mishra,23,programmer,003 007,Komal,Nayak,24,teamlead,002 008,Bharathi,Nambiayar,24,manager,001
employee_contact.txt
001,9848022337,Rajiv@gmail.com,Hyderabad,003 002,9848022338,siddarth@gmail.com,Kolkata,003 003,9848022339,Rajesh@gmail.com,Delhi,003 004,9848022330,Preethi@gmail.com,Pune,003 005,9848022336,Trupthi@gmail.com,Bhuwaneshwar,003 006,9848022335,Archana@gmail.com,Chennai,003 007,9848022334,Komal@gmail.com,trivendram,002 008,9848022333,Bharathi@gmail.com,Chennai,001
And we have loaded these two files into Pig with relations employee and employee_contact as shown below.
grunt> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int); grunt> employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING PigStorage(',') as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);
Now, let us join the contents of these two relations using the JOIN operator as shown below.
grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);
Verify the relation emp using the DUMP operator as shown below.
grunt> Dump emp;
It will produce the following output, displaying the contents of the relation named emp as shown below.
(1,Rajiv,Reddy,21,programmer,113,1,9848022337,Rajiv@gmail.com,Hyderabad,113) (2,siddarth,Battacharya,22,programmer,113,2,9848022338,siddarth@gmail.com,Kolka ta,113) (3,Rajesh,Khanna,22,programmer,113,3,9848022339,Rajesh@gmail.com,Delhi,113) (4,Preethi,Agarwal,21,programmer,113,4,9848022330,Preethi@gmail.com,Pune,113) (5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,Trupthi@gmail.com,Bhuwaneshw ar,113) (6,Archana,Mishra,23,programmer,113,6,9848022335,Archana@gmail.com,Chennai,113) (7,Komal,Nayak,24,teamlead,112,7,9848022334,Komal@gmail.com,trivendram,112) (8,Bharathi,Nambiayar,24,manager,111,8,9848022333,Bharathi@gmail.com,Chennai,111)