Quick Learning- Apache Pig
Apache Pig is a high level scripting language used in hadoop.
After installation type 'pig' to get grunt prompt
grunt>
Create a Sample Dataset like below in hdfs and copy your dataset to that hdfs location. In the below example I have stored my dataset in /input and the name of dataset is custs. Column headings are not required I am using column headings just to understand.
CustId fname lname age prof
C001 Ram Shah 45 engineer
grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid, fname, lname, age, profession);
grunt>dump loadCusts;
=================================================
grunt>store loadcusts into '/input/mylocation';
Data will be available in below location.
$hdfs dfs -ls /input/mylocation
=================================================
limit command:
grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>dump loadCusts;
grunt>record = limit cust 10;
grunt>dump record;
======
generate,distinct command:
grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>foreachCusts = foreach loadCusts generate profession;
grunt>distinctCusts = distinct foreachCusts;
grunt>dump distinctCusts;
=========
filter command:
grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>filterCusts = filter loadCusts by fname == 'Ray';
grunt>foreachCustsFilter = foreach filterCusts generate fname, lname, age;
grunt>dump foreachCustsFilter;
====
Filter More Than One
grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>filterTwoCusts = filter loadCusts by (profession == 'Coach' or profession == 'Judge') and fname == 'Ray';
grunt>foreachTwoCustsFilter = foreach filterTwoCusts generate fname, lname, profession;
grunt>dump foreachTwoCustsFilter;
====
Filter Wild card
grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>filterCusts = filter loadCusts by (fname matches '.*lor.*');
grunt>dump filterCusts;
====
Filter On Numeric value
grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>filterCusts = filter loadCusts by (age>= 35 and agedump filterCusts;
===
order command:
grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>orderbyCusts = order loadCusts by age, fname desc;
grunt>foreachOrderbyCusts = foreach orderbyCusts generate fname, lname, age;
grunt>dump foreachOrderbyCusts;
========
Join in Pig:
Create two datasets
staff
Staffid,name,joiningdate
s001,Amit Kumar,05/07/1985
s002,Sumit Kar,08/09/1988
s003,Abhinav,01/01/1995
s004,Varsha,05/05/2000
promotion
Staffid,Grade,joiningdate,PromotionDate
s001,Amit Kumar,05/07/1985,10/12/2005
s002,Sumit Kar,08/09/1988,1/10/2007
s004,Varsha,05/05/2000,08/09/2013
//Put the above files to input directory
Inner Join Example
grunt> staff = load '/input/staff' using PigStorage(',') as(staffid:chararray,name:chararray,DOJ:long);
grunt> promoted = load '/input/promotion' using PigStorage (',')as (staffid:chararray,grade:chararray,DOP:long);
grunt> result = join staff by staffid, promoted by staffid;
grunt> dump result;