Profile PictureMukund Kumar Mishra
₹75

Quick Learning- Apache Pig

Add to cart

Quick Learning- Apache Pig

₹75

Apache Pig is a high level scripting language used in hadoop.
After installation type 'pig' to get grunt prompt

grunt>

Create a Sample Dataset like below in hdfs and copy  your dataset to that hdfs location. In the below example I have stored my dataset in /input and the name of dataset is custs. Column headings are not required I am using column headings just to understand.

CustId fname lname age prof
C001 Ram Shah 45 engineer

grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid, fname, lname, age, profession);
grunt>dump loadCusts;
=================================================

grunt>store loadcusts into '/input/mylocation';

Data will be available in below location.
$hdfs dfs -ls /input/mylocation
=================================================

 

limit command:

grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>dump loadCusts;
grunt>record = limit cust 10;
grunt>dump record;
======

 

generate,distinct command:

grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>foreachCusts = foreach loadCusts generate profession;
grunt>distinctCusts = distinct foreachCusts;
grunt>dump distinctCusts;
=========

 

filter command:

grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>filterCusts = filter loadCusts by fname == 'Ray';
grunt>foreachCustsFilter = foreach filterCusts generate fname, lname, age;
grunt>dump foreachCustsFilter;

====

 

Filter More Than One

grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>filterTwoCusts = filter loadCusts by (profession == 'Coach' or profession == 'Judge') and fname == 'Ray';
grunt>foreachTwoCustsFilter = foreach filterTwoCusts generate fname, lname, profession;

grunt>dump foreachTwoCustsFilter;

====

 

Filter Wild card

grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>filterCusts = filter loadCusts by (fname matches '.*lor.*');
grunt>dump filterCusts;

====

 

Filter On Numeric value

grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>filterCusts = filter loadCusts by (age>= 35 and agedump filterCusts;

===

 

order command:

grunt>loadCusts = load '/input/custs' using PigStorage(',') as (custid:chararray, fname:chararry, lname:chararray, age:long, profession:chararry);
grunt>orderbyCusts = order loadCusts by age, fname desc;
grunt>foreachOrderbyCusts = foreach orderbyCusts generate fname, lname, age;
grunt>dump foreachOrderbyCusts;
========

Join in Pig:

 

Create two datasets

 

staff

Staffid,name,joiningdate
s001,Amit Kumar,05/07/1985
s002,Sumit Kar,08/09/1988
s003,Abhinav,01/01/1995
s004,Varsha,05/05/2000

 

promotion

Staffid,Grade,joiningdate,PromotionDate
s001,Amit Kumar,05/07/1985,10/12/2005
s002,Sumit Kar,08/09/1988,1/10/2007
s004,Varsha,05/05/2000,08/09/2013

//Put the above files to input directory

 

Inner Join Example

grunt> staff = load '/input/staff' using PigStorage(',') as(staffid:chararray,name:chararray,DOJ:long);

grunt> promoted = load '/input/promotion' using PigStorage (',')as (staffid:chararray,grade:chararray,DOP:long);

grunt> result = join staff by staffid, promoted by staffid;

grunt> dump result;

Add to cart
Copy product URL
30-day money back guarantee