Apache pig

profileNick122
Chapter5-1.pdf

Exercise 5-1: Question: For this exercise, use the information shown in Figure 5-30. These two files can be found as tab-delimited text formats in the companion website and are named investor.txt and stockprice.txt. The first file provides a list of investors and the number of shares that they have purchased in a given stock. The second file stores the stock prices.

Figure 0-1: Investor and Stock Data Using Pig Latin commands, perform the following operations:

1. Upload the two files to HDFS. 2. Load the files as investors and stock_prices. 3. Display both files to make sure they are loaded correctly. 4. Join the two files (investors and stock_prices) by stock symbol. 5. Display the joined file. 6. Group the joined file by the "lastname" of the investors and display the results. 7. Calculate the total shares (simply the sum of shares among all stocks) and

display the results. 8. Calculate the total dollar amount that each investor has invested (shares per

each stock multiplied by the stock price) and display the results. 9. Filter the top two investors who have invested the most and display the results.

Solution: Upload the two files to HDFS. We create the directories, where the files will be placed.

Load the files as investors and stock_prices. We load the files by invoking ‘pig’ command first.

Display both files to make sure they are loaded correctly.

The results of command are given in the output file. Sample output records:

Join the two files (investors and stock_prices) by stock symbol.

$ hadoop fs -mkdir PG $ hadoop fs -mkdir PG/Exercise1 $ hadoop fs -mkdir PG/Exercise1/input $ hadoop fs -copyFromLocal investor.txt PG/Exercise1/input $ hadoop fs -copyFromLocal stockprices.txt PG/Exercise1/input

$ pig grunt> investors = LOAD 'PG/Exercise1/input/investor.txt' AS(TransactionNumber:int,CustomerFName:chararray,CustomerLName:chararray,StockSymbol:chararray,StockShares:int); grunt> stock_prices = LOAD 'PG/Exercise1/input/stockprices.txt' AS(StockSymbol:chararray,StockPrice:int);

grunt> DUMP investors; grunt> DUMP stock_prices;

(BAC,25)

(CAH,48)

(GE,12)

(MCD,170)

(PFE,40)

(1,Pippa,Dickens,BAC,240)

(2,Gavin,Thomson,CAH,60)

(3,Brian,Johnston,GE,850)

(4,Jessica,Henderson,MCD,200)

(5,Andrea,Arnold,PFE,130)

(6,Vanessa,Robertson,BAC,275)

(7,Megan,Clark,CAH,65)

Display the joined file.

Sample output records:

Group the joined file by the "lastname" of the investors and display the results.

Sample output records:

Calculate the total shares (simply the sum of shares among all stocks) and display the results.

Sample output records:

grunt> investors_stockPrices = JOIN investors BY StockSymbol, stock_prices BY StockSymbol;

grunt> DUMP investors_stockPrices;

(1,Pippa,Dickens,BAC,240,BAC,25)

(6,Vanessa,Robertson,BAC,275,BAC,25)

(15,Andrea,Arnold,BAC,105,BAC,25)

(20,Joe,Chapman,BAC,115,BAC,25)

(25,Lisa,Butler,BAC,125,BAC,25)

(7,Megan,Clark,CAH,65,CAH,48)

(16,Vanessa,Robertson,CAH,250,CAH,48)

(11,Pippa,Dickens,CAH,280,CAH,48)

(26,Amanda,Piper,CAH,35,CAH,48)

grunt> group_by_investor = GROUP investors_stockPrices BY CustomerLName; grunt> DUMP group_by_investor;

(Clark,{(7,Megan,Clark,CAH,65,CAH,48),(24,Megan,Clark,PFE,220,PFE,40),(17,Megan,Clark,GE,75,GE,12)})

(Piper,{(19,Amanda,Piper,PFE,225,PFE,40),(26,Amanda,Piper,CAH,35,CAH,48),(9,Amanda,Piper,MCD,210,MCD,170)})

(Arnold,{(5,Andrea,Arnold,PFE,130,PFE,40),(22,Andrea,Arnold,GE,65,GE,12),(15,Andrea,Arnold,BAC,105,BAC,25)})

(Butler,{(8,Lisa,Butler,GE,800,GE,12),(25,Lisa,Butler,BAC,125,BAC,25),(18,Lisa,Butler,MCD,700,MCD,170)})

(Chapman,{(10,Joe,Chapman,PFE,125,PFE,40),(20,Joe,Chapman,BAC,115,BAC,25),(27,Joe,Chapman,GE,810,GE,12)})

grunt> sum_of_shares = FOREACH group_by_investor GENERATE investors_stockPrices, SUM(investors_stockPrices.StockShares) as Sum; grunt> DUMP sum_of_shares;

Calculate the total dollar amount that each investor has invested (shares per each stock multiplied by the stock price) and display the results. To find total dollar amount that each investor has invested, we perform the following commands.

Sample output records:

Filter the top two investors who have invested the most and display the results. We order the result that we got in question 8 in descending order and get the top 2.

Sample output records:

({(7,Megan,Clark,CAH,65,CAH,48),(24,Megan,Clark,PFE,220,PFE,40),(17,Megan,Clark,GE,75,GE,12)},360)

({(19,Amanda,Piper,PFE,225,PFE,40),(26,Amanda,Piper,CAH,35,CAH,48),(9,Amanda,Piper,MCD,210,MCD,170)},470)

({(5,Andrea,Arnold,PFE,130,PFE,40),(22,Andrea,Arnold,GE,65,GE,12),(15,Andrea,Arnold,BAC,105,BAC,25)},300)

({(8,Lisa,Butler,GE,800,GE,12),(25,Lisa,Butler,BAC,125,BAC,25),(18,Lisa,Butler,MCD,700,MCD,170)},1625)

({(10,Joe,Chapman,PFE,125,PFE,40),(20,Joe,Chapman,BAC,115,BAC,25),(27,Joe,Chapman,GE,810,GE,12)},1050)

({(1,Pippa,Dickens,BAC,240,BAC,25),(11,Pippa,Dickens,CAH,280,CAH,48)},520)

({(12,Gavin,Thomson,GE,35,GE,12),(2,Gavin,Thomson,CAH,60,CAH,48)},95)

({(13,Brian,Johnston,MCD,810,MCD,170),(3,Brian,Johnston,GE,850,GE,12)},1660)

grunt> final_details = FOREACH investors_stockPrices GENERATE investors::CustomerFName, investors::CustomerLName as lastName, investors::StockSymbol, investors::StockShares, stock_prices::StockPrice, (investors::StockShares*stock_prices::StockPrice) as amount; grunt> grouped_data = GROUP final_details BY lastName; grunt> amount = FOREACH grouped_data GENERATE final_details, SUM(final_details.amount) as Total; DUMP amount;

({(Megan,Clark,CAH,65,48,3120),(Megan,Clark,PFE,220,40,8800),(Megan,Clark,GE,75,12,900)},12820)

({(Amanda,Piper,PFE,225,40,9000),(Amanda,Piper,CAH,35,48,1680),(Amanda,Piper,MCD,210,170,35700)},46380)

({(Andrea,Arnold,PFE,130,40,5200),(Andrea,Arnold,GE,65,12,780),(Andrea,Arnold,BAC,105,25,2625)},8605)

({(Lisa,Butler,GE,800,12,9600),(Lisa,Butler,BAC,125,25,3125),(Lisa,Butler,MCD,700,170,119000)},131725)

({(Joe,Chapman,PFE,125,40,5000),(Joe,Chapman,BAC,115,25,2875),(Joe,Chapman,GE,810,12,9720)},17595)

({(Pippa,Dickens,BAC,240,25,6000),(Pippa,Dickens,CAH,280,48,13440)},19440)

({(Gavin,Thomson,GE,35,12,420),(Gavin,Thomson,CAH,60,48,2880)},3300)

grunt> OrderedList = ORDER amount BY Total DESC; grunt> TopInvestors = LIMIT OrderedList 2; grunt> DUMP topInvestors;

({(Vanessa,Robertson,BAC,275,25,6875),(Vanessa,Robertson,CAH,250,48,12000),(Vanessa,Robertson,MCD,875,170,148750)},167625)

({(Brian,Johnston,MCD,810,170,137700),(Brian,Johnston,GE,850,12,10200)},147900)

  • Exercise 5-1: