Sql query
3 years ago
5
Week8ETLMemoTemplate1.docx
Candy_part_1_skeleton_for_students.sql
Additional_Clarification_ETL_Candy_v02.docx
- IFSM330Candyv051.docx
- pd2018.csv
- pd2017.csv
- pd2019.csv
Week8ETLMemoTemplate1.docx
Memo
To: Manager
Cc: < insert your instructor’s name here >
From: < insert your name here >
Date: < insert date here >
Re: ETL Data Transformation
Introduction/Background: This memo provides an overview of what was done to the yearly data to input it into the staging table for later usage. <continue this text here in needed>
Document How Each Year was Tranformed:
· 2017: <write up what was done to the provided 2017 data to make it fit in the staging table>
· 2018: <write up what was done to the provided 2018 data to make it fit in the staging table>
· 2019: <write up what was done to the provided 2019 data to make it fit in the staging table>
Conclusion: <Any next steps that need to be done>.
Candy_part_1_skeleton_for_students.sql
--Skeleton Candy Part 1 - extract from 2017, 2018, 2019 tables and put into stagingTable --Students, see notes below where it says --you need to fill this in here --Create the staging table DROP TABLE IF EXISTS stagingTable; CREATE TABLE stagingTable ( yearInt INT(4), monthInt INT(2), State CHAR(2), country CHAR(3), region VARCHAR(25), customer_id(4), Product_Name VARCHAR(25, unitPrice INT(4), quantityDiscount INT(4), quantity INT(6), orderTotal INT(6) ); --Insert 2017 Data INSERT INTO stagingTable("monthInt", "state", "country", "region", "Product_Name", "unitPrice" --you need to fill this in here) SELECT "Month", "State", "Country", "Region", "Product", "Per-Unit_price", "Quantity", "Order_Total" FROM pd2017 ; UPDATE stagingTable SET yearInt=2017; --Insert 2018 Data INSERT INTO stagingTable("monthInt", "region", "customer_id", "Product_Name", "unitPrice","quantity") -- you need to fill this in here --) SELECT "Month", "Region", "Customer_ID", "Product", "Per-unitPrice", "Quantity_1 + Quantity_2" AS quantity FROM pd2018; FROM pd2018 ; UPDATE stagingTable SET yearInt=2018 WHERE yearInt ISNULL; --Insert 2019 Data INSERT INTO stagingTable("monthInt", "country", "region", "state", "Product_Name", "unitPrice", "Order_Total", "quantityDiscount") -- you need to fill this in here --) SELECT "Month", "country", "region", "state", "Product", "Per-unitPrice", "quantity", "per-unit price x quantity" AS "orderSubtotal", "quantityDiscount"FROM pd2019 ; UPDATE stagingTable ( -- you need to fill this in here ); SELECT * FROM stagingTable;
Additional_Clarification_ETL_Candy_v02.docx
Additional clarification regarding ETL Candy Assignment.
The difficult part about this week's work is not necessarily the SQL, it's the Extract-Transform-Load part.
First a few general comments on INSERT INTO and populating a database:
INSERT INTO destinationTable("x","y","z")
SELECT a,b,c FROM sourceTable;
Is different from
INSERT INTO desinationTable ("x","y","z")
VALUES (1,2,3);
The latter defines and inserts the VALUES on the spot. But in both the order is extremely important.
In the former and certainly when conducting ETL, the two tables, source and destination may not have the same column names, but it is essential that in the statement that you carefully choose the destinationTable column names for INSERT INTO and the sourceTable for SELECT so they match. But choosing x,y,z for INSERT INTO and then a,b,c for SELECT is the same as INSERT INTO z,y,x with SELECT c,b,a.
Also, as is the case in this Candy assignment, the data itself may need to be manipulated/multiplied/added/etc in some way to get it into a standard form across years.
For example, the following gets FirstName and LastName directly from sourceTable. There is NO SUCH field as AnnualSalary in sourceTable, however there is a MonthlySalary. So the code then multiplies MonthlySalary by 12 to get AnnualSalary. Here you have to be careful about the commas. Until the first comma, FirstName and LastName are being INSERTed INTO FirstName and LastName respectively, after the second comma, the calculation is being INSERTed INTO AnnualSalary.
After AnnualSalary is calculated from “MonthlySalary”, we see that there are two values for bonus pay, Bonus1 and Bonus2. We can calculate TotalBonusPay by adding the two values together in the SELECT statement.
CREATE TABLE destinationTable (FirstName TEXT, LastName TEXT, AnnualSalary INTEGER, TotalBonusPay INTEGER);
INSERT INTO destinationTable("FirstName","LastName","AnnualSalary","TotalBonusPay")
SELECT "FirstName","LastName","MonthlySalary" * 12, "Bonus1" + "Bonus2" FROM sourceTable;
Now let’s look specifically at our Candy dataset.
If we had 'perfect' data, an INSERT INTO and SELECT statement to populate stagingTable might look something like this, 11 fields in stagingTable from 11 fields in pd2017, all nicely arranged with the same names, same values. However, note they sometimes use different field names:
stagingTable uses Product_Name and unitPrice, while pd2017 uses Product and Per-Unit_Price.
What’s important here is that the field names for INSERT INTO correspond to stagingTable (where we’re inserting the data) and SELECT correspond to pd2017 (or pd2018 or pd2019).
INSERT INTO stagingTable("monthInt", "state", "country", "region", "Product_Name", "unitPrice" --you need to fill this in here)
SELECT "Month", "State", "Country", "Region", "Product", "Per-Unit_price", "Quantity", "Order_Total" FROM pd2017;
HOWEVER, rarely does data comes in a neat format. So you first need to look at the data (you can do this in Excel). Start with 2017 and you'll see the following columns, not 11 fields, only 8:
2017:
2018:
2019:
You should note how the fields/columns changed over the years. Attention: the following list is NOT complete. Important information is given in each year's Data Notes (part c) and conduct your own analysis of the differences in data. This is the heart of the ETL process and you will document this in your one page write up as a part of this assignment.
Now look at the stagingTable structure:
Each year MUST conform to this stagingTable structure so that all three years can be analyzed.
· However, we're not going to 'make up' customer_id where there is none.
· Also don’t worry about adding data to country where it might be listed as ‘null’.
· The most important fields that must be properly populated in stagingTable for proper analysis are quantity and orderTotal.
2017 will be imported relatively straightforward via an INSERT INTO and SELECT command.
2018 and 2019, however, will require you to choose the correct fields (from stagingTable) in the INSERT INTO command and then SELECT fields FROM pd2018 or pd2019 which will sometimes require some basic math to 'transform' the data into a common, usable format.
Again, this ‘Additional Clarification’ does not describe ALL of the differences in data that you will need to address. Properly identifying differences and discrepancies and transforming the data into a usable format is essential to the ETL task.
Finally, you're not making up any data, just transforming what is there. Also note that 'year' is not an element in any of the individual year table, which is why you need an UPDATE command after your INSERT INTO and SELECT command after each year. This for 2018:
UPDATE stagingTable SET yearInt=2018 WHERE yearInt ISNULL;