CSE50 final analysis
SQL Tutorial - CSE 50 This tutorial guides you through the process of setting up a database environment in your web browser and performing essential SQL commands to get hands-on experience initiating and managing databases. It is important to follow the instructions herein carefully for a seamless experience. We will be using Jupyter Lab notebook for setting up the databases and executing SQL commands. PART 1: SETTING UP Step 1: Opening your workspace Go to https://jupyter.org/try You should see the following web page. Click on the Try JupyterLab icon.
You will see the following for a few seconds. DO NOT click on anything as it will automatically redirect you to your workspace.
Once redirected to your workspace, you should view this page:
Click on the + sign on the top left and then choose Python3 as shown here:
You will then see a new tab opened for you:
Step 2: Installing SQL interface for Jupyter To install the SQL interface, type the following command in the cell and click on the play icon to execute it.
!pip install ipython-sql
Once you hit the play button, the notebook will automatically create a new cell for you and after the installation is complete, you will see the output indicating installation has been successful:
Step 3: Importing SQLAlchemy SQLAlchemy is a python-based toolkit for SQL. We need to import SQLAlchemy into our Jupyter notebook to be able to handle database operations. Type in the command
import sqlalchemy in the cell and hit the play/triangle button to execute:
Step 4: Setting up the Engine After performing steps 1,2, and 3, we set up a SQLite database engine here.. You will see how this will get used later.
engine = sqlalchemy.create_engine('sqlite:///new.db')
Step 2: Establishing Connection In this step, we establish a connection to the engine by executing this command:
con = engine.raw_connection() So, in effect, we are storing our connection to the engine in a variable called con. This will be our gateway to the database and performing queries. Once you run this command, you will see the new.db in the left hand-side panel.
Step 3: Creating a Cursor A cursor is what we use to manipulate the database. All commands will be executed through the cursor. To establish your cursor, run the following command:
cursor = con.cursor() Obviously, you need to have established your connection (here connection is in the variable con) before creating your cursor.
Step 4: Simple Example of Running a SQL Command using Cursor So, as mentioned before, cursor is what we use to run our SQL commands through it. For instance, we can type in the code below:
cursor.execute("CREATE TABLE names (name TEXT, last_name TEXT);") Which creates a table in our database called names that has two columns name and last_name both storing data of type TEXT. Note that the underlined text in the parentheses (passed as argument to the execute function) are the actual SQL code for creating a table.
We can then use the cursor again to insert data into our table
cursor.execute("INSERT INTO names VALUES ('John', 'Doe');") This will insert into the table the values of ‘John’ and ‘Doe’ in the name and last_name columns of the names table. IMPORTANT: After creating the table and inserting a row into it, we need to COMMIT the changes made to our database; otherwise we would lose the data. So we run this command to commit the changes: Remember that con here is the name of the variable that stores the connection to the database.
con.commit()
To view/query the data inserted to the table, we run the following command: for row in cursor.execute("SELECT * FROM names;"): print(row) SELECT * FROM names; is the SQL command for displaying all the rows in table names. So the above command is essentially going through all the rows in the query and printing them.
And since we had only one row, we get the corresponding data back from our query. SECTION 2: Large Scale SQL Operations Now that you are familiar with basic SQL operations like CREATE, INSERT, and SELECT, you can see that it would be cumbersome to manually INSERT data into a table or pull large amounts of data from the database. Furthermore, it is usually the case that your data is coming from other sources/pipelines and you need to handle the data using SQL. In this section, we will practice taking advantage of SQL commands on data files with considerably large number of records. Part 1: Importing a CSV file The first step is to upload your CSV into your jupyter lab workspace. To do this, click on the top left up-arrow icon and choose the CSV file to upload it.
Once you do that, you can see the file added to the left-hand side list:
Part 2: Storing the CSV data into a database table Now, we want to store the data contained in the CSV file into a table in our database. There are different ways of automatically importing a CSV file into a database table depending on the specific database type (MySQL, PostgreSQL, SQLite, etc.) But fortunately, since we are using
the Python interface for our database handling, we can easily do this using a method from the pandas package. To do this, first we import the pandas package into our jupyterlab workspace by calling: import pandas as pd
And then we read in the CSV file contents into a pandas dataframe that we call data:
data = pd.read_csv('flights-2.csv')
And then we just use the .to_sql function to automatically move the data into a table in our database:
data.to_sql('flights', con, if_exists='append', index=False)
So we call the function .to_sql on the dataframe that we created earlier from the CSV file (we named it data). The parameters/arguments to this function are as follows:
● The first parameter is the name of the table that we would like to create in the database. So here we want to call the table flights. Remember that this needs to be a string.
● The second parameter is the connection to the database that we had previously established and called it con
● The third argument basically says that if a table of the same name already exists in the database, just append the data to the existing table
● The last parameter indicates whether you want to have the indices of the data as a separate column or not.
Note: It is okay if you see a warning like the one in the screenshot above. They relate to automatic type conversion conventions of different packages. Part 3: Getting Table Information Now that you have imported your data into a table in your database, let’s see if the data is correctly stored in the table. So, as before, we use the cursor as our means of handling the database and run the following command: for row in cursor.execute("SELECT * FROM flights LIMIT 10;"):
print(row)
Note 1: Since we are only looking at the data here, we do not need to commit after running the command. Note 2: The LIMIT keyword in the SQL command here limits the number of returned items in the query to 10 (or whatever number the user specified) Part 4: Table Information To get the list of the table’s column names and the corresponding data types, we can run the following command: for row in cursor.execute("PRAGMA table_info('flights');"): print(row)
As you can see, this prints out the name of each column, for instance, Depart, Origin, Arrival, etc and the corresponding data types such as TEXT and REAL (number). Part 5: (EXTRA) Multi-Line Strings in Python The usual signature of of strings in python is the double quotations like str1 = “this is a string” However, since going forward, we are going to be writing longer SQL commands, we will be introducing multi-line strings that are denoted by triple quotations. Str2 = “””This is a multiline string””” You can now write your SQL commands as a multi-line string and save it in a variable and then just pass that variable into your cursor. For example: Command = “””SELECT * FROM flights LIMIT 10;””” for row in cursor.execute(Command): print(row) Part 6: Selecting specific columns Instead of using SELECT * THAT selects data from all columns, we can choose which columns to select from. command = "SELECT Depart_Time, Origin, Aircraft FROM flights LIMIT 10;" for row in cursor.execute(command):
print(row) As mentioned above, we are saving our SQL query in a variable called command and then just pass that string variable into the cursor for execution. In this SQL query, we are only choosing to select Depart Time, Origin, and Aircraft columns.
Part 7: Ordering We now introduce the ORDER BY functionality in SQL. Let’s assume we want to find the 10 longest flights in our database and report their duration along with their origin and flight number. So we need to choose data from Flight, Origin, and duration columns and order them by their duration. Since we want the longest flights, we need to sort them in descending order and we only need the 10 longest flights so we limit the query by 10. Our SQL query would then look like: q= "SELECT Flight, Origin, duration FROM flights ORDER BY duration DESC LIMIT 10;" for row in cursor.execute(q): print(row)
Notice the ORDER BY keyword that is followed by the column according to which we want to sort our report. After the column name we specify whether we want an ascending (ASC) or descending (DESC) ordering.
Part 7: Conditioning and WHERE clause Now, let’s find the flight numbers, origin and duration of the top 10 operating carriers with the longest duration flights arriving in SFO. Note that operating carriers have codeshare = 0. This is similar to the previous query except here we’re only interested in flights that land in SFO and have codeshare zero. So we need to condition the query on these two columns (Destination and codeshare). So our query would be: q= """SELECT Flight, Origin, duration FROM flights WHERE Destination = 'SFO' and codeshare = 1 ORDER BY duration DESC LIMIT 10;""" for row in cursor.execute(q): print(row) Note that here we are using multi-line strings using the triple quotations.
SECTION 3: JOIN Commands JOINs are an important feature of SQL databases that allow aggregation of multiple tables into a single table/schema to optimize queries and information retrieval. In this section, we will practice JOIN commands in SQL and get familiar with a few other functionalities of SQL databases. Part 1: Import aircrafts.csv Similar to part 1 in Section 2, we import aircrafts.csv into our jupyter lab environment:
Part 2: Create an aircrafts table in the database from the CSV file Again, using the pandas package, we create a table in our database and call it aircrafts that stores the data in the aircrafts.csv file. df = pd.read_csv('aircrafts.csv') df.to_sql('aircrafts', con, if_exists='append', index=False) As in part 2 of Section 2, the first line of code stores the contents of the CSV file in a pandas dataframe and the second line uses that dataframe to create a table in the database.
Part 3: Exploring the Data Again, to familiarize ourselves with the data, let’s take a look at the first few rows of this new table and see what data is stored in it. q= "SELECT * FROM aircrafts LIMIT 10;" for row in cursor.execute(q): print(row)
So here we are storing the SQL query for selecting the first 10 rows in the table in a string variable q and then printing the resulting report. We can also see the column names and data type stored in each column:
q = "PRAGMA table_info('aircrafts');" for row in cursor.execute(q): print(row) We can see that the columns are named Aircraft and Seats that store TEXT and INTEGER values respectively.
Part 4: JOIN Now, if we go back to the flights table, we can see we also have a column named Aircraft that stores TEXT values indicating the type of aircraft corresponding to each flight. Since our aircrafts table has the number of seats for each aircraft type, we can combine the information in both flights and aircrafts table to produce more effective reports; for instance we can compute the capacity of each travel route by summing up the number of seats based on the aircrafts that fly to those destinations. To do this, we need to “join” the two tables and create a new table and then query the new table that has the combined information. Before performing JOIN operations, let’s quickly go over a few key concepts that are essential in understanding the mechanics of SQL JOIN operation. Primary Key: A primary key is a unique identifier for each row (or record) of data. For instance, if we have a database of student information, student ID is good choice of primary key because each student has a unique ID and we cannot find two students that have the same ID number. Last name, however, is not a good choice of a primary key for a table containing student information as two students may share the same last name. In the case of our flights table, we can think of flight number as a choice of primary key. However, we might have multiple instances of the same flight number where we have, for instance, a weekly flight from SFO to JFK. So flight number is not a unique identifier as two flights (at different times) may share the same flight number. The combination of flight number and Depart time, however, is necessarily unique. SQL allows us to define a primary key as the combination of two columns. Step 1) Dropping a table Since we imported the CSV file into our table without specifying a primary key, we need to DROP the table and create the table anew to be able to define the primary key for that table. q = "DROP TABLE IF EXISTS flights;" cursor.execute(q) con.commit() So here, we are dropping the table flights and then committing the changes made to the database.
Step 2) Creating a table with a primary key Knowing the columns of our table and the values they store, we can create the flights table over again, except this time we are specifying a primary key for this table that is the combination of departure time and flight number. q = """CREATE TABLE IF NOT EXISTS flights (Depart_Time TEXT, Origin TEXT, Arrival_Time TEXT, Destination TEXT, Flight TEXT, Aircraft TEXT, Stops TEXT, hours INTEGER, minutes INTEGER, duration REAL, codeshare INTEGER, PRIMARY KEY(Depart_Time, Flight) );""" cursor.execute(q) con.commit() Again, notice that we are using a multi-line string denoted by the triple quotations (“””) and also committing the changes after creating the table. The primary key is defined after specifying the column names and their corresponding types. Since we are defining the primary key as the combination of flight number and departure time, the two columns are grouped together using parentheses to denote a composite primary key.
Step 3) Importing data from CSV Now that we have created our table with the right primary key, we can import the data from the CSV file as before: import pandas as pd df = pd.read_csv('flights-2.csv') df.to_sql('flights', con, if_exists='append', index = False)
And again, to look at the first few rows/records:
Similarly, we re-create the aircrafts table by choosing the aircraft type as its primary key. So first we drop the table:
q = "DROP TABLE IF EXISTS aircrafts;" cursor.execute(q) con.commit() Again, notice that when creating or dropping tables, we need to commit our changes to the database.
Now, let’s create the new aircrafts table by specifying a primary key for it: q = "CREATE TABLE IF NOT EXISTS aircrafts (Aircraft TEXT PRIMARY KEY, Seats INTEGER);" cursor.execute(q) con.commit() Here, because the primary key is not composite and is only one column (in this case, Aircraft column), we just specify that immediately after indicating the type of the column as shown in the code.
Now, we read in the data from the CSV file (aircrafts.csv) and import that into the newly created table.
import pandas as pd df = pd.read_csv('aircrafts.csv') df.to_sql('aircrafts', con, if_exists='append', index = False)
And we can look at the first few rows of the tables:
Step 4) LEFT JOIN If we have two tables: table1(left) and table2 (right), the LEFT JOIN is the intersection of all elements of table 1 with table 2:
This means that the new (joined) table is guaranteed to have all elements of table1 along with those elements from table2 that are shared with table1. To see an example of this, consider the following query on the our database: Generate a report showing the total number of seats for all Airbus 320 aircraft flights arriving in SFO from LAX. Since we need information of ALL flights for this query, we LEFT JOIN the flights table with aircrafts table ON aircraft column that is shared between the two tables.
q = """SELECT SUM(Seats) FROM flights LEFT JOIN aircrafts ON flights.Aircraft = aircrafts.Aircraft WHERE flights.aircraft = '320' AND flights.Origin = 'LAX' AND flights.Destination = 'SFO';""" for row in cursor.execute(q): print(row) Notice that we are summing up the number of seats in this query using the SUM() function.
As another example, let’s ask this question from our database: What is the aircraft that has maximum number of passengers that can arrive in SFO from London (LHR) or Frankfurt (FRA)? Indeed, here we’re looking for all flights from LHR or FRA to SFO and looking for the aircraft that has the largest number of seats. Our query would be: q = """SELECT flights.Aircraft, MAX(Seats) FROM flights LEFT JOIN aircrafts ON flights.Aircraft = aircrafts.Aircraft WHERE flights.Origin = 'LHR' OR flights.Origin = 'FRA' AND flights.Destination = 'SFO';""" for row in cursor.execute(q): print(row)
Notice that we’re using the MAX function here.
INNER JOIN Unlike LEFT JOIN, INNER JOIN only includes those elements that exist in both tables. It is the equivalent of the intersection of two sets.
In this data set, because all aircraft types in the flights table also exist in the aircrafts table, INNER JOIN and LEFT JOIN would yield the same results.
Step 5) LIKE Sometime we’re not sure on the name of the column or want to aggregate data. In this case, we can use the LIKE keyword. For instance: What is the maximum number of passengers that can arrive in SFO on 747 class aircraft (i.e: 744, 747, 74E, 742 aircraft)? Here, again we can use the MAX() function along with the keyword LIKE: q = """SELECT flights.Aircraft, MAX(Seats) FROM flights LEFT JOIN aircrafts ON flights.Aircraft = aircrafts.Aircraft WHERE flights.Aircraft LIKE '%74%' AND flights.Destination = 'SFO';""" for row in cursor.execute(q): print(row) Notice that we want to look for values in the Aircraft column that contain TEXT data type containing ‘74’ so we use the '%74%' after the LIKE keyword.
Step 6) GROUP BY
Sometimes we wish to collapse the data to compute some aggregate function/statistic of our data. For instance, we might be interested to find the number of flights each airplane type flying from London (LHR) to San Francisco (SFO) has. To do this, we need to GROUP the data based on airplanes; meaning that we get all the flights from LHR to SFO and then find the unique aircraft types and count how many times each aircraft type shows up in our data. The SQL query for this report is: q = """SELECT flights.Aircraft, COUNT(Flight) FROM flights LEFT JOIN aircrafts ON flights.Aircraft = aircrafts.Aircraft WHERE flights.Origin = 'LHR' AND flights.Destination = 'SFO' GROUP BY flights.Aircraft;""" for row in cursor.execute(q): print(row) We are using the COUNT function here and calling it on the Flight column. Note that GROUP BY clause must come after the WHERE clause; because first we filter the data based on the conditions (what WHERE clause does) and then compute the aggregate functions like COUNT, etc using the GROUP BY clause.