Computer Science- Python Python Homework Assignment using Jupyter Notebook

profileRocket Coffey
In-classexample.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Topic 3: Exploring data relations and SEC filing\n", "\n", "<span style=\"color:darkblue;text-align:center; background-color:lightgreen; opacity: 0.3\"> Date modified: Jan, 2024; Code code: AF5123, MAFA </span>\n", "\n", "\n", "### 1. Exploring data relations (basics)\n", "\n", "* In this notebook, we will further learn to use Seaborn library to explore the relation between two factors (i.e.,variables, or columns) \n", "* Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.\n", "* To have a complete understanding of Seaborn, please refer to the following link [Seaborn](https://seaborn.pydata.org/tutorial.html)\n", "* Below is an example that relates to our course topic:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'''\n", "The objective of this code to plot the relation beween sales and SG&A for P&G inc.\n", "'''\n", "\n", "# import libraries \n", "import datetime as dt\n", "import sqlite3\n", "import os\n", "#please change D:\\Temp to your working directory\n", "os.chdir(r\"D:\\Temp\") \n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns #import seaborn for plotting data\n", "%matplotlib inline \n", "print(\"The most recent time to run this code: {}\".format(dt.datetime.now()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We have provided a data set ( *`sales_022024.db`* -- **table name**: `PG_MSFT_30yr`) that includes:\n", " * **sale**: sales \n", " * **xsga**: SG&A\n", " * **cogs**: COGS \n", "information since 1990 for P&G and Microsoft. \n", "- Note: Please copy the data file (`sales_022024.db`) to your working folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conn=sqlite3.connect(r\"D:\\Temp\\sales_022024.db\") ##establish the connection with sqlite 3 database\n", "firms='0000080424' #P&G's cik code\n", "sales=pd.read_sql('''select cik, fyear, datadate, sale, xsga, cogs \n", " from PG_MSFT_30yr\n", " where cik=?\n", " ''', con=conn, params=(firms,)) #read sales, SG&A and COGS for P&G over the past 20 years\n", "\n", "conn.close()\n", "## We still can plot scatter points without sns\n", "def plot_scatter(sales, cols, col_x = 'sale'):\n", " for col in cols:\n", " fig = plt.figure(figsize=(7,6)) # define plot area\n", " ax = fig.gca() # define axis \n", " sales.plot.scatter(x = col_x, y = col, color='red', ax = ax)\n", " ax.set_title('Scatter plot of ' + col + ' vs. ' + col_x) # Give the plot a main title\n", " ax.set_ylabel(col) # Set text for the x axis\n", " ax.set_xlabel(col_x)# Set text for y axis\n", " plt.show()\n", "\n", "num_cols = ['xsga']\n", "plot_scatter(sales, num_cols)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.set() #default seaborn setting\n", "#make a scatter plot with sns and indicate the pandemic period\n", "sales['COVID-19']=np.where((sales['fyear']>=2020) & (sales['fyear']<=2023), \"YES\", \"NO\")\n", "sns.scatterplot(x='sale', y='xsga', data=sales, hue='COVID-19')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##we use seaborn so that we can fit the data points in the linear regression model\n", "sns.regplot(x=\"sale\", y=\"xsga\", data=sales) #linear model - plotting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* <span style=\"color:darkorange;text-align:center;\">Your turn</span>: please use *Microsoft* data since 1990 and provide scatter plots for the relations as follows: (1) the one between COGS and sales; \n", "(2) the other between SG&A and sales; " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* You also can easily plot two companies on one figure" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conn=sqlite3.connect(r\"D:\\Temp\\sales_022024.db\") ##establish the connection with sqlite 3 database\n", "sales=pd.read_sql('''select conm, fyear, datadate, sale, xsga, cogs \n", " from PG_MSFT_30yr\n", " ''', con=conn) #read sales, SG&A and COGS for P&G over the past 20 years\n", "\n", "conn.close()\n", "sns.scatterplot(x='sale', y='xsga', data=sales, hue='conm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. SEC EDGAR - crawling data files\n", "* SEC EDGAR provides an easy access to all the disclosures of the public companies in the US. Please See [SEC EDGAR Info.](https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm) \n", "You can use crawler files to download all the disclosures. However, please keep in mind that you only download the data you need. \n", "* Let us download one crawler file and explore its structure \n", "* Let us write a code to download information for 2023 (four quarters)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'''\n", "The objective of this code is to download crawling information for 10-K files of SP500\n", "'''\n", "## import libs\n", "import datetime\n", "import os\n", "os.chdir(r\"D:\\Temp\")\n", "import requests\n", "import pandas as pd\n", "import sqlite3\n", "headers = {\n", " \"User-Agent\": \"@@USE your own e-mail@@\",\n", " \"Accept-Encoding\":\"gzip, deflate\", \n", " \"Host\": \"www.sec.gov\" \n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##Step 1: find out the list of SP500 (CIK)\n", "conn1=sqlite3.connect(r\"D:\\Temp\\sp500_mafa2024.db\") #we create the SP500 list in Topic 2\n", "sp_list=pd.read_sql('''\n", " select distinct cik \n", " from sp_list;\n", " ''', con=conn1)#SP 500 cik list\n", "cik_list=list(sp_list['cik'].astype(int)) #convert the framework to a integer list\n", "\n", "##Step 2: set up the period and prepare the weblink list\n", "start_year=2023 \n", "end_year=2023\n", "years = list(range(start_year, end_year+1)) #create a list of years for downloading\n", "quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4'] #Quarter list\n", "history=[] #create a list of tuples for years and quarters\n", "for y in years:\n", " for q in quarters:\n", " history.append((y,q))\n", "'''\n", "An efficient way to create a list is:\n", "history = [(y, q) for y in years for q in quarters]\n", "'''\n", "urls=[] #create a list of links \n", "for x in history:\n", " urls.append('https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx' % (x[0], x[1]))\n", "'''\n", "An efficient way to create this list is:\n", "urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx' % (x[0], x[1]) for x in history]\n", "'''\n", "# a standardized way to do the string formatting is:\n", "#urls = ['https://www.sec.gov/Archives/edgar/full-index/{year:d}/{quarter}/crawler.idx'\\\n", "#.format(year=x[0], quarter=x[1]) for x in history]\n", "urls.sort()\n", "\n", "## Step 3: start to download the crawler files with filtering\n", "cur=conn1.cursor()#create a cursor object so that we can run SQL statements\n", "cur.execute('DROP TABLE IF EXISTS sp10K') #drop idx table if this table exists\n", "cur.execute('CREATE TABLE sp10K (conm TEXT, type TEXT, cik INTEGER, date TEXT, path TEXT)') #create a table\n", "\n", "for url in urls: #loop over the list of crawler files\n", " lines = requests.get(url, headers=headers).content.decode(\"utf-8\", errors='ignore').splitlines() \n", " nameloc = lines[7].find('Company Name')#locate the position of Company name\n", " typeloc = lines[7].find('Form Type') #locate the position of File type \n", " cikloc = lines[7].find('CIK') # Locate the positin of CIK\n", " dateloc = lines[7].find('Date Filed') # Locate the positin of Date Filed\n", " urlloc = lines[7].find('URL') # Locate the position of URL\n", " records=[]\n", " for line in lines[9:]:\n", " records.append(tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(),\n", " line[dateloc:urlloc].strip(), line[urlloc:].strip()]))\n", " '''\n", " The efficient way to create a list is as follows:\n", " records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(),\n", " line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]\n", " '''\n", " for record in records:\n", " if record[1]==\"10-K\" and int(record[2]) in cik_list:\n", " cur.execute('INSERT INTO sp10K VALUES (?, ?, ?, ?, ?)', record)\n", " print(url, 'downloaded and wrote to SQLite')\n", "\n", "conn1.commit()\n", "conn1.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* With the URL list in the data table, we can download all the 10 K files for all the SP500 companies that filed annual reports in 2023\n", "* Note: you need to review what we have done in Topic 1 (computing work)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }