General data processing and using big data
Practical 3 - Data Acquisition
1
SIT742 Modern Data Science (2018 T1) - Practical Prepared by Guangyan Huang
Practicals 1-2: Python Basic
Practical 3: Data Acquisition √ Practical 4: Data Cleaning and Preparation
Practical 5: Data Integration
Practical 6: Plotting and Visualization
Practical 7: K-means Clustering
Practical 8: Principal Component Analysis
Practical 9: Support Vector Machines
Practical 10: Time Series Basic
Practical 11: Time Series Applications
Reference
[1]. Wes Mckinney, Python for Data Analysis, 2nd Edition, O'Reilly, 2018. (Python 3.6, Anaconda)
[2] Jake VanderPlas, Python Data Science Hand Book, O’Reilly, 2017.
Please download data for Practical 3 at:
https://d2l.deakin.edu.au/d2l/le/content/631152/viewContent/3939713/View
Practical 3 - Data Acquisition
2
Accessing data is a necessary first step for using most of the tools in this unit. Package pandas will be used here, though there are numerous tools in other libraries to help with reading and writing data in various formats. Input and output typically falls into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.
Part 1 Reading and Writing Data in Text Format
Pandas features a number of functions for reading tabular data as a DataFrame object. Table 1 summarizes some of them, though read_csv and read_table are likely the ones you will use the most.
Tab. 1. Parsing functions in pandas.
(1) CSV Data Read csv file Let’s start with reading a small comma-separated (CSV) text file – ex1.csv, since this is comma-delimited, we can use read_csv to read it into a DataFrame:
Practical 3 - Data Acquisition
3
We could also have used read_table and specified the delimiter:
Practical 3 - Data Acquisition
4
A file will not always have a header row – ex2.csv. To read this file, you have a couple of options. You can allow pandas to assign default column names:
Or you can specify names yourself:
Practical 3 - Data Acquisition
5
In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields (see ex3.csv with notepad++). While you could do some munging by hand, the fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression as a delimiter for read_table. This can be expressed by the regular expression \s+, so we have then. (This is optional to practice.)
Practical 3 - Data Acquisition
6
The parser functions have many additional arguments to help you handle the wide variety of exception file formats that occur. For example, you can skip the first, third, and fourth rows of a file with skiprows in ex4.csv:
Practical 3 - Data Acquisition
7
Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA and NULL (see ex5.csv):
Practical 3 - Data Acquisition
8
Reading Text Files in Pieces When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file. Before we look at a large file, we make the pandas display settings more compact (ex6.csv). (This is optional to practice.)
… … … … … …
Practical 3 - Data Acquisition
9
If you want to only read a small number of rows (avoiding reading the entire file), specify that with nrows. (This is optional to practice.)
To read a file in pieces, specify a chunksize as a number of rows. (This is optional to practice.)
Practical 3 - Data Acquisition
10
The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so: (This is optional to practice.)
Practical 3 - Data Acquisition
11
Writing Data to Text Format Data can also be exported to a delimited format. Let’s consider one of the CSV files read before (ex5.csv), using DataFrame’s to_csv method, we can write the data out to a comma-separated file (out.csv):
Practical 3 - Data Acquisition
12
(2) JSON Data. JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV. JSON is very nearly valid Python code with the exception of its null value null and some other nuances (such as disallowing trailing commas at the end of lists). The basic types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. All of the keys in an object must be strings. There are several Python libraries for reading and writing JSON data. Module json will be used here, as it is built into the Python standard library. To convert a JSON string to Python form, use json.loads:
Practical 3 - Data Acquisition
13
The pandas.read_json can automatically convert JSON datasets in specific arrangements into a Series or DataFrame. The default options for pandas.read_json assume that each object in the JSON array is a row in the table:
Practical 3 - Data Acquisition
14
Part 2 Binary Data Formats
One of the easiest ways to store data (also known as serialization) efficiently in binary format is using Python’s built-in pickle serialization. pandas objects all have a to_pickle method that writes the data to disk in pickle format. You can read any “pickled” object stored in a file by using the built-in pickle directly, or even more conveniently using pandas.read_pickle:
Practical 3 - Data Acquisition
15
Using HDF5 Format HDF5 is a well-regarded file format intended for storing large quantities of scientific array data. It is available as a C library, and it has interfaces available in many other languages, including Java, Julia, MATLAB, and Python. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file can store multiple datasets and supporting metadata. Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compression modes, enabling data with repeated patterns to be stored more efficiently. HDF5 can be a good choice for working with very large datasets that don’t fit into memory, as you can efficiently read and write small sections of much larger arrays. While it’s possible to directly access HDF5 files using either the PyTables or h5py libraries, pandas provides a high-level interface that simplifies storing Series and DataFrame object. The HDFStore class works like a dict and handles the low-level details: (This is optional to practice.)
… ….
Practical 3 - Data Acquisition
16
Reading Microsoft Excel Files pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the ExcelFile class or pandas.read_excel function. Internally these tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. You may need to install these manually with pip or conda. To use ExcelFile, create an instance by passing a path to an xls or xlsx file:
If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can also simply pass the filename to pandas.read_excel:
Practical 3 - Data Acquisition
17
To write pandas data to Excel format, you must first create an ExcelWriter, then write data to it using pandas objects’ to_excel method:
Practical 3 - Data Acquisition
18
Part 3 Interacting with Web APIs
Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the requests package. To find the last 30 GitHub issues for pandas on GitHub, we can make a GET HTTP request using the add-on requests library:
Practical 3 - Data Acquisition
19
Part 4 Interacting with Databases
In a business setting, most data may not be stored in text or Excel files. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application. Loading data from SQL into a DataFrame is straightforward, and pandas has some functions to simplify the process. As an example, a SQLite database is created using Python’s built-in sqlite3 driver: (This is optional to practice.)
------------------------------------The End of Practical 2---------------------------------------
- Reference
- [1]. Wes Mckinney, Python for Data Analysis, 2nd Edition, O'Reilly, 2018. (Python 3.6, Anaconda)
- Please download data for Practical 3 at:
- https://d2l.deakin.edu.au/d2l/le/content/631152/viewContent/3939713/View
- Part 1 Reading and Writing Data in Text Format
- Part 2 Binary Data Formats
- One of the easiest ways to store data (also known as serialization) efficiently in binary format is using Python’s built-in pickle serialization. pandas objects all have a to_pickle method that writes the data to disk in pickle format. You can read an...
- Part 3 Interacting with Web APIs
- Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the requests package. To find the last 30 GitHub issues for pand...
- Part 4 Interacting with Databases
- In a business setting, most data may not be stored in text or Excel files. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have become quite popular. The choice of database is ...