General data processing and using big data
Practical 1 - Python Basic
1
SIT742 Modern Data Science (2018 T1) - Practical
Prepared by Guangyan Huang
Practical 1: Python Basic √
Practical 2: Data Acquisition
Practical 3: Data Cleaning and Preparation
Practical 4: Data Integration
Practical 5: Plotting and Visualization
Practical 6: Demo Display
Practical 7: K-means Clustering
Practical 8: Principal Component Analysis
Practical 9: Support Vector Machines
Practical 10: Time Series Basic
Practical 11: Time Series Applications
Reference
[1]. Wes Mckinney, Python for Data Analysis, 2nd Edition, O'Reilly, 2018. (Python 3.6, Anaconda)
[2] Jake VanderPlas, Python Data Science Hand Book, O’Reilly, 2017.
Practical 1 - Python Basic
2
Part I: Set Up Anaconda Python Environment
The first step in setting up your environment with the required Anaconda distribution is
downloading the required installation package from https://www.anaconda.com/download/.
(1) Download the Anaconda3-4.2.0-Windows-x86_64 package (the one with Python 3.5) from https://repo.continuum.io/archive/. A screenshot of the target page is shown in Figure 1.
Fig. 1. Downloading "Anaconda3-4.2.0-windows-x86_64.exe".
(2) Installing the downloaded file is as simple as double-clicking the file and letting the installer take care of the entire process. To check if the installation was successful, just open a command prompt or terminal and start up Python. You should be
greeted with the message shown in Figure 2 identifying the Python and the Anaconda version.
We also recommend that you use the iPython shell (the command is ipython) instead of the
regular Python shell, because you get a lot of features including inline plots, autocomplete,
and so on.
Practical 1 - Python Basic
3
Fig. 2 Verifying installation with the Python shell.
To exit the shell, press Ctrl-D (on Linux or macOS), Ctrl-Z (on Windows), or type the
command exit() and press Enter.
Test it as following screen:
This should complete the process of setting up your Python environment for Data Science and
Machine Learning.
(3) Setup Jupyter notebooks, formerly known as ipython notebooks, are an interactive computational environment that can be used to develop Python based Data Science analyses,
which emphasize on reproducible research. We don’t require any additional installation for
Jupyter notebooks, as it is already installed by the Anaconda distribution. We can invoke the
jupyter notebook by executing the following command at the command prompt or terminal.
Please ensure to run at Anaconda folder.
Practical 1 - Python Basic
4
C:\Anaconda3all\jupyter notebook
This will start a notebook server at the address localhost:8888 of your machine. An important
point to note here is that you access the notebook using a browser so you can even initiate it
on a remote server and use it locally using techniques like ssh tunneling. This feature is
extremely useful in case you have a powerful computing resource that you can only access
remotely but lack a GUI for it. Jupyter notebook allows you to access those resources in a
visually interactive shell. Once you invoke this command, you can navigate to the address
localhost:8888 in your browser, to find the landing page depicted in Figure 3, which can be
used to access existing notebooks or create new ones.
Figure 3. Jupyter notebook landing page.
Practical 1 - Python Basic
5
Part II: IPython Basic
(1) Tab Completion On the surface, the IPython shell looks like a cosmetically different version of the standard
terminal Python interpreter (invoked with python). One of the major improvements over the
standard Python shell is tab completion, found in many IDEs or other interactive computing
analysis environments. While entering expressions in the shell, pressing the Tab key will
search the namespace for any variables (objects, functions, etc.) matching the characters you
have typed so far:
Practical 1 - Python Basic
6
(2) Introspection Using a question mark (?) before or after a variable will display some general information
about the object:
This is referred to as object introspection. If the object is a function or instance method, the
docstring, if defined, will also be shown. Suppose we’d written the following function (which
you can reproduce in IPython or Jupyter):
Then using ? shows us the docstring:
Practical 1 - Python Basic
7
Using ?? will also show the function’s source code if possible:
? has a final usage, which is for searching the IPython namespace in a manner similar to the
standard Unix or Windows command line. A number of characters combined with the
wildcard (*) will show all names matching the wildcard expression. For example, we could
get a list of all functions in the top-level NumPy namespace containing load:
Practical 1 - Python Basic
8
(3) The %run Command You can run any file as a Python program inside the environment of your IPython session
using the %run command. Suppose you had the following simple script stored in
c:\Anaconda3all\test.py:
In Jupyter notebook:
Practical 1 - Python Basic
9
(4) The %load command:
(5) About Magic Commands IPython’s special commands (which are not built into Python itself) are known as “magic”
commands. These are designed to facilitate common tasks and enable you to easily control the
behavior of the IPython system. A magic command is any command prefixed by the percent
symbol %. For example, you can check the execution time of any Python statement, such as a
matrix multiplication, using the %timeit magic function (which will be discussed in more
detail later):
Practical 1 - Python Basic
10
Useful Keyboard Shortcut:
Useful Command:
Practical 1 - Python Basic
11
(6) Matplotlib Integration One reason for IPython’s popularity in analytical computing is that it integrates well with data
visualization and other user interface libraries like matplotlib. The %matplotlib magic
function configures its integration with the IPython shell or Jupyter notebook. This is
important, as otherwise plots you create will either not appear (notebook) or take control of
the session until closed (shell). In the IPython shell, running %matplotlib sets up the
integration so you can create multiple plot windows without interfering with the console
session:
Practical 1 - Python Basic
12
Part III: Python Language Basics
Now let's go through the essential Python programming concepts and language mechanics.
(1) Language Semantics
The Python language design is distinguished by its emphasis on readability, simplicity,
and explicitness. Some people go so far as to liken it to “executable pseudocode.”
Indentation, not braces
Python uses whitespace (tabs or spaces) to structure code instead of using braces as in many
other languages like R, C++, Java, and Perl. We will introduce later.
Variables and argument passing
When assigning a variable (or name) in Python, you are creating a reference to the object on
the righthand side of the equals sign. In practical terms, consider a list of integers:
When you pass objects as arguments to a function, new local variables are created referencing
the original objects without any copying. If you bind a new object to a variable inside a
function, that change will not be reflected in the parent scope. It is therefore possible to alter
the internals of a mutable argument. Suppose we had the following function:
Practical 1 - Python Basic
13
Dynamic references, strong types
In contrast with many compiled languages, such as Java and C++, object references in Python
have no type associated with them. There is no problem with the following:
Practical 1 - Python Basic
14
In this regard Python is considered a strongly typed language, which means that every object
has a specific type (or class), and implicit conversions will occur only in certain obvious
circumstances, such as the following:
Knowing the type of an object is important, and it’s useful to be able to write functions that
can handle many different kinds of input. You can check that an object is an instance of a
particular type using the is instance function:
Practical 1 - Python Basic
15
Attributes and methods Objects in Python typically have both attributes (other Python objects stored “inside”the
object) and methods (functions associated with an object that can have access to the object’s
internal data). Both of them are accessed via the syntax
obj.attribute_name: <Press tab>
Attributes and methods can also be accessed by name via the getattr function:
Duck typing Often you may not care about the type of an object but rather only whether it has certain
methods or behavior. This is sometimes called “duck typing,” after the saying “If it walks like
a duck and quacks like a duck, then it’s a duck.” For example, you can verify that an object is
iterable if it implemented the iterator protocol. For many objects, this means it has a __iter__
“magic method,” though an alternative and better way to check is to try using the iter
function:
Practical 1 - Python Basic
16
Imports In Python a module is simply a file with the .py extension containing Python code. Suppose
that we had the following module:
Practical 1 - Python Basic
17
Or equivalently:
from some_module import f, g, PI
result = g(5, PI)
By using the as keyword you can give imports different variable names:
import some_module as sm
from some_module import PI as pi, g as gf
r1 = sm.f(pi)
r2 = gf(6, pi)
Binary operators and comparisons Most of the binary math operations and comparisons are as you might expect:
Practical 1 - Python Basic
18
To check if two references refer to the same object, use the is keyword. is not is also perfectly
valid if you want to check that two objects are not the same:
Binary operators:
Practical 1 - Python Basic
19
Mutable and immutable objects Most objects in Python, such as lists, dicts, NumPy arrays, and most user-defined types
(classes), are mutable. This means that the object or values that they contain can be modified:
Remember that just because you can mutate an object does not mean that you always should.
Such actions are known as side effects. For example, when writing a function, any side effects
should be explicitly communicated to the user in the function’s documentation or comments.
If possible, I recommend trying to avoid side effects and favor immutability, even though
there may be mutable objects involved.
(2) Scalar Types
Python along with its standard library has a small set of built-in types for handling numerical
data, strings, boolean (True or False) values, and dates and time. These “single value” types
are sometimes called scalar types and we refer to them in this book as scalars. See the
following table for a list of the main scalar types. Date and time handling will be discussed
separately, as these are provided by the datetime module in the standard library.
Practical 1 - Python Basic
20
Numeric types The primary Python types for numbers are int and float. An int can store arbitrarily large
numbers. Integer division not resulting in a whole number will always yield a floating-point
number: To get C-style integer division (which drops the fractional part if the result is not a
whole number), use the floor division operator //:
Many people use Python for its powerful and flexible built-in string processing capabilities.
You can write string literals using either single quotes ' or double quotes ":
Practical 1 - Python Basic
21
Many Python objects can be converted to a string using the str function. Strings are a
sequence of Unicode characters and can be treated like other sequences, e.g., lists and tuples
Practical 1 - Python Basic
22
The syntax s[:3] is called slicing and is implemented for many kinds of Python sequences.
This will be explained in more detail later on, as it is used extensively. The backslash
character \ is an escape character, meaning that it is used to specify special characters like
newline \n or Unicode characters. To write a string literal with backslashes, you need to
escape them:
Bytes and Unicode In modern Python (i.e., Python 3.0 and up), Unicode has become the first-class string type to
enable more consistent handling of ASCII and non-ASCII text. In older versions of Python,
strings were all bytes without any explicit Unicode encoding. You could convert to Unicode
assuming you knew the character encoding. Let’s look at an example:
Practical 1 - Python Basic
23
Booleans The two boolean values in Python are written as True and False. Comparisons and other
conditional expressions evaluate to either True or False. Boolean values are combined with
the and and or keywords:
Type casting The str, bool, int, and float types are also functions that can be used to cast values to those
types:
Practical 1 - Python Basic
24
None None is the Python null value type. If a function does not explicitly return a value, it
implicitly returns None:
Practical 1 - Python Basic
25
Dates and times The built-in Python datetime module provides datetime, date, and time types. The datetime
type, as you may imagine, combines the information stored in date and time and is the most
commonly used:
Practical 1 - Python Basic
26
Datetime format specification (ISO C89 compatible)
Practical 1 - Python Basic
27
(3) Control Flow
Python has several built-in keywords for conditional logic, loops, and other standard control
flow concepts found in other programming languages.
if, elif, and else The if statement is one of the most well-known control flow statement types. It checks a
condition that, if True, evaluates the code in the block that follows:
for loops for loops are for iterating over a collection (like a list or tuple) or an iterater. The standard
syntax for a for loop is:
for value in collection:
# do something with value
You can advance a for loop to the next iteration, skipping the remainder of the block, using
the continue keyword. Consider this code, which sums up integers in a list and skips None
values:
Practical 1 - Python Basic
28
As we will see in more detail, if the elements in the collection or iterator are sequences (tuples
or lists, say), they can be conveniently unpacked into variables in the for loop statement:
for a, b, c in iterator:
# do something
while loops A while loop specifies a condition and a block of code that is to be executed until the
condition evaluates to False or the loop is explicitly ended with break:
Practical 1 - Python Basic
29
pass pass is the “no-op” statement in Python. It can be used in blocks where no action is to be
taken (or as a placeholder for code not yet implemented); it is only required because Python
uses whitespace to delimit blocks:
range The range function returns an iterator that yields a sequence of evenly spaced integers:
Practical 1 - Python Basic
30
------------------------------------The End of Practical 1---------------------------------------