General data processing and using big data

profileCruzlappo
SIT742ModernDataScience2018T1-Practical1.pdf

Practical 1 - Python Basic

 

SIT742 Modern Data Science (2018 T1) - Practical

Prepared by Guangyan Huang

Practical 1: Python Basic √

Practical 2: Data Acquisition

Practical 3: Data Cleaning and Preparation

Practical 4: Data Integration

Practical 5: Plotting and Visualization

Practical 6: Demo Display

Practical 7: K-means Clustering

Practical 8: Principal Component Analysis

Practical 9: Support Vector Machines

Practical 10: Time Series Basic

Practical 11: Time Series Applications

Reference

[1]. Wes Mckinney, Python for Data Analysis, 2nd Edition, O'Reilly, 2018. (Python 3.6, Anaconda)

[2] Jake VanderPlas, Python Data Science Hand Book, O’Reilly, 2017.

Practical 1 - Python Basic

 

Part I: Set Up Anaconda Python Environment

The first step in setting up your environment with the required Anaconda distribution is

downloading the required installation package from https://www.anaconda.com/download/.

(1) Download the Anaconda3-4.2.0-Windows-x86_64 package (the one with Python 3.5) from https://repo.continuum.io/archive/. A screenshot of the target page is shown in Figure 1.

Fig. 1. Downloading "Anaconda3-4.2.0-windows-x86_64.exe".

(2) Installing the downloaded file is as simple as double-clicking the file and letting the installer take care of the entire process. To check if the installation was successful, just open a command prompt or terminal and start up Python. You should be

greeted with the message shown in Figure 2 identifying the Python and the Anaconda version.

We also recommend that you use the iPython shell (the command is ipython) instead of the

regular Python shell, because you get a lot of features including inline plots, autocomplete,

and so on.

Practical 1 - Python Basic

 

Fig. 2 Verifying installation with the Python shell.

To exit the shell, press Ctrl-D (on Linux or macOS), Ctrl-Z (on Windows), or type the

command exit() and press Enter.

Test it as following screen:

This should complete the process of setting up your Python environment for Data Science and

Machine Learning.

(3) Setup Jupyter notebooks, formerly known as ipython notebooks, are an interactive computational environment that can be used to develop Python based Data Science analyses,

which emphasize on reproducible research. We don’t require any additional installation for

Jupyter notebooks, as it is already installed by the Anaconda distribution. We can invoke the

jupyter notebook by executing the following command at the command prompt or terminal.

Please ensure to run at Anaconda folder.

Practical 1 - Python Basic

 

C:\Anaconda3all\jupyter notebook

This will start a notebook server at the address localhost:8888 of your machine. An important

point to note here is that you access the notebook using a browser so you can even initiate it

on a remote server and use it locally using techniques like ssh tunneling. This feature is

extremely useful in case you have a powerful computing resource that you can only access

remotely but lack a GUI for it. Jupyter notebook allows you to access those resources in a

visually interactive shell. Once you invoke this command, you can navigate to the address

localhost:8888 in your browser, to find the landing page depicted in Figure 3, which can be

used to access existing notebooks or create new ones.

Figure 3. Jupyter notebook landing page.

Practical 1 - Python Basic

 

Part II: IPython Basic

(1) Tab Completion On the surface, the IPython shell looks like a cosmetically different version of the standard

terminal Python interpreter (invoked with python). One of the major improvements over the

standard Python shell is tab completion, found in many IDEs or other interactive computing

analysis environments. While entering expressions in the shell, pressing the Tab key will

search the namespace for any variables (objects, functions, etc.) matching the characters you

have typed so far:

Practical 1 - Python Basic

 

(2) Introspection Using a question mark (?) before or after a variable will display some general information

about the object:

This is referred to as object introspection. If the object is a function or instance method, the

docstring, if defined, will also be shown. Suppose we’d written the following function (which

you can reproduce in IPython or Jupyter):

Then using ? shows us the docstring:

Practical 1 - Python Basic

 

Using ?? will also show the function’s source code if possible:

? has a final usage, which is for searching the IPython namespace in a manner similar to the

standard Unix or Windows command line. A number of characters combined with the

wildcard (*) will show all names matching the wildcard expression. For example, we could

get a list of all functions in the top-level NumPy namespace containing load:

Practical 1 - Python Basic

 

(3) The %run Command You can run any file as a Python program inside the environment of your IPython session

using the %run command. Suppose you had the following simple script stored in

c:\Anaconda3all\test.py:

In Jupyter notebook:

Practical 1 - Python Basic

 

(4) The %load command:

(5) About Magic Commands IPython’s special commands (which are not built into Python itself) are known as “magic”

commands. These are designed to facilitate common tasks and enable you to easily control the

behavior of the IPython system. A magic command is any command prefixed by the percent

symbol %. For example, you can check the execution time of any Python statement, such as a

matrix multiplication, using the %timeit magic function (which will be discussed in more

detail later):

Practical 1 - Python Basic

10 

 

Useful Keyboard Shortcut:

Useful Command:

Practical 1 - Python Basic

11 

 

(6) Matplotlib Integration One reason for IPython’s popularity in analytical computing is that it integrates well with data

visualization and other user interface libraries like matplotlib. The %matplotlib magic

function configures its integration with the IPython shell or Jupyter notebook. This is

important, as otherwise plots you create will either not appear (notebook) or take control of

the session until closed (shell). In the IPython shell, running %matplotlib sets up the

integration so you can create multiple plot windows without interfering with the console

session:

Practical 1 - Python Basic

12 

 

Part III: Python Language Basics

Now let's go through the essential Python programming concepts and language mechanics.

(1) Language Semantics

The Python language design is distinguished by its emphasis on readability, simplicity,

and explicitness. Some people go so far as to liken it to “executable pseudocode.”

Indentation, not braces

Python uses whitespace (tabs or spaces) to structure code instead of using braces as in many

other languages like R, C++, Java, and Perl. We will introduce later.

Variables and argument passing

When assigning a variable (or name) in Python, you are creating a reference to the object on

the righthand side of the equals sign. In practical terms, consider a list of integers:

When you pass objects as arguments to a function, new local variables are created referencing

the original objects without any copying. If you bind a new object to a variable inside a

function, that change will not be reflected in the parent scope. It is therefore possible to alter

the internals of a mutable argument. Suppose we had the following function:

Practical 1 - Python Basic

13 

 

Dynamic references, strong types

In contrast with many compiled languages, such as Java and C++, object references in Python

have no type associated with them. There is no problem with the following:

Practical 1 - Python Basic

14 

 

In this regard Python is considered a strongly typed language, which means that every object

has a specific type (or class), and implicit conversions will occur only in certain obvious

circumstances, such as the following:

Knowing the type of an object is important, and it’s useful to be able to write functions that

can handle many different kinds of input. You can check that an object is an instance of a

particular type using the is instance function:

Practical 1 - Python Basic

15 

 

Attributes and methods Objects in Python typically have both attributes (other Python objects stored “inside”the

object) and methods (functions associated with an object that can have access to the object’s

internal data). Both of them are accessed via the syntax

obj.attribute_name: <Press tab>

Attributes and methods can also be accessed by name via the getattr function:

Duck typing Often you may not care about the type of an object but rather only whether it has certain

methods or behavior. This is sometimes called “duck typing,” after the saying “If it walks like

a duck and quacks like a duck, then it’s a duck.” For example, you can verify that an object is

iterable if it implemented the iterator protocol. For many objects, this means it has a __iter__

“magic method,” though an alternative and better way to check is to try using the iter

function:

Practical 1 - Python Basic

16 

 

Imports In Python a module is simply a file with the .py extension containing Python code. Suppose

that we had the following module:

Practical 1 - Python Basic

17 

 

Or equivalently:

from some_module import f, g, PI

result = g(5, PI)

By using the as keyword you can give imports different variable names:

import some_module as sm

from some_module import PI as pi, g as gf

r1 = sm.f(pi)

r2 = gf(6, pi)

Binary operators and comparisons Most of the binary math operations and comparisons are as you might expect:

Practical 1 - Python Basic

18 

 

To check if two references refer to the same object, use the is keyword. is not is also perfectly

valid if you want to check that two objects are not the same:

Binary operators:

Practical 1 - Python Basic

19 

 

Mutable and immutable objects Most objects in Python, such as lists, dicts, NumPy arrays, and most user-defined types

(classes), are mutable. This means that the object or values that they contain can be modified:

Remember that just because you can mutate an object does not mean that you always should.

Such actions are known as side effects. For example, when writing a function, any side effects

should be explicitly communicated to the user in the function’s documentation or comments.

If possible, I recommend trying to avoid side effects and favor immutability, even though

there may be mutable objects involved.

(2) Scalar Types

Python along with its standard library has a small set of built-in types for handling numerical

data, strings, boolean (True or False) values, and dates and time. These “single value” types

are sometimes called scalar types and we refer to them in this book as scalars. See the

following table for a list of the main scalar types. Date and time handling will be discussed

separately, as these are provided by the datetime module in the standard library.

Practical 1 - Python Basic

20 

 

Numeric types The primary Python types for numbers are int and float. An int can store arbitrarily large

numbers. Integer division not resulting in a whole number will always yield a floating-point

number: To get C-style integer division (which drops the fractional part if the result is not a

whole number), use the floor division operator //:

Many people use Python for its powerful and flexible built-in string processing capabilities.

You can write string literals using either single quotes ' or double quotes ":

Practical 1 - Python Basic

21 

 

Many Python objects can be converted to a string using the str function. Strings are a

sequence of Unicode characters and can be treated like other sequences, e.g., lists and tuples

Practical 1 - Python Basic

22 

 

The syntax s[:3] is called slicing and is implemented for many kinds of Python sequences.

This will be explained in more detail later on, as it is used extensively. The backslash

character \ is an escape character, meaning that it is used to specify special characters like

newline \n or Unicode characters. To write a string literal with backslashes, you need to

escape them:

Bytes and Unicode In modern Python (i.e., Python 3.0 and up), Unicode has become the first-class string type to

enable more consistent handling of ASCII and non-ASCII text. In older versions of Python,

strings were all bytes without any explicit Unicode encoding. You could convert to Unicode

assuming you knew the character encoding. Let’s look at an example:

Practical 1 - Python Basic

23 

 

Booleans The two boolean values in Python are written as True and False. Comparisons and other

conditional expressions evaluate to either True or False. Boolean values are combined with

the and and or keywords:

Type casting The str, bool, int, and float types are also functions that can be used to cast values to those

types:

Practical 1 - Python Basic

24 

 

None None is the Python null value type. If a function does not explicitly return a value, it

implicitly returns None:

Practical 1 - Python Basic

25 

 

Dates and times The built-in Python datetime module provides datetime, date, and time types. The datetime

type, as you may imagine, combines the information stored in date and time and is the most

commonly used:

Practical 1 - Python Basic

26 

 

Datetime format specification (ISO C89 compatible)

Practical 1 - Python Basic

27 

 

(3) Control Flow

Python has several built-in keywords for conditional logic, loops, and other standard control

flow concepts found in other programming languages.

if, elif, and else The if statement is one of the most well-known control flow statement types. It checks a

condition that, if True, evaluates the code in the block that follows:

for loops for loops are for iterating over a collection (like a list or tuple) or an iterater. The standard

syntax for a for loop is:

for value in collection:

# do something with value

You can advance a for loop to the next iteration, skipping the remainder of the block, using

the continue keyword. Consider this code, which sums up integers in a list and skips None

values:

Practical 1 - Python Basic

28 

 

As we will see in more detail, if the elements in the collection or iterator are sequences (tuples

or lists, say), they can be conveniently unpacked into variables in the for loop statement:

for a, b, c in iterator:

# do something

while loops A while loop specifies a condition and a block of code that is to be executed until the

condition evaluates to False or the loop is explicitly ended with break:

Practical 1 - Python Basic

29 

 

pass pass is the “no-op” statement in Python. It can be used in blocks where no action is to be

taken (or as a placeholder for code not yet implemented); it is only required because Python

uses whitespace to delimit blocks:

range The range function returns an iterator that yields a sequence of evenly spaced integers:

Practical 1 - Python Basic

30 

 

------------------------------------The End of Practical 1---------------------------------------