week 8-data science
Data Science and Big Data Analytics
Chap 11: Adv. Analytics – Tech & Tools: In-Database Analytics
1
Chapter Contents
11.1 SQL Essentials
11.1.1 Joins
11.1.2 Set Operations
11.1.3 Grouping Extensions
11.2 In-Database Text Analysis
11.3 Advanced SQL
11.3.1 Window Functions
11.3.2 User-Defined Functions and Aggregates
11.3.3 Ordered Aggregates
11.3.4 MADlib
Summary
2
Chap 11: Adv. Analytics – Tech & Tools: In-Database Analytics
In-database analytics is a broad term that describes the processing of data within its repository
This is in contrast to extracting data from a source and loading it into a sandbox or workspace like R
Advantages and disadvantages
Advantage: Eliminates need to move the data
Advantage: Fast – often almost real-time results
Disadvantage: Data must be mostly structured
Disadvantage: Data must be limited, not too huge
Applications – Credit card transaction fraud detection, product recommendations, web advertisement selection
3
11.1 SQL Essentials Relational Database – Entity Relationship Diagram
11.1 SQL Essentials
Tables
Records (rows)
Fields
Primary Keys
Foreign Keys
Normalization
reduces dup
SQL queries
4
11.2 In-Database Text Analysis
SQL offers basic text string functions
Example – extract zip code from text string
5
11.2 In-Database Text Analysis
Example – identify invalid zip codes
6
11.3 Advanced SQL
Window functions – moving averages
7
11.3 Advanced SQL
EWMA = Exponentially Weighted Moving Average
8
11.3 Advanced SQL
MADlib – open-source library for in-database analytics
MAD
Magnetic – attract all the data sources within organization
Agile – physical and logical contents in continuous rapid evolution
Deep – sophisticated statistical methods
9
11.3 Advanced SQL MADlib Modules
10
Summary
SQL can perform in-database analytics
Although SQL is usually associated with structured data, SQL tables often contain unstructured data – comments, descriptions, etc. – that can be restructured for further analysis
Complex SQL queries can utilize window functions and user-defined functions
Libraries such as MADlib can be used to conduct statistical analyses within a database
11