week 8-data science

profilerav
Chapter_111.pptx

Data Science and Big Data Analytics

Chap 11: Adv. Analytics – Tech & Tools: In-Database Analytics

1

Chapter Contents

11.1 SQL Essentials

11.1.1 Joins

11.1.2 Set Operations

11.1.3 Grouping Extensions

11.2 In-Database Text Analysis

11.3 Advanced SQL

11.3.1 Window Functions

11.3.2 User-Defined Functions and Aggregates

11.3.3 Ordered Aggregates

11.3.4 MADlib

Summary

2

Chap 11: Adv. Analytics – Tech & Tools: In-Database Analytics

In-database analytics is a broad term that describes the processing of data within its repository

This is in contrast to extracting data from a source and loading it into a sandbox or workspace like R

Advantages and disadvantages

Advantage: Eliminates need to move the data

Advantage: Fast – often almost real-time results

Disadvantage: Data must be mostly structured

Disadvantage: Data must be limited, not too huge

Applications – Credit card transaction fraud detection, product recommendations, web advertisement selection

3

11.1 SQL Essentials Relational Database – Entity Relationship Diagram

11.1 SQL Essentials

Tables

Records (rows)

Fields

Primary Keys

Foreign Keys

Normalization

reduces dup

SQL queries

4

11.2 In-Database Text Analysis

SQL offers basic text string functions

Example – extract zip code from text string

5

11.2 In-Database Text Analysis

Example – identify invalid zip codes

6

11.3 Advanced SQL

Window functions – moving averages

7

11.3 Advanced SQL

EWMA = Exponentially Weighted Moving Average

8

11.3 Advanced SQL

MADlib – open-source library for in-database analytics

MAD

Magnetic – attract all the data sources within organization

Agile – physical and logical contents in continuous rapid evolution

Deep – sophisticated statistical methods

http://doc.madlib.net/latest/modules.html

9

11.3 Advanced SQL MADlib Modules

10

Summary

SQL can perform in-database analytics

Although SQL is usually associated with structured data, SQL tables often contain unstructured data – comments, descriptions, etc. – that can be restructured for further analysis

Complex SQL queries can utilize window functions and user-defined functions

Libraries such as MADlib can be used to conduct statistical analyses within a database

11