Big Data Management Question Paper

Magnum_1993

4.2NewSQL.pptx

Home >Computer Science homework help >Big Data Management Question Paper

NewSQL

Big Data Management

Phil Bartie [email protected] EM G.29

Big Data Management

NewSQL

Materials released under CC-BY License

You are free to:

Share — copy and redistribute the material in any medium or format

Adapt — remix, transform, and build upon the material for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Big Data Management

Outline

Relational Database Storage

Row oriented

Column oriented

Different from wide-column stores

NewSQL

Relational community reaction to NoSQL!

Big Data Management

Relational

Oracle

MySQL

MS SQL Server

PostgreSQL

DB2

MS Access

SQLite

Teradata

SAP Adaptive Server

FileMaker

Hive

MariaDB

Informix

Vertica

Database Landscape

Big Data Management

NoSQL

Document

MongoDB

Elasticsearch

DynamoDB

CouchBase

Key-Value

Redis

Memcached

Riak KV

Aerospike

SimpleDB

Graph

Neo4J

Titan

Giraph

InfiniteGraph

RDF

Virtuoso

Stardog

GraphDB

Blazegraph

Jena

RDF4J

NewSQL

SAP HANA

Google Spanner

Clustrix

VoltDB

MemSQL

NuoDB

Object

Caché

Db4o

Versant

ObjectStore

Wide-Column

Cassandra

HBase

Accumulo

HyperTable

XML

MarkLogic

Sedna

Tamino

BaseX

eXist-db

Systems in blue are column-stores

NewSQL

Big Data Management

A look at the Big data Landscape… … - we’ve just been talking about the {CLICK} NoSQL database part of this landscape.

Today going to take a look at {CLICK} NewSQL databases

Not talked about Hadoop and Spark --- for analytics --- Hadoop is a framework with a distributed files system, that can use Apache Spark functions (eg machine learning).

-- there is another course option that does use Hadoop. (F21DP)

-- Hadoop will use files on disk (HDFS) + good for READ only like analysis of log files

NewSQL

Big Data Management

https://youtu.be/v5e_PasMdXc?t=538

Which NoSQL database to use for a particular project?

Frank Kane

Frank spent 9 years at Amazon and IMDb

http://www.sundog-education.com/

Frank has own company running courses online, and also works as instructor with udemy

{CLICK to play video clip from YouTube - about 10mins}

Examples:

(1) mySQL – in house experience

(2) Hadoop / Spark for log file analytics

(3) Cassandra

(4) MongoDB ( or NewSQL or could be Hbase but lacking strong external support so MongoDB better choice for this example )

NewSQL

Big Data Management

https://www.youtube.com/watch?v=w2hCJIZz3n8&feature=youtu.be&t=186

Database Universe – value of data over time

VoltDB Co-Founder and Chief Strategy Officer Scott Jarr on the value of data in real-time.

{CLICK image to watch video - play until 5m 57s }

Looking at the DB universe and figuring out where things might be most useful… always evolving.

Apache Hadoop a framework that allows for the distributed processing of large data sets across clusters -

- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data

- Map Reduce – division of jobs into smaller ones and results joined up (eg vote counting by district – sum to total votes)

Spark – compute engine provides supports for ETL, machine learning, stream processing, and graph computation and more

NewSQL

Big Data Management

History in No-tation

Big Data Management

http://strataconf.com/london/public/schedule/detail/32351

1970: NoSQL = We have no SQL

SQL does not yet exist

1980: NoSQL = Know SQL

SQL has been standardised

Used for all data retrieval

2000: NoSQL = NoSQL!

SQL is slow and hard, let’s ditch it

2005: NoSQL = Not only SQL

There are more than one query language to get data

2013: NoSQL = No, SQL!

We need SQL, NoSQL stores starting to put SQL frontends on

NewSQL

Big Data Management

History in No-tation

1970: NoSQL = We have no SQL

SQL does not yet exist

1980: NoSQL = Know SQL

SQL has been standardised

Used for all data retrieval

2000: NoSQL = NoSQL!

SQL is slow and hard, let’s ditch it

2005: NoSQL = Not only SQL

There are more than one query language to get data

2013: NoSQL = No, SQL!

We need SQL, NoSQL stores starting to put SQL frontends on

Big Data Management

Traditional DBMS Architecture

Big Data Management

https://www.cs.oberlin.edu/~jdonalds/311/fig02-01.png

Magnetic disk

Optimise data distribution on disk and access

Request data transfer to working memory

Security and cataloguing

Logging

Concurrency control

(locks)

Transactions

Schema management

Query compilation

Query execution

Assumptions:

Operating on single machine

Data on disk – disk access slow

Limited memory

A lot of subsystems for management purposes

Big Data Management

NewSQL

Work performed by RDBMS

Big Data Management

http://blog.jooq.org/2013/08/24/mit-prof-michael-stonebraker-the-traditional-rdbms-wisdom-is-all-wrong/

Do you spot any issues with this data?

Do you trust it?

Latching: protects memory data structures from concurrent threads

{CLICK x2}

Probably not as bad as this

Looks made up since everything is 24%

Data has come from a Dr Michael Stonebraker talk based on data from 2007

(Shore DBMS, instrumented on TPC-C benchmark)

--------

Dr Michael Stonebraker's career can be broadly divided into two phases:

University of California, Berkeley - Ingres and Postgres

Massachusetts Institute of Technology (MIT) where he developed more novel data management techniques such as C-Store, H-Store and SciDB.

Major prizes include the Turing Award in 2015.

Stonebraker is currently a Professor Emeritus at UC Berkeley and an adjunct professor at MIT.

NewSQL

Big Data Management

Latching Recovery Work Buffer Pool Locking 24 24 4 24 24

Work performed by RDBMS

Big Data Management

https://downloads.voltdb.com/datasheets_collateral/technical_overview.pdf

This looks more plausible

Based on actual executions

Moved from 4% work to 12%

NewSQL

Big Data Management

Percentage

Index Management Logging Locking Latching Buffer Management Work 11 20 18 10 29 12

Traditional RDBMS Assumptions

Operating on single machine

Data on disk – disk access slow

Limited memory

Single copy of each record

Many subsystems for management!

Big Data Management

http://hadoop4japan.files.wordpress.com/2012/04/kb_scale_outup.png

Column-oriented DBMS

Big Data Management

One of the bottlenecks in RDBMS is disk access.

Usually ROW based but also some systems support COLUMN based storage to disk

NewSQL

Big Data Management

Storing Relations

Column Storage

Only read relevant data

Whole record retrieval is rare

Efficient storage on disc

Compression 1:10 vs 1:3

Row storage

Traditional approach

Efficient scans of relations

Efficient record insert

Big Data Management

Row storage

Column storage

Better compression due to columns containing the same type of data - so can get 10x smaller with column storage rather than 3x smaller with row storage

All operators need to be native column implementations

Big Data Management

NewSQL

Compression

Rows contain values from different types

More entropy between values

Difficult to dense-pack

Columns have a set type

Less entropy in values

CPU consideration in compression

Big Data Management

Row Based Storage

Big Data Management

001:Wednesday,Addams,01412318743,[email protected];002:Bart, Simpson,0131444777,[email protected];003:Lisa,Simpson,NULL, [email protected];

004:Pugsley,Addams,01911211538,[email protected];

How many disk blocks need to be accessed to count the number of Simpson’s ?

	Firstname	Lastname	Number	Email
001	Wednesday	Addams	01412318743	[email protected]
004	Pugsley	Addams	01911211538	[email protected]
002	Bart	Simpson	0131444777	[email protected]
003	Lisa	Simpson	NULL	[email protected]

{CLICK}

Each section is a new disk block,

i.e. block 1 holds 3 records and block 2 holds 1 record

{CLICK}

Need to access both disk blocks to get all the SIMPSONs

Big Data Management

NewSQL

Column Based Storage

Big Data Management

	Firstname	Lastname	Number	Email
001	Wednesday	Addams	01412318743	[email protected]
004	Pugsley	Addams	01911211538	[email protected]
002	Bart	Simpson	0131444777	[email protected]
003	Lisa	Simpson	NULL	[email protected]

Wednesday:001,Bart:002,Lisa:003,Pugsley:004;Addams:001,004, Simpson:002,003;01412318743:001,0131444777:002, NULL:003,01911211538:004;

How many disk blocks need to be accessed to count the number of ‘Simpson’?

[email protected]:001,[email protected]:002,[email protected]:003,[email protected]:004;

Note removal of repeated column values

{CLICK to show data for BLOCK 2}

{CLICK to show in colours}

red - firstname data

green - lastname

purple - numbers

blue - email

{CLICK for question}

Only need to access the first disk block

Big Data Management

NewSQL

Hardware Affects

Traditional HW

Slow CPU

Limited memory

Large spinning disks

Slow access

Favours row storage!

Modern HW

Fast CPUs

Multiple caches

Large memory

Flash disk

Fast access

Favours column storage!

Big Data Management

https://www.youtube.com/watch?v=8KGVFB3kVHQ

Column-Orientated DBMS

Column Oriented Applications

Data warehousing

Mainly read operations

Information Retrieval

Efficient use of indexes

Scientific databases e.g. SLOAN Digital Sky Survey

Wide tuples, focus on small number at a time

RDF

Matches query patterns based on graphs

Big Data Management

Column Orientated Relational Database Systems

Big Data Management

C-Store

Commercial and Open Source

Many still use SQL as the query language!

Big Data Management

NewSQL

New SQL

Big Data Management

Builds on strengths of RDBMS

RDBMS re-visited

Big Data Management

NewSQL

451 Group’s Definition of New SQL

A DBMS that delivers the scalability and flexibility promised by NoSQL while retaining the support for SQL queries and/or ACID, or to improve performance for appropriate workloads.

https://451research.com/

https://451research.com/report-short?entityId=66963

Big Data Management

451 Research is a global research and advisory firm.

NewSQL

Big Data Management

Stonebraker’s Definition

SQL as primary interface

ACID support for transactions

Non-locking concurrency control

High per-node performance

Parallel, shared-nothing architecture

M. Stonebraker. New SQL: An Alternative to NoSQL and Old SQL for New OLTP Apps. Blog@Communications of the ACM. 16 June 2011. http://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext

Big Data Management

http://hadoop4japan.files.wordpress.com/2012/04/kb_scale_outup.png

If the system is going to perform well then each node needs to be high performance

NewSQL

Big Data Management

Application Areas

Big Data Management

http://imagens.canaltech.com.br/50452.69272-OLTP-OLAP.png

OLTP: Online Transaction Processing

Financial

ATMs

Order systems

Retail sales

…

Day-to-day operations

Requires ACID guarantees

What about CAP?

Require guarantees, not possible with NoSQL

Want scalability of NoSQL

Big Data Management

NewSQL

Key requirements

Big Data Management

http://hadoop4japan.files.wordpress.com/2012/04/kb_scale_outup.png

Strong transactional guarantees – ACID

Not offered by most NoSQL

Consistency requirements

Scale-out not scale-up

Support distribution through

Concurrency control

Flow control

Query processing

NewSQL

Big Data Management

Transaction Bottlenecks

Disk reads/writes

Persist data

Undo/redo logs

Network Communications

Intra-node

Client-server

Concurrency control

Locking: restricts concurrent data access

Latching: restricts concurrent index access

Big Data Management

Disks and communications are the bottlenecks in RDBMS

NewSQL

Big Data Management

Processing

Transactions Revisited Solutions

Solutions

Short-lived:

No user interaction

Time limit (5-10 seconds)

Access (small) subset of data

Data must be indexed

No full table scans

Repetitive interactions

Same queries with different inputs

Strategy

Precompile transactions

Parameterized

Serialize transactions

No delays for disk or user

Big Data Management

Want to offer transactions that are

- quick

- access small subset of the data (not full table scans)

- precompile transactions with input parameters

- reduce locking of memory

{next slide for solutions implemented}

NewSQL

Big Data Management

VoltDB Solution

Big Data Management

https://downloads.voltdb.com/datasheets_collateral/technical_overview.pdf

Timestamp CC

MVCC

In-memory

Single threaded transactions

Replication

Failover

- multi version concurrency control

Work performed by RDBMS

Solutions to the bottlenecks:

{CLICK}

Concurrency Control (CC) is not by locking and 2 phase commit

- MVCC= multi version concurrency control

{CLICK} - use memory

{CLICK}

Single threaded: stop multi-core machines being blocked on shared memory access

{CLICK} have nodes replicate each other for fault tolerance

NewSQL

Big Data Management

Percentage

Index Management Logging Locking Latching Buffer Management Work 11 20 18 10 29 12

Volt DB Video

Big Data Management

https://www.youtube.com/watch?v=XPA7zNRshRU

Ryan Betts, VoltDB's Chief Technology Officer speaks in-depth about VoltDB's architecture.

{CLICK on image to go to YouTube Video - 3m 41s video}

NewSQL

Big Data Management

System Architecture

Redesigned for latest hardware

Big Data Management

New Architecture Assumptions

Distributed shared-nothing cluster

Move queries to the data

Aim for stateless query execution

Data resides in memory

Backed-up to disk

Multiple copies

No buffer pools

Avoid locking (serialised transactions)

Minimise logging: DDL statements only

Big Data Management

In-memory requires rethink of physical representation

No longer constrained by disk assumptions

Big Data Management

NewSQL

Limitations

Latency: still going to wait to get data

Limited by speed of light

Limited by bandwidth

SQL subset

No negated sub-queries

Subset of aggregates

Ad hoc schema changes

Big Data Management

But (like Dynamo), tables are tunable towards CP

Where are NewSQL?

Big Data Management

CA: Guarantees to give a correct response but only while network works fine

(Centralised / Traditional)

CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)

AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)

Big Data Management

NewSQL

NewSQL Systems

Big Data Management

Google F1 not strictly NewSQL. Hierarchical schema. Used for AdWords (the bit of google that makes all their money)

VoltDB is a commercialization of H-Store

NuoDB: MVCC + append only key-value storage

VoltDB: In-memory, multi-master replication, with disk snapshots

ScaleDB: Cluster manager for MySQL with locks

FoundationDB: Key-value storage with SQL and ACID layer

Clustrix: self-managed MySQL replacement for a cluster

MemSQL: Column-oriented disk persistence, in memory row oriented, lock-free, compiled plans. Replicas

Big Data Management

NewSQL

Distributed DBMS: mySQL cluster

MySQL Cluster is implemented through the NDB or NDBCLUSTER storage engine for MySQL ("NDB" stands for Network Database)

The MySQL Cluster is a fault tolerant in-memory clustered database designed for high availability (99.9%) and fast automatic fail overall running on cost-effective commodity hardware.

Big Data Management

https://medium.com/@dilekamadushan/introduction-to-mysql-cluster-84393a12e2f9

Distributed DBMS: Postgresql-XL

Postgres-XL is a horizontally scalable open source SQL database cluster, flexible enough to handle varying database workloads:

OLTP write-intensive workloads

Business Intelligence requiring MPP parallelism

Operational data store

Key-value store

GIS Geospatial

Mixed-workload environments

Multi-tenant provider hosted environments

Big Data Management

Citus

https://www.citusdata.com/product

Big Data Management

Video from link – extension to scale out horizontally - sharding

NewSQL

Big Data Management

OmniSci

GPU databases

Big Data Management

SQL

https://www.holistics.io/blog/the-rise-of-sql-based-data-modeling-and-dataops/

Big Data Management

Couchbase N1QL – (nikel)

N1QL (pronounced “nickel”) is Couchbase’s next-generation query language. N1QL aims to meet the query needs of distributed document-oriented databases. This document specifies the syntax and semantics of the SELECT statement in N1QL.

The N1QL data model derives its name from the non-first normal form, which is a superset and generalization of the relational first normal form (1NF).

https://query-tutorial.couchbase.com/tutorial/#1

Big Data Management

Different architecture choices

NuoDB: MVCC + append only key-value storage

VoltDB: In-memory, multi-master replication, with disk snapshots

ScaleDB: Cluster manager for MySQL with locks

FoundationDB: Key-value storage with SQL and ACID layer

Clustrix: self-managed MySQL replacement for a cluster

MemSQL: Column-oriented disk persistence, in memory row oriented, lock-free, compiled plans. Replicas

Big Data Management

NewSQL

Big Data Management

Summary

RDBMS evolution due to

Out-of-date hardware assumptions

First relational column-stores (different from wide-column stores)

Second NewSQL

NewSQL: Reaction to NoSQL

Scale-out not up

Replication

Big Data Management

Relational Column Orientated Databases

Benefits

OLAP

Data warehouses

Column-based operations:

Aggregate queries

Adjust price of values

Reduced storage space

Disadvantages

OLTP

Row-based operations

New record

Retrieve multiple values

Big Data Management

S. Harizopoulos, D. Abadi and P. Boncz, "Column-Oriented Database Systems", VLDB 2009 Tutorial, p. 5. http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf

No faster for select * queries

Data compressiong is better for COLUMNS than ROWS… typically get 1:10 compression rather than 1:3. - because COLUMNS more similar data (better for compressing)

OLTP – online transaction processing - typically, OLTP systems are used for order entry, financial transactions, customer relationship management (CRM) and retail sales.

Big Data Management

NewSQL

New SQL (response to NoSQL)

Distributed shared-nothing cluster

Scale-out

Data resides in memory

Fast transactions

No user interaction

Tight upper time limits

ACID guarantees

Target OLTP

Big Data Management

Links to check out:

Big Data Management

Why SQL is awesome (video)

https://www.youtube.com/watch?v=wTPGW1PNy_Y&app=desktop

N1QL – CouchBase (web page and interactive shell)

https://www.couchbase.com/products/n1ql

https://query-tutorial.couchbase.com/tutorial/#1

Video on why SQL is awesome!!

NewSQL

Big Data Management

Data Files Catalog Logs Index Structure Database

Storage Manager

Transaction Manager

Security Control

Disk Manager

Buffer Manager

Code Generator

Physical Optimiser

Logical Optimiser

Query Parser

Query Processor

SQL Queries

Data FilesCatalog Logs

Index

Structure

Database

Storage

Manager

Transaction

Manager

Security

Control

Disk

Manager

Buffer

Manager

Code

Generator

Physical

Optimiser

Logical

Optimiser

Query

Parser

Query

Processor

SQL Queries

VLDB 2009 Tutorial Column-Oriented Database Systems 2

What is a column-store?

VLDB 2009 Tutorial Column-Oriented Database Systems 2

row-store column-store Date CustomerProductStore

+ easy to add/modify a record

- might read in unnecessary data

+ only need to read in relevant data

- tuple writes require multiple accesses

=> suitable for read-mostly, read-intensive, large data repositories

Date Store Product Customer Price Price

VLDB 2009 Tutorial Column-Oriented Database Systems 2

What is a column-store?

VLDB 2009 Tutorial Column-Oriented Database Systems 2

row-store column-store Date CustomerProductStore

+ easy to add/modify a record

- might read in unnecessary data

+ only need to read in relevant data

- tuple writes require multiple accesses

=> suitable for read-mostly, read-intensive, large data repositories

Date Store Product Customer Price Price