Big Data Management Question Paper
NewSQL
Big Data Management
Phil Bartie phil.bartie@hw.ac.uk EM G.29
1
Big Data Management
NewSQL
Materials released under CC-BY License
You are free to:
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Big Data Management
2
Outline
Relational Database Storage
Row oriented
Column oriented
Different from wide-column stores
NewSQL
Relational community reaction to NoSQL!
Big Data Management
3
Relational
Oracle
MySQL
MS SQL Server
PostgreSQL
DB2
MS Access
SQLite
Teradata
SAP Adaptive Server
FileMaker
Hive
MariaDB
Informix
Vertica
Database Landscape
Big Data Management
4
NoSQL
Document
MongoDB
Elasticsearch
DynamoDB
CouchBase
Key-Value
Redis
Memcached
Riak KV
Aerospike
SimpleDB
Graph
Neo4J
Titan
Giraph
InfiniteGraph
RDF
Virtuoso
Stardog
GraphDB
Blazegraph
Jena
RDF4J
NewSQL
SAP HANA
Google Spanner
Clustrix
VoltDB
MemSQL
NuoDB
Object
Caché
Db4o
Versant
ObjectStore
Wide-Column
Cassandra
HBase
Accumulo
HyperTable
XML
MarkLogic
Sedna
Tamino
BaseX
eXist-db
Systems in blue are column-stores
NewSQL
Big Data Management
4
5
A look at the Big data Landscape… … - we’ve just been talking about the {CLICK} NoSQL database part of this landscape.
Today going to take a look at {CLICK} NewSQL databases
Not talked about Hadoop and Spark --- for analytics --- Hadoop is a framework with a distributed files system, that can use Apache Spark functions (eg machine learning).
-- there is another course option that does use Hadoop. (F21DP)
-- Hadoop will use files on disk (HDFS) + good for READ only like analysis of log files
NewSQL
Big Data Management
5
6
https://youtu.be/v5e_PasMdXc?t=538
Which NoSQL database to use for a particular project?
Frank Kane
Frank spent 9 years at Amazon and IMDb
http://www.sundog-education.com/
Frank has own company running courses online, and also works as instructor with udemy
{CLICK to play video clip from YouTube - about 10mins}
Examples:
(1) mySQL – in house experience
(2) Hadoop / Spark for log file analytics
(3) Cassandra
(4) MongoDB ( or NewSQL or could be Hbase but lacking strong external support so MongoDB better choice for this example )
NewSQL
Big Data Management
6
7
https://www.youtube.com/watch?v=w2hCJIZz3n8&feature=youtu.be&t=186
Database Universe – value of data over time
VoltDB Co-Founder and Chief Strategy Officer Scott Jarr on the value of data in real-time.
{CLICK image to watch video - play until 5m 57s }
Looking at the DB universe and figuring out where things might be most useful… always evolving.
Apache Hadoop a framework that allows for the distributed processing of large data sets across clusters -
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data
- Map Reduce – division of jobs into smaller ones and results joined up (eg vote counting by district – sum to total votes)
Spark – compute engine provides supports for ETL, machine learning, stream processing, and graph computation and more
NewSQL
Big Data Management
7
History in No-tation
Big Data Management
8
http://strataconf.com/london/public/schedule/detail/32351
1970: NoSQL = We have no SQL
SQL does not yet exist
1980: NoSQL = Know SQL
SQL has been standardised
Used for all data retrieval
2000: NoSQL = NoSQL!
SQL is slow and hard, let’s ditch it
2005: NoSQL = Not only SQL
There are more than one query language to get data
2013: NoSQL = No, SQL!
We need SQL, NoSQL stores starting to put SQL frontends on
NewSQL
Big Data Management
8
History in No-tation
1970: NoSQL = We have no SQL
SQL does not yet exist
1980: NoSQL = Know SQL
SQL has been standardised
Used for all data retrieval
2000: NoSQL = NoSQL!
SQL is slow and hard, let’s ditch it
2005: NoSQL = Not only SQL
There are more than one query language to get data
2013: NoSQL = No, SQL!
We need SQL, NoSQL stores starting to put SQL frontends on
Big Data Management
9
Traditional DBMS Architecture
Big Data Management
10
https://www.cs.oberlin.edu/~jdonalds/311/fig02-01.png
Magnetic disk
Optimise data distribution on disk and access
Request data transfer to working memory
Security and cataloguing
Logging
Concurrency control
(locks)
Transactions
Schema management
Query compilation
Query execution
Assumptions:
Operating on single machine
Data on disk – disk access slow
Limited memory
A lot of subsystems for management purposes
10
Big Data Management
NewSQL
Work performed by RDBMS
Big Data Management
11
Do you spot any issues with this data?
Do you trust it?
Latching: protects memory data structures from concurrent threads
{CLICK x2}
Probably not as bad as this
Looks made up since everything is 24%
Data has come from a Dr Michael Stonebraker talk based on data from 2007
(Shore DBMS, instrumented on TPC-C benchmark)
--------
Dr Michael Stonebraker's career can be broadly divided into two phases:
University of California, Berkeley - Ingres and Postgres
Massachusetts Institute of Technology (MIT) where he developed more novel data management techniques such as C-Store, H-Store and SciDB.
Major prizes include the Turing Award in 2015.
Stonebraker is currently a Professor Emeritus at UC Berkeley and an adjunct professor at MIT.
NewSQL
Big Data Management
11
Latching Recovery Work Buffer Pool Locking 24 24 4 24 24
Work performed by RDBMS
Big Data Management
12
https://downloads.voltdb.com/datasheets_collateral/technical_overview.pdf
This looks more plausible
Based on actual executions
Moved from 4% work to 12%
NewSQL
Big Data Management
12
Index Management Logging Locking Latching Buffer Management Work 11 20 18 10 29 12
Traditional RDBMS Assumptions
Operating on single machine
Data on disk – disk access slow
Limited memory
Single copy of each record
Many subsystems for management!
Big Data Management
13
http://hadoop4japan.files.wordpress.com/2012/04/kb_scale_outup.png
Column-oriented DBMS
Big Data Management
14
One of the bottlenecks in RDBMS is disk access.
Usually ROW based but also some systems support COLUMN based storage to disk
NewSQL
Big Data Management
14
Storing Relations
Column Storage
Only read relevant data
Whole record retrieval is rare
Efficient storage on disc
Compression 1:10 vs 1:3
Row storage
Traditional approach
Efficient scans of relations
Efficient record insert
Big Data Management
15
Row storage
Column storage
Better compression due to columns containing the same type of data - so can get 10x smaller with column storage rather than 3x smaller with row storage
All operators need to be native column implementations
15
Big Data Management
NewSQL
Compression
Rows contain values from different types
More entropy between values
Difficult to dense-pack
Columns have a set type
Less entropy in values
CPU consideration in compression
Big Data Management
16
Row Based Storage
Big Data Management
17
001:Wednesday,Addams,01412318743,wed@addams.com;002:Bart, Simpson,0131444777,bart@example.com;003:Lisa,Simpson,NULL, lisa@macs.ac.uk;
004:Pugsley,Addams,01911211538,pug@addams.com;
How many disk blocks need to be accessed to count the number of Simpson’s ?
| Firstname | Lastname | Number | ||
| 001 | Wednesday | Addams | 01412318743 | wed@addams.com |
| 004 | Pugsley | Addams | 01911211538 | pug@addams.com |
| 002 | Bart | Simpson | 0131444777 | bart@example.com |
| 003 | Lisa | Simpson | NULL | lisa@macs.hw.ac.uk |
{CLICK}
Each section is a new disk block,
i.e. block 1 holds 3 records and block 2 holds 1 record
{CLICK}
Need to access both disk blocks to get all the SIMPSONs
17
Big Data Management
NewSQL
Column Based Storage
Big Data Management
18
| Firstname | Lastname | Number | ||
| 001 | Wednesday | Addams | 01412318743 | wed@addams.com |
| 004 | Pugsley | Addams | 01911211538 | pug@addams.com |
| 002 | Bart | Simpson | 0131444777 | bart@example.com |
| 003 | Lisa | Simpson | NULL | lisa@macs.hw.ac.uk |
Wednesday:001,Bart:002,Lisa:003,Pugsley:004;Addams:001,004, Simpson:002,003;01412318743:001,0131444777:002, NULL:003,01911211538:004;
Wednesday:001,Bart:002,Lisa:003,Pugsley:004;Addams:001,004, Simpson:002,003;01412318743:001,0131444777:002, NULL:003,01911211538:004;
How many disk blocks need to be accessed to count the number of ‘Simpson’?
wed@addams.com:001,bart@example.com:002,lisa@macs.hw.ac.uk:003,pug@addams.com:004;
wed@addams.com:001,bart@example.com:002,lisa@macs.hw.ac.uk:003,pug@addams.com:004;
Note removal of repeated column values
{CLICK to show data for BLOCK 2}
{CLICK to show in colours}
red - firstname data
green - lastname
purple - numbers
blue - email
{CLICK for question}
Only need to access the first disk block
18
Big Data Management
NewSQL
Hardware Affects
Traditional HW
Slow CPU
Limited memory
Large spinning disks
Slow access
Favours row storage!
Modern HW
Fast CPUs
Multiple caches
Large memory
Flash disk
Fast access
Favours column storage!
Big Data Management
19
https://www.youtube.com/watch?v=8KGVFB3kVHQ
Column-Orientated DBMS
Column Oriented Applications
Data warehousing
Mainly read operations
Information Retrieval
Efficient use of indexes
Scientific databases e.g. SLOAN Digital Sky Survey
Wide tuples, focus on small number at a time
RDF
Matches query patterns based on graphs
Big Data Management
21
Column Orientated Relational Database Systems
Big Data Management
22
C-Store
Commercial and Open Source
Many still use SQL as the query language!
22
Big Data Management
NewSQL
New SQL
Big Data Management
23
Builds on strengths of RDBMS
RDBMS re-visited
23
Big Data Management
NewSQL
451 Group’s Definition of New SQL
A DBMS that delivers the scalability and flexibility promised by NoSQL while retaining the support for SQL queries and/or ACID, or to improve performance for appropriate workloads.
https://451research.com/report-short?entityId=66963
Big Data Management
24
451 Research is a global research and advisory firm.
NewSQL
Big Data Management
24
Stonebraker’s Definition
SQL as primary interface
ACID support for transactions
Non-locking concurrency control
High per-node performance
Parallel, shared-nothing architecture
M. Stonebraker. New SQL: An Alternative to NoSQL and Old SQL for New OLTP Apps. Blog@Communications of the ACM. 16 June 2011. http://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext
Big Data Management
25
http://hadoop4japan.files.wordpress.com/2012/04/kb_scale_outup.png
If the system is going to perform well then each node needs to be high performance
NewSQL
Big Data Management
25
Application Areas
Big Data Management
26
http://imagens.canaltech.com.br/50452.69272-OLTP-OLAP.png
OLTP: Online Transaction Processing
Financial
ATMs
Order systems
Retail sales
…
Day-to-day operations
Requires ACID guarantees
What about CAP?
Require guarantees, not possible with NoSQL
Want scalability of NoSQL
26
Big Data Management
NewSQL
Key requirements
Big Data Management
27
http://hadoop4japan.files.wordpress.com/2012/04/kb_scale_outup.png
Strong transactional guarantees – ACID
Not offered by most NoSQL
Consistency requirements
Scale-out not scale-up
Support distribution through
Concurrency control
Flow control
Query processing
NewSQL
Big Data Management
27
Transaction Bottlenecks
Disk reads/writes
Persist data
Undo/redo logs
Network Communications
Intra-node
Client-server
Concurrency control
Locking: restricts concurrent data access
Latching: restricts concurrent index access
Big Data Management
28
Disks and communications are the bottlenecks in RDBMS
NewSQL
Big Data Management
28
T1
T2
T3
Processing
Transactions Revisited Solutions
Solutions
Short-lived:
No user interaction
Time limit (5-10 seconds)
Access (small) subset of data
Data must be indexed
No full table scans
Repetitive interactions
Same queries with different inputs
Strategy
Precompile transactions
Parameterized
Serialize transactions
No delays for disk or user
Big Data Management
29
Want to offer transactions that are
- quick
- access small subset of the data (not full table scans)
- precompile transactions with input parameters
- reduce locking of memory
{next slide for solutions implemented}
NewSQL
Big Data Management
29
VoltDB Solution
Big Data Management
30
https://downloads.voltdb.com/datasheets_collateral/technical_overview.pdf
Timestamp CC
MVCC
In-memory
Single threaded transactions
Replication
Failover
- multi version concurrency control
Work performed by RDBMS
Solutions to the bottlenecks:
{CLICK}
Concurrency Control (CC) is not by locking and 2 phase commit
- MVCC= multi version concurrency control
{CLICK} - use memory
{CLICK}
Single threaded: stop multi-core machines being blocked on shared memory access
{CLICK} have nodes replicate each other for fault tolerance
NewSQL
Big Data Management
30
Index Management Logging Locking Latching Buffer Management Work 11 20 18 10 29 12
Volt DB Video
Big Data Management
31
https://www.youtube.com/watch?v=XPA7zNRshRU
Ryan Betts, VoltDB's Chief Technology Officer speaks in-depth about VoltDB's architecture.
{CLICK on image to go to YouTube Video - 3m 41s video}
NewSQL
Big Data Management
31
System Architecture
Redesigned for latest hardware
Big Data Management
32
New Architecture Assumptions
Distributed shared-nothing cluster
Move queries to the data
Aim for stateless query execution
Data resides in memory
Backed-up to disk
Multiple copies
No buffer pools
Avoid locking (serialised transactions)
Minimise logging: DDL statements only
Big Data Management
33
In-memory requires rethink of physical representation
No longer constrained by disk assumptions
33
Big Data Management
NewSQL
Limitations
Latency: still going to wait to get data
Limited by speed of light
Limited by bandwidth
SQL subset
No negated sub-queries
Subset of aggregates
Ad hoc schema changes
Big Data Management
34
But (like Dynamo), tables are tunable towards CP
Where are NewSQL?
Big Data Management
35
C
A
P
CA: Guarantees to give a correct response but only while network works fine
(Centralised / Traditional)
CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)
AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)
CA
35
Big Data Management
NewSQL
NewSQL Systems
Big Data Management
36
Google F1 not strictly NewSQL. Hierarchical schema. Used for AdWords (the bit of google that makes all their money)
VoltDB is a commercialization of H-Store
NuoDB: MVCC + append only key-value storage
VoltDB: In-memory, multi-master replication, with disk snapshots
ScaleDB: Cluster manager for MySQL with locks
FoundationDB: Key-value storage with SQL and ACID layer
Clustrix: self-managed MySQL replacement for a cluster
MemSQL: Column-oriented disk persistence, in memory row oriented, lock-free, compiled plans. Replicas
36
Big Data Management
NewSQL
Distributed DBMS: mySQL cluster
MySQL Cluster is implemented through the NDB or NDBCLUSTER storage engine for MySQL ("NDB" stands for Network Database)
The MySQL Cluster is a fault tolerant in-memory clustered database designed for high availability (99.9%) and fast automatic fail overall running on cost-effective commodity hardware.
Big Data Management
37
Distributed DBMS: Postgresql-XL
Postgres-XL is a horizontally scalable open source SQL database cluster, flexible enough to handle varying database workloads:
OLTP write-intensive workloads
Business Intelligence requiring MPP parallelism
Operational data store
Key-value store
GIS Geospatial
Mixed-workload environments
Multi-tenant provider hosted environments
Big Data Management
38
Citus
https://www.citusdata.com/product
Big Data Management
39
Video from link – extension to scale out horizontally - sharding
NewSQL
Big Data Management
39
OmniSci
GPU databases
Big Data Management
40
SQL
https://www.holistics.io/blog/the-rise-of-sql-based-data-modeling-and-dataops/
Big Data Management
41
Couchbase N1QL – (nikel)
N1QL (pronounced “nickel”) is Couchbase’s next-generation query language. N1QL aims to meet the query needs of distributed document-oriented databases. This document specifies the syntax and semantics of the SELECT statement in N1QL.
The N1QL data model derives its name from the non-first normal form, which is a superset and generalization of the relational first normal form (1NF).
https://query-tutorial.couchbase.com/tutorial/#1
Big Data Management
42
Different architecture choices
NuoDB: MVCC + append only key-value storage
VoltDB: In-memory, multi-master replication, with disk snapshots
ScaleDB: Cluster manager for MySQL with locks
FoundationDB: Key-value storage with SQL and ACID layer
Clustrix: self-managed MySQL replacement for a cluster
MemSQL: Column-oriented disk persistence, in memory row oriented, lock-free, compiled plans. Replicas
Big Data Management
43
NewSQL
Big Data Management
43
Summary
RDBMS evolution due to
Out-of-date hardware assumptions
First relational column-stores (different from wide-column stores)
Second NewSQL
NewSQL: Reaction to NoSQL
Scale-out not up
Replication
Big Data Management
44
Relational Column Orientated Databases
Benefits
OLAP
Data warehouses
Column-based operations:
Aggregate queries
Adjust price of values
Reduced storage space
Disadvantages
OLTP
Row-based operations
New record
Retrieve multiple values
Big Data Management
45
S. Harizopoulos, D. Abadi and P. Boncz, "Column-Oriented Database Systems", VLDB 2009 Tutorial, p. 5. http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf
No faster for select * queries
Data compressiong is better for COLUMNS than ROWS… typically get 1:10 compression rather than 1:3. - because COLUMNS more similar data (better for compressing)
OLTP – online transaction processing - typically, OLTP systems are used for order entry, financial transactions, customer relationship management (CRM) and retail sales.
45
Big Data Management
NewSQL
New SQL (response to NoSQL)
Distributed shared-nothing cluster
Scale-out
Data resides in memory
Fast transactions
No user interaction
Tight upper time limits
ACID guarantees
Target OLTP
Big Data Management
46
Links to check out:
Big Data Management
47
Why SQL is awesome (video)
https://www.youtube.com/watch?v=wTPGW1PNy_Y&app=desktop
N1QL – CouchBase (web page and interactive shell)
https://www.couchbase.com/products/n1ql
https://query-tutorial.couchbase.com/tutorial/#1
Video on why SQL is awesome!!
NewSQL
Big Data Management
47
Data Files Catalog Logs Index Structure Database
Storage Manager
Transaction Manager
Security Control
Disk Manager
Buffer Manager
Code Generator
Physical Optimiser
Logical Optimiser
Query Parser
Query Processor
SQL Queries
Data FilesCatalog Logs
Index
Structure
Database
Storage
Manager
Transaction
Manager
Security
Control
Disk
Manager
Buffer
Manager
Code
Generator
Physical
Optimiser
Logical
Optimiser
Query
Parser
Query
Processor
SQL Queries
VLDB 2009 Tutorial Column-Oriented Database Systems 2
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What is a column-store?
VLDB 2009 Tutorial Column-Oriented Database Systems 2
row-store column-store Date CustomerProductStore
+ easy to add/modify a record
- might read in unnecessary data
+ only need to read in relevant data
- tuple writes require multiple accesses
=> suitable for read-mostly, read-intensive, large data repositories
Date Store Product Customer Price Price
VLDB 2009 Tutorial Column-Oriented Database Systems 2
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What is a column-store?
VLDB 2009 Tutorial Column-Oriented Database Systems 2
row-store column-store Date CustomerProductStore
+ easy to add/modify a record
- might read in unnecessary data
+ only need to read in relevant data
- tuple writes require multiple accesses
=> suitable for read-mostly, read-intensive, large data repositories
Date Store Product Customer Price Price