Big Data Management Question Paper
DBMS Landscape
Big Data Management
Phil Bartie phil.bartie@hw.ac.uk EM G.29
Material from : Alasdair Gray, Heriot-Watt Aidan Hogan, Universidad de Chile Tim Wellhausen, Freelance Software Developer
Limitations of distributed computing
Brewer’s CAP Theorem
Big Data Management
3
http://learnyousomeerlang.com/static/img/cap.png
But first … ACID For traditional (non-distributed) databases …
1. Atomicity: ¤ Transactions all or nothing: fail cleanly
2. Consistency: ¤ Doesn’t break constraints/rules
3. Isolation: ¤ Transaction doesn’t show elsewhere until COMMIT / ROLLBACK
4. Durability: ¤ Once completed a transaction is permanent (saved to non-volatile memory)
Big Data Management
4
Transaction Example
Big Data Management
5
SCENARIO: Transfer £125 from Bob to Mia - to do this requires multiple steps.
ATOMIC: Transaction only valid if all the steps work (all or nothing)
CONSISTENT: If part of the transaction fails then undo everything in the transaction
tr a
ns a
c tio
n
Transaction Example: Isolation
Big Data Management
6
USER 1
USER 2
SELECT * from account;
id | fname | amount ----+-------+-------- 1 | frank | 300 2 | sarah | 1500 3 | jen | 2500 4 | mia | 300 5 | bob | 600
(5 rows)
Starting Position
Transaction Example: Isolation
Big Data Management
7
USER1 carries out the COMMIT ...and now both users see the same values in the table
ISOLATION: the results from the transaction don’t show elsewhere until COMMIT or ROLLBACK
USER 1 USER 2
What is CAP?
Guarantees for a distributed system:
1. Consistency: ¤ All nodes have a consistent view of the system
2. Availability: ¤ Every read/write is acted upon
3. Partition-tolerance: ¤ The system works even if messages are lost
Big Data Management
8
Distributed Replication
Big Data Management
9
Consistency Big Data Management
10
There’s 891 users in ‘M’
There’s 891 users in ‘M’
Availability Big Data Management
11
891How many users start with ‘M’
Partition-Tolerance Big Data Management
12
891 How many users start with ‘M’
The CAP Question
Can a distributed system guarantee consistency (all nodes have the same up-to-date view),
availability (every read/write is acted upon) and
partition-tolerance (the system works even if messages are lost)
at the same time?
Big Data Management
13
What do you think?
The CAP Answer
Big Data Management
14
The CAP “Proof”
Big Data Management
15
How many users start with ‘M’
There’s 891 users in ‘M’
There’s 891 users in ‘M’
891
There’s 892 users in ‘M’
The CAP “Proof” ¤Consider machines m1 and m2 on either side of a
partition: ¤ If an update is allowed on m2 (Availability), then m1 cannot see
the change (lose Consistency) ¤ To make sure that m1 and m2 have the same, up-to-date view
(Consistency), neither m1 nor m2 can accept any updates (lose Availability)
¤ Thus, only when m1 and m2 can communicate (lose Partition tolerance) can Availability and Consistency be guaranteed
Big Data Management
16
The CAP Theorem A distributed system cannot guarantee consistency
(all nodes have the same up-to-date view), availability
(every read/write is acted upon) and partition-tolerance
(the system works even if messages are lost) at the same time. (“Proof” as shown on previous slide J)
Big Data Management
17
18
Big Data Management
???
Imagine this Scenario
Big Data Management
19
• Working in cafe with a friend on a film script
• Each with a copy of the script
• You go home & your friend stays in the cafe
• You carry on editing the script, so does your friend talking over the phone.
• Your friend’s mobile then runs out of power [i.e. the network now is down / you are partitioned]
• Your flat mate asks a question about the script…
• You can give them an answer based on your version of the script (AVAILABLE but not consistent) or say don’t know yet until you sync edits with the other author when phone is charged (CONSISTENT but not currently available)
The CAP Triangle
Big Data Management
20
C
A P
CAP Systems
Big Data Management
21
C
A P
(No intersection)
CA: Guarantees to give a correct response but only while network works fine (Centralised / Traditional)
CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)
AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)
CA System
Big Data Management
22
How many users start with ‘M’
There’s 891 users in ‘M’
There’s 891 users in ‘M’
There’s 892 users in ‘M’
There’s 892 users in ‘M’892
No support for network partition
CP System
Big Data Management
23
How many users start with ‘M’
There’s 891 users in ‘M’
There’s 891 users in ‘M’891
Can’t allow the update to take place
AP System
Big Data Management
24
How many users start with ‘M’
There’s 891 users in ‘M’
There’s 891 users in ‘M’891
There’s 892 users in ‘M’
BASE (AP) ¤Basically Available
¤ Pretty much always “up”
¤Soft State ¤ Replicated, cached data – not always consistent
¤Eventual Consistency ¤ Stale data tolerated, but data will be copied across at some
point
Examples: Amazon, eBay, Google, DNS …
Big Data Management
25
The CAP Theorem ¤C,A in CAP ≠ C,A in ACID
¤Simplified model ¤Partitions are rare ¤Systems may be a mix of CA/CP/AP ¤C/A/P often continuous in reality!
¤But concept useful/frequently discussed: ¤How to handle Partitions?
¤ Availability? or ¤ Consistency?
Big Data Management
26
Extension to CAP theorem: PACELC theorem (pass-elk)
Big Data Management
27
If there is a Partition:
there is a trade-off between Availability and Consistency
Else :
there is a trade-off between Latency vs Consistency
PACELC (pass-elk)
Big Data Management
28
https://youtu.be/hUd_9FENShA?t=989
Relational Databases Storing Structured Data
Big Data Management
29
Relational Databases
Big Data Management
30
¤ Regular tabular structure
¤ Rigid defined schema
Database System
Big Data Management
31
Database
Application Program
Application Program
User User
User User
User User
User User User
Queries
RDBMS Architecture
Big Data Management
33
Data Files Catalog Logs Index Structure Database
Storage Manager
Transaction Manager
Security Control
Disk Manager
Buffer Manager
Code Generator
Physical Optimiser
Logical Optimiser
Query Parser
Query Processor
SQL Queries
ANSI-SPARC Three Tier Architecture
Big Data Management
35
Database
Physical Schema
Logical Schema
Application View
Application View
Application View
Application View
Logical Data Independence
Physical Data Independence
RDBMS: Performance Overheads ¤ Structured Query Language (SQL):
¤ Declarative Language ¤ Lots of Rich Features ¤ Difficult to Optimise!
¤ Atomicity, Consistency, Isolation, Durability (ACID): ¤ Makes sure your database stays correct
¤ Even if there’s a lot of traffic! ¤ Transactions incur a lot of overhead
¤ Multi-phase locks, multi-versioning, write ahead logging
¤Distribution not straightforward
Big Data Management
37
Relational Databases: One Size Fits All?
Big Data Management
39
M. Stonebraker and U. Cetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in 21st International Conference on Data Engineering (ICDE’05), 2005, pp. 2–11.
Alternatives to Relational Databases The rise of the NoSQL movement
Big Data Management
40
Relational Oracle MySQL MS SQL Server
PostgreSQL DB2
MS Access SQLite
Teradata
SAP Adaptive Server
FileMaker Hive
MariaDB Informix Vertica
Database Landscape
Big Data Management
41
NoSQL
DocumentMongoDB
Elasticsearch
DynamoDB CouchBase
Key-Value Redis Memcached
Riak KV Aerospike SimpleDB
Graph Neo4J Titan Giraph InfiniteGraph
RDF Virtuoso Stardog
GraphDB Blazegraph
Jena RDF4J
NewSQL SAP HANA Google Spanner Clustrix VoltDB MemSQL NuoDB
Object Caché Db4o Versant
ObjectStore
Wide-Column Cassandra HBase Accumulo
HyperTable
XML MarkLogic
Sedna Tamino
BaseX eXist-db
DB-Engines Ranking
Big Data Management
43
The DB-Engines Ranking is a list of database management systems ranked by their current popularity. We measure the popularity of a system by using the following parameters:
•Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google, Bing and Yandex for this measurement. In order to count only relevant results, we are searching for <system name> together with the term database, e.g. "Oracle" and "database".
•General interest in the system. For this measurement, we use the frequency of searches in Google Trends.
•Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.
•Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired.
•Number of profiles in professional networks, in which the system is mentioned.We use the internationally most popular professional networks LinkedIn and Upwork.
•Relevance in social networks. We count the number of Twitter tweets, in which the system is mentioned.
Ranking Trends Oct 2018
Big Data Management
44http://db-engines.com/en/ranking
http://db-engines.com/en/ranking
45
Big Data Management
NOSQL
Big Data Management
46
47
Big Data Management
Key-Value Stores
Big Data Management
48
¤ Also known as Tuple Stores
¤ Blob-stores are a specialised subset ¤ Operations for streaming files over the Internet
Document Store
Big Data Management
50
Wide-Column Store
Big Data Management
52
¤ Also known as ¤ Extensible Record Store ¤ Big Table (after Google implementation)
Azure Table Storage
Graph-Stores
Big Data Management
54
NoSQL
NoSQL systems aim to be:
¤Flexible: no fixed schema
¤Highly available: replicated across commodity compute clusters ¤Scale-out
¤Web-scale: (automatic) horizontal sharding
Big Data Management
56
NoSQL Drop support for:
¤Schema: Data model can evolve ¤ Data typically exchanged as JSON
¤SQL: No standard query language ¤ Proliferation of APIs ¤ Simpler lighter-weight interactions
¤Transaction processing: ¤ BASE consistency: Basically Available, Soft state, Eventually
consistent
Big Data Management
57
NoSQL ¤ Distributed!
¤ Sharding: splitting data over servers “horizontally” ¤ Replication
¤ Lower-level than RDBMS/SQL ¤ Simpler ad hoc APIs ¤ But you build the application (programming not querying) ¤ Operations simple and cheap
¤ Different flavours (for different scenarios) ¤ Different CAP emphasis ¤ Different scalability profiles ¤ Different query functionality ¤ Different data models
Big Data Management
58
Recap
¤Relational Databases don’t solve everything ¤SQL and ACID add overhead ¤Distribution not so easy
¤NoSQL: what if you don’t need SQL or ACID? ¤Something simpler ¤Something more scalable ¤Trade efficiency against guarantees
Big Data Management
59
Summary
Big Data Management
60
CAP not ACID
Big Data Management
61
C
A P
CA: Guarantees to give a correct response but only while network works fine (Centralised / Traditional)
CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)
AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)
Relational Oracle MySQL MS SQL Server
PostgreSQL DB2
MS Access SQLite
Teradata
SAP Adaptive Server
FileMaker Hive
MariaDB Informix Vertica
Database Landscape
Big Data Management
62
NoSQL
DocumentMongoDB
Elasticsearch
DynamoDB CouchBase
Key-Value Redis Memcached
Riak KV Aerospike SimpleDB
Graph Neo4J Titan Giraph InfiniteGraph
RDF Virtuoso Stardog
GraphDB Blazegraph
Jena RDF4J
NewSQL SAP HANA Google Spanner Clustrix VoltDB MemSQL NuoDB
Object Caché Db4o Versant
ObjectStore
Wide-Column Cassandra HBase Accumulo
HyperTable
XML MarkLogic
Sedna Tamino
BaseX eXist-db
List of NoSQL databases
Big Data Management
63
http://nosql-database.org/
Big Data Discussion
¤What is Big Data?
¤When did Big Data come into being?
¤Why did Big Data become a major issue at this time?
2018Big Data Management
65
http://makingdatameaningful.com/ wp-content/uploads/2012/12/value- big-data-150x150.png
CAP Discussion
¤ What is the CAP theorem?
¤ Does it hold for relational databases?
¤ Are C, A, and P equally important?
2018Big Data Management
66
http://learnyousomeerlang.com/static/img/cap.png
q PACELC – builds on CAP (Abadi, 2010)
If network Partition then need to choose between Availability and Consistency (like CAP theorem)
Else (even) when no network issues choice between Latency and Consistency
PACELC (pass-elk)
Big Data Management
67
https://youtu.be/hUd_9FENShA?t=989
Reading ¤ Essential
¤ Brewer, E. (2012). CAP twelve years later: How the “rules” have changed. Computer, 45(2), 23–29. http://doi.org/10.1109/MC.2012.37 (Sections 1 & 2)
¤ Wellhausen, T. (2012). Highly Scalable , Ultra-Fast and Lots of Choices A Pattern Approach to NoSQL. In EuroPLoP 2012 (pp. B1–1–B1–9).
¤ Useful ¤ Lourenço, J. R., Cabral, B., Carreiro, P., Vieira, M., & Bernardino, J. (2015).
Choosing the right NoSQL database for the job: a quality attribute evaluation. Journal of Big Data, 2(1), 18. http://doi.org/10.1186/s40537-015- 0025-0 (Introduction & Background + table 1)
¤ Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12. http://doi.org/10.1145/1978915.1978919
Big Data Management
68