Big Data Management Question Paper

Magnum_1993
2.1DBMSLandscape.pdf

DBMS Landscape

Big Data Management

Phil Bartie phil.bartie@hw.ac.uk EM G.29

Material from : Alasdair Gray, Heriot-Watt Aidan Hogan, Universidad de Chile Tim Wellhausen, Freelance Software Developer

Limitations of distributed computing

Brewer’s CAP Theorem

Big Data Management

3

http://learnyousomeerlang.com/static/img/cap.png

But first … ACID For traditional (non-distributed) databases …

1. Atomicity: ¤ Transactions all or nothing: fail cleanly

2. Consistency: ¤ Doesn’t break constraints/rules

3. Isolation: ¤ Transaction doesn’t show elsewhere until COMMIT / ROLLBACK

4. Durability: ¤ Once completed a transaction is permanent (saved to non-volatile memory)

Big Data Management

4

Transaction Example

Big Data Management

5

SCENARIO: Transfer £125 from Bob to Mia - to do this requires multiple steps.

ATOMIC: Transaction only valid if all the steps work (all or nothing)

CONSISTENT: If part of the transaction fails then undo everything in the transaction

tr a

ns a

c tio

n

Transaction Example: Isolation

Big Data Management

6

USER 1

USER 2

SELECT * from account;

id | fname | amount ----+-------+-------- 1 | frank | 300 2 | sarah | 1500 3 | jen | 2500 4 | mia | 300 5 | bob | 600

(5 rows)

Starting Position

Transaction Example: Isolation

Big Data Management

7

USER1 carries out the COMMIT ...and now both users see the same values in the table

ISOLATION: the results from the transaction don’t show elsewhere until COMMIT or ROLLBACK

USER 1 USER 2

What is CAP?

Guarantees for a distributed system:

1. Consistency: ¤ All nodes have a consistent view of the system

2. Availability: ¤ Every read/write is acted upon

3. Partition-tolerance: ¤ The system works even if messages are lost

Big Data Management

8

Distributed Replication

Big Data Management

9

Consistency Big Data Management

10

There’s 891 users in ‘M’

There’s 891 users in ‘M’

Availability Big Data Management

11

891How many users start with ‘M’

Partition-Tolerance Big Data Management

12

891 How many users start with ‘M’

The CAP Question

Can a distributed system guarantee consistency (all nodes have the same up-to-date view),

availability (every read/write is acted upon) and

partition-tolerance (the system works even if messages are lost)

at the same time?

Big Data Management

13

What do you think?

The CAP Answer

Big Data Management

14

The CAP “Proof”

Big Data Management

15

How many users start with ‘M’

There’s 891 users in ‘M’

There’s 891 users in ‘M’

891

There’s 892 users in ‘M’

The CAP “Proof” ¤Consider machines m1 and m2 on either side of a

partition: ¤ If an update is allowed on m2 (Availability), then m1 cannot see

the change (lose Consistency) ¤ To make sure that m1 and m2 have the same, up-to-date view

(Consistency), neither m1 nor m2 can accept any updates (lose Availability)

¤ Thus, only when m1 and m2 can communicate (lose Partition tolerance) can Availability and Consistency be guaranteed

Big Data Management

16

The CAP Theorem A distributed system cannot guarantee consistency

(all nodes have the same up-to-date view), availability

(every read/write is acted upon) and partition-tolerance

(the system works even if messages are lost) at the same time. (“Proof” as shown on previous slide J)

Big Data Management

17

18

Big Data Management

???

Imagine this Scenario

Big Data Management

19

• Working in cafe with a friend on a film script

• Each with a copy of the script

• You go home & your friend stays in the cafe

• You carry on editing the script, so does your friend talking over the phone.

• Your friend’s mobile then runs out of power [i.e. the network now is down / you are partitioned]

• Your flat mate asks a question about the script…

• You can give them an answer based on your version of the script (AVAILABLE but not consistent) or say don’t know yet until you sync edits with the other author when phone is charged (CONSISTENT but not currently available)

The CAP Triangle

Big Data Management

20

C

A P

CAP Systems

Big Data Management

21

C

A P

(No intersection)

CA: Guarantees to give a correct response but only while network works fine (Centralised / Traditional)

CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)

AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)

CA System

Big Data Management

22

How many users start with ‘M’

There’s 891 users in ‘M’

There’s 891 users in ‘M’

There’s 892 users in ‘M’

There’s 892 users in ‘M’892

No support for network partition

CP System

Big Data Management

23

How many users start with ‘M’

There’s 891 users in ‘M’

There’s 891 users in ‘M’891

Can’t allow the update to take place

AP System

Big Data Management

24

How many users start with ‘M’

There’s 891 users in ‘M’

There’s 891 users in ‘M’891

There’s 892 users in ‘M’

BASE (AP) ¤Basically Available

¤ Pretty much always “up”

¤Soft State ¤ Replicated, cached data – not always consistent

¤Eventual Consistency ¤ Stale data tolerated, but data will be copied across at some

point

Examples: Amazon, eBay, Google, DNS …

Big Data Management

25

The CAP Theorem ¤C,A in CAP ≠ C,A in ACID

¤Simplified model ¤Partitions are rare ¤Systems may be a mix of CA/CP/AP ¤C/A/P often continuous in reality!

¤But concept useful/frequently discussed: ¤How to handle Partitions?

¤ Availability? or ¤ Consistency?

Big Data Management

26

Extension to CAP theorem: PACELC theorem (pass-elk)

Big Data Management

27

If there is a Partition:

there is a trade-off between Availability and Consistency

Else :

there is a trade-off between Latency vs Consistency

PACELC (pass-elk)

Big Data Management

28

https://youtu.be/hUd_9FENShA?t=989

Relational Databases Storing Structured Data

Big Data Management

29

Relational Databases

Big Data Management

30

¤ Regular tabular structure

¤ Rigid defined schema

Database System

Big Data Management

31

Database

Application Program

Application Program

User User

User User

User User

User User User

Queries

RDBMS Architecture

Big Data Management

33

Data Files Catalog Logs Index Structure Database

Storage Manager

Transaction Manager

Security Control

Disk Manager

Buffer Manager

Code Generator

Physical Optimiser

Logical Optimiser

Query Parser

Query Processor

SQL Queries

ANSI-SPARC Three Tier Architecture

Big Data Management

35

Database

Physical Schema

Logical Schema

Application View

Application View

Application View

Application View

Logical Data Independence

Physical Data Independence

RDBMS: Performance Overheads ¤ Structured Query Language (SQL):

¤ Declarative Language ¤ Lots of Rich Features ¤ Difficult to Optimise!

¤ Atomicity, Consistency, Isolation, Durability (ACID): ¤ Makes sure your database stays correct

¤ Even if there’s a lot of traffic! ¤ Transactions incur a lot of overhead

¤ Multi-phase locks, multi-versioning, write ahead logging

¤Distribution not straightforward

Big Data Management

37

Relational Databases: One Size Fits All?

Big Data Management

39

M. Stonebraker and U. Cetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in 21st International Conference on Data Engineering (ICDE’05), 2005, pp. 2–11.

Alternatives to Relational Databases The rise of the NoSQL movement

Big Data Management

40

Relational Oracle MySQL MS SQL Server

PostgreSQL DB2

MS Access SQLite

Teradata

SAP Adaptive Server

FileMaker Hive

MariaDB Informix Vertica

Database Landscape

Big Data Management

41

NoSQL

DocumentMongoDB

Elasticsearch

DynamoDB CouchBase

Key-Value Redis Memcached

Riak KV Aerospike SimpleDB

Graph Neo4J Titan Giraph InfiniteGraph

RDF Virtuoso Stardog

GraphDB Blazegraph

Jena RDF4J

NewSQL SAP HANA Google Spanner Clustrix VoltDB MemSQL NuoDB

Object Caché Db4o Versant

ObjectStore

Wide-Column Cassandra HBase Accumulo

HyperTable

XML MarkLogic

Sedna Tamino

BaseX eXist-db

DB-Engines Ranking

Big Data Management

43

The DB-Engines Ranking is a list of database management systems ranked by their current popularity. We measure the popularity of a system by using the following parameters:

•Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google, Bing and Yandex for this measurement. In order to count only relevant results, we are searching for <system name> together with the term database, e.g. "Oracle" and "database".

•General interest in the system. For this measurement, we use the frequency of searches in Google Trends.

•Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.

•Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired.

•Number of profiles in professional networks, in which the system is mentioned.We use the internationally most popular professional networks LinkedIn and Upwork.

•Relevance in social networks. We count the number of Twitter tweets, in which the system is mentioned.

Ranking Trends Oct 2018

Big Data Management

44http://db-engines.com/en/ranking

http://db-engines.com/en/ranking

45

Big Data Management

NOSQL

Big Data Management

46

47

Big Data Management

Key-Value Stores

Big Data Management

48

¤ Also known as Tuple Stores

¤ Blob-stores are a specialised subset ¤ Operations for streaming files over the Internet

Document Store

Big Data Management

50

Wide-Column Store

Big Data Management

52

¤ Also known as ¤ Extensible Record Store ¤ Big Table (after Google implementation)

Azure Table Storage

Graph-Stores

Big Data Management

54

NoSQL

NoSQL systems aim to be:

¤Flexible: no fixed schema

¤Highly available: replicated across commodity compute clusters ¤Scale-out

¤Web-scale: (automatic) horizontal sharding

Big Data Management

56

NoSQL Drop support for:

¤Schema: Data model can evolve ¤ Data typically exchanged as JSON

¤SQL: No standard query language ¤ Proliferation of APIs ¤ Simpler lighter-weight interactions

¤Transaction processing: ¤ BASE consistency: Basically Available, Soft state, Eventually

consistent

Big Data Management

57

NoSQL ¤ Distributed!

¤ Sharding: splitting data over servers “horizontally” ¤ Replication

¤ Lower-level than RDBMS/SQL ¤ Simpler ad hoc APIs ¤ But you build the application (programming not querying) ¤ Operations simple and cheap

¤ Different flavours (for different scenarios) ¤ Different CAP emphasis ¤ Different scalability profiles ¤ Different query functionality ¤ Different data models

Big Data Management

58

Recap

¤Relational Databases don’t solve everything ¤SQL and ACID add overhead ¤Distribution not so easy

¤NoSQL: what if you don’t need SQL or ACID? ¤Something simpler ¤Something more scalable ¤Trade efficiency against guarantees

Big Data Management

59

Summary

Big Data Management

60

CAP not ACID

Big Data Management

61

C

A P

CA: Guarantees to give a correct response but only while network works fine (Centralised / Traditional)

CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)

AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)

Relational Oracle MySQL MS SQL Server

PostgreSQL DB2

MS Access SQLite

Teradata

SAP Adaptive Server

FileMaker Hive

MariaDB Informix Vertica

Database Landscape

Big Data Management

62

NoSQL

DocumentMongoDB

Elasticsearch

DynamoDB CouchBase

Key-Value Redis Memcached

Riak KV Aerospike SimpleDB

Graph Neo4J Titan Giraph InfiniteGraph

RDF Virtuoso Stardog

GraphDB Blazegraph

Jena RDF4J

NewSQL SAP HANA Google Spanner Clustrix VoltDB MemSQL NuoDB

Object Caché Db4o Versant

ObjectStore

Wide-Column Cassandra HBase Accumulo

HyperTable

XML MarkLogic

Sedna Tamino

BaseX eXist-db

List of NoSQL databases

Big Data Management

63

http://nosql-database.org/

Big Data Discussion

¤What is Big Data?

¤When did Big Data come into being?

¤Why did Big Data become a major issue at this time?

2018Big Data Management

65

http://makingdatameaningful.com/ wp-content/uploads/2012/12/value- big-data-150x150.png

CAP Discussion

¤ What is the CAP theorem?

¤ Does it hold for relational databases?

¤ Are C, A, and P equally important?

2018Big Data Management

66

http://learnyousomeerlang.com/static/img/cap.png

q PACELC – builds on CAP (Abadi, 2010)

If network Partition then need to choose between Availability and Consistency (like CAP theorem)

Else (even) when no network issues choice between Latency and Consistency

PACELC (pass-elk)

Big Data Management

67

https://youtu.be/hUd_9FENShA?t=989

Reading ¤ Essential

¤ Brewer, E. (2012). CAP twelve years later: How the “rules” have changed. Computer, 45(2), 23–29. http://doi.org/10.1109/MC.2012.37 (Sections 1 & 2)

¤ Wellhausen, T. (2012). Highly Scalable , Ultra-Fast and Lots of Choices A Pattern Approach to NoSQL. In EuroPLoP 2012 (pp. B1–1–B1–9).

¤ Useful ¤ Lourenço, J. R., Cabral, B., Carreiro, P., Vieira, M., & Bernardino, J. (2015).

Choosing the right NoSQL database for the job: a quality attribute evaluation. Journal of Big Data, 2(1), 18. http://doi.org/10.1186/s40537-015- 0025-0 (Introduction & Background + table 1)

¤ Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12. http://doi.org/10.1145/1978915.1978919

Big Data Management

68