Big Data Management Question Paper

profileMagnum_1993
2.2DocumentStores.pptx

Document Stores

Big Data Management

Phil Bartie [email protected] EM G.29

Based on material from: Alasdair Gray, Heriot-Watt University Kenneth McLeod, Heriot-Watt University

+ MongoDB demo print out needed

1

Big Data Management

Document Stores

Materials released under CC-BY License

You are free to:

Share — copy and redistribute the material in any medium or format

Adapt — remix, transform, and build upon the material for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Big Data Management

2

MongoDB accounts?

Did you all get an email about your MongoDB account at MACS?

If not please email MACS helpdesk today: [email protected]

Include your HWU email address, username (eg ab12), and course (F20BD;F21BD) – and ask for a MongoDB account to be setup.

cc the email to me: [email protected]

Big Data Management

3

Relational

Oracle

MySQL

MS SQL Server

PostgreSQL

DB2

MS Access

SQLite

Teradata

SAP Adaptive Server

FileMaker

Hive

MariaDB

Informix

Vertica

Database Landscape

Big Data Management

4

NoSQL

Document

MongoDB

Elasticsearch

DynamoDB

CouchBase

Key-Value

Redis

Memcached

Riak KV

Aerospike

SimpleDB

Graph

Neo4J

Titan

Giraph

InfiniteGraph

RDF

Virtuoso

Stardog

GraphDB

Blazegraph

Jena

RDF4J

NewSQL

SAP HANA

Google Spanner

Clustrix

VoltDB

MemSQL

NuoDB

Object

Caché

Db4o

Versant

ObjectStore

Wide-Column

Cassandra

HBase

Accumulo

HyperTable

XML

MarkLogic

Sedna

Tamino

BaseX

eXist-db

DynamoDB - talked about last time can be used as both a Key-Value store and Document store.

Document Stores are a SUBCLASS of Key-Value stores.

Elasticsearch is a search engine based on the Lucene library.

CouchDB vs Couchbase vs MongoDB

https://assist-software.net/blog/couchbase-vs-couchdb-vs-mongodb. -- comparison

-- Couchbase is CouchDB with Membase -- introducing N1QL. (like SQL) for Couchbase.

-- CouchDB has REST API – can use CURL to get data

-- CouchDB uses a concept called MVCC (Multi Version Concurrency Control).

Document Stores

Big Data Management

4

Document Stores

aka semi-structured data

Big Data Management

5

Subclass of key-value stores

– the difference is that Document Stores requires the VALUE part (of K-V) to be stored in a DATA format it can understand. e.g. JSON

-- this allows the user to use NESTED Key-Value pairs, and supports searching the VALUE

Document Stores

Big Data Management

5

Document stores

JSON/XML Documents (not MS Office documents) 

Very flexible, good for high volume & variety

Example: http://www.mongodb.com

{

"name": "ken",

"age": 37,

"office": "G43"

}

Any data that can be modelled in JSON can be stored in MongoDB.

Document Representation

XML

JSON

Big Data Management

7

Same data represented in 2 different ways – XML and JSON

XML - Extensible Markup Language.

- lots of variations (GML + KML for map data)

JSON – JavaScript Object Notation – very popular on web as it is so lightweight – self-des

Depends on the exact DBMS if you can use XML or JSON

--- MongoDB uses JSON

JSON --> easy to understand / supports nesting

Document Stores

Big Data Management

7

BSON

JSON: JavaScript Object Notation

BSON: Binary JSON

Binary representation

Additional types

Date

Byte array

Efficient data access

Smaller representation

Big Data Management

8

JSON:

{

"make": "vw",

"model": "golf"

}

BSON: (viewed as bytes)

Doc length | value type | key | value | value type | key | value

https://onlinejsontools.com/convert-json-to-bson

MongoDB represents JSON documents in binary-encoded format behind the scenes, called BSON.

BSON extends the JSON model to provide additional data types, ordered fields, and to be efficient for encoding and decoding within different programming languages.

Document Stores

Big Data Management

8

Document Query Language

XPath and XQuery for XML

Common for all XML Stores

(Not covered in this course)

JSONiq proposed for JSON http://www.jsoniq.org/

Based on XQuery

Very limited uptake

APIs for JSON and BSON

No common API

Support CRUD operations

Create

Read

Update

Delete

Big Data Management

9

No standard document query language

NoSQL databases don’t have a common query language like SQL….

Couchbase have introduced N1QL (Nickle)

– a SQL like language for CouchBase

– take a look in another lecture later on.

https://query-tutorial.couchbase.com/tutorial/#1

Document Stores

Big Data Management

9

mongoDB

Most widely used NoSQL system

BSON Document Store

Open source

MongoDB Inc => sell consultancy + training

Written in C++

Runs on standard hardware

Easy to distribute

Big Data Management

10

http://db-engines.com/en/ranking

2018

Mongo more widely used by quite some margin – 5th top in the list of DB ENGINES in 2018

2019 stats

- 40 million downloads

- used by about 1 million Uni Students

Businesses like it because it is free but it is from a company who offer support, updates, training.

- They also offer MongoDB Atlas which is an online hosted service

-- there are also Mobile versions and new Charting tools (dashboards)

Document Stores

Big Data Management

10

So how do they make money?

You get the basic DBMS for free, but you pay for the rest ...

They act as consultants and developers building & specialising Info Systems to your needs...

MongoDB inc

An MongoDB IDE Compass:

Alternative: Robo 3T

https:// robomongo.org /

F28DM: Database Connectivity

13

mongoDB vs RDBMS

Most applications can be done equally

well by mongoDB & RDBMS

RDBMS only:

joins & transactions

mongoDB only:

big data

80% can

be done

by both

Slide replicated from mongoDB Inc presentation

Not entirely true

Some truth here

A lot of truth here

THIS SLIDE is from a MongoDB presentation….

[SHORT DISCUSSION]

{3 clicks}

I’m not convinced that it is as much as 80% that mongoDB can do.

Transactions are very prevalent, that is why RDBMS has been the prevalent model and remains to be (Oracle, SQL Server and MySQL are by far the dominant DBMSs)

I’m not sure it is fair to say that RDBMS cannot do big data. However, mongo can do it for less money.

Benefits

Multi-Versioning Concurrency Control (MVCC)

Big Data Management

16

e.g. mySQL / PostgreSQL

/ DynamoDB / MongoDB

e.g.

https:// docs.mongodb.com/manual/core/wiredtiger

Some DBMS like MongoDB, DynamoDB and CouchBD maintain high

WRITE performance by using

MVCC – Multi-Versioning Concurrency Control

READS don’t wait for the TABLE to be UNLOCKED where a WRITE is going on….

Document Stores

Big Data Management

16

In the News….

https://www.theregister.co.uk/2019/01/10/amazon_documentdb/

Jan 2019

10th Jan 2019 announcement from Amazon

- already have DynamoDB but now offering DocumentDB

Document Stores

Big Data Management

17

MongoDB vs RDBMS: terminology

MongoDB vs RDBMS: terminology

Database = same

Collection = like a table

MongoDB vs RDBMS: terminology

{ "people" : [

{ "name" : "bob smith", "age" : 22},

{ "name" : "ben taylor", "age" : 32},

{ "name" : "rob snow", "age" : 12},

]}

in a collection are documents - a bit like rows of a table

BUT a document could map to many rows of an RDBMS

MongoDB vs RDBMS: terminology

{ "people" : [

{ "name" : "bob smith", "age" : 22},

{ "name" : "ben taylor", "age" : 32},

{ "name" : "rob snow", "age" : 12},

]}

Field maps to a Column

Data Modelling

mongo: denormalised

RDBMS: normalised

Consider queries:

Get name and address of bob smith

mongo: single operation

RDBMS: join between people and addresses

Get all names

mongo must touch every document

RDBMS single operation

Documents have fields = like database fields /columns of table

Inserting a JSON document

db.dbis.insert({"name": "ken", "age": 40})

Notice: double quotes around both the key and value, unless the value is a number

You can type this into the terminal window when the mongodb client is running

dbis is the name of the collection

F28DM: Database Connectivity

22

Retrieving a JSON document

db.dbis.find({})

db.dbis.find({"name": "ken"})

Finds all documents in the database

Finds all documents that include the name-value pair "name" : "ken"

F28DM: Database Connectivity

23

Updating a JSON document

db.dbis.update({"name": "ken"}, {$set: {"office" : "G43"})

The document(s) to change

What is added to the document(s)

Document changed to: ({"name": "ken", "age": 40, "office": "G43"})

F28DM: Database Connectivity

24

http://docs.mongodb.org/manual/core/sharding-introduction/

Sharding

Documents split by shard key

Aims for uniform distribution

Timestamp is good

Limited domain, e.g. gender, is bad

Key must be

Immutable

Indexed

512 bytes max

Big Data Management

25

Choose your own shard key.

{CLICK for example}

{CLICK remove example}

Also necessary to split processor load evenly!

Once chosen a shard key it must be fixed so that the data isn't being moved around as it gets updated

(e.g. not going to choose number of tennis games played as the player plays another game so the document might have to be moved - not a good idea - instead use a Primary Key => eg social security number)

Replication

By default:

Client reads/writes to primary

Primary pushes updates to secondaries.

Then responds to client

Strong consistency!

Big Data Management

26

DATA is SHARDED across many servers based on the SHARD KEY… each server will also be replicated to other servers.

By Default MongoDB is CP -- consistency is main objective

Pushes to secondaries before accepting the write

Distributed transaction

Details: https://docs.mongodb.com/manual/core/replica-set-primary/

All members of the replica set can accept read operations. However, by default, an application directs its read operations to the primary member. 

Document Stores

Big Data Management

26

Client

Primary

Secondary

Secondary

Secondary

CA: Guarantees to give a correct response but only while network works fine

(Centralised / Traditional)

Tunable towards AP

Where is mongoDB?

Big Data Management

27

C

A

P

CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)

AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)

CP: strong consistency

Tunable to AP

27

Big Data Management

Document Stores

Tunable to AP

Client reads from secondary

Useful for some workloads, e.g.

Aggregation of values

Locality of data

Big Data Management

28

Clients WRITEs to PRIMARY / READs from SECONDARY

Higher AVAILABILITY but lower CONSISTENCY

Document Stores

Big Data Management

28

Client

Primary

Secondary

Secondary

Secondary

Client

Data Modelling

https://www.flickr.com/photos/18378305@N00/

Big Data Management

How to model relationships within JSON (also works for XML)

Aside on Schemas

Big Data Management

30

Aside on Schemas

Database Schema: database structure described in a formal language

Blueprint for the data

Relational Schema: defines the relations, attributes, constraints, view, indexes, etc, in SQL

Enforced by RDBMS

Upfront design effort: ER diagrams

Big Data Management

31

Company Example: ER Diagram

Big Data Management

32

Department
number {PK}
name {AK} location [1..5]
Employee
ssn {PK}
name first Name last Name dateOfBirth gender /age

manages

1

1..*

works for

1

0..1

start date

supervises

0..1

0..*

Project
project number {PK}
project name {AK}

controls

1

1..*

hours

works on

1..*

0..*

Deadline
date
type

1

0..*

has

Captures entities of interest and their relationships

Captures constraints

Time consuming to do right

Needs change over time!

/age  derived field not stored value…

AK = ALTERNATE KEY

--- also uniquely identifies a row may be used like PK for WHERE but not usually used in JOINS

PK should be immutable – ie never changed

Notice in ERD - no Foreign Keys

Can have self joins

Can have many relations between entities (manages, works for)

Direction of the arrow is how you read it

Multiplicity of a Relation (0..1 to 0..* etc)

ER Diagrams

F28DM Database Systems – A.J.G. Gray

32

NoSQL Schemas

Schema-less/schema-free: no structure defined

Example: Objects in key-value store

Cannot be enforced

Flexible/dynamic schema: no structure defined with DBMS

Initial structure drawn-up for objects

Structure not enforced, (but could be)

Structure allowed to evolve as needs change

Note: most material does not distinguish between these

Big Data Management

33

TWO NoSQL Schemas

1) No structure defined at all.. just store VALUES in KV pairs in DOCUMENT

Hard to know what to expect is in each document in advance

2) Don't specify a structure in the DBMS but still do an initial design

using ER diagrams!

Flexible schema promotes prototyping – popular with developers

Document Stores

Big Data Management

33

Example: Employee

ER Entity

Employee { "_id": 1, "name": { "first names": ["Mary", "Ann"], "surname": "McLeod" }, "date of birth": "1980-02-29", "gender": "F", "salary": 54000 }

JSON Object

Big Data Management

34

Employee
id {PK}
name first Name last Name dateOfBirth gender /age

/age is derived => something to implement in the application or a VIEW based on the dateofBirth and now()

SCHEMA is defined by the document as the KEYS for this person.

Could be different for the next person's document.

first names as an array [ ]

Document Stores

Big Data Management

34

JSON Schema

Metadata about a JSON document

Describes

Data structure

Data types

Constraints: min/max values, required, etc

Human and machine readable

Validation: checks a document against a schema

http://json-schema.org/

Big Data Management

35

JSON schema is a way of validating your data inputs / updates

Document Stores

Big Data Management

35

Example: Employee

JSON Document

Employee { "_id": 1, "name": { "first names": ["Mary", "Ann"], "surname": "McLeod" }, "date of birth": "1980-02-29", "gender": "F", "salary": 54000 }

JSON Schema

{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { … }, "required": [ "_id", "name", "date of birth", "gender” ] }

Big Data Management

36

Declares

- schema

required properties

Can be used in MongoDB to Validate input

WILL LOOK AT DETAILS ON NEXT SLIDE

Main things to notice here are that it sets out what is REQUIRED

{NEXT SLIDE for more details on properties}

Document Stores

Big Data Management

36

Example: Employee (properties)

JSON Document

Employee { "_id": 1, "name": { "first names": ["Mary", "Ann"], "surname": "McLeod" }, "date of birth": "1980-02-29", "gender": "F", "salary": 54000 }

JSON Schema

"properties": { "_id": { "type": "integer", "minimum": 1 }, "name": { … }, "date of birth": { "type": "string", "format": "date" }, "gender": { "type": "string", "enum": "[”M", ”F"]", }, "salary": { "type": "integer", "minimum": 10000 } }

Big Data Management

37

Defines properties

Names

Types

Format / restrictions NOTE: Look at the name section on the next slide

Document Stores

Big Data Management

37

Example: Employee (name)

JSON Document

Employee { "_id": 1, "name": { "first names": ["Mary", "Ann"], "surname": "McLeod" }, "date of birth": "1980-02-29", "gender": "F", "salary": 54000 }

JSON Schema

"name":

{ "type": "object", "properties": { "first names": { "type": "array", "items": { "type": "string” } }, "surname": { "type": "string” } }, "additionalProperties": false, "required": [ "first names", "surname” ] },

Big Data Management

38

Declares complex structure for name

first names as a string array [ ]

Eliminates the possibility of additional properties

Also note that this nested document has its own requirements

(for NAME) which are in addition to the higher level requirements (name, date of birth, gender)

Document Stores

Big Data Management

38

39

Big Data Management

db.createCollection( "people" , {

validator: { $jsonSchema: {

bsonType: "object",

required: [ "name", "surname", "email" ],

properties: {

name: {

bsonType: "string",

description: "required and must be a string" },

surname: {

bsonType: "string",

description: "required and must be a string" },

email: {

bsonType: "string",

pattern: "^.+\@.+$",

description: "required and must be a valid email address" },

year_of_birth: {

bsonType: "int",

minimum: 1900,

maximum: 2018,

description: "the value must be in the range 1900-2018" },

gender: {

enum: [ "M", "F" ],

description: "can be only M or F" }

}

}

}})

https://www.percona.com/blog/2018/08/16/mongodb-how-to-use-json-schema-validator/

MongoDB > db.people.insert( { name : "John", surname : "Smith" } )

WriteResult({

"nInserted" : 0,

"writeError" : {

"code" : 121,

"errmsg" : "Document failed validation"

}

})

MongoDB > db.people.insert( { name : "John", surname : "Smith", email : "[email protected]" } )

WriteResult({ "nInserted" : 1 })

MongoDB > db.people.insert( { name : "John", surname : "Smith", email : "john.smith.gmail.com" } )

WriteResult({

"nInserted" : 0,

"writeError" : {

"code" : 121,

"errmsg" : "Document failed validation"

}

})

EXAMPLE of using JSON Schema in MongoDB

Example of using JSON Schema for Validation

Fails as no email given

Fails as email is not a valid address (reg expression @ needed)

Works Not very helpful msgs - could say WHY it failed!!

Document Stores

Big Data Management

39

JSON Schema

Many more types of restrictions:

Pattern matching

Ranges

Referenced dependencies

References:

JSON Schema: http://json-schema.org/

Understanding JSON Schema Book https://spacetelescope.github.io/understanding-json-schema/index.html

Big Data Management

40

Schema Design for MongoDb

Big Data Management

41

Flexible Schema

Flexible:

Person & Cat have different data models.

Heterogenous data types!

Exercise!

Did that last JSON document make sense?

No!

Why would you have cat and human in an array called People?

Notice there is no recorded relationship between them!

The cat has nothing to do with Ken. 

Example: Cars and People

Is there a relationship between Ken and the Toyota car?

mongoDB has no referential integrity

No join support

Application logic and two data calls

{ "_id": "ken", "name": "ken mcleod", "car": "ej59kjj" }

{ "_id": "ej59kjj", "make": "toyota" }

Big Data Management

45

$lookup function equivalent to LEFT JOIN

Just 2 documents but can see car reg is in both

However MongoDB has no ref integrity

{CLICK}

$lookup function in MONGODB does the same thing as a left join. --- those things in the LEFT entity that can also be found in the RIGHT entity are retrieved

Document Stores

Big Data Management

45

Nesting (aka Embedding)

Captures 1:1 relationships

Relationship explicit

Data retrieval in one call

Denormalised data

Data replication

{ "_id": "ken", "name": "ken mcleod", "car": { "_id": "ej59kjj", "make": "toyota" } }

Big Data Management

46

Nested Documents

Data is denormalised - different entities being stored together

in RDBMS we'd ensure all attributes relate to the PK in the entity

So information about people in a Person table

and information about Cars in a Vehicles table

Linked by a PK - FK relationship

Normal forms (1NF, 2NF, 3NF..)

Data replication if holding any other details about those cars in other documents.. (eg MOT date, last serviced, previous owners)

Document Stores

Big Data Management

46

Nesting (aka Embedding)

What if Ken owns two cars?

ej59kjj toyota ej61tcd ford

Big Data Management

47

Document Stores

Big Data Management

47

Nesting (aka Embedding)

What if Ken owns two cars?

Captures 1:M relationships

Relationship explicit

Data retrieval in one call

Denormalised data

Data replication

{ "_id": "ken", "name": "ken mcleod", "car": [ { "_id": "ej59kjj", "make": "toyota" }, { "_id": "ej61tcd”, "make": "ford” } ] }

Big Data Management

48

Data replication if holding any other details about those cars in other documents.. (eg MOT date, last serviced, previous owners)

Document Stores

Big Data Management

48

Nesting (aka Embedding)

What if Ken owns ten or more cars?

Embedding fails if M too large

16MB limit per document

Can use GridFS

Chunks document to get around 16MB limitation

Not a portable solution

{ "_id": "ken", "name": "ken mcleod", "car": [ { "_id": "ej59kjj", "make": "toyota" }, { "_id": "ej61tcd”, "make": "ford” }, ... ]}

Big Data Management

49

http://docs.mongodb.org/manual/core/gridfs/

GridFS comes in useful for storing media (images) to get around the 16MB limit per document

{CLICK} - ex of 10+ cars using Normalisation

Document Stores

Big Data Management

49

Normalisation

What if Ken owns ten or more cars?

Good for large M

Embed reference

Akin to foreign key

No referential integrity

_id Primary key: either

Automatically generated, or

User supplied

{ "_id": "ken", "name": "ken mcleod" }

{ "_id": "ej59kjj", "make": "toyota", "owner": "ken" }

{ "_id": "ej61tcd", "make": "ford", "owner": "ken" }

Big Data Management

50

http://docs.mongodb.org/manual/core/data-modeling-introduction/

HERE putting owner as field in each CAR document (FK)

mongoDB does not force constraints, unlike a RDBMS

– no check that the foreign key is a primary key in another ‘collection’

If you wish to force/check relationships, do that in your application code – means the DBMS is simpler but the software developed has to take care of referential integrity

Anyone think of any issues with this?

> What if you have a web page, desktop client, and a native mobile app?

>> Each application needs to work the same way – each dev team needs to implement the same methods – rather than pushing the logic a single place on the DBMS server where the data are stored = can increase risk of data integrity issues

Document Stores

Big Data Management

50

Normalisation

Data Integrity

mongoDB does not force constraints

unlike an RDBMS.

Application code must check relationships

Move responsibility to application developer

Simplifies DBMS

Big Data Management

51

Can validate data against a JSON schema – but optional if you create this and use it

Each document can be storing very different things – but then hard to actually use – how do you know names of the KEYs

Document Stores

Big Data Management

51

Normalisation: Trees

Big Data Management

52

Could store data as:

> Parent references:

The PARENT ID is STORED In the CHILD NODE (see tree figure).

-- would need many DBMS calls to trace the tree back following the reference links

> OR Child references: The child’s id stored in parent node

> OR Optionally array of all ancestors

Select approach according to the application to:

Maximise performance

Minimise updating

Reduce redundancy

Document Stores

Big Data Management

52

{"_id": "alasdair", "name": "Alasdair Gray", "mother": "jane",

"father": "alex"}

{"_id": "jane", "name": "Jane McLeod", "mother": "mary",

"father": "oliver"}

{"_id": "mary", ...}

{"_id": "oliver", ...}

{"_id": "alex", "name": "Alexander Gray", "mother": "may",

"father": "alexander"

{"_id": "may", ...}

{"_id": "alexander", ...}

Normalisation: Trees

Can capture tree structures

Parent references: parent’s id stored in child node

Child references: child’s id stored in parent node

Optionally array of all ancestors

Materialised path: ids of path to route

Application must follow reference links

Multiple data calls

Approach depends on application domain. Select approach that will

Maximise performance

Minimise updating

Reduce redundancy

https://docs.mongodb.com/manual/applications/data-models-tree-structures/

Big Data Management

53

Mongo Example: Query a Nested Document

mongoDB to work with a NESTED DOCUMENTS use dot notation

e.g.

db.inventory.insertMany ([

{item:”book”, qty:25,size {h:140,w:200}},

{item:”pencil”, qty:10,size {h:100,w:15}},

{item:”paper”, qty:100,size {h:25,w:30}}

]);

 

db.inventory.find ( { “size.h” : {$gt:50} } )

Big Data Management

54

Use dot notation to access the nested document fields

Summary

Document

Semi-structured

XML

YAML

JSON

BSON

No standard query language

API access

Flexible-schema

JSON Schema

Denormalise for data retrieval

mongoDB

BSON

“Very relational”

No referential integrity

Complexity pushed to apps

Sharding key

Replication: primary/secondary

CAP

CP: Strong consistency (default)

AP: Eventual consistency (tunable)

Big Data Management

55

Compare SQL commands to MongoDB

https://docs.mongodb.com/manual/reference/sql-comparison/

Big Data Management

56

mongoDB Lab

Covers:

Import

Create

Read

Update

Delete

Optimisation & indices

Export

Will take around 1 hour

Big Data Management

57

No lecture on Thu this week

Document Stores

Big Data Management

57

Inserting a JSON document

db.dbis.insert({"name": "ken", "age": 40})

Notice: double quotes around both the key and value, unless the value is a number

You can type this into the terminal window when the mongodb client is running

dbis is the name of the collection

F28DM: Database Connectivity

58

Retrieving a JSON document

db.dbis.find({})

db.dbis.find({"name": "ken"})

Finds all documents in the database

Finds all documents that include the name-value pair "name" : "ken"

F28DM: Database Connectivity

59

Updating a JSON document

db.dbis.update({"name": "ken"}, {$set: {"office" : "G43"})

The document(s) to change

What is added to the document(s)

Document changed to: ({"name": "ken", "age": 40, "office": "G43"})

F28DM: Database Connectivity

60