Big Data Management Question Paper
Document Stores
Big Data Management
Phil Bartie [email protected] EM G.29
Based on material from: Alasdair Gray, Heriot-Watt University Kenneth McLeod, Heriot-Watt University
+ MongoDB demo print out needed
1
Big Data Management
Document Stores
Materials released under CC-BY License
You are free to:
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Big Data Management
2
MongoDB accounts?
Did you all get an email about your MongoDB account at MACS?
If not please email MACS helpdesk today: [email protected]
Include your HWU email address, username (eg ab12), and course (F20BD;F21BD) – and ask for a MongoDB account to be setup.
cc the email to me: [email protected]
Big Data Management
3
Relational
Oracle
MySQL
MS SQL Server
PostgreSQL
DB2
MS Access
SQLite
Teradata
SAP Adaptive Server
FileMaker
Hive
MariaDB
Informix
Vertica
Database Landscape
Big Data Management
4
NoSQL
Document
MongoDB
Elasticsearch
DynamoDB
CouchBase
Key-Value
Redis
Memcached
Riak KV
Aerospike
SimpleDB
Graph
Neo4J
Titan
Giraph
InfiniteGraph
RDF
Virtuoso
Stardog
GraphDB
Blazegraph
Jena
RDF4J
NewSQL
SAP HANA
Google Spanner
Clustrix
VoltDB
MemSQL
NuoDB
Object
Caché
Db4o
Versant
ObjectStore
Wide-Column
Cassandra
HBase
Accumulo
HyperTable
XML
MarkLogic
Sedna
Tamino
BaseX
eXist-db
DynamoDB - talked about last time can be used as both a Key-Value store and Document store.
Document Stores are a SUBCLASS of Key-Value stores.
Elasticsearch is a search engine based on the Lucene library.
CouchDB vs Couchbase vs MongoDB
https://assist-software.net/blog/couchbase-vs-couchdb-vs-mongodb. -- comparison
-- Couchbase is CouchDB with Membase -- introducing N1QL. (like SQL) for Couchbase.
-- CouchDB has REST API – can use CURL to get data
-- CouchDB uses a concept called MVCC (Multi Version Concurrency Control).
Document Stores
Big Data Management
4
Document Stores
aka semi-structured data
Big Data Management
5
Subclass of key-value stores
– the difference is that Document Stores requires the VALUE part (of K-V) to be stored in a DATA format it can understand. e.g. JSON
-- this allows the user to use NESTED Key-Value pairs, and supports searching the VALUE
Document Stores
Big Data Management
5
Document stores
JSON/XML Documents (not MS Office documents)
Very flexible, good for high volume & variety
Example: http://www.mongodb.com
{
"name": "ken",
"age": 37,
"office": "G43"
}
Any data that can be modelled in JSON can be stored in MongoDB.
Document Representation
XML
JSON
Big Data Management
7
Same data represented in 2 different ways – XML and JSON
XML - Extensible Markup Language.
- lots of variations (GML + KML for map data)
JSON – JavaScript Object Notation – very popular on web as it is so lightweight – self-des
Depends on the exact DBMS if you can use XML or JSON
--- MongoDB uses JSON
JSON --> easy to understand / supports nesting
Document Stores
Big Data Management
7
BSON
JSON: JavaScript Object Notation
BSON: Binary JSON
Binary representation
Additional types
Date
Byte array
Efficient data access
Smaller representation
Big Data Management
8
JSON:
{
"make": "vw",
"model": "golf"
}
BSON: (viewed as bytes)
Doc length | value type | key | value | value type | key | value
MongoDB represents JSON documents in binary-encoded format behind the scenes, called BSON.
BSON extends the JSON model to provide additional data types, ordered fields, and to be efficient for encoding and decoding within different programming languages.
Document Stores
Big Data Management
8
Document Query Language
XPath and XQuery for XML
Common for all XML Stores
(Not covered in this course)
JSONiq proposed for JSON http://www.jsoniq.org/
Based on XQuery
Very limited uptake
APIs for JSON and BSON
No common API
Support CRUD operations
Create
Read
Update
Delete
Big Data Management
9
No standard document query language
NoSQL databases don’t have a common query language like SQL….
Couchbase have introduced N1QL (Nickle)
– a SQL like language for CouchBase
– take a look in another lecture later on.
https://query-tutorial.couchbase.com/tutorial/#1
Document Stores
Big Data Management
9
mongoDB
Most widely used NoSQL system
BSON Document Store
Open source
MongoDB Inc => sell consultancy + training
Written in C++
Runs on standard hardware
Easy to distribute
Big Data Management
10
http://db-engines.com/en/ranking
2018
Mongo more widely used by quite some margin – 5th top in the list of DB ENGINES in 2018
2019 stats
- 40 million downloads
- used by about 1 million Uni Students
Businesses like it because it is free but it is from a company who offer support, updates, training.
- They also offer MongoDB Atlas which is an online hosted service
-- there are also Mobile versions and new Charting tools (dashboards)
Document Stores
Big Data Management
10
So how do they make money?
You get the basic DBMS for free, but you pay for the rest ...
They act as consultants and developers building & specialising Info Systems to your needs...
MongoDB inc
An MongoDB IDE Compass:
Alternative: Robo 3T
F28DM: Database Connectivity
13
mongoDB vs RDBMS
Most applications can be done equally
well by mongoDB & RDBMS
RDBMS only:
joins & transactions
mongoDB only:
big data
80% can
be done
by both
Slide replicated from mongoDB Inc presentation
Not entirely true
Some truth here
A lot of truth here
THIS SLIDE is from a MongoDB presentation….
[SHORT DISCUSSION]
{3 clicks}
I’m not convinced that it is as much as 80% that mongoDB can do.
Transactions are very prevalent, that is why RDBMS has been the prevalent model and remains to be (Oracle, SQL Server and MySQL are by far the dominant DBMSs)
I’m not sure it is fair to say that RDBMS cannot do big data. However, mongo can do it for less money.
Benefits
https://www.flickr.com/photos/n0rthw1nd/
https://www.flickr.com/photos/thoth-god/
https://www.flickr.com/photos/61056899@N06/
https://www.flickr.com/photos/wainwright/
Lightweight
Horizontal
Scaling
Flexible schema
Cheap!
Multi-Versioning Concurrency Control (MVCC)
Big Data Management
16
Some DBMS like MongoDB, DynamoDB and CouchBD maintain high
WRITE performance by using
MVCC – Multi-Versioning Concurrency Control
READS don’t wait for the TABLE to be UNLOCKED where a WRITE is going on….
Document Stores
Big Data Management
16
In the News….
https://www.theregister.co.uk/2019/01/10/amazon_documentdb/
Jan 2019
10th Jan 2019 announcement from Amazon
- already have DynamoDB but now offering DocumentDB
Document Stores
Big Data Management
17
MongoDB vs RDBMS: terminology
MongoDB vs RDBMS: terminology
Database = same
Collection = like a table
MongoDB vs RDBMS: terminology
{ "people" : [
{ "name" : "bob smith", "age" : 22},
{ "name" : "ben taylor", "age" : 32},
{ "name" : "rob snow", "age" : 12},
]}
in a collection are documents - a bit like rows of a table
BUT a document could map to many rows of an RDBMS
MongoDB vs RDBMS: terminology
{ "people" : [
{ "name" : "bob smith", "age" : 22},
{ "name" : "ben taylor", "age" : 32},
{ "name" : "rob snow", "age" : 12},
]}
Field maps to a Column
Data Modelling
mongo: denormalised
RDBMS: normalised
Consider queries:
Get name and address of bob smith
mongo: single operation
RDBMS: join between people and addresses
Get all names
mongo must touch every document
RDBMS single operation
Documents have fields = like database fields /columns of table
Inserting a JSON document
db.dbis.insert({"name": "ken", "age": 40})
Notice: double quotes around both the key and value, unless the value is a number
You can type this into the terminal window when the mongodb client is running
dbis is the name of the collection
F28DM: Database Connectivity
22
Retrieving a JSON document
db.dbis.find({})
db.dbis.find({"name": "ken"})
Finds all documents in the database
Finds all documents that include the name-value pair "name" : "ken"
F28DM: Database Connectivity
23
Updating a JSON document
db.dbis.update({"name": "ken"}, {$set: {"office" : "G43"})
The document(s) to change
What is added to the document(s)
Document changed to: ({"name": "ken", "age": 40, "office": "G43"})
F28DM: Database Connectivity
24
http://docs.mongodb.org/manual/core/sharding-introduction/
Sharding
Documents split by shard key
Aims for uniform distribution
Timestamp is good
Limited domain, e.g. gender, is bad
Key must be
Immutable
Indexed
512 bytes max
Big Data Management
25
Choose your own shard key.
{CLICK for example}
{CLICK remove example}
Also necessary to split processor load evenly!
Once chosen a shard key it must be fixed so that the data isn't being moved around as it gets updated
(e.g. not going to choose number of tennis games played as the player plays another game so the document might have to be moved - not a good idea - instead use a Primary Key => eg social security number)
Replication
By default:
Client reads/writes to primary
Primary pushes updates to secondaries.
Then responds to client
Strong consistency!
Big Data Management
26
DATA is SHARDED across many servers based on the SHARD KEY… each server will also be replicated to other servers.
By Default MongoDB is CP -- consistency is main objective
Pushes to secondaries before accepting the write
Distributed transaction
Details: https://docs.mongodb.com/manual/core/replica-set-primary/
All members of the replica set can accept read operations. However, by default, an application directs its read operations to the primary member.
Document Stores
Big Data Management
26
Client
Primary
Secondary
Secondary
Secondary
CA: Guarantees to give a correct response but only while network works fine
(Centralised / Traditional)
Tunable towards AP
Where is mongoDB?
Big Data Management
27
C
A
P
CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)
AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)
CP: strong consistency
Tunable to AP
27
Big Data Management
Document Stores
Tunable to AP
Client reads from secondary
Useful for some workloads, e.g.
Aggregation of values
Locality of data
Big Data Management
28
Clients WRITEs to PRIMARY / READs from SECONDARY
Higher AVAILABILITY but lower CONSISTENCY
Document Stores
Big Data Management
28
Client
Primary
Secondary
Secondary
Secondary
Client
Data Modelling
https://www.flickr.com/photos/18378305@N00/
Big Data Management
How to model relationships within JSON (also works for XML)
Aside on Schemas
Big Data Management
30
Aside on Schemas
Database Schema: database structure described in a formal language
Blueprint for the data
Relational Schema: defines the relations, attributes, constraints, view, indexes, etc, in SQL
Enforced by RDBMS
Upfront design effort: ER diagrams
Big Data Management
31
Company Example: ER Diagram
Big Data Management
32
| Department |
| number {PK} |
| name {AK} location [1..5] |
| Employee |
| ssn {PK} |
| name first Name last Name dateOfBirth gender /age |
manages
1
1..*
works for
1
0..1
| start date |
supervises
0..1
0..*
| Project |
| project number {PK} |
| project name {AK} |
controls
1
1..*
| hours |
works on
1..*
0..*
| Deadline |
| date |
| type |
1
0..*
has
Captures entities of interest and their relationships
Captures constraints
Time consuming to do right
Needs change over time!
/age derived field not stored value…
AK = ALTERNATE KEY
--- also uniquely identifies a row may be used like PK for WHERE but not usually used in JOINS
PK should be immutable – ie never changed
Notice in ERD - no Foreign Keys
Can have self joins
Can have many relations between entities (manages, works for)
Direction of the arrow is how you read it
Multiplicity of a Relation (0..1 to 0..* etc)
ER Diagrams
F28DM Database Systems – A.J.G. Gray
32
NoSQL Schemas
Schema-less/schema-free: no structure defined
Example: Objects in key-value store
Cannot be enforced
Flexible/dynamic schema: no structure defined with DBMS
Initial structure drawn-up for objects
Structure not enforced, (but could be)
Structure allowed to evolve as needs change
Note: most material does not distinguish between these
Big Data Management
33
TWO NoSQL Schemas
1) No structure defined at all.. just store VALUES in KV pairs in DOCUMENT
Hard to know what to expect is in each document in advance
2) Don't specify a structure in the DBMS but still do an initial design
using ER diagrams!
Flexible schema promotes prototyping – popular with developers
Document Stores
Big Data Management
33
Example: Employee
ER Entity
Employee { "_id": 1, "name": { "first names": ["Mary", "Ann"], "surname": "McLeod" }, "date of birth": "1980-02-29", "gender": "F", "salary": 54000 }
JSON Object
Big Data Management
34
| Employee |
| id {PK} |
| name first Name last Name dateOfBirth gender /age |
/age is derived => something to implement in the application or a VIEW based on the dateofBirth and now()
SCHEMA is defined by the document as the KEYS for this person.
Could be different for the next person's document.
first names as an array [ ]
Document Stores
Big Data Management
34
JSON Schema
Metadata about a JSON document
Describes
Data structure
Data types
Constraints: min/max values, required, etc
Human and machine readable
Validation: checks a document against a schema
http://json-schema.org/
Big Data Management
35
JSON schema is a way of validating your data inputs / updates
Document Stores
Big Data Management
35
Example: Employee
JSON Document
Employee { "_id": 1, "name": { "first names": ["Mary", "Ann"], "surname": "McLeod" }, "date of birth": "1980-02-29", "gender": "F", "salary": 54000 }
JSON Schema
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { … }, "required": [ "_id", "name", "date of birth", "gender” ] }
Big Data Management
36
Declares
- schema
required properties
Can be used in MongoDB to Validate input
WILL LOOK AT DETAILS ON NEXT SLIDE
Main things to notice here are that it sets out what is REQUIRED
{NEXT SLIDE for more details on properties}
Document Stores
Big Data Management
36
Example: Employee (properties)
JSON Document
Employee { "_id": 1, "name": { "first names": ["Mary", "Ann"], "surname": "McLeod" }, "date of birth": "1980-02-29", "gender": "F", "salary": 54000 }
JSON Schema
"properties": { "_id": { "type": "integer", "minimum": 1 }, "name": { … }, "date of birth": { "type": "string", "format": "date" }, "gender": { "type": "string", "enum": "[”M", ”F"]", }, "salary": { "type": "integer", "minimum": 10000 } }
Big Data Management
37
Defines properties
Names
Types
Format / restrictions NOTE: Look at the name section on the next slide
Document Stores
Big Data Management
37
Example: Employee (name)
JSON Document
Employee { "_id": 1, "name": { "first names": ["Mary", "Ann"], "surname": "McLeod" }, "date of birth": "1980-02-29", "gender": "F", "salary": 54000 }
JSON Schema
"name":
{ "type": "object", "properties": { "first names": { "type": "array", "items": { "type": "string” } }, "surname": { "type": "string” } }, "additionalProperties": false, "required": [ "first names", "surname” ] },
Big Data Management
38
Declares complex structure for name
first names as a string array [ ]
Eliminates the possibility of additional properties
Also note that this nested document has its own requirements
(for NAME) which are in addition to the higher level requirements (name, date of birth, gender)
Document Stores
Big Data Management
38
39
Big Data Management
db.createCollection( "people" , {
validator: { $jsonSchema: {
bsonType: "object",
required: [ "name", "surname", "email" ],
properties: {
name: {
bsonType: "string",
description: "required and must be a string" },
surname: {
bsonType: "string",
description: "required and must be a string" },
email: {
bsonType: "string",
pattern: "^.+\@.+$",
description: "required and must be a valid email address" },
year_of_birth: {
bsonType: "int",
minimum: 1900,
maximum: 2018,
description: "the value must be in the range 1900-2018" },
gender: {
enum: [ "M", "F" ],
description: "can be only M or F" }
}
}
}})
https://www.percona.com/blog/2018/08/16/mongodb-how-to-use-json-schema-validator/
MongoDB > db.people.insert( { name : "John", surname : "Smith" } )
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
MongoDB > db.people.insert( { name : "John", surname : "Smith", email : "[email protected]" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.insert( { name : "John", surname : "Smith", email : "john.smith.gmail.com" } )
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
EXAMPLE of using JSON Schema in MongoDB
Example of using JSON Schema for Validation
Fails as no email given
Fails as email is not a valid address (reg expression @ needed)
Works Not very helpful msgs - could say WHY it failed!!
Document Stores
Big Data Management
39
JSON Schema
Many more types of restrictions:
Pattern matching
Ranges
Referenced dependencies
…
References:
JSON Schema: http://json-schema.org/
Understanding JSON Schema Book https://spacetelescope.github.io/understanding-json-schema/index.html
Big Data Management
40
Schema Design for MongoDb
Big Data Management
41
Flexible Schema
Flexible:
Person & Cat have different data models.
Heterogenous data types!
Exercise!
Did that last JSON document make sense?
No!
Why would you have cat and human in an array called People?
Notice there is no recorded relationship between them!
The cat has nothing to do with Ken.
Example: Cars and People
Is there a relationship between Ken and the Toyota car?
mongoDB has no referential integrity
No join support
Application logic and two data calls
{ "_id": "ken", "name": "ken mcleod", "car": "ej59kjj" }
{ "_id": "ej59kjj", "make": "toyota" }
Big Data Management
45
$lookup function equivalent to LEFT JOIN
Just 2 documents but can see car reg is in both
However MongoDB has no ref integrity
{CLICK}
$lookup function in MONGODB does the same thing as a left join. --- those things in the LEFT entity that can also be found in the RIGHT entity are retrieved
Document Stores
Big Data Management
45
Nesting (aka Embedding)
Captures 1:1 relationships
Relationship explicit
Data retrieval in one call
Denormalised data
Data replication
{ "_id": "ken", "name": "ken mcleod", "car": { "_id": "ej59kjj", "make": "toyota" } }
Big Data Management
46
Nested Documents
Data is denormalised - different entities being stored together
in RDBMS we'd ensure all attributes relate to the PK in the entity
So information about people in a Person table
and information about Cars in a Vehicles table
Linked by a PK - FK relationship
Normal forms (1NF, 2NF, 3NF..)
Data replication if holding any other details about those cars in other documents.. (eg MOT date, last serviced, previous owners)
Document Stores
Big Data Management
46
Nesting (aka Embedding)
What if Ken owns two cars?
ej59kjj toyota ej61tcd ford
Big Data Management
47
Document Stores
Big Data Management
47
Nesting (aka Embedding)
What if Ken owns two cars?
Captures 1:M relationships
Relationship explicit
Data retrieval in one call
Denormalised data
Data replication
{ "_id": "ken", "name": "ken mcleod", "car": [ { "_id": "ej59kjj", "make": "toyota" }, { "_id": "ej61tcd”, "make": "ford” } ] }
Big Data Management
48
Data replication if holding any other details about those cars in other documents.. (eg MOT date, last serviced, previous owners)
Document Stores
Big Data Management
48
Nesting (aka Embedding)
What if Ken owns ten or more cars?
Embedding fails if M too large
16MB limit per document
Can use GridFS
Chunks document to get around 16MB limitation
Not a portable solution
{ "_id": "ken", "name": "ken mcleod", "car": [ { "_id": "ej59kjj", "make": "toyota" }, { "_id": "ej61tcd”, "make": "ford” }, ... ]}
Big Data Management
49
http://docs.mongodb.org/manual/core/gridfs/
GridFS comes in useful for storing media (images) to get around the 16MB limit per document
{CLICK} - ex of 10+ cars using Normalisation
Document Stores
Big Data Management
49
Normalisation
What if Ken owns ten or more cars?
Good for large M
Embed reference
Akin to foreign key
No referential integrity
_id Primary key: either
Automatically generated, or
User supplied
{ "_id": "ken", "name": "ken mcleod" }
{ "_id": "ej59kjj", "make": "toyota", "owner": "ken" }
{ "_id": "ej61tcd", "make": "ford", "owner": "ken" }
Big Data Management
50
http://docs.mongodb.org/manual/core/data-modeling-introduction/
HERE putting owner as field in each CAR document (FK)
mongoDB does not force constraints, unlike a RDBMS
– no check that the foreign key is a primary key in another ‘collection’
If you wish to force/check relationships, do that in your application code – means the DBMS is simpler but the software developed has to take care of referential integrity
Anyone think of any issues with this?
> What if you have a web page, desktop client, and a native mobile app?
>> Each application needs to work the same way – each dev team needs to implement the same methods – rather than pushing the logic a single place on the DBMS server where the data are stored = can increase risk of data integrity issues
Document Stores
Big Data Management
50
Normalisation
Data Integrity
mongoDB does not force constraints
unlike an RDBMS.
Application code must check relationships
Move responsibility to application developer
Simplifies DBMS
Big Data Management
51
Can validate data against a JSON schema – but optional if you create this and use it
Each document can be storing very different things – but then hard to actually use – how do you know names of the KEYs
Document Stores
Big Data Management
51
Normalisation: Trees
Big Data Management
52
Could store data as:
> Parent references:
The PARENT ID is STORED In the CHILD NODE (see tree figure).
-- would need many DBMS calls to trace the tree back following the reference links
> OR Child references: The child’s id stored in parent node
> OR Optionally array of all ancestors
Select approach according to the application to:
Maximise performance
Minimise updating
Reduce redundancy
Document Stores
Big Data Management
52
{"_id": "alasdair", "name": "Alasdair Gray", "mother": "jane",
"father": "alex"}
{"_id": "jane", "name": "Jane McLeod", "mother": "mary",
"father": "oliver"}
{"_id": "mary", ...}
{"_id": "oliver", ...}
{"_id": "alex", "name": "Alexander Gray", "mother": "may",
"father": "alexander"
{"_id": "may", ...}
{"_id": "alexander", ...}
Normalisation: Trees
Can capture tree structures
Parent references: parent’s id stored in child node
Child references: child’s id stored in parent node
Optionally array of all ancestors
Materialised path: ids of path to route
Application must follow reference links
Multiple data calls
Approach depends on application domain. Select approach that will
Maximise performance
Minimise updating
Reduce redundancy
https://docs.mongodb.com/manual/applications/data-models-tree-structures/
Big Data Management
53
Mongo Example: Query a Nested Document
mongoDB to work with a NESTED DOCUMENTS use dot notation
e.g.
db.inventory.insertMany ([
{item:”book”, qty:25,size {h:140,w:200}},
{item:”pencil”, qty:10,size {h:100,w:15}},
{item:”paper”, qty:100,size {h:25,w:30}}
]);
db.inventory.find ( { “size.h” : {$gt:50} } )
Big Data Management
54
Use dot notation to access the nested document fields
Summary
Document
Semi-structured
XML
YAML
JSON
BSON
No standard query language
API access
Flexible-schema
JSON Schema
Denormalise for data retrieval
mongoDB
BSON
“Very relational”
No referential integrity
Complexity pushed to apps
Sharding key
Replication: primary/secondary
CAP
CP: Strong consistency (default)
AP: Eventual consistency (tunable)
Big Data Management
55
Compare SQL commands to MongoDB
https://docs.mongodb.com/manual/reference/sql-comparison/
Big Data Management
56
mongoDB Lab
Covers:
Import
Create
Read
Update
Delete
Optimisation & indices
Export
Will take around 1 hour
Big Data Management
57
No lecture on Thu this week
Document Stores
Big Data Management
57
Inserting a JSON document
db.dbis.insert({"name": "ken", "age": 40})
Notice: double quotes around both the key and value, unless the value is a number
You can type this into the terminal window when the mongodb client is running
dbis is the name of the collection
F28DM: Database Connectivity
58
Retrieving a JSON document
db.dbis.find({})
db.dbis.find({"name": "ken"})
Finds all documents in the database
Finds all documents that include the name-value pair "name" : "ken"
F28DM: Database Connectivity
59
Updating a JSON document
db.dbis.update({"name": "ken"}, {$set: {"office" : "G43"})
The document(s) to change
What is added to the document(s)
Document changed to: ({"name": "ken", "age": 40, "office": "G43"})
F28DM: Database Connectivity
60