Big Data Management Question Paper
ppt/presentation.xml
ppt/slideMasters/slideMaster1.xml
Click to edit Master title style Big Data Management ‹#› Click to edit the outline text format Second Outline Level Third Outline Level Fourth Outline Level Fifth Outline Level Sixth Outline Level Seventh Outline Level
ppt/slideMasters/slideMaster2.xml
Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level Big Data Management ‹#›
ppt/slideMasters/slideMaster3.xml
Click to edit Master title style Big Data Management ‹#› Click to edit the outline text format Second Outline Level Third Outline Level Fourth Outline Level Fifth Outline Level Sixth Outline Level Seventh Outline Level
ppt/slideMasters/slideMaster4.xml
Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level Click to edit Master text styles Second level Third level Fourth level Fifth level Big Data Management ‹#›
ppt/slideMasters/slideMaster5.xml
Click to edit Master title style Big Data Management Drag picture to placeholder or click icon to add
ppt/slideMasters/slideMaster6.xml
‹#› Click to edit the title text format Click to edit the outline text format Second Outline Level Third Outline Level Fourth Outline Level Fifth Outline Level Sixth Outline Level Seventh Outline Level
ppt/slideMasters/slideMaster7.xml
Click to edit Master title style Click to edit Master text styles Big Data Management ‹#›
ppt/slideMasters/slideMaster8.xml
Click to edit Master title style Click to edit Master text styles Click to edit Master text styles Second level Third level Fourth level Fifth level Click to edit Master text styles Click to edit Master text styles Second level Third level Fourth level Fifth level Big Data Management ‹#›
ppt/slides/slide1.xml
Wide-Column Stores Big Data Management Phil Bartie [email protected] EM G.29 Using material from Alasdair Gray, HWU Aidan Hogan, Universidad de Chile Guillaume Marquis https://www.tutorialspoint.com/cassandra/ https://pandaforme.gitbooks.io/introduction-to-cassandra/
ppt/slides/slide2.xml
Materials released under CC-BY License You are free to: Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. Big Data Management 2
ppt/slides/slide3.xml
Relational Oracle MySQL MS SQL Server PostgreSQL DB2 MS Access SQLite Teradata SAP Adaptive Server FileMaker Hive MariaDB Informix Vertica Database Landscape Big Data Management 3 NoSQL Document MongoDB Elasticsearch DynamoDB CouchBase Key-Value Redis Memcached Riak KV Aerospike SimpleDB Graph Neo4J Titan Giraph InfiniteGraph RDF Virtuoso Stardog GraphDB Blazegraph Jena RDF4J NewSQL SAP HANA Google Spanner Clustrix VoltDB MemSQL NuoDB Object Caché Db4o Versant ObjectStore Wide-Column Cassandra HBase Accumulo HyperTable XML MarkLogic Sedna Tamino BaseX eXist-db style.opacity style.opacity style.opacity style.opacity style.opacity style.opacity style.opacity style.opacity
ppt/slides/slide4.xml
Relational Databases Recap Two-dimensional tables Relationships between tables Fixed schema Homogeneous Highly structured NULLs – arrghh! Source : http://excel.quebec/attachments/Image/excel-quebec-requete-sql-excel-1.jpg Big Data Management 4
ppt/slides/slide5.xml
Relational Databases Two-dimensional: array of arrays Fixed schema: Defined in advanced Hard to change Highly structured Homogeneous structure: All rows have the same columns All columns have the same data type NULLs are problematic Ambiguous meaning Variable semantics in query processing Big Data Management 5
ppt/slides/slide6.xml
Key-value and Tabular Big Data Management 6 Countries Primary Key Value Afghanistan capital:Kabul,continent:Asia,pop:31108077#2011 Albania capital:Tirana,continent:Europe,pop:3011405#2013 … … Tabular = Two-dimensional Maps Countries Primary Key capital continent pop-value pop-year Afghanistan Kabul Asia 31108077 2011 Albania Tirana Europe 3011405 2013 … … … … … Key–Value = a Distributed Map style.visibility style.visibility
ppt/slides/slide7.xml
Wide-Column Stores Sparse – not a value for every column (i.e. not dense square) Distributed – each node has the same role – no single point of failure Masterless – each node can service any request New nodes can be added without downtime Keyspace: container for column families Column Family: container for rows Rows: ordered columns Big Data Management 7 https://www.tutorialspoint.com/cassandra/cassandra_data_model.htm a sparse , distributed , persistent, multi-dimensional , sorted map
ppt/slides/slide8.xml
Wide-column model “ a sparse , distributed , persistent, multi-dimensional , sorted map .” sparse : not all values form a dense square distributed : lots of machines persistent : disk storage (GFS) multi-dimensional : multiple values in columns sorted : sorting lexicographically by row key map : look up a key, get a value Big Data Management 8 style.visibility style.visibility style.visibility style.visibility style.visibility style.visibility
ppt/slides/slide9.xml
Wide-Column Store Model Keyspace : container for one or more column families Defines replication factor (n) and strategy Similar to a database in the relational model Column Family : container for an ordered collection of rows Static: columns defined in advance Dynamic: columns defined each time a row is inserted Row : set of ordered columns Columns not predefined Each row defines its columns Arbitrary number of columns – can be very large Related data Column : basic data structure consisting of Name (key): the identifier for the column Value: the contents of the cell Timestamp: when was the value updated Big Data Management 9
ppt/slides/slide10.xml
Column Family (Table) Big Data Management 10 style.visibility
ppt/slides/slide11.xml
Row Big Data Management 11 Row: smallest unit that stores related data Data partition mechanism https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html
ppt/slides/slide12.xml
Row Row: stores related data Rows: stored within a column family Unit of partitioning: each row in a family can go on a different node Row key: uniquely identifies a row in a column family Used to partition data Row: a set of columns Column is a triple: (key, value, timestamp) Column key: uniquely identifies a column value in a row Column value: stores one value or a collection of values Column timestamp: captures last edit time Big Data Management 12 https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html
ppt/slides/slide13.xml
Keys Big Data Management 13 Composite Row Key Composite Column Key https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html
ppt/slides/slide14.xml
Column Family View: Single-row partitions Big Data Management 14 https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html
ppt/slides/slide15.xml
Column Family: Multi-row partitions Big Data Management 15 https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html
ppt/slides/slide16.xml
Wide-Column Advantages Highly scalable: designed for distributing across: Cluster Data centres Data manipulation: includes limited query language Data stored in sorted order Wide-columns: increased granularity of operation Not affected by increasing number of rows Big Data Management 16
ppt/slides/slide17.xml
Cassandra Wide-Column Store Big Data Management
ppt/slides/slide18.xml
FB Messenger 1.3 billion users 21 billion images sent per month Search requires inverse-index Search term to message id Continuous data arrival Instantaneous responses Cassandra developed as a solution Big Data Management 18 https://www.messenger.com/ Facebook Stats (2018) https://www.messenger.com/messengerfacts
ppt/slides/slide19.xml
Cassandra History Big Data Management History Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik initially developed Cassandra at Facebook to power the Facebook inbox search feature. Facebook released Cassandra as an open-source project on Google code in July 2008. In March 2009 it became an Apache Incubator project. On February 17, 2010 it graduated to a top-level project. Facebook developers named their database after the Trojan mythological prophet Cassandra - with classical allusions to a curse on an oracle. https://en.wikipedia.org/wiki/Apache_Cassandra Free and open-source Distributed Wide Column Store NoSQL database Masterless replication Each node has same role Low latency Can add more hardware nodes with no downtime Should always be able to read/write to Cassandra Consistency can be adjusted – at expense of availability - Secondary Index support is weak (single columns only; equality comparisons only)
ppt/slides/slide20.xml
20 Big Data Management http://cassandra.apache.org/
ppt/slides/slide21.xml
21 CONSISTENT HASHING https://www.scnsoft.com/blog/cassandra-performance
ppt/slides/slide22.xml
22 https://www.scnsoft.com/blog/cassandra-performance
ppt/slides/slide23.xml
23 Cassandra Write Path QUORUM Consistency: (n/2 +1) rounded down where n= replication factor
ppt/slides/slide24.xml
24 https://www.scnsoft.com/blog/cassandra-performance
ppt/slides/slide25.xml
Distributed, Replicated and Fault Tolerant Consistent Hashing Hashed to ring Order preserving hash function Gossip style membership algorithm Data replication Eventual Consistency Merkle Tree Big Data Management 25
ppt/slides/slide26.xml
But (like Dynamo), tables are tunable towards CP Where is Cassandra? Big Data Management 26 C A P C A : Guarantees to give a correct response but only while network works fine ( Centralised / Traditional ) C P : Guarantees responses are correct even if there are network failures, but response may fail ( Weak availability ) A P : Always provides a “best-effort” response even in presence of network failures ( Eventual consistency ) style.visibility style.visibility style.visibility style.visibility style.visibility style.opacity style.visibility
ppt/slides/slide27.xml
Tuneable Consistency Write = Commit Log + Memtable Quorom = Majority of replicas: ⌊R/2⌋+1 for R the replication factor Hinted handoff: central 3 hour TODO log ( not readable ) Big Data Management 27 Level Explanation ANY One replica node or a hinted handoff ONE One replica node ( hinted handoff not enough) TWO Two replica nodes THREE Three replica nodes QUORUM A quorum of replica nodes ALL All replica nodes Availability Consistency
ppt/slides/slide28.xml
Tuneable Consistency https://blog.imaginea.com/consistency-tuning-in-cassandra Big Data Management 28 For write operations, ANY is the lowest consistency (but highest availability), and ALL is the highest consistency (but lowest availability). For read operations, ONE is the lowest consistency (but highest availability), and ALL is the highest consistency (but lowest availability). QUORUM is a good middle-ground ensuring strong consistency, yet still tolerating some level of failure. Level Explanation ANY One replica node or a hinted handoff ONE One replica node ( hinted handoff not enough) TWO Two replica nodes THREE Three replica nodes QUORUM A quorum of replica nodes ALL All replica nodes The size of the quorum is calculated as (replication_factor / 2) + 1 Replication factor Replication factor is total number of replicas across the cluster.
ppt/slides/slide29.xml
Cassandra Query Language (CQL) SQL-like declarative query language Big Data Management 29
ppt/slides/slide30.xml
CQL SQL-like declarative query language Lowers the entry barrier for RDBMS folk Filters only on indexed columns No joins Aggregation: Functions: max, min, sum, avg, count Group by Limited to key columns Big Data Management 30
ppt/slides/slide31.xml
CQL: Create Keyspace (Database) CQL Create Keyspace CREATE KEYSPACE MyKeySpace WITH REPLICATION = { 'class' : 'SimpleStrategy’ , 'replication_factor' : 3 }; Load in keyspace USE MyKeySpace ; MySQL (equivalent) Create Database CREATE DATABASE MyKeySpace; Load in Database USE MyKeySpace ; Big Data Management 31
ppt/slides/slide32.xml
CQL: Create Column Family(Table) CQL Create Column Family CREATE COLUMNFAMILY MyColumns ( id varint, lastname varchar, firstname varchar, PRIMARY KEY ( id )); Load data INSERT INTO MyColumns ( id, lastname, firstname ) VALUES ( 1, 'Doe', 'John' ); MySQL (equivalent) Create Database CREATE TABLE MyColumns ( id int NOT NULL , lastname varchar(50), firstname varchar (100), PRIMARY KEY ( id )); Load data INSERT INTO MyColumns ( id, lastname, firstname ) VALUES ( 1, 'Doe', 'John' ); Big Data Management 32
ppt/slides/slide33.xml
CQL: Retrieve data CQL Retrieve all rows SELECT * FROM MyColumns ; MySQL (equivalent) Retrieve all rows SELECT * FROM MyColumns ; Big Data Management 33
ppt/slides/slide34.xml
CQL: Retrieve data CQL Retrieve row 1 SELECT * FROM MyColumns WHERE id = 1 ; MySQL (equivalent) Retrieve id 1 SELECT * FROM MyColumns WHERE id = 1 ; Big Data Management 34
ppt/slides/slide35.xml
CQL: Retrieve data CQL Retrieve all Johns SELECT * FROM MyColumns WHERE firstname = 'John' ; MySQL (equivalent) Retrieve all Johns SELECT * FROM MyColumns WHERE firstname = 'John' ; Big Data Management 35 Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING. CREATE INDEX on MyColumns (firstname); style.visibility
ppt/slides/slide36.xml
How is this different from RDBMS? Big Data Management 36 ALTER TABLE users ADD birth_date INT; new columns can be added on the fly while running and processing queries In a static-column storage engine, each row must reserve space for every column https://www.datastax.com/dev/blog/schema-in-cassandra-1-1 style.visibility style.visibility style.visibility
ppt/slides/slide37.xml
Using Columns siteid 2012-09-01 2012-09-02 2012-09-03 1 20.6 21.9 21.7 Big Data Management 37 siteid date mean_temp 1 2012-09-01 20.6 1 2012-09-01 21.9 1 2012-09-01 21.7 RDBMS approach CASSANDRA approach
ppt/slides/slide38.xml
CQL: Consistency SELECT totalsales FROM sales USING CONSISTENCY QUORUM WHERE customerid =5 ; Big Data Management 38 UPDATE SALES USING CONSISTENCY ONE SET totalsales =50000 WHERE customerid =4;
ppt/slides/slide39.xml
Limitations of CQL No join or subquery support, and limited support for aggregation. - This is by design , to force you to denormalize into partitions that can be efficiently queried from a single replica, instead of having to gather data from across the entire cluster. A single column value may not be larger than 2GB - in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob values. The maximum number of cells (rows x columns) in a single partition is 2 billion. Big Data Management 39 https://wiki.apache.org/cassandra/CassandraLimitations
ppt/slides/slide40.xml
Using Bloom Filters for Fast Data Retrieval Each SSTable (String Sorted Table) has an associated Bloom Filter Bloom Filter Stored in Memory Highly Efficient Can produce false positives Big Data Management 40
ppt/slides/slide41.xml
Bloom Filters Efficient test for data location Hash object on insert using k hash functions Set bit to 1 Hash object on read using k hash functions Any 0s then not present Bit would have been set to 1 on insert They can give FALSE POSITIVES, but not FALSE NEGATIVES – so a good way to check if data has been processed before 2017 Big Data Management 41 Video on Bloom Filters : https://youtu.be/bEmBh1HtYrw
ppt/slides/slide42.xml
Bloom Filters: Insert A 2017 Big Data Management 42 Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters Hash object on insert using k hash functions Set bit to 1 e.g. input word = ‘ aardvark ’ Output from hash function 1 = 3 Output from hash function 2 = 1 Output from hash function 3 = 14 style.visibility style.visibility style.visibility
ppt/slides/slide43.xml
Bloom Filters: Insert B 2017 Big Data Management 43 Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters Hash object on insert using k hash functions Set bit to 1 e.g. input word = ‘ bat ’ Output from hash function 1 = 16 Output from hash function 2 = 1 Output from hash function 3 = 7 style.visibility style.visibility
ppt/slides/slide44.xml
Bloom Filters: Read Y 2017 Big Data Management 44 Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters Hash object on read using k hash functions Any 0s then not present Y not present Bit would have been set to 1 on insert e.g. input word = ‘ elephant ’ style.visibility
ppt/slides/slide45.xml
Bloom Filters: Read X 2017 Big Data Management 45 Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters Hash object on read using k hash functions All 1s then data may be present Bit would have been set to 1 on insert e.g. input word = ‘ bat ’ Hash results [16,1,7] e.g. input word = ‘ snake ’ Hash results [1,14,16] style.visibility style.visibility
ppt/slides/slide46.xml
Bloom Filters Bloom Filters on locality groups – avoid searching Efficient test for set membership: member(key) true/false False => definitely not in the set, no need for lookup True => probably is in the set (do lookup to make sure and get value) Used in Google BigTable Avoid reading all SSTables for elements that are not present (at least mostly avoid it) Saves many seeks 2017 Big Data Management 46
ppt/slides/slide47.xml
Bloom Filters m bit positions initialised to 0 k hash functions: each maps an object (obj) to one position for insert: compute k bit locations, set them to 1 set x 1 [hash 1 (obj)], …, x k [hash k (obj)] to 1 for lookup: compute k bit locations x 1 [hash 1 (obj)], …, x k [hash k (obj)] all x i = 1 => return true (may be wrong) any x i = 0 => return false 1% error rate ~ 10 bits/element Good to have some a priori idea of the target set size 2017 Big Data Management 47
ppt/slides/slide48.xml
DEMO of a BLOOM Filter https://llimllib.github.io/bloomfilter-tutorial/
ppt/slides/slide49.xml
Cassandra vs MongoDB Big Data Management 49 https://scalegrid.io/blog/cassandra-vs-mongodb/ Problem domain needs a rich data model = MongoDB Need secondary indexes and flexibility in the query model = MongoDB (by contrast Cassandra secondary indexes only support single columns and equality comparisons) 100% uptime = Cassandra Write scalability = Cassandra Query language support = CQL is similar to SQL The Apache Cassandra database is a good choice when you need scalability and high availability without compromising performance, and with no single point failure. style.visibility style.visibility style.visibility style.visibility style.visibility style.visibility
ppt/slides/slide50.xml
Summary: Wide-column stores Big Data Management 50 Conceptually a big table Sparse: not all cells have values Distributed and persistent Multi-dimensional: multiple values Sorted Map Model: Keyspace: container for column families Column Family: container for rows Rows: Set of ordered columns Column: (name, value, timestamp) Cassandra: Distributed, replicated, and fault tolerant SQL-like query language Bloom filters: efficient data presence testing
ppt/slides/slide51.xml
Reading Big Data Management 51 https://blog.panoply.io/cassandra-vs-mongodb Cassandra Vs MongoDB In 2018 by Matan Sarig https://www.youtube.com/watch?v=B_HTdrTgGNs Apache Cassandra introduction video
ppt/notesMasters/notesMaster1.xml
Click to move the slide Click to edit the notes format <header> <date/time> <footer> ‹#›
ppt/presProps.xml
ppt/viewProps.xml
ppt/theme/theme1.xml
ppt/tableStyles.xml
ppt/slideLayouts/slideLayout1.xml
ppt/slideLayouts/slideLayout2.xml
ppt/slideLayouts/slideLayout3.xml
ppt/slideLayouts/slideLayout4.xml
ppt/slideLayouts/slideLayout5.xml
ppt/slideLayouts/slideLayout6.xml
ppt/slideLayouts/slideLayout7.xml
ppt/slideLayouts/slideLayout8.xml
ppt/slideLayouts/slideLayout9.xml
ppt/slideLayouts/slideLayout10.xml
ppt/slideLayouts/slideLayout11.xml
ppt/slideLayouts/slideLayout12.xml
ppt/slideLayouts/slideLayout13.xml
ppt/slideLayouts/slideLayout14.xml
ppt/slideLayouts/slideLayout15.xml
ppt/slideLayouts/slideLayout16.xml
ppt/slideLayouts/slideLayout17.xml
ppt/slideLayouts/slideLayout18.xml
ppt/slideLayouts/slideLayout19.xml
ppt/slideLayouts/slideLayout20.xml
ppt/slideLayouts/slideLayout21.xml
ppt/slideLayouts/slideLayout22.xml
ppt/slideLayouts/slideLayout23.xml
ppt/slideLayouts/slideLayout24.xml
ppt/theme/theme2.xml
ppt/slideLayouts/slideLayout25.xml
ppt/slideLayouts/slideLayout26.xml
ppt/slideLayouts/slideLayout27.xml
ppt/slideLayouts/slideLayout28.xml
ppt/slideLayouts/slideLayout29.xml
ppt/slideLayouts/slideLayout30.xml
ppt/slideLayouts/slideLayout31.xml
ppt/slideLayouts/slideLayout32.xml
ppt/slideLayouts/slideLayout33.xml
ppt/slideLayouts/slideLayout34.xml
ppt/slideLayouts/slideLayout35.xml
ppt/slideLayouts/slideLayout36.xml
ppt/theme/theme3.xml
ppt/slideLayouts/slideLayout37.xml
ppt/slideLayouts/slideLayout38.xml
ppt/slideLayouts/slideLayout39.xml
ppt/slideLayouts/slideLayout40.xml
ppt/slideLayouts/slideLayout41.xml
ppt/slideLayouts/slideLayout42.xml
ppt/slideLayouts/slideLayout43.xml
ppt/slideLayouts/slideLayout44.xml
ppt/slideLayouts/slideLayout45.xml
ppt/slideLayouts/slideLayout46.xml
ppt/slideLayouts/slideLayout47.xml
ppt/slideLayouts/slideLayout48.xml
ppt/theme/theme4.xml
ppt/slideLayouts/slideLayout49.xml
ppt/slideLayouts/slideLayout50.xml
ppt/slideLayouts/slideLayout51.xml
ppt/slideLayouts/slideLayout52.xml
ppt/slideLayouts/slideLayout53.xml
ppt/slideLayouts/slideLayout54.xml
ppt/slideLayouts/slideLayout55.xml
ppt/slideLayouts/slideLayout56.xml
ppt/slideLayouts/slideLayout57.xml
ppt/slideLayouts/slideLayout58.xml
ppt/slideLayouts/slideLayout59.xml
ppt/slideLayouts/slideLayout60.xml
ppt/theme/theme5.xml
ppt/slideLayouts/slideLayout61.xml
ppt/slideLayouts/slideLayout62.xml
ppt/slideLayouts/slideLayout63.xml
ppt/slideLayouts/slideLayout64.xml
ppt/slideLayouts/slideLayout65.xml
ppt/slideLayouts/slideLayout66.xml
ppt/slideLayouts/slideLayout67.xml
ppt/slideLayouts/slideLayout68.xml
ppt/slideLayouts/slideLayout69.xml
ppt/slideLayouts/slideLayout70.xml
ppt/slideLayouts/slideLayout71.xml
ppt/slideLayouts/slideLayout72.xml
ppt/theme/theme6.xml
ppt/slideLayouts/slideLayout73.xml
ppt/slideLayouts/slideLayout74.xml
ppt/slideLayouts/slideLayout75.xml
ppt/slideLayouts/slideLayout76.xml
ppt/slideLayouts/slideLayout77.xml
ppt/slideLayouts/slideLayout78.xml
ppt/slideLayouts/slideLayout79.xml
ppt/slideLayouts/slideLayout80.xml
ppt/slideLayouts/slideLayout81.xml
ppt/slideLayouts/slideLayout82.xml
ppt/slideLayouts/slideLayout83.xml
ppt/slideLayouts/slideLayout84.xml
ppt/theme/theme7.xml
ppt/slideLayouts/slideLayout85.xml
ppt/slideLayouts/slideLayout86.xml
ppt/slideLayouts/slideLayout87.xml
ppt/slideLayouts/slideLayout88.xml
ppt/slideLayouts/slideLayout89.xml
ppt/slideLayouts/slideLayout90.xml
ppt/slideLayouts/slideLayout91.xml
ppt/slideLayouts/slideLayout92.xml
ppt/slideLayouts/slideLayout93.xml
ppt/slideLayouts/slideLayout94.xml
ppt/slideLayouts/slideLayout95.xml
ppt/slideLayouts/slideLayout96.xml
ppt/theme/theme8.xml
ppt/theme/theme9.xml
ppt/notesSlides/notesSlide1.xml
Remind students of key different between document and key-value stores Document databases can INDEX on non-key attributes (eg MongoDB) – the VALUE (ie document) is in ‘understandable’ (eg JSON / BSON) Key-Value stores only index on the KEY – the value is not ‘understood’ by the DBMS This week we are taking a look at Wide-Column Stores, or BigTables. Wide-Column Stores Big Data Management 1
ppt/media/image1.png
ppt/media/image2.png
ppt/notesSlides/notesSlide2.xml
The last of the 4 NoSQL database types we'll be looking at... Wide-Column Stores Big Data Management 3
ppt/notesSlides/notesSlide3.xml
Two-dimensional: array of arrays Fixed schema: Defined in advanced Hard to change Highly structured Homogeneous structure: All rows have the same columns All columns have the same data type NULLs are problematic -- need to use field is Null rather than field = Null etc... Question for the class) Does NULL = NULL? In mysql , postgresql answer would be Null (makes sense); but MS SQL Server gives answer: false !?
ppt/media/image3.jpeg
ppt/notesSlides/notesSlide4.xml
Wide-Column Stores Big Data Management 5
ppt/notesSlides/notesSlide5.xml
Key – Value ---- value can be anything like REDIS {CLICK} Tabular – can have lots of columns – 1 value per column - each column has a header row - 1 of cols holds the PK We call this a ROW-ORIENTED database Wide-Column Stores Big Data Management 6
ppt/notesSlides/notesSlide6.xml
sparse : not all values form a dense square distributed : lots of machines persistent : disk storage multi-dimensional : multiple values in columns sorted map : look up a key, get a value -- the keys are sorted Keyspace : container for one or more column families + A keyspace is like a database in a relational DBMS + Defines replication factor (e.g. 3) and strategy Column Family: container for an ordered collection of rows -- like a TABLE in a RDBMS Static: columns defined in advance Dynamic: columns defined each time a row is inserted Row: set of ordered columns Columns not predefined Each row defines its columns Arbitrary number of columns – can be very large Related data Column: basic data structure consisting of Name (key): the identifier for the column Value: the contents of the cell Timestamp: when was the value updated Wide-Column Stores Big Data Management 7
ppt/media/image4.tif
ppt/notesSlides/notesSlide7.xml
9
ppt/notesSlides/notesSlide8.xml
WideColumnStores / BigTable : set of rows with a similar structure {CLICK} The columns can be different on each row Wide-Column Stores Big Data Management 10
ppt/media/image5.tif
ppt/media/image6.tif
ppt/notesSlides/notesSlide9.xml
Column family (Table) consists of ROWS Row: stores related data - has a row key, set of columns Row key: uniquely identifies a row in a column family Used to partition data Row: a set of columns Column is a triple: (key, value, timestamp) Column key: uniquely identifies a column value in a row Column value: stores one value or a collection of values Column timestamp: captures last edit time {CLICK – to see next slide on COL KEY approaches} Wide-Column Stores Big Data Management 11
ppt/media/image7.tif
ppt/notesSlides/notesSlide10.xml
Wide-Column Stores Big Data Management 12
ppt/notesSlides/notesSlide11.xml
COMPOSITE ROW KEY Album:year (colon separator) COLUMN KEY could also be composite Choice in modelling sub-structure of tracks for example all the track names in a single column, with components separate by a colon. OR … the column key as the track number storing the title as the value. Wide-Column Stores Big Data Management 13
ppt/media/image8.tif
ppt/media/image9.tif
ppt/notesSlides/notesSlide12.xml
Looking at the DATA as a TABLE and as a COLUMN FAMILY VIEW… e.g. Artist details stored using single-row partitions Artist as key -- column keys (e.g. born) and values Although these are column families the overall data is stored in a ROW – so this is a ROW ORIENTED DBMS Wide-Column Stores Big Data Management 14
ppt/media/image10.tif
ppt/notesSlides/notesSlide13.xml
Or for an ALBUM might be more sensible to store row key as Album title : Year column keys for each track number and value is the track title might wish to store another set of columns for track:length value = track length in minutes 1:length 3.2mins 2: length 3.3mins etc Wide-Column Stores Big Data Management 15
ppt/media/image11.tif
ppt/notesSlides/notesSlide14.xml
Apache open source datastore Takes ideas from Bigtable and Amazon Dynamo Combination key-value and column-family store Wide-Column Stores Big Data Management 17
ppt/media/image12.png
ppt/notesSlides/notesSlide15.xml
Cassandra developed at Facebook Was released as an Open Source project in 2008 Wide-Column Stores Big Data Management 18
ppt/media/image13.png
ppt/notesSlides/notesSlide16.xml
the name Cassandra come from the Torjan myth of a prophet called Cassandra who put a curse on an oracle --- obv linked to Oracle DB Can ADD more NODES without downtime Masterless Secondary INDEX support is weak - single cols only - equality comparisons only Wide-Column Stores Big Data Management 19
ppt/notesSlides/notesSlide17.xml
Used by some big companies like: + Apple (75k nodes) + Netflix (2500 nodes, 420TB data - 1 trillion request per day) + eBay (100 nodes, 250TB per day) + Chinese search engine Easou (270 nodes, 300TB, 800M request /day) Decentralised, with no single point of failure Every node in the cluster can do all jobs (read/write) Fault tolerant with automatic replication between nodes Open source but various supporting companies like DataStax - they offer training, support etc Wide-Column Stores Big Data Management 20
ppt/media/image14.png
ppt/notesSlides/notesSlide18.xml
CONSISTENT HASHING HASH of key determines first node to write the data…. ...data is then replicated to other nodes based on replication factor ( eg 3) {CLICK for next slide view at each node of the write procedure} Wide-Column Stores Big Data Management 21
ppt/media/image15.tif
ppt/notesSlides/notesSlide19.xml
After being directed to a specific node, a write request first gets to the commit log (it stores all the info about in-cache writes). At the same time, the data gets stored in the memtable . At some point (for instance, when the memtable is full), Cassandra flushes the data from cache onto the disk – into SSTables ( Sorted Strings Table) After a node writes the data, it notifies the coordinator node about the successfully completed operation. Wide-Column Stores Big Data Management 22
ppt/media/image16.tif
ppt/notesSlides/notesSlide20.xml
Cassandra is Masterless so the coordinator could be any node. The coordinators is responsible for satisfying the clients request. The consistency level determines the number of nodes that the coordinator needs to hear from in order to notify the client of a successful WRITE. All inter-node requests are sent through a messaging service and in an asynchronous manner. So here the write is sent to nodes 1,2,3. The coordinator waits for a response from the appropriate number of nodes required to satisfy the consistency level. QUORUM is a commonly used consistency level which refers to a majority of the nodes (n/2 +1) where n is the replication factor. QUORUM for replication factor of 3 would be 2 nodes to respond – if doesn’t hear back the coordinator will wait for at most 10 seconds (default setting) - then it’ll send the outcome the client (Success/ failure of the write request). If a NODE is down then coordinator stores as a missed write and will try again later – this is a hinted handoff . If the node doesn’t recover within 3 hours, the coordinator stores the write permanently. Wide-Column Stores Big Data Management 23
ppt/media/image17.tif
ppt/notesSlides/notesSlide21.xml
When a read request starts its journey, the data’s partition key is used to find node the data is on (based on the range held on each node) Checks memtable . If the data is not there, it checks the row key cache (if enabled), then the bloom filter and then the partition key cache . Cassandra uses Bloom filters to determine whether an SSTable (Sorted Strings Table) has data for a particular partition. Bloom filters are not used for range scans , but are used for index scans. We'll talk about Bloom filters in more detail shortly. Wide-Column Stores Big Data Management 24
ppt/media/image18.tif
ppt/notesSlides/notesSlide22.xml
Gossip Style Membership Algorithm: > Cassandra uses gossiping for peer discovery and metadata propagation. > The gossip process runs every second for every node and exchange state messages with up to three other nodes in the cluster. > Since the whole process is decentralized, there is nothing or no one that coordinates each node to gossip. Each node independently will always select one to three peers to gossip with. (like Amazon DynamoDB we talked about in Key-Value stores) Cassandra's AntiEntropy service uses Merkle trees to detect the inconsistencies in data between replicas. (also like Amazon DynamoDB we talked about in Key-Value stores) When nodetool repair command is executed, the target node specified with -h option in the command, coordinates the repair of each column family in each keyspace . A repair coordinator node requests Merkle tree from each replica for a specific token range to compare them. Each replica builds a Merkle tree by scanning the data stored locally in the requested token range. The repair coordinator node compares the Merkle trees and finds all the sub token ranges that differ between the replicas and repairs data in those ranges. Wide-Column Stores Big Data Management 25
ppt/media/image19.jpeg
ppt/media/image20.png
ppt/media/image21.png
ppt/notesSlides/notesSlide23.xml
AP 26 Big Data Management Wide-Column Stores
ppt/notesSlides/notesSlide24.xml
Cassandra can be tuned to be more AVAILABLE or more CONSISTENT – depending on the requirements Wide-Column Stores Big Data Management 27
ppt/notesSlides/notesSlide25.xml
A replication factor of 2 implies that there are two copies of each row and each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. Wide-Column Stores Big Data Management 28
ppt/notesSlides/notesSlide26.xml
This is a unique feature Lowers the entry barrier for DBMS folk Limited expressivity: no joins Do live coding against Cassandra Wide-Column Stores Big Data Management 29
ppt/notesSlides/notesSlide27.xml
n Wide-Column Stores Big Data Management 32
ppt/notesSlides/notesSlide28.xml
Wide-Column Stores Big Data Management 33
ppt/media/image22.wmf
id | firstname | lastname ----+-----------+----------
1 | John | Doe
id | firstname| lastname
----+-----------+----------
1 | John | Doe
ppt/notesSlides/notesSlide29.xml
Wide-Column Stores Big Data Management 34
ppt/notesSlides/notesSlide30.xml
{CLICK: to create an index on the firstname } You can specify an INDEX NAME if you wish, but if none given then CASSANDRA will add one as : table_name _ column_name _idx Wide-Column Stores Big Data Management 35
ppt/notesSlides/notesSlide31.xml
The big difference is in how CASSANDRA’s engine stores the data…. Rather than storing empty cells (as with a static-column storage engine) it only saves values where the column is present. You can have thousands, or millions of columns… for example {CLICK TO NEXT SLIDE} Wide-Column Stores Big Data Management 36
ppt/media/image23.png
ppt/media/image24.png
ppt/media/image25.png
ppt/notesSlides/notesSlide32.xml
You could have thousands of columns all for the KEY siteid 1 in CASSANDRA Wide-Column Stores Big Data Management 37
ppt/notesSlides/notesSlide33.xml
You can specify the consistency required per statement – ie . if all replicas are in sync (CONSISTENCY) or for higher AVAILABILITY with eventual consistency Wide-Column Stores Big Data Management 38
ppt/notesSlides/notesSlide34.xml
{CLICK: Solution is to use BLOOM filters} Wide-Column Stores Big Data Management 40
ppt/notesSlides/notesSlide35.xml
Bloom filters from the 70s… they are fast and space efficient. … it doesn’t keep a copy of all the data but just a record of if it was ‘ probably ’ seen before or not. It may give false positives, but not false negatives… Wide-column Stores 2017 Big Data Management 41
ppt/notesSlides/notesSlide36.xml
Big Data Management 2016 Wide-column Stores 42
ppt/media/image26.png
ppt/notesSlides/notesSlide37.xml
Note still storing first input in the array (pos 1,3,14 for aardvark) --- position 1 was already 1 so not changed by this input… hence why you can have false positives if enough data entered then possible to have a pattern match even thought that input hasn’t been seen before (false positive). However if any of the array locations have a 0 then can’t have seen that entry before. Big Data Management 2016 Wide-column Stores 43
ppt/media/image27.png
ppt/notesSlides/notesSlide38.xml
{CLICK for input word - elephant'} elephant hashed to 16;2;7 Definitely not present as array position 2 is showing 0 --- which means this input can’t have been stored before. BLOOM FILTERS HISTORY: 1) First used for spelling checkers – fire all the English words you know at it then can check a doc to see if words not seen before (however will be some false positives) – but for 70s this was a memory efficient solution to checking word spellings 2) Bloom filter have also been used for forbidden password lists – pass a blacklist of words through it and then don’t worry about any false positives….. these are words not permitted in a password. 3) Used in network routers for packet forwarding – fast and minimal memory required Big Data Management 2016 Wide-column Stores 44
ppt/media/image28.png
ppt/notesSlides/notesSlide39.xml
{CLICK for input word - bat} May be present. --- could be a false positive if the inputs seen so far have resulted in those locations being set to TRUE. So ‘bat’ has been stored and the locations 16,1,7 were enabled by the input of that word. {CLICK for input word - snake} However ‘snake’ hasn’t been seen and we’ve not had 1,14,16 before… but those locations have been set TRUE by a combination of ‘bat’ and ‘aardvark’ - hence false positive. Cassandra uses them in memory to check is SSTable holds the data reducing seek operations / time on disk. It has tools to check the FALSE POSITIVE rate – and the Bloom Filter strategy can be edited (eg lower false positive rate based on the hashing algorithms) – but at MEMORY cost as BLOOM filters are held in memory. Wide-column Stores 2017 Big Data Management 45
ppt/media/image29.png
ppt/notesSlides/notesSlide40.xml
Try Demo 48
ppt/media/image30.png
ppt/notesSlides/notesSlide41.xml
Distributed – each node has the same role – no single point of failure Masterless – each node can service any request New nodes can be added without downtime Wide-Column Stores Big Data Management 50
ppt/changesInfos/changesInfo1.xml
ppt/revisionInfo.xml
docProps/core.xml
CC5212-1 Procesamiento Masivo de Datos 2014 Aidan Hogan Microsoft Office User 899 2019-01-27T12:48:14Z 2006-08-16T00:00:00Z 2020-02-03T11:31:18Z en-GB
docProps/app.xml
lectures 15787 5570 Microsoft Macintosh PowerPoint Widescreen 935 51 41 8 0 false Fonts Used 13 Theme 8 Slide Titles 51 Arial Calibri Century Gothic Courier New Encode Sans Semi Expanded Fira Sans Condensed Helvetica Neue Monaco Open Sans Symbol Times New Roman Wingdings Wingdings 2 Office Theme Office Theme Office Theme Office Theme Office Theme Office Theme Office Theme Office Theme PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation false false false 16.0000
docProps/custom.xml
16.0000 8 false false 0 38 Widescreen false false 49
_rels/.rels
ppt/_rels/presentation.xml.rels