Big Data Management Question Paper
Graph Databases Big Data Management
Phil Bartie [email protected] EM G.29
Based on material from: Alasdair Gray, Heriot-Watt University
Relational Oracle MySQL MS SQL Server
PostgreSQL DB2
MS Access SQLite
Teradata
SAP Adaptive Server
FileMaker Hive
MariaDB Informix Vertica
Database Landscape
Big Data Management
4
NoSQL
DocumentMongoDB
Elasticsearch
DynamoDB CouchBase
Key-Value Redis Memcached
Riak KV Aerospike SimpleDB
Graph Neo4J Titan Giraph InfiniteGraph
RDF Virtuoso Stardog
GraphDB Blazegraph
Jena RDF4J
NewSQL SAP HANA Google Spanner Clustrix VoltDB MemSQL NuoDB
Object Caché Db4o Versant
ObjectStore
Wide-Column Cassandra HBase Accumulo
HyperTable
XML MarkLogic
Sedna Tamino
BaseX eXist-db
Outline
¤Examples of graphs
¤Theory of graphs
¤Storing data graphs
¤Neo4j
¤Querying a graph store
Big Data Management
5
Graphs
These are charts, not graphs!
A Simple Graph
A node (vertex) represents an entity
Nodes are connected by relationships (edges)
Relationships can have no direction, one direction or two directions
Can you think of use cases in which graphs feature?
Big Data Management
12
London Underground
Big Data Management
13
http://www.animalsontheunderground.com
Network Diagram
Big Data Management
20http://www.conceptdraw.com/How-To-Guide/picture/Network-diagram-System-design.png
Black Mirror - Bandersnatch (Netflix)
Big Data Management
21Graph Source: Asim Chaudhry @AsimC86
6 degrees of Kevin Bacon ¤Can all actors be linked to Kevin Bacon in under 6
hops? http://oracleofbacon.org/
Facebook claims 3.74 degrees of freedom between ‘strangers’ (Nov 2016)
Big Data Management
22
http://www.classtools.net/blog/wp-content/uploads/2015/04/bacon.jpg
https://research.fb.com/blog/2016/02/th ree-and-a-half-degrees-of-separation/
What is a graph?
¤Single relationship ¤ Set of Vertices (nodes):
¤ No properties other than a name ¤ Set of Edges (relationships):
¤ No properties ¤ Directed or undirected ¤ May be weighted
¤Operations ¤ Adjacency ¤ Neighbours
¤Algorithms: ¤ Path Finding ¤ Least cost path ¤ Centrality ¤ Degree (indegree,outdegree for directed graphs)
Big Data Management
23
G = (V,E)
Degree (number shows connections per node)
Adjacency: Example
Tube map
Transport for London This diagram is an evolution of the original design conceived in 1931 by Harry Beck · Correct at time of going to print · Poster December 2014
Station in both fare zones Bank/Waterloo Waterloo & City line open between Bank and Waterloo 0621-0030 Mondays to Fridays and 0802-0030 Saturdays. Between Waterloo and Bank 0615-0030 Mondays to Fridays and 0800-0030 Saturdays. Closed Sundays and public holidays. Camden Town Open Sundays 1300 -1730 for interchange and exit only. Covent Garden Exit only from late February until early November 2015. Also, on Saturdays and Sundays, westbound trains will not stop. Please use Leicester Square instead. Emirates Air Line Emirates Greenwich Peninsula and Emirates Royal Docks For full information about operating times and fares please visit tfl.gov.uk/emiratesairline Hounslow West Step-free access for manual wheelchairs only. Stanmore Step-free access via a steep ramp. Tottenham Court Road Central line trains will not stop at this station from early January until early December 2015. Turnham Green Served by Piccadilly line trains until 0650 Mondays to Saturdays, 0745 Sundays and after 2230 every evening. At other times use District line. West Ham Until late February 2015 step-free access will only be to the DLR and c2c. West India Quay Not served by DLR trains from Bank towards Lewisham at certain times.
Moor ParkMetropolitan
Victoria
Circle Central Bakerloo
DLR
London Overground
Piccadilly
Waterloo & City
Jubilee
Hammersmith & City
Northern
District District open weekends, public holidays and some Olympia events
Emirates Air Line
National Rail
Riverboat services
Airport
Tramlink
Interchange stations
Step-free access from street to platform
Step-free access from street to train
Emirates Air Line
River Thames
2 2
2 2
2
5
8 8 6
2
4
4
6 5
41
3
2
4 3
3
33 1
1
3
3
59 7 7 Special fares apply
5
5
4
4
46
Chorleywood
Mill Hill East
Rickmansworth
Perivale
Kentish Town West
Camden Road
Dalston Kingsland
Wanstead Park
Hanger Lane
Edgware
Burnt Oak
Colindale
Hendon Central
Brent Cross
Golders Green
West Silvertown
Emirates Royal Docks
Emirates Greenwich Peninsula Pontoon Dock
King George V
Hampstead
Belsize Park
Chalk Farm
Chesham
Amersham
Chalfont & Latimer
Moor Park
Northwood Northwood Hills
Pinner
North Harrow
Custom House for ExCeL
Prince Regent
Royal Albert
Beckton Park
Cyprus
Gallions Reach
Beckton
Watford
Croxley
Fulham Broadway
Lambeth North
Heathrow Terminal 4
Kensal Rise
Bethnal Green
Westferry Blackwall
Brondesbury Park
Hampstead Heath
Harringay Green Lanes
Leytonstone High Road
Leyton Midland Road
Hackney Central
Northwick Park
Preston Road
Royal Victoria
Wembley Park
Rayners Lane
Watford High Street
Ruislip Gardens
South Ruislip
Greenford
Northolt
South Harrow
Sudbury Hill
Sudbury Town
Alperton
Pimlico
Park Royal
North Ealing
Acton Central
South Acton
Ealing Broadway
West Ruislip
Carpenders Park
Hatch End
North Wembley
Ealing Common
South Kenton
Kenton
Kensal Green
Queen’s Park
Gunnersbury
Kew Gardens
Stockwell
Bow Church
Stonebridge Park
Harlesden
Camden Town
Willesden Junction
Headstone Lane
Parsons Green
Putney Bridge
East Putney
Southfields
Wimbledon Park
Island Gardens
Deptford Bridge
South Quay
Crossharbour
Mudchute
Heron Quays
West India Quay
Elverson Road
Oakwood
Cockfosters
Southgate
Arnos Grove
Bounds Green
Theydon Bois
Epping
Debden
Loughton
Buckhurst Hill
Walthamstow Queen’s Road
Woodgrange Park
Leytonstone
Leyton
Wood Green
Turnpike Lane
Manor House
Stanmore
Canons Park
Queensbury
Kingsbury
High Barnet
Totteridge & Whetstone
Woodside Park
West Finchley
Finchley Central Woodford
South Woodford
Snaresbrook
Hainault
Fairlop
Barkingside
Newbury Park
East Finchley
Highgate
Archway
Devons Road
Langdon Park
All Saints
Tufnell Park Neasden
Dollis Hill
Willesden Green
South Tottenham
Swiss Cottage
Kilburn
Blackhorse Road
Acton Town
Canning Town
Finchley Road
Canary Wharf
Stepney Green
East Ham
Plaistow
Upton Park
Poplar
Upper Holloway
Pudding Mill Lane
Kennington
Borough
Elm Park Dagenham
East
Dagenham Heathway
Becontree
Upney
Heathrow Terminal 5
Finchley Road & Frognal
Crouch Hill
Northfields
Boston Manor
South Ealing
Osterley
Hounslow Central
Hounslow East
Clapham North
Clapham High Street
Oval
Clapham Common
Clapham South
Tooting Bec
Tooting Broadway
Colliers Wood
South Wimbledon
Arsenal
Holloway Road
Caledonian Road
Morden
Hounslow West
Hatton Cross
Heathrow Terminals 1, 2, 3
West Harrow
Brondesbury Caledonian Road &
Barnsbury
Hackney Wick
Homerton
West Acton
East India
Chiswick Park
Roding Valley
Grange Hill
Chigwell
Redbridge
Gants Hill
Wanstead
Ickenham
Turnham Green
Uxbridge
Hillingdon Ruislip
Gospel Oak
Mile End
Bow Road
Bromley- by-Bow
Upminster Bridge
Hornchurch
London City Airport
Richmond
West Brompton
Clapham Junction
Balham
Norwood Junction
Forest Hill
Anerley
Penge West
Honor Oak Park
Brockley
Brixton
Elephant & Castle
Wembley Central
Harrow & Wealdstone
Bushey
Watford Junction
Kentish Town Highbury & Islington
Finsbury Park
Seven Sisters
Tottenham Hale
Walthamstow Central
Stratford
Upminster
Barking
West Ham
Limehouse
Woolwich Arsenal
Wimbledon
West Croydon
Cutty Sark for Maritime Greenwich
Ruislip Manor
Eastcote
West Hampstead
Wapping
New Cross Gate
Sydenham
New Cross
Crystal Palace
Queens Road Peckham
Peckham Rye
Denmark Hill
Canada Water
Surrey Quays
Whitechapel
Greenwich
Lewisham
Kilburn Park
Regent’s Park
Kilburn High Road
South Hampstead
Goodge Street
Shepherd’s Bush Market
Goldhawk Road
Bayswater
Warren Street
Aldgate
Barbican Russell Square
Mornington Crescent
High Street Kensington
St. John’s Wood
Green Park
Baker Street
Notting Hill Gate
Aldgate East
Mansion House
Oxford Circus
Bond Street
Piccadilly Circus
Holborn
Tower Gateway
Monument
Leicester Square
St. Paul’s
Hyde Park Corner
Knightsbridge
Stamford Brook
Ravenscourt Park
West Kensington
North Acton
Holland Park
Angel
Queensway Marble Arch
South Kensington
Sloane Square
Wandsworth Road
Covent Garden
Great Portland
Street
Bank
East Acton
Chancery Lane
Lancaster Gate
Kensington (Olympia)
Marylebone
Victoria
Charing Cross
Cannon Street
Farringdon
Euston
Westminster London Bridge
Liverpool Street
Stratford International
Old Street
Moorgate
Warwick Avenue Maida Vale
Tower Hill
Fenchurch Street
Paddington
Barons Court
Gloucester Road St. James’s
Park Temple
Latimer Road Ladbroke Grove
Royal Oak
Westbourne Park
Bermondsey
Rotherhithe
Shoreditch High Street
Dalston Junction
Haggerston
Hoxton
Wood Lane
White City
Shepherd’s Bush
King’s Cross St. Pancras
Euston SquareEdgware
Road
Southwark
Embankment
Stratford High Street
Abbey Road
Star Lane
Vauxhall
Blackfriars
Waterloo
Tottenham Court Road
Canonbury
Shadwell
Earl’s Court
North Greenwich
Hammersmith
Edgware Road
Harrow- on-the-Hill
Imperial Wharf
Big Data Management
24
Marble Arch is adjacent (connected) to: • Bond Street • Lancaster Gate
Neighbourhood: Example
Big Data Management
25
Tube map
Transport for London This diagram is an evolution of the original design conceived in 1931 by Harry Beck · Correct at time of going to print · Poster December 2014
Station in both fare zones Bank/Waterloo Waterloo & City line open between Bank and Waterloo 0621-0030 Mondays to Fridays and 0802-0030 Saturdays. Between Waterloo and Bank 0615-0030 Mondays to Fridays and 0800-0030 Saturdays. Closed Sundays and public holidays. Camden Town Open Sundays 1300 -1730 for interchange and exit only. Covent Garden Exit only from late February until early November 2015. Also, on Saturdays and Sundays, westbound trains will not stop. Please use Leicester Square instead. Emirates Air Line Emirates Greenwich Peninsula and Emirates Royal Docks For full information about operating times and fares please visit tfl.gov.uk/emiratesairline Hounslow West Step-free access for manual wheelchairs only. Stanmore Step-free access via a steep ramp. Tottenham Court Road Central line trains will not stop at this station from early January until early December 2015. Turnham Green Served by Piccadilly line trains until 0650 Mondays to Saturdays, 0745 Sundays and after 2230 every evening. At other times use District line. West Ham Until late February 2015 step-free access will only be to the DLR and c2c. West India Quay Not served by DLR trains from Bank towards Lewisham at certain times.
Moor ParkMetropolitan
Victoria
Circle Central Bakerloo
DLR
London Overground
Piccadilly
Waterloo & City
Jubilee
Hammersmith & City
Northern
District District open weekends, public holidays and some Olympia events
Emirates Air Line
National Rail
Riverboat services
Airport
Tramlink
Interchange stations
Step-free access from street to platform
Step-free access from street to train
Emirates Air Line
River Thames
2 2
2 2
2
5
8 8 6
2
4
4
6 5
41
3
2
4 3
3
33 1
1
3
3
59 7 7 Special fares apply
5
5
4
4
46
Chorleywood
Mill Hill East
Rickmansworth
Perivale
Kentish Town West
Camden Road
Dalston Kingsland
Wanstead Park
Hanger Lane
Edgware
Burnt Oak
Colindale
Hendon Central
Brent Cross
Golders Green
West Silvertown
Emirates Royal Docks
Emirates Greenwich Peninsula Pontoon Dock
King George V
Hampstead
Belsize Park
Chalk Farm
Chesham
Amersham
Chalfont & Latimer
Moor Park
Northwood Northwood Hills
Pinner
North Harrow
Custom House for ExCeL
Prince Regent
Royal Albert
Beckton Park
Cyprus
Gallions Reach
Beckton
Watford
Croxley
Fulham Broadway
Lambeth North
Heathrow Terminal 4
Kensal Rise
Bethnal Green
Westferry Blackwall
Brondesbury Park
Hampstead Heath
Harringay Green Lanes
Leytonstone High Road
Leyton Midland Road
Hackney Central
Northwick Park
Preston Road
Royal Victoria
Wembley Park
Rayners Lane
Watford High Street
Ruislip Gardens
South Ruislip
Greenford
Northolt
South Harrow
Sudbury Hill
Sudbury Town
Alperton
Pimlico
Park Royal
North Ealing
Acton Central
South Acton
Ealing Broadway
West Ruislip
Carpenders Park
Hatch End
North Wembley
Ealing Common
South Kenton
Kenton
Kensal Green
Queen’s Park
Gunnersbury
Kew Gardens
Stockwell
Bow Church
Stonebridge Park
Harlesden
Camden Town
Willesden Junction
Headstone Lane
Parsons Green
Putney Bridge
East Putney
Southfields
Wimbledon Park
Island Gardens
Deptford Bridge
South Quay
Crossharbour
Mudchute
Heron Quays
West India Quay
Elverson Road
Oakwood
Cockfosters
Southgate
Arnos Grove
Bounds Green
Theydon Bois
Epping
Debden
Loughton
Buckhurst Hill
Walthamstow Queen’s Road
Woodgrange Park
Leytonstone
Leyton
Wood Green
Turnpike Lane
Manor House
Stanmore
Canons Park
Queensbury
Kingsbury
High Barnet
Totteridge & Whetstone
Woodside Park
West Finchley
Finchley Central Woodford
South Woodford
Snaresbrook
Hainault
Fairlop
Barkingside
Newbury Park
East Finchley
Highgate
Archway
Devons Road
Langdon Park
All Saints
Tufnell Park Neasden
Dollis Hill
Willesden Green
South Tottenham
Swiss Cottage
Kilburn
Blackhorse Road
Acton Town
Canning Town
Finchley Road
Canary Wharf
Stepney Green
East Ham
Plaistow
Upton Park
Poplar
Upper Holloway
Pudding Mill Lane
Kennington
Borough
Elm Park Dagenham
East
Dagenham Heathway
Becontree
Upney
Heathrow Terminal 5
Finchley Road & Frognal
Crouch Hill
Northfields
Boston Manor
South Ealing
Osterley
Hounslow Central
Hounslow East
Clapham North
Clapham High Street
Oval
Clapham Common
Clapham South
Tooting Bec
Tooting Broadway
Colliers Wood
South Wimbledon
Arsenal
Holloway Road
Caledonian Road
Morden
Hounslow West
Hatton Cross
Heathrow Terminals 1, 2, 3
West Harrow
Brondesbury Caledonian Road &
Barnsbury
Hackney Wick
Homerton
West Acton
East India
Chiswick Park
Roding Valley
Grange Hill
Chigwell
Redbridge
Gants Hill
Wanstead
Ickenham
Turnham Green
Uxbridge
Hillingdon Ruislip
Gospel Oak
Mile End
Bow Road
Bromley- by-Bow
Upminster Bridge
Hornchurch
London City Airport
Richmond
West Brompton
Clapham Junction
Balham
Norwood Junction
Forest Hill
Anerley
Penge West
Honor Oak Park
Brockley
Brixton
Elephant & Castle
Wembley Central
Harrow & Wealdstone
Bushey
Watford Junction
Kentish Town Highbury & Islington
Finsbury Park
Seven Sisters
Tottenham Hale
Walthamstow Central
Stratford
Upminster
Barking
West Ham
Limehouse
Woolwich Arsenal
Wimbledon
West Croydon
Cutty Sark for Maritime Greenwich
Ruislip Manor
Eastcote
West Hampstead
Wapping
New Cross Gate
Sydenham
New Cross
Crystal Palace
Queens Road Peckham
Peckham Rye
Denmark Hill
Canada Water
Surrey Quays
Whitechapel
Greenwich
Lewisham
Kilburn Park
Regent’s Park
Kilburn High Road
South Hampstead
Goodge Street
Shepherd’s Bush Market
Goldhawk Road
Bayswater
Warren Street
Aldgate
Barbican Russell Square
Mornington Crescent
High Street Kensington
St. John’s Wood
Green Park
Baker Street
Notting Hill Gate
Aldgate East
Mansion House
Oxford Circus
Bond Street
Piccadilly Circus
Holborn
Tower Gateway
Monument
Leicester Square
St. Paul’s
Hyde Park Corner
Knightsbridge
Stamford Brook
Ravenscourt Park
West Kensington
North Acton
Holland Park
Angel
Queensway Marble Arch
South Kensington
Sloane Square
Wandsworth Road
Covent Garden
Great Portland
Street
Bank
East Acton
Chancery Lane
Lancaster Gate
Kensington (Olympia)
Marylebone
Victoria
Charing Cross
Cannon Street
Farringdon
Euston
Westminster London Bridge
Liverpool Street
Stratford International
Old Street
Moorgate
Warwick Avenue Maida Vale
Tower Hill
Fenchurch Street
Paddington
Barons Court
Gloucester Road St. James’s
Park Temple
Latimer Road Ladbroke Grove
Royal Oak
Westbourne Park
Bermondsey
Rotherhithe
Shoreditch High Street
Dalston Junction
Haggerston
Hoxton
Wood Lane
White City
Shepherd’s Bush
King’s Cross St. Pancras
Euston SquareEdgware
Road
Southwark
Embankment
Stratford High Street
Abbey Road
Star Lane
Vauxhall
Blackfriars
Waterloo
Tottenham Court Road
Canonbury
Shadwell
Earl’s Court
North Greenwich
Hammersmith
Edgware Road
Harrow- on-the-Hill
Imperial Wharf
Marble Arch 2 step neighbourhood: • Bond Street • Baker Street • Oxford Circus • Green Park • Lancaster Gate • Queensway
7 Bridges of Königsberg
Big Data Management
26
Can you walk a route that crosses every bridge only once?
https://www.mathsisfun.com/activity/seven-bridges-konigsberg.html
Only if the number of
nodes with an odd degree
is 0 or 2
7 Bridges of Königsberg
Big Data Management
27https://www.mathsisfun.com/activity/seven-bridges-konigsberg.html
A route visiting every edge once is known as a Euler path
degree A(3) B(3) C(5) D(3) Nodes with ODD connections is 4 so Euler path not possible
(If removed edge t) degree A(3) B(3) C(4) D(2) 2 ODD = Euler path possible
Which of these have Euler Paths?
Big Data Management
28
Only possible if number of nodes with odd
degree is 0 or 2
https://www.mathsisfun.com/activity/seven-bridges-konigsberg.html
Applications ¤Social Networks:
¤ Friendship relationships ¤ Cliques ¤ Centres
¤Transportation: ¤ Shortest route ¤ Intercity air links ¤ Inter-transport ¤ Travelling Salesman
¤Technology ¤ Network links ¤ Software dependencies
¤Science ¤ Biological pathways ¤ Chemistry ¤ Electrical networks
¤Movies ¤ Actors, films, ¤ Locations
Big Data Management
29
Multi-Relational Graph
¤Multi-Relationship ¤ Set of Vertices: no properties ¤ Sets of Edges: may be
weighted
¤No properties on edges
Big Data Management
30
G = (V,E) E=(E0,E1,...,Em ⊆ (V xV))
http://www.w3.org/TR/rdf11-primer/
London Underground
Big Data Management
31
Tube map
Transport for London This diagram is an evolution of the original design conceived in 1931 by Harry Beck · Correct at time of going to print · Poster December 2014
Station in both fare zones Bank/Waterloo Waterloo & City line open between Bank and Waterloo 0621-0030 Mondays to Fridays and 0802-0030 Saturdays. Between Waterloo and Bank 0615-0030 Mondays to Fridays and 0800-0030 Saturdays. Closed Sundays and public holidays. Camden Town Open Sundays 1300 -1730 for interchange and exit only. Covent Garden Exit only from late February until early November 2015. Also, on Saturdays and Sundays, westbound trains will not stop. Please use Leicester Square instead. Emirates Air Line Emirates Greenwich Peninsula and Emirates Royal Docks For full information about operating times and fares please visit tfl.gov.uk/emiratesairline Hounslow West Step-free access for manual wheelchairs only. Stanmore Step-free access via a steep ramp. Tottenham Court Road Central line trains will not stop at this station from early January until early December 2015. Turnham Green Served by Piccadilly line trains until 0650 Mondays to Saturdays, 0745 Sundays and after 2230 every evening. At other times use District line. West Ham Until late February 2015 step-free access will only be to the DLR and c2c. West India Quay Not served by DLR trains from Bank towards Lewisham at certain times.
Moor ParkMetropolitan
Victoria
Circle Central Bakerloo
DLR
London Overground
Piccadilly
Waterloo & City
Jubilee
Hammersmith & City
Northern
District District open weekends, public holidays and some Olympia events
Emirates Air Line
National Rail
Riverboat services
Airport
Tramlink
Interchange stations
Step-free access from street to platform
Step-free access from street to train
Emirates Air Line
River Thames
2 2
2 2
2
5
8 8 6
2
4
4
6 5
41
3
2
4 3
3
33 1
1
3
3
59 7 7 Special fares apply
5
5
4
4
46
Chorleywood
Mill Hill East
Rickmansworth
Perivale
Kentish Town West
Camden Road
Dalston Kingsland
Wanstead Park
Hanger Lane
Edgware
Burnt Oak
Colindale
Hendon Central
Brent Cross
Golders Green
West Silvertown
Emirates Royal Docks
Emirates Greenwich Peninsula Pontoon Dock
King George V
Hampstead
Belsize Park
Chalk Farm
Chesham
Amersham
Chalfont & Latimer
Moor Park
Northwood Northwood Hills
Pinner
North Harrow
Custom House for ExCeL
Prince Regent
Royal Albert
Beckton Park
Cyprus
Gallions Reach
Beckton
Watford
Croxley
Fulham Broadway
Lambeth North
Heathrow Terminal 4
Kensal Rise
Bethnal Green
Westferry Blackwall
Brondesbury Park
Hampstead Heath
Harringay Green Lanes
Leytonstone High Road
Leyton Midland Road
Hackney Central
Northwick Park
Preston Road
Royal Victoria
Wembley Park
Rayners Lane
Watford High Street
Ruislip Gardens
South Ruislip
Greenford
Northolt
South Harrow
Sudbury Hill
Sudbury Town
Alperton
Pimlico
Park Royal
North Ealing
Acton Central
South Acton
Ealing Broadway
West Ruislip
Carpenders Park
Hatch End
North Wembley
Ealing Common
South Kenton
Kenton
Kensal Green
Queen’s Park
Gunnersbury
Kew Gardens
Stockwell
Bow Church
Stonebridge Park
Harlesden
Camden Town
Willesden Junction
Headstone Lane
Parsons Green
Putney Bridge
East Putney
Southfields
Wimbledon Park
Island Gardens
Deptford Bridge
South Quay
Crossharbour
Mudchute
Heron Quays
West India Quay
Elverson Road
Oakwood
Cockfosters
Southgate
Arnos Grove
Bounds Green
Theydon Bois
Epping
Debden
Loughton
Buckhurst Hill
Walthamstow Queen’s Road
Woodgrange Park
Leytonstone
Leyton
Wood Green
Turnpike Lane
Manor House
Stanmore
Canons Park
Queensbury
Kingsbury
High Barnet
Totteridge & Whetstone
Woodside Park
West Finchley
Finchley Central Woodford
South Woodford
Snaresbrook
Hainault
Fairlop
Barkingside
Newbury Park
East Finchley
Highgate
Archway
Devons Road
Langdon Park
All Saints
Tufnell Park Neasden
Dollis Hill
Willesden Green
South Tottenham
Swiss Cottage
Kilburn
Blackhorse Road
Acton Town
Canning Town
Finchley Road
Canary Wharf
Stepney Green
East Ham
Plaistow
Upton Park
Poplar
Upper Holloway
Pudding Mill Lane
Kennington
Borough
Elm Park Dagenham
East
Dagenham Heathway
Becontree
Upney
Heathrow Terminal 5
Finchley Road & Frognal
Crouch Hill
Northfields
Boston Manor
South Ealing
Osterley
Hounslow Central
Hounslow East
Clapham North
Clapham High Street
Oval
Clapham Common
Clapham South
Tooting Bec
Tooting Broadway
Colliers Wood
South Wimbledon
Arsenal
Holloway Road
Caledonian Road
Morden
Hounslow West
Hatton Cross
Heathrow Terminals 1, 2, 3
West Harrow
Brondesbury Caledonian Road &
Barnsbury
Hackney Wick
Homerton
West Acton
East India
Chiswick Park
Roding Valley
Grange Hill
Chigwell
Redbridge
Gants Hill
Wanstead
Ickenham
Turnham Green
Uxbridge
Hillingdon Ruislip
Gospel Oak
Mile End
Bow Road
Bromley- by-Bow
Upminster Bridge
Hornchurch
London City Airport
Richmond
West Brompton
Clapham Junction
Balham
Norwood Junction
Forest Hill
Anerley
Penge West
Honor Oak Park
Brockley
Brixton
Elephant & Castle
Wembley Central
Harrow & Wealdstone
Bushey
Watford Junction
Kentish Town Highbury & Islington
Finsbury Park
Seven Sisters
Tottenham Hale
Walthamstow Central
Stratford
Upminster
Barking
West Ham
Limehouse
Woolwich Arsenal
Wimbledon
West Croydon
Cutty Sark for Maritime Greenwich
Ruislip Manor
Eastcote
West Hampstead
Wapping
New Cross Gate
Sydenham
New Cross
Crystal Palace
Queens Road Peckham
Peckham Rye
Denmark Hill
Canada Water
Surrey Quays
Whitechapel
Greenwich
Lewisham
Kilburn Park
Regent’s Park
Kilburn High Road
South Hampstead
Goodge Street
Shepherd’s Bush Market
Goldhawk Road
Bayswater
Warren Street
Aldgate
Barbican Russell Square
Mornington Crescent
High Street Kensington
St. John’s Wood
Green Park
Baker Street
Notting Hill Gate
Aldgate East
Mansion House
Oxford Circus
Bond Street
Piccadilly Circus
Holborn
Tower Gateway
Monument
Leicester Square
St. Paul’s
Hyde Park Corner
Knightsbridge
Stamford Brook
Ravenscourt Park
West Kensington
North Acton
Holland Park
Angel
Queensway Marble Arch
South Kensington
Sloane Square
Wandsworth Road
Covent Garden
Great Portland
Street
Bank
East Acton
Chancery Lane
Lancaster Gate
Kensington (Olympia)
Marylebone
Victoria
Charing Cross
Cannon Street
Farringdon
Euston
Westminster London Bridge
Liverpool Street
Stratford International
Old Street
Moorgate
Warwick Avenue Maida Vale
Tower Hill
Fenchurch Street
Paddington
Barons Court
Gloucester Road St. James’s
Park Temple
Latimer Road Ladbroke Grove
Royal Oak
Westbourne Park
Bermondsey
Rotherhithe
Shoreditch High Street
Dalston Junction
Haggerston
Hoxton
Wood Lane
White City
Shepherd’s Bush
King’s Cross St. Pancras
Euston SquareEdgware
Road
Southwark
Embankment
Stratford High Street
Abbey Road
Star Lane
Vauxhall
Blackfriars
Waterloo
Tottenham Court Road
Canonbury
Shadwell
Earl’s Court
North Greenwich
Hammersmith
Edgware Road
Harrow- on-the-Hill
Imperial Wharf
We need node
properties!
How can a wheelchair
user get from Paddington to Green Park?
Graph Operations
¤Create: add a node or relationship
¤Read: retrieve data ¤Update: change a node
or relationship’s properties
¤Delete: remove a node or relationship
¤Adjacency: Find connected nodes
¤Neighbourhood: Find nodes up to x steps away
¤Shortest path: Find the shortest path between two nodes
Big Data Management
32
Graph storage
¤Native: Optimised and designed for graphs
¤Serialised: stored in some other form ¤Relational DBMS ¤Object store ¤Document store
Big Data Management
33Check here for more details: https://neo4j.com/blog/native-vs-non-native-graph-technology/
Native = data stored as a graph Non-native = data not stored as a graph
Storing Relationship in an RDBMS
Person ID Name 1 Alice 2 Bob … … 99 Zach
Big Data Management
34
Friendships PersonID FriendID 1 2 2 1 2 99 … … 99 1
1
99
2 FRIEND
FRIENDFRIEND
Bob’s Friends SELECT p2.Name FROM Person p1,
Friendship f1, Person p2
WHERE p1.ID = f1.PersonID AND p2.ID = f1.FriendID AND p1.Name = ‘Bob’
FRIEND
Storing Relationship in an RDBMS
Person ID Name 1 Alice 2 Bob … … 99 Zach
Big Data Management
35
Friendships PersonID FriendID 1 2 2 1 2 99 … … 99 1
1
99
2 FRIEND
FRIENDFRIEND
Bob’s Friends-of-friends SELECT p2.Name
AS Friend_of_friend FROM Person p1,
Friendship f1, Person p2, Friendship f2
WHERE p1.ID = f1.PersonID AND f1.FriendID =
f2.PersonID AND p2.ID = f2.FriendID AND p1.Name = ‘Bob’ AND f2.FriendID <> p1.ID
FRIEND
Storing Graphs in JSON
{ A : { outE : [B, C]
} B : { outE : []
} C : { outE : [D]
} D : { outE : [A]
} }
Representing a Graph in a Relational Database
outV | inV
------------
A | B
A | C
C | D
D | A
A
CB
D
Big Data Management
37
http://www.slideshare.net/slidarko/graph-databases-trends-in-the-web-of-data
Native Storage ¤Index-free adjacency
¤ Doesn’t rely on indexes ¤ Relationships modelled as links
between nodes ¤ Micro-index of adjacent
nodes ¤ Query time proportional to
amount of graph searched, not total size
InLinks OutLinks D B
C
Big Data Management
39
Representing a Graph in a Relational Database
outV | inV
------------
A | B
A | C
C | D
D | A
A
CB
D
Processing engine
¤Runs algorithms over the graph ¤Identify clusters ¤Questions about relationships
¤Traverses the graph ¤Following paths ¤Chasing pointers
Big Data Management
40
A
DC
B E
41
Figure 1-3. An overview of the graph database space
Figure 1-4 shows a common architecture for deploying a graph compute engine. The architecture includes a system of record (SOR) database with OLTP properties (such as MySQL, Oracle, or Neo4j), which serves, requests, and responds to queries from the application (and ultimately the users) at runtime. Periodically, an Extract, Transform, and Load (ETL) job moves data from the system of record database into the graph compute engine for offline querying and analysis.
Figure 1-4. A high-level view of a typical graph compute engine deployment
A variety of different types of graph compute engines exist. Most notably there are in- memory/single machine graph compute engines like Cassovary, and distributed graph compute engines like Pegasus or Giraph. Most distributed graph compute engines are
A High-Level View of the Graph Space | 7
Big Data Management
I. Robinson, J. Webber, and E. Eifrem, Graph Databases. O’Reilly Media, 2013.
Performance: Relationship queries
Relational ¤Performance degrades with
many joins
¤Multiple self-joins even worse ¤ Each self-join must create a
copy of the relation
¤Cost increases as cardinality increases
Graph ¤Performance stays constant
¤ Queries are localised to a node
¤ Nodes are linked via relationships
¤Cost constrained by query
Big Data Management
42
Find all friends of my friends
Query Performance: Set queries
Relational ¤Exploit data model structure
¤Matches “traditional” business query loads
Graphs ¤Have to visit every node
Big Data Management
43
For each film, how many actors are there? What is the average number of actor per film?
Flexibility: New Data Needs
Relational ¤Schema redesign
¤Implementing redesign implies data migration
Graphs ¤Add new nodes and
relationships
¤Naturally supported
Big Data Management
44
http://neo4j.com/
Native Graph Storage
For this course we will use…
..other native graph databases are available…
• Data model looks like graph you would draw on whiteboard, so simple to discuss with non technical people
• ACID compliant • Flexible schema • Cheap to run (e.g. compared to Oracle) • Relationships are no longer hidden, they are a 1st class entity • Fast
Benefits
• Correspond to entities
• Have properties that describe the entity; stored as JSON-like
• Have labels that define the class of entity
Nodes
HWU
KEN
{ name: “ken”, age: 26, job: “ra”
}
UNIVERSITY
PERSON
RESEARCH
TEACH
• Link 2 entities with a direction
• Have properties that describe the relationship; stored as JSON-like
• Can describe the nature (e.g., parent- child)
• Or be qualitative (e.g., start date)
• Have labels that define the class of relationship
Relationships
HWU
KEN
{ start : “1.4.2006”, contract : “rolling” grade : 7
}
EMPLOYS
• Relationships ALWAYS have 1 direction
• Relationships cannot have 2 directions
• You can have multiple relationships – do you need to?
Answer: no
¤You can navigate in reverse!
HOWEVER.. You might want to!
For example: HWU employs Ken Ken is employed by HWU it doesn’t make sense that Ken Employs HWU
BUT: Ken follows HWU (i.e. on Twitter) and HWU follows Ken
Exercise
Create a graph that models the relationships between Heriot-Watt University, Edinburgh and yourself.
52
STUDIES AT
LOCATED N
PERSON
LIVES IN
UNIVERSITY
CITY
STUDIES AT
LOCATED IN
PERSON
LIVES IN
{ name: “steve”, age: 23, job: “student”
}
UNIVERSITY
{ name: “hwu”, founded: 1821
}
{ name:
“Edinburgh” }
CITY
Queries generate sub-graphs
HWU
KEN
Query
Query
Treating relationships as 1st class entities, and storing quantitative or qualitative facts about the relationship enables very complex querying:
Find all men who are connected within three friends of my women friends who like sailing but not bowling and who live within 30 miles of my address.
• Still something of a niche product. Many use cases, but not a general purpose DBMS in the way that RDBMS are.
• Immature in comparison to RDBMS • Writes are slow (in comparison to something like MongoDB) • Flexible schema are not that flexible. You need to fix a schema in
order that your program knows what to ask for and how to ask for it. So if there is no schema defined at the DBMS level, you end up defining one in the programs that access the data.
• Neo4J Cypher pushing for a standard query language https://neo4j.com/press-releases/query-language-graph- databases-international-standard/
Weaknesses
Neo4j Architecture
at O(1) cost each. To find who is friends with Alice, we simply follow all of Alice’s incoming FRIEND relationships to their source, again at O(1) cost each. Given these costings, it’s clear that, in theory at least, graph traversals can be very effi! cient. But such high-performance traversals only become reality when they are sup! ported by an architecture designed for that purpose.
Native Graph Storage If index-free adjacency is the key to high-performance traversals, queries, and writes, then one key aspect of the design of a graph database is the way in which graphs are stored. An efficient, native graph storage format supports extremely rapid traversals for arbitrary graph algorithms—an important reason for using graphs. For illustrative pur! poses we’ll use the Neo4j database as an example of how a graph database is architected.
First, let’s contextualize our discussion by looking at Neo4j’s high-level architecture, presented in Figure 6-3. In what follows we’ll work bottom-up, from the files on disk, through the programmatic APIs, and up to the Cypher query language. Along the way we’ll discuss the performance and dependability characteristics of Neo4j, and the design decisions that make Neo4j a performant, reliable graph database.
Figure 6-3. Neo4j architecture
Neo4j stores graph data in a number of different store files. Each store file contains the data for a specific part of the graph (e.g., nodes, relationships, properties). The division of storage responsibilities—particularly the separation of graph structure from property data—facilitates performant graph traversals, even though it means the user’s view of
144 | Chapter 6: Graph Database Internals
Big Data Management
59I. Robinson, J. Webber, and E. Eifrem, Graph Databases. O’Reilly Media, 2013.
Nodes Relationships Properties
Neo4J hardware calculator
Big Data Management
60https://neo4j.com/hardware-sizing/
But (like Dynamo), tables are tunable towards CP
Where is Neo4j?
Big Data Management
61
C
A P
CA: Guarantees to give a correct response but only while network works fine (Centralised / Traditional)
CP: Guarantees responses are correct even if there are network failures, but response may fail (Weak availability)
AP: Always provides a “best-effort” response even in presence of network failures (Eventual consistency)
Versions of Neo4J
Big Data Management
62
• Community • Enterprise
• Both offer ACID transactions • Both offer Cypher graph query language • Both offer drivers for C#, Java, JS, Python…
• Only Enterprise offers clustering
Neo4J: Enterprise Edition – Horizontal Scaling
Big Data Management
63https://neo4j.com/docs/operations-manual/current/clustering/introduction/
Core Servers: main responsibility is to safeguard data – they replicate all transactions using the Raft protocol. Raft ensures that the data is safely durable before confirming transaction commit to the end user application.
Read Replicas are fully-fledged Neo4j databases capable of fulfilling arbitrary (read-only) graph queries and procedures.
Read Replicas' main responsibility is to scale out graph workloads. They are asynchronously replicated from Core Servers via transaction log shipping. Periodically (usually in the ms range) a Read Replica will poll a Core Server for any new transactions and the Core Server will ship those transactions to the Read Replica. Many Read Replicas can be fed data from a relatively small number of Core Servers, allowing for a large fan out of the query workload for scale.
Cypher Query Language
Neo4j specific
Big Data Management
http://neo4j.com/developer/cypher-query-language/
Neo4J
Big Data Management
Create simplest graph
Big Data Management
67
What happened?
Big Data Management
68
Displaying nodes
Big Data Management
69
Display Graph: nodes
Big Data Management
70
Display Graph
Big Data Management
71
Create a relationship
Big Data Management
72
Relationship created
Big Data Management
73
Display result using match
Big Data Management
74
Create link from 0 to 1
Big Data Management
75
Displaying the new edge
Big Data Management
76
Delete all nodes and edges
Big Data Management
77
More information
Big Data Management
78
Cypher Summary ¤ Create a node: create()
¤ Retrieve all nodes: match(n) return n
¤ Create two nodes and a relationship called :LINKS_TO between them: create()-[:LINKS_TO]-> ()
¤ Create LINKS_TO relationship between nodes 0 and 1: match (n),(m) where id(n)=0 and id(m)= 1 create(n) -[:LINKS_TO]-> (m)
¤ Delete all nodes and edges: match(n) optional match (n)-[r]-() delete n,r
79
Movie Schema
Big Data Management
80
Neo4J supports ACID ¤ In order to fully maintain data integrity and ensure good transactional behaviour, Neo4j supports the ACID properties:
¤ Atomicity: If any part of a transaction fails, the database state is left unchanged. ¤ Consistency: Any transaction will leave the database in a consistent state. ¤ Isolation: During a transaction, modified data cannot be accessed by other operations. ¤ Durability: The DBMS can always recover the results of a committed transaction.
¤ All database operations that access the graph, indexes, or the schema must be performed in a transaction. ¤ The default isolation level is READ_COMMITTED. ¤ Data retrieved by traversals is not protected from modification by other transactions. ¤ Non-repeatable reads may occur (i.e., only write locks are acquired and held until the end of the transaction). ¤ One can manually acquire write locks on nodes and relationships to achieve higher level of isolation ¤ Locks are acquired at the Node and Relationship level. ¤ Deadlock detection is built into the core transaction management.
Big Data Management
81https://neo4j.com/docs/java-reference/current/transactions/
TRANSACTION 1.Begin a transaction. 2.Perform database operations. 3.Mark the transaction as successful or not. 4.Finish the transaction.
Requires a write lock, and Cypher automatically acquires one. MATCH (n:X {id: 42}) SET n.prop = n.prop + 1
Summary ¤Graphs good for relationships ¤Network analysis
¤Property graphs ¤Edges are directed ¤Nodes and edges have properties ¤Multi-relational
¤Efficient native storage and processing ¤Neo4j contains pattern matching query language
Big Data Management
82
http://en.wikipedia.org/wiki/Graph_database
Cypher vs SQL
Big Data Management
83
Query that for each customer who bought a product, will recommend products that peer customers have purchased. The WHERE clause removes products that the customer has already purchased, since we don’t want to recommend something the customer has already bought.
SELECT product.product_name as Recommendation, count(1) as Frequency FROM product, customer_product_mapping, (SELECT cpm3.product_id, cpm3.customer_id FROM Customer_product_mapping cpm, Customer_product_mapping cpm2, Customer_product_mapping cpm3 WHERE cpm.customer_id = ‘customer-one’ and cpm.product_id = cpm2.product_id and cpm2.customer_id != ‘customer-one’ and cpm3.customer_id = cpm2.customer_id and cpm3.product_id not in (select distinct product_id FROM Customer_product_mapping cpm WHERE cpm.customer_id = ‘customer-one’) ) recommended_products WHERE customer_product_mapping.product_id = product.product_id and customer_product_mapping.product_id in recommended_products.product_id and customer_product_mapping.customer_id = recommended_products.customer_id GROUP BY product.product_name ORDER BY Frequency desc;
MATCH (u:Customer {customer_id:’customer-one’})- [:BOUGHT]->
(p:Product) <- [:BOUGHT]-(peer:Customer)- [:BOUGHT]->(reco:Product) WHERE not (u)-[:BOUGHT]->(reco) RETURN reco as Recommendation, count(*) as Frequency ORDER BY Frequency DESC LIMIT 5;
https://neo4j.com/blog/slow-sql-queries-killing-recommendation-engine/?ref=blog
CYPHER SQL
Lab Exercise
¤See instructions on Vision
¤Download neo4j, unzip, and run from /bin folder
¤Two labs available – do lab a (networks) and optionally lab b (data analysis)
¤Don’t forget to stop neo4j service when finished
Big Data Management
84
Download from: http://www.macs.hw.ac.uk/~pb56/neo4j.tar.gz (for Linux lab PCs 86MB zip file)
Neo4J for Very Large Scale Systems
¤ https://www.youtube.com/watch?v=BfPDZf2wmqg&feature=youtu.be
Big Data Management
85
Intro to Cypher Video Link
¤https://youtu.be/VdivJqlPzCI?t=2
Big Data Management
86