Neo4j Graph Database Homework
1
Graph Database Implementation and
Data Import
Summer, 2021
Dar-jen Chang
Computer Science and Engineering
University of Louisville
2
Sources and References
[1] The Neo4j Cypher Manual v4.2 2021 pdf version
(https://neo4j.com/docs/pdf/neo4j-cypher-manual-4.2.pdf)
or online version
(https://neo4j.com/docs/cypher-manual/current/ )
[2] Cypher Refcard 4.2
(https://neo4j.com/docs/pdf/cypher-refcard-4.2.pdf )
[3] APOC User Guide 4.2
(https://neo4j.com/labs/apoc/4.2/)
[4] Load CSV tips
https://neo4j.com/developer/guide-import-csv/
TOC
• Graph Database Design and Implementation
Process
• The GoT Graph Database
• Degree Distribution Exploration
• CSV File Format and CSV File Import
• Dataset Examples
3
4
Database Application Design Process
• Information and processing requirement analysis
• ER (Entity Relationship) Modeling
• Database schema design (e.g., relational model, graph
data model, etc.)
• Database implementation
• Application design and development
5
ER Model Example
6
ER Model Transformations
• To Relational data model
• To Graph data model
7
Graph Database Implementation
• Graph data model
• Constraints and indexes
• Data import
• Testing
8
Constraints and indexes
• We can specify unique constraints that guarantee
uniqueness of a certain property on nodes with a specific
label. These constraints are also used by the MERGE clause to make certain that a node only exists once.
• For example, we decide that every Movie node should
have a unique title:
CREATE CONSTRAINT ON (movie:Movie) ASSERT movie.title IS UNIQUE
Note that adding the unique constraint will implicitly add
an index on that property, so we won’t have to do that
separately.
9
Indexes
• The main reason for using indexes in a graph database is
to find the starting point in the graph as fast as possible.
• Indexes can be added at any time. Note that it will take
some time for an index to come online when there’s
existing data.
CREATE INDEX ON :Person(name)
10
APOC for Constraints and Indexes
• APOC provides a convenient function to create both indexes and
constraints. The function signature is given below:
apoc.schema.assert (indexes :: MAP?, constraints :: MAP?,
dropExisting = true :: BOOLEAN?) :: (label :: STRING?, key :: STRING?, keys :: LIST?OF
STRING?, unique :: BOOLEAN?, action :: STRING?)
11
APOC for Constraints and Indexes
Example
// Yelp datasets call apoc.schema.assert (
// Indexes {Category:['name']},
// Constraints {Business:['id'],User:['id'],Review:['id’]}
);
TOC
• Graph Database Design and Implementation
Process
• The GoT Graph Database
• Degree Distribution Exploration
• CSV File Format and CSV File Import
• Dataset Examples
12
13
Got
Andrew Beveridge and Jie Shan, NETWORK of THRONES, April
2016, Math Horizons https://www.maa.org/sites/default/files/pdf/Mathhorizons/NetworkofThrones%20%281%29.pdf
“… Our network, shown in figure 2 (next slide), has sets of
vertices V and edges E. The 107 vertices represent the characters,
including ladies and lords, guards and mercenaries, councilmen
and consorts, villagers and savages. The vertices are joined by 352
integer-weighted edges, in which higher weights correspond to
stronger relationships between those characters. We generated the
edges using A Storm of Swords, the third book in the series. … We
parsed the ebook, incrementing the edge weight between two
characters when their names (or nicknames) appeared within 15
words of one another. Afterward, we performed some manual
validation and cleaning. Note that an edge between two characters
doesn’t necessarily mean that they are friends—it simply means
that they interact, speak of one another, or are mentioned together.
…”
14
Got Dataset
Stormofswords.csv
1 Source,Target,Weight
2 Aemon,Grenn,5
3 Aemon,Samwell,31
4 Aerys,Jaime,18
5 Aerys,Robert,6
6 Aerys,Tyrion,5
7 Aerys,Tywin,8
8 Alliser,Mance,5
9 Amory,Oberyn,5
10 Arya,Anguy,11
11 Arya,Beric,23
12 Arya,Bran,9
13 Arya,Brynden,6
14 Arya,Cersei,5 ... ........................... 352 Ygritte,Qhorin,7
353 Ygritte,Rattleshirt,9
15
The GoT Database - 1
create database Create a database called GoT (version 4.2.7) and add the plugins:
APOC and Graph Data Science Library(GDSL):
16
The GoT Database - 2
prepare for data import
Open import folder and copy stormofswords.csv to the folder.
17
The GoT Database - 3
create constraints and indices
• Start the GoT database.
• Open Neo4j Browser and enter:
CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE
18
• Test read script
LOAD CSV WITH HEADERS FROM "file:///stormofswords.csv" AS row RETURN count(*)
The GoT Database - 4
test stormofswords.csv import
19
• Import data script
LOAD CSV WITH HEADERS FROM "file:///stormofswords.csv" AS row merge (p1:Person {name:row.Source}) merge (p2:Person {name:row.Target}) create (p1)-[c:CONN {weight:toInteger(row.Weight)}]->(p2)
The GoT Database - 5
import stormofswords.csv
20
• Match (n), ()-[r]->() return count(distinct n) as NumberOfPersons,
count(distinct r) as NumberOfConnections
The GoT Database - 6
test run
21
• call db.schema.visualization()
The GoT Database - 7
show schema
22
• Match(n) return n
The GoT Database - 8
show whole graph
23
• Match (p:Person) return p.name as Name, apoc.node.degree(p,'CONN') as Degree order by Degree desc
The GoT Database - 9
show vertex degrees
TOC
• Graph Database Design and Implementation
Process
• The GoT Graph Database
• Degree Distribution Exploration
• CSV File Format and CSV File Import
• Dataset Examples
24
25
Degree Distribution Exploration
• Vertex degree info is a very useful metric for graph exploration.
• Given a graph, we like to compute
summary statistics (e.g., min, max, mean, and std of degrees)
degree distribution (density and histogram)
• In this section, we demonstrate a general workflow of doing this
degree exploration with
Neo4j (the GoT graph data)
+
R or Python
26
Degree Distribution Exploration
Approach 1
Neo4j Graph Database
Export
R Python
Import Import
27
Degree Distribution Exploration
Approach 2
Neo4j Graph Database
R Neo4j client (neo4r) Python Neo4j client (neo4j or py2neo)
28
Degree Distribution Exploration
export degree.csv from GoT graph
match (p:Person) return id(p) as nid, apoc.node.degree(p,'CONN') as degree
order by nid
29
Degree Distribution Exploration
degree.csv file format warning
• The file, degree.csv, exported from Neo4j contains a 3-byte BOM
(Byte Order Mark): EF BB BF which indicates the file is
encoded in UTF-8.
• Since the file, degree.csv, only contains ASCII characters, we can
remove the BOM form the file (say, using the Visual Studio’s
Binary Editor) so that it can be read by any text reader (e.g., R’s
read_csv function) without any surprise.
30
Degree Distribution Exploration
degree.csv (GoT, 107 nodes)
1 nid,degree 2 0,5 3 1,4 4 2,3 5 3,1 6 4,19 7 5,6 8 6,4 9 7,6 10` 8,14 11 9,7 12 10,4 13 11,8 14 12,18 15 13,20
...... 108 106,1
31
Degree Distribution Exploration
degree.csv R recipe – read and view > dg = read.csv(file="degree.csv", header=TRUE) > colnames(dg) [1] "nid" "degree" > View(dg)
32
Degree Distribution Exploration
degree.csv R receipe – summary statistics
> summary(dg[,2]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 4.000 6.579 7.000 36.000
> sd (dg[,2]) # standard deviation [1] 6.611935
33
Degree Distribution Exploration
degree.csv R recipe – plot degree vs nid (Bar)
> library(ggplot2) > ggplot(dg, aes(x = nid, y = degree)) + geom_col()
34
Degree Distribution Exploration
degree.csv R receipe – plot degree vs nid (line)
> library(ggplot2) > ggplot(dg, aes(x = nid, y = degree)) + geom_line()
35
Degree Distribution Exploration
degree.csv R recipe – plot degree histogram
> hist(dg[,2]) # easy, but not pleasing
36
Degree Distribution Exploration
degree.csv in R – plot degree histogram
> library(ggplot2) > ggplot(dg, aes(x=degree)) + geom_histogram(binwidth=1)
37
Degree Distribution Exploration
degree.csv R recipe – compute degree frequency
> freq = tabulate(dg[,2]+1) > max = max(dg[,2]) > max [1] 36 > degree = seq(from = 0, to = max) > degree_freq = data.frame(degree, freq) > View(degree_freq)
> # verify the freq > nrow(dg[dg$degree==1,]) [1] 16 > nrow(dg[dg$degree==max,]) [1] 1
38
Degree Distribution Exploration
degree.csv R recipe – plot degree frequency
> ggplot(degree_freq, aes(x = degree, y = freq)) + geom_col()
39
Degree Distribution Exploration
degree.csv R recipe – plot degree frequency
ggplot(degree_freq, aes(x = degree, y = freq)) + geom_line()
40
Degree Distribution Exploration
degree.csv R recipe – compute degree distribution
> sum = sum(degree_freq[,2]) > sum [1] 107 > degree_dist = degree_freq > degree_dist[,2] = degree_dist[,2]/ (1.0 * sum) > colnames(degree_dist) = c('degree', 'density') > sum(degree_dist[,2]) [1] 1 > View(degree_dist)
41
Degree Distribution Exploration
degree.csv R recipe – plot degree distribution
ggplot(degree_dist, aes(x = degree, y = density)) + geom_col()
42
Degree Distribution Exploration
degree.csv R recipe – plot degree distribution
ggplot(degree_dist, aes(x = degree, y = density)) + geom_line()
43
Degree Distribution Exploration
Yelp 2019 Dataset – Graph Data Model
44
Degree Distribution Exploration
Yelp 2019 Dataset – Neo4j match (u:User) return id(u) as user_id, apoc.node.degree(u,'WROTE') as number_reviews order by number_reviews desc
45
Degree Distribution Exploration
Yelp 2019 Dataset – Neo4j
MATCH (u:User) return u.id as id, apoc.node.degree(u,'WROTE') as number_reviews
46
Degree Distribution Exploration
Yelp 2019 Dataset – Neo4j
MATCH (u:User) return id(u) as id, apoc.node.degree(u,'WROTE') as number_reviews
47
Degree Distribution Exploration
Yelp 2019 Dataset – R scripts
> dg = read.csv(file="Yelp_2019_NoBOM.csv", header=TRUE) > colnames(dg) = c('nid', 'degree') > View(dg)
48
Degree Distribution Exploration
Yelp 2019 Dataset – R scripts
> freq = tabulate(dg[,2]+1) > degree = seq(from = 0, to = max) > degree_freq = data.frame(degree, freq) > View(degree_freq)
> sum(degree_freq[degree_freq$degree < 10,2]) / nrow(dg) [1] 0.9249764
# This means more than 92.5 percent of users wrote less # than 10 reviews
49
Degree Distribution Exploration
Yelp 2019 Dataset – R scripts
> degree_dist = degree_freq > degree_dist[,2] = degree_dist[,2]/ (1.0 * sum) > colnames(degree_dist) = c('degree', 'density') > View(degree_dist)
50
Degree Distribution Exploration
Yelp 2019 Dataset – R scripts
> library(ggplot2) > ggplot(degree_freq, aes(x = degree, y = freq)) + geom_line()
51
Degree Distribution Exploration
Yelp 2019 Dataset – R scripts
> ggplot(degree_freq, aes(x = degree, y = freq)) + geom_line() + coord_cartesian(xlim = c(0, 50), ylim = c(0, 0.6))
52
Degree Distribution Exploration
Yelp 2019 Dataset – R scripts
> ggplot(degree_freq, aes(x = degree, y = freq)) + geom_col() + coord_cartesian(xlim = c(0, 50), ylim = c(0, 0.6))
TOC
• Graph Database Design and Implementation
Process
• The GoT Graph Database
• Degree Distribution Exploration
• CSV File Format and CSV File Import
• Dataset Examples
53
54
Common Data File Formats
• CSV (Coma Separated Values)
• JSON (JavaScript Object Notation)
55
CSV File Format
• The character encoding must be UTF-8.
• The end line termination is system dependent, for example, \n on
Unix or \r\n on Windows.
• The terminator must be a comma , unless specified otherwise
using the FIELDTERMINATOR option.
• The character for string quotation is the double quote " (these are
stripped off when the data is read in).
• Any characters that need to be escaped can be escaped with the
backslash \ character.
56
CSV File Example
categories.csv (from the NorthWind dataset)
Header (optional)
R dataframe cat <- read.csv(file="categories.csv", header=TRUE, sep=",")
57
Neo4j Load (Import) Data Methods
• Cypher Load CSV command
• APOC's apoc.load.json procedure
58
Cypher LOAD CSV as an
ETL (Extract, Transform, Load) Tool
• supports loading / ingesting CSV data from an URI
• direct mapping of input data into complex graph/domain structure
• data conversion
• supports complex computations
• create or merge data, relationships and structure
59
Cypher LOAD CSV command
LOAD CSV [WITH HEADERS] FROM <file_spec> AS <var_name> [FIELDTERMINATOR <char>]
where
<file_spec> specifies CSV files stored on the database server and are then accessible using a file:/// URL. Alternatively, LOAD CSV also supports accessing
CSV files via HTTPS, HTTP, and FTP.
Example
(1) load csv with headers from "file:///order.csv" as line
(2) load csv with headers from "http://neo4j.com/order.csv" as line
(3) load csv with headers from "file:///order.csv" as line fieldterminator '\t'
60
LOAD CSV from Local (Server) File System
LOAD CSV WITH HEADERS FROM “file:///order.csv” AS line .......................................
where is the file, order.csv, in the local (i.e., server) file system?
61
Use LOAD CSV command to check csv files
(1) // check correct line count LOAD CSV WITH HEADERS FROM "file:///order.csv" AS line RETURN count(*)
(2) // display first few raw lines LOAD CSV FROM "file:///orders.csv" AS line RETURN line LIMIT 5
62
Using LOAD CSV Tips
• Empty fields have to be skipped or replaced with default values
during LOAD CSV
• All data from the CSV file is read as a string, you have to use
toInt, toFloat, split or similar functions to convert
• Split arrays in a cell by delimiter using split (combine with extract for conversions)
• Check your Cypher import statement for typos: labels, property
names and relationship types are case-sensitive
• Conditional conversions can be achieved with CASE
TOC
• Graph Database Design and Implementation
Process
• The GoT Graph Database
• Degree Distribution Exploration
• CSV File Format and CSV File Import
• Dataset Examples
63
64
Load CSV/JSON Case Study
• NorthWind DB
• Airport Network
Analyzing Airline Flight Data (the Graph Algorithms book,
pp.166 -181)
• Instacart datasets
• iMDB datasets
• Yelp datasets
65
Instacart Datasets
Instacart is an American company founded in 2012 that operates as a
same-day grocery delivery service. Customers select groceries
through a web application from various retailers and delivered by a
personal shopper.
66
Instacart Datasets
https://www.instacart.com/datasets/grocery-shopping-2017
67
Kaggle TMDB 5000 Dataset - 1
https://www.kaggle.com/tmdb/tmdb-movie-metadata
tmdb_5000_credits.csv
movie_id, title, cast, crew
19995,Avatar,
"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, ...............................................................................................]",
"[{""credit_id"": ""52fe48009251416c750aca23"", ""department"": ""Editing"", ""gender"": 0, ""id"": 1721, ""job"": ""Editor"", ""name"": ""Stephen E. Rivkin""}, {""credit_id"": ""539c47ecc3a36810e3001f87"", ""department"": ""Art"", ""gender"": 2, ""id"": 496, ""job"": ""Production Design"", ""name"": ""Rick Carter""}, {""credit_id"": ""54491c89c3a3680fb4001cf7"", ""department"": ""Sound"", ""gender"": 0, ""id"": 900, ""job"": ""Sound Designer"", ""name"": ""Christopher Boyes""}, .......................................................................]"
68
Kaggle TMDB 5000 Dataset - 2
tmdb_5000_movies.csv
budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,prod
uction_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,t
agline,title,vote_average,vote_count
237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""},
{""id"": 14, ""name"": ""Fantasy""}, {""id"": 878, ""name"": ""Science
Fiction""}]",http://www.avatarmovie.com/,19995, "[{""id"": 1463, ""name"": ""culture
clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""space war""},
{""id"": 3388, ""name"": ""space colony""}, {""id"": 3679, ""name"": ""society""}, {""id"":
3801, ""name"": ""space travel""}, {""id"": 9685, ""name"": ""futuristic""}, {""id"": 9840,
""name"": ""romance""}, {""id"": 9882, ""name"": ""space""}, {""id"": 9951, ""name"":
""alien""}, {""id"": 10148, ""name"": ""tribe""}, {""id"": 10158, ""name"": ""alien
planet""}, {""id"": 10987, ""name"": ""cgi""}, {""id"": 11399, ""name"": ""marine""},
{""id"": 13065, ""name"": ""soldier""}, {""id"": 14643, ""name"": ""battle""}, {""id"":
14720, ""name"": ""love affair""}, {""id"": 165431, ""name"": ""anti war""}, {""id"":
193554, ""name"": ""power relations""}, {""id"": 206690, ""name"": ""mind and soul""},
{""id"": 209714, ""name"": ""3d""}]",en,Avatar,"In the 22nd century, a paraplegic Marine is
dispatched to the moon Pandora on a unique mission, but becomes torn between following
orders and protecting an alien civilization.",150.437577, "[{""name"": ""Ingenious Film
Partners"", ""id"": 289}, {""name"": ""Twentieth Century Fox Film Corporation"", ""id"":
306}, {""name"": ""Dune Entertainment"", ""id"": 444}, {""name"": ""Lightstorm
Entertainment"", ""id"": 574}]","[{""iso_3166_1"": ""US"", ""name"": ""United States of
America""}, {""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""}]",2009-12-
10,2787965087,162,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso_639_1"": ""es"",
""name"": ""Espa\u00f1ol""}]",Released,Enter the World of Pandora.,Avatar,7.2,11800
69
iMDB Movie Dataset
Check out this website:
www.imdb.com/interfaces/
70
Yelp Dataset
Check out this website:
https://www.yelp.com/dataset
Yelp 2019 Datasets
File Name File Size Number of records
user.json 1,847,071 KB 1,326,101
business.json 135,039 KB 192,609
review.json 4,099,872 KB 5,261,669
tip.json 192,928 KB 1,098,325