Neo4j Graph Database Homework

profilezfuson
Neo4j_4_GraphDatabaseImplementationandDataImport.pdf

1

Graph Database Implementation and

Data Import

Summer, 2021

Dar-jen Chang

Computer Science and Engineering

University of Louisville

2

Sources and References

[1] The Neo4j Cypher Manual v4.2 2021 pdf version

(https://neo4j.com/docs/pdf/neo4j-cypher-manual-4.2.pdf)

or online version

(https://neo4j.com/docs/cypher-manual/current/ )

[2] Cypher Refcard 4.2

(https://neo4j.com/docs/pdf/cypher-refcard-4.2.pdf )

[3] APOC User Guide 4.2

(https://neo4j.com/labs/apoc/4.2/)

[4] Load CSV tips

https://neo4j.com/developer/guide-import-csv/

TOC

• Graph Database Design and Implementation

Process

• The GoT Graph Database

• Degree Distribution Exploration

• CSV File Format and CSV File Import

• Dataset Examples

3

4

Database Application Design Process

• Information and processing requirement analysis

• ER (Entity Relationship) Modeling

• Database schema design (e.g., relational model, graph

data model, etc.)

• Database implementation

• Application design and development

5

ER Model Example

6

ER Model Transformations

• To Relational data model

• To Graph data model

7

Graph Database Implementation

• Graph data model

• Constraints and indexes

• Data import

• Testing

8

Constraints and indexes

• We can specify unique constraints that guarantee

uniqueness of a certain property on nodes with a specific

label. These constraints are also used by the MERGE clause to make certain that a node only exists once.

• For example, we decide that every Movie node should

have a unique title:

CREATE CONSTRAINT ON (movie:Movie) ASSERT movie.title IS UNIQUE

Note that adding the unique constraint will implicitly add

an index on that property, so we won’t have to do that

separately.

9

Indexes

• The main reason for using indexes in a graph database is

to find the starting point in the graph as fast as possible.

• Indexes can be added at any time. Note that it will take

some time for an index to come online when there’s

existing data.

CREATE INDEX ON :Person(name)

10

APOC for Constraints and Indexes

• APOC provides a convenient function to create both indexes and

constraints. The function signature is given below:

apoc.schema.assert (indexes :: MAP?, constraints :: MAP?,

dropExisting = true :: BOOLEAN?) :: (label :: STRING?, key :: STRING?, keys :: LIST?OF

STRING?, unique :: BOOLEAN?, action :: STRING?)

11

APOC for Constraints and Indexes

Example

// Yelp datasets call apoc.schema.assert (

// Indexes {Category:['name']},

// Constraints {Business:['id'],User:['id'],Review:['id’]}

);

TOC

• Graph Database Design and Implementation

Process

• The GoT Graph Database

• Degree Distribution Exploration

• CSV File Format and CSV File Import

• Dataset Examples

12

13

Got

Andrew Beveridge and Jie Shan, NETWORK of THRONES, April

2016, Math Horizons https://www.maa.org/sites/default/files/pdf/Mathhorizons/NetworkofThrones%20%281%29.pdf

“… Our network, shown in figure 2 (next slide), has sets of

vertices V and edges E. The 107 vertices represent the characters,

including ladies and lords, guards and mercenaries, councilmen

and consorts, villagers and savages. The vertices are joined by 352

integer-weighted edges, in which higher weights correspond to

stronger relationships between those characters. We generated the

edges using A Storm of Swords, the third book in the series. … We

parsed the ebook, incrementing the edge weight between two

characters when their names (or nicknames) appeared within 15

words of one another. Afterward, we performed some manual

validation and cleaning. Note that an edge between two characters

doesn’t necessarily mean that they are friends—it simply means

that they interact, speak of one another, or are mentioned together.

…”

14

Got Dataset

Stormofswords.csv

1 Source,Target,Weight

2 Aemon,Grenn,5

3 Aemon,Samwell,31

4 Aerys,Jaime,18

5 Aerys,Robert,6

6 Aerys,Tyrion,5

7 Aerys,Tywin,8

8 Alliser,Mance,5

9 Amory,Oberyn,5

10 Arya,Anguy,11

11 Arya,Beric,23

12 Arya,Bran,9

13 Arya,Brynden,6

14 Arya,Cersei,5 ... ........................... 352 Ygritte,Qhorin,7

353 Ygritte,Rattleshirt,9

15

The GoT Database - 1

create database Create a database called GoT (version 4.2.7) and add the plugins:

APOC and Graph Data Science Library(GDSL):

16

The GoT Database - 2

prepare for data import

Open import folder and copy stormofswords.csv to the folder.

17

The GoT Database - 3

create constraints and indices

• Start the GoT database.

• Open Neo4j Browser and enter:

CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE

18

• Test read script

LOAD CSV WITH HEADERS FROM "file:///stormofswords.csv" AS row RETURN count(*)

The GoT Database - 4

test stormofswords.csv import

19

• Import data script

LOAD CSV WITH HEADERS FROM "file:///stormofswords.csv" AS row merge (p1:Person {name:row.Source}) merge (p2:Person {name:row.Target}) create (p1)-[c:CONN {weight:toInteger(row.Weight)}]->(p2)

The GoT Database - 5

import stormofswords.csv

20

• Match (n), ()-[r]->() return count(distinct n) as NumberOfPersons,

count(distinct r) as NumberOfConnections

The GoT Database - 6

test run

21

• call db.schema.visualization()

The GoT Database - 7

show schema

22

• Match(n) return n

The GoT Database - 8

show whole graph

23

• Match (p:Person) return p.name as Name, apoc.node.degree(p,'CONN') as Degree order by Degree desc

The GoT Database - 9

show vertex degrees

TOC

• Graph Database Design and Implementation

Process

• The GoT Graph Database

• Degree Distribution Exploration

• CSV File Format and CSV File Import

• Dataset Examples

24

25

Degree Distribution Exploration

• Vertex degree info is a very useful metric for graph exploration.

• Given a graph, we like to compute

summary statistics (e.g., min, max, mean, and std of degrees)

degree distribution (density and histogram)

• In this section, we demonstrate a general workflow of doing this

degree exploration with

Neo4j (the GoT graph data)

+

R or Python

26

Degree Distribution Exploration

Approach 1

Neo4j Graph Database

Export

R Python

Import Import

27

Degree Distribution Exploration

Approach 2

Neo4j Graph Database

R Neo4j client (neo4r) Python Neo4j client (neo4j or py2neo)

28

Degree Distribution Exploration

export degree.csv from GoT graph

match (p:Person) return id(p) as nid, apoc.node.degree(p,'CONN') as degree

order by nid

29

Degree Distribution Exploration

degree.csv file format warning

• The file, degree.csv, exported from Neo4j contains a 3-byte BOM

(Byte Order Mark): EF BB BF which indicates the file is

encoded in UTF-8.

• Since the file, degree.csv, only contains ASCII characters, we can

remove the BOM form the file (say, using the Visual Studio’s

Binary Editor) so that it can be read by any text reader (e.g., R’s

read_csv function) without any surprise.

30

Degree Distribution Exploration

degree.csv (GoT, 107 nodes)

1 nid,degree 2 0,5 3 1,4 4 2,3 5 3,1 6 4,19 7 5,6 8 6,4 9 7,6 10` 8,14 11 9,7 12 10,4 13 11,8 14 12,18 15 13,20

...... 108 106,1

31

Degree Distribution Exploration

degree.csv R recipe – read and view > dg = read.csv(file="degree.csv", header=TRUE) > colnames(dg) [1] "nid" "degree" > View(dg)

32

Degree Distribution Exploration

degree.csv R receipe – summary statistics

> summary(dg[,2]) Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 2.000 4.000 6.579 7.000 36.000

> sd (dg[,2]) # standard deviation [1] 6.611935

33

Degree Distribution Exploration

degree.csv R recipe – plot degree vs nid (Bar)

> library(ggplot2) > ggplot(dg, aes(x = nid, y = degree)) + geom_col()

34

Degree Distribution Exploration

degree.csv R receipe – plot degree vs nid (line)

> library(ggplot2) > ggplot(dg, aes(x = nid, y = degree)) + geom_line()

35

Degree Distribution Exploration

degree.csv R recipe – plot degree histogram

> hist(dg[,2]) # easy, but not pleasing

36

Degree Distribution Exploration

degree.csv in R – plot degree histogram

> library(ggplot2) > ggplot(dg, aes(x=degree)) + geom_histogram(binwidth=1)

37

Degree Distribution Exploration

degree.csv R recipe – compute degree frequency

> freq = tabulate(dg[,2]+1) > max = max(dg[,2]) > max [1] 36 > degree = seq(from = 0, to = max) > degree_freq = data.frame(degree, freq) > View(degree_freq)

> # verify the freq > nrow(dg[dg$degree==1,]) [1] 16 > nrow(dg[dg$degree==max,]) [1] 1

38

Degree Distribution Exploration

degree.csv R recipe – plot degree frequency

> ggplot(degree_freq, aes(x = degree, y = freq)) + geom_col()

39

Degree Distribution Exploration

degree.csv R recipe – plot degree frequency

ggplot(degree_freq, aes(x = degree, y = freq)) + geom_line()

40

Degree Distribution Exploration

degree.csv R recipe – compute degree distribution

> sum = sum(degree_freq[,2]) > sum [1] 107 > degree_dist = degree_freq > degree_dist[,2] = degree_dist[,2]/ (1.0 * sum) > colnames(degree_dist) = c('degree', 'density') > sum(degree_dist[,2]) [1] 1 > View(degree_dist)

41

Degree Distribution Exploration

degree.csv R recipe – plot degree distribution

ggplot(degree_dist, aes(x = degree, y = density)) + geom_col()

42

Degree Distribution Exploration

degree.csv R recipe – plot degree distribution

ggplot(degree_dist, aes(x = degree, y = density)) + geom_line()

43

Degree Distribution Exploration

Yelp 2019 Dataset – Graph Data Model

44

Degree Distribution Exploration

Yelp 2019 Dataset – Neo4j match (u:User) return id(u) as user_id, apoc.node.degree(u,'WROTE') as number_reviews order by number_reviews desc

45

Degree Distribution Exploration

Yelp 2019 Dataset – Neo4j

MATCH (u:User) return u.id as id, apoc.node.degree(u,'WROTE') as number_reviews

46

Degree Distribution Exploration

Yelp 2019 Dataset – Neo4j

MATCH (u:User) return id(u) as id, apoc.node.degree(u,'WROTE') as number_reviews

47

Degree Distribution Exploration

Yelp 2019 Dataset – R scripts

> dg = read.csv(file="Yelp_2019_NoBOM.csv", header=TRUE) > colnames(dg) = c('nid', 'degree') > View(dg)

48

Degree Distribution Exploration

Yelp 2019 Dataset – R scripts

> freq = tabulate(dg[,2]+1) > degree = seq(from = 0, to = max) > degree_freq = data.frame(degree, freq) > View(degree_freq)

> sum(degree_freq[degree_freq$degree < 10,2]) / nrow(dg) [1] 0.9249764

# This means more than 92.5 percent of users wrote less # than 10 reviews

49

Degree Distribution Exploration

Yelp 2019 Dataset – R scripts

> degree_dist = degree_freq > degree_dist[,2] = degree_dist[,2]/ (1.0 * sum) > colnames(degree_dist) = c('degree', 'density') > View(degree_dist)

50

Degree Distribution Exploration

Yelp 2019 Dataset – R scripts

> library(ggplot2) > ggplot(degree_freq, aes(x = degree, y = freq)) + geom_line()

51

Degree Distribution Exploration

Yelp 2019 Dataset – R scripts

> ggplot(degree_freq, aes(x = degree, y = freq)) + geom_line() + coord_cartesian(xlim = c(0, 50), ylim = c(0, 0.6))

52

Degree Distribution Exploration

Yelp 2019 Dataset – R scripts

> ggplot(degree_freq, aes(x = degree, y = freq)) + geom_col() + coord_cartesian(xlim = c(0, 50), ylim = c(0, 0.6))

TOC

• Graph Database Design and Implementation

Process

• The GoT Graph Database

• Degree Distribution Exploration

• CSV File Format and CSV File Import

• Dataset Examples

53

54

Common Data File Formats

• CSV (Coma Separated Values)

• JSON (JavaScript Object Notation)

55

CSV File Format

• The character encoding must be UTF-8.

• The end line termination is system dependent, for example, \n on

Unix or \r\n on Windows.

• The terminator must be a comma , unless specified otherwise

using the FIELDTERMINATOR option.

• The character for string quotation is the double quote " (these are

stripped off when the data is read in).

• Any characters that need to be escaped can be escaped with the

backslash \ character.

56

CSV File Example

categories.csv (from the NorthWind dataset)

Header (optional)

R dataframe cat <- read.csv(file="categories.csv", header=TRUE, sep=",")

57

Neo4j Load (Import) Data Methods

• Cypher Load CSV command

• APOC's apoc.load.json procedure

58

Cypher LOAD CSV as an

ETL (Extract, Transform, Load) Tool

• supports loading / ingesting CSV data from an URI

• direct mapping of input data into complex graph/domain structure

• data conversion

• supports complex computations

• create or merge data, relationships and structure

59

Cypher LOAD CSV command

LOAD CSV [WITH HEADERS] FROM <file_spec> AS <var_name> [FIELDTERMINATOR <char>]

where

<file_spec> specifies CSV files stored on the database server and are then accessible using a file:/// URL. Alternatively, LOAD CSV also supports accessing

CSV files via HTTPS, HTTP, and FTP.

Example

(1) load csv with headers from "file:///order.csv" as line

(2) load csv with headers from "http://neo4j.com/order.csv" as line

(3) load csv with headers from "file:///order.csv" as line fieldterminator '\t'

60

LOAD CSV from Local (Server) File System

LOAD CSV WITH HEADERS FROM “file:///order.csv” AS line .......................................

where is the file, order.csv, in the local (i.e., server) file system?

61

Use LOAD CSV command to check csv files

(1) // check correct line count LOAD CSV WITH HEADERS FROM "file:///order.csv" AS line RETURN count(*)

(2) // display first few raw lines LOAD CSV FROM "file:///orders.csv" AS line RETURN line LIMIT 5

62

Using LOAD CSV Tips

• Empty fields have to be skipped or replaced with default values

during LOAD CSV

• All data from the CSV file is read as a string, you have to use

toInt, toFloat, split or similar functions to convert

• Split arrays in a cell by delimiter using split (combine with extract for conversions)

• Check your Cypher import statement for typos: labels, property

names and relationship types are case-sensitive

• Conditional conversions can be achieved with CASE

TOC

• Graph Database Design and Implementation

Process

• The GoT Graph Database

• Degree Distribution Exploration

• CSV File Format and CSV File Import

• Dataset Examples

63

64

Load CSV/JSON Case Study

• NorthWind DB

• Airport Network

Analyzing Airline Flight Data (the Graph Algorithms book,

pp.166 -181)

• Instacart datasets

• iMDB datasets

• Yelp datasets

65

Instacart Datasets

Instacart is an American company founded in 2012 that operates as a

same-day grocery delivery service. Customers select groceries

through a web application from various retailers and delivered by a

personal shopper.

66

Instacart Datasets

https://www.instacart.com/datasets/grocery-shopping-2017

67

Kaggle TMDB 5000 Dataset - 1

https://www.kaggle.com/tmdb/tmdb-movie-metadata

tmdb_5000_credits.csv

movie_id, title, cast, crew

19995,Avatar,

"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, ...............................................................................................]",

"[{""credit_id"": ""52fe48009251416c750aca23"", ""department"": ""Editing"", ""gender"": 0, ""id"": 1721, ""job"": ""Editor"", ""name"": ""Stephen E. Rivkin""}, {""credit_id"": ""539c47ecc3a36810e3001f87"", ""department"": ""Art"", ""gender"": 2, ""id"": 496, ""job"": ""Production Design"", ""name"": ""Rick Carter""}, {""credit_id"": ""54491c89c3a3680fb4001cf7"", ""department"": ""Sound"", ""gender"": 0, ""id"": 900, ""job"": ""Sound Designer"", ""name"": ""Christopher Boyes""}, .......................................................................]"

68

Kaggle TMDB 5000 Dataset - 2

tmdb_5000_movies.csv

budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,prod

uction_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,t

agline,title,vote_average,vote_count

237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""},

{""id"": 14, ""name"": ""Fantasy""}, {""id"": 878, ""name"": ""Science

Fiction""}]",http://www.avatarmovie.com/,19995, "[{""id"": 1463, ""name"": ""culture

clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""space war""},

{""id"": 3388, ""name"": ""space colony""}, {""id"": 3679, ""name"": ""society""}, {""id"":

3801, ""name"": ""space travel""}, {""id"": 9685, ""name"": ""futuristic""}, {""id"": 9840,

""name"": ""romance""}, {""id"": 9882, ""name"": ""space""}, {""id"": 9951, ""name"":

""alien""}, {""id"": 10148, ""name"": ""tribe""}, {""id"": 10158, ""name"": ""alien

planet""}, {""id"": 10987, ""name"": ""cgi""}, {""id"": 11399, ""name"": ""marine""},

{""id"": 13065, ""name"": ""soldier""}, {""id"": 14643, ""name"": ""battle""}, {""id"":

14720, ""name"": ""love affair""}, {""id"": 165431, ""name"": ""anti war""}, {""id"":

193554, ""name"": ""power relations""}, {""id"": 206690, ""name"": ""mind and soul""},

{""id"": 209714, ""name"": ""3d""}]",en,Avatar,"In the 22nd century, a paraplegic Marine is

dispatched to the moon Pandora on a unique mission, but becomes torn between following

orders and protecting an alien civilization.",150.437577, "[{""name"": ""Ingenious Film

Partners"", ""id"": 289}, {""name"": ""Twentieth Century Fox Film Corporation"", ""id"":

306}, {""name"": ""Dune Entertainment"", ""id"": 444}, {""name"": ""Lightstorm

Entertainment"", ""id"": 574}]","[{""iso_3166_1"": ""US"", ""name"": ""United States of

America""}, {""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""}]",2009-12-

10,2787965087,162,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso_639_1"": ""es"",

""name"": ""Espa\u00f1ol""}]",Released,Enter the World of Pandora.,Avatar,7.2,11800

69

iMDB Movie Dataset

Check out this website:

www.imdb.com/interfaces/

70

Yelp Dataset

Check out this website:

https://www.yelp.com/dataset

Yelp 2019 Datasets

File Name File Size Number of records

user.json 1,847,071 KB 1,326,101

business.json 135,039 KB 192,609

review.json 4,099,872 KB 5,261,669

tip.json 192,928 KB 1,098,325