Getting Started With Graph Analytics
In this tutorial, you’ll learn the basic workflow for doing graph analytics in Snowflake with RelationalAI. You will learn how to run a graph algorithm on data from the Open Movie Database (OMDB) to answer the question “Who are the most influential people in the movie business?”
Goal
After completing this tutorial, you’ll know how to:
- Model entity relationships as a graph.
- Describe graphs using a relationship table.
- Use RelationalAI to perform graph algorithms in Snowflake.
Prerequisites
You’ll need access to a Snowflake account with the RelationalAI Integration installed. This is required. Before reading this tutorial, you should complete the steps in the User Quick Start.
Scenario
How do you measure a person’s influence in the movie industry? You might start by counting how many other people that person has worked with. To be more accurate, though, you should consider the influence of those other people. That is, someone who has worked with many influential people should be more influential than someone who has worked with lots of less influential people.
Keeping track of who has worked with who, and how each person’s rank affects the rank of others, is a challenge. It’s easier to grasp the situation as a graph, like the one illustrated below, which represents people in the movie industry as nodes — circles in the diagram — and connects two people together with an edge — a line in the diagram — if they have collaborated with each other in a movie. Since this particular graph represents people who have collaborated with each other, it is called a collaboration graph.
For example, here’s a small portion of the nodes and edges in the OMDB collaboration graph:
One popular graph algorithm for measuring the influence of a node in a collaboration graph is the PageRank algorithm, which assigns to each node a value between 0 and 1 that takes into account the number of and PageRank values of adjacent nodes. The higher the PageRank value, the more influential the node. Many important graph algorithms are difficult, if not impossible, to implement in pure SQL. PageRank, for instance, requires advanced programming techniques, like recursion, and specialized scripting capabilities.
In this tutorial, you’ll use RelationalAI’s SQL Graph Analytics Library to create a collaboration graph using data in the Open Movie Database (OMDB) and run the PageRank algorithm on the graph to determine the most influential people in the database.
Steps
This tutorial is meant to be completed in a Snowflake SQL Worksheet. Before you continue, open a new worksheet and select a database and user role with access to the RelationalAI Snowflake Integration.
The complete SQL worksheet that accompanies this tutorial is available at the bottom of this page.
Load the OMDB Data Set
The OMDB contains tables for people, movies, and casting information with the schema described in the follwing entity relationship (ER) diagram:
The data for each entity is contained in four CSV files:
all_people.csv
all_movies.csv
job_names.csv
all_casts.csv
To load the all_people.csv
file, for instance,
run the following SQL in your worksheet:
USE DATABASE <sf_db_name>;
USE SCHEMA <schema_name>;
CREATE TABLE Person(
id INT PRIMARY KEY,
name VARCHAR,
birthday DATE,
deathday DATE,
gender INT
);
COPY INTO Person
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_people.csv'
FILE_FORMAT = (
TYPE = CSV
SKIP_HEADER = 1
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);
You can load the remaining CSV files in a similar fashion. See the complete worksheet for full details.
The CSV files used in this tutorial were downloaded from the Open Movie Database (opens in a new tab) on October 6, 2023 and are used under the CC BY 2.0 DE (opens in a new tab) license. Small formatting changes were made.
Create a Relationship Table
Next, execute the following to create a table named collaborated_with
containing pairs of Person
IDs corresponding to
people who have been cast together in the same movie:
CREATE TABLE collaborated_with AS (
SELECT
c1.person_id AS person1_id,
c2.person_id AS person2_id
FROM CastMember c1
JOIN CastMember c2 ON c1.movie_id = c2.movie_id
WHERE c1.person_id < c2.person_id -- To avoid duplicate people
);
The collaborated_with
table describes the
relationships between people in the OMDB collaboration graph.
Each row corresponds to an edge between two people in the graph.
Select RelationalAI Resources
Graphs are stored in RelationalAI databases, and RelationalAI engines supply the computational power for executing graph algorithms. So, before you can create a graph and run an algorithm, you must specify which RelationalAI resources to use.
For the best experience, use an engine of size S
or greater
for this tutorial.
RelationalAI resources may only be created by integration administrators.
See Integration Management: RAI Resources
for more information.
You can select RelationalAI resources
with the use_rai_database
and use_rai_engine
procedures:
CALL RAI.use_rai_database('<rai_db_name>');
/*+-------------------+
| USE_RAI_DATABASE |
+-------------------+
| <rai_db_name> |
+-------------------+ */
CALL RAI.use_rai_engine('<rai_engine_name>');
/*+-------------------+
| USE_RAI_ENGINE |
+-------------------+
| <rai_engine_name> |
+-------------------+ */
Create a Data Stream
Data streams are used to synchronize data in a Snowflake table with a RelationalAI database so that graph algorithms, and other kinds of workloads, may be performed on the data.
Use the create_data_stream
procedure to initiate this data synchronization:
CALL RAI.create_data_stream('collaborated_with');
/*+------------------------------------------------+
| { "account": "***", ..., "state": "CREATED" } |
+------------------------------------------------+ */
By default, data streams have the same name as the table from which they are created.
The create_data_stream
procedure finishes once the
data stream has been created.
However, this does not mean that the data stream is ready to use.
You must wait until the data stream has finished syncing.
This may take up to two minutes.
You may check the status of your data stream using
the get_data_stream_status
procedure:
CALL RAI.get_data_stream_status('collaborated_with');
/*+--------------------+------------------------------------------------+
| PROPERTY | VALUE |
+--------------------+------------------------------------------------+
| Source | "<sf_db_name>.<schema_name>.collaborated_with" |
| DB Link | "<sf_db_name>.rai" |
| Integration | "<integration_name>" |
| Data sync status | "Syncing: pending unloading from SF" |
| Data stream health | "Healthy" |
| ... | ... |
+--------------------+------------------------------------------------+ */
When Data sync status says Fully synced, your data stream is ready and you may proceed to the next step.
Create a Graph
Now you can create a graph in RelationalAI
from the collaborated_with
data stream
using the create_graph
procedure:
CALL RAI.create_graph('collaboration_graph', 'collaborated_with');
/*+------------------------------------------------------------------------------------------------+
| {"message":"Graph 'collaboration_graph' created in database '<rai_db_name>'.", "success":true} |
+------------------------------------------------------------------------------------------------+ */
Compute PageRank
To compute the PageRank for each node in the collaboration graph,
call the pagerank
function:
CREATE TABLE influence AS (
SELECT * FROM TABLE(RAI.pagerank('collaboration_graph'))
);
The collaboration_graph
has
over 1 million nodes and almost 9 million edges.
Computing the PageRank will take a minute or two.
In this example, the PageRank values are computed in RelationalAI
and stored in a Snowflake table named influence
.
The pagerank
function returns a table with two columns:
node
, containing the IDs of nodes in the graph.value
, containing the PageRank value of each node.
You can see this by inspecting the first few rows of the influence
table:
SELECT * FROM influence LIMIT 5;
/*+------+------------------+
| NODE | VALUE |
|------+------------------+
| 1 | 0.0000289266722 |
| 2 | 0.00004992577285 |
| 3 | 0.00003116103839 |
| 4 | 0.00002542779375 |
| 5 | 0.00003826305083 |
+------+------------------+ */
Find the Influencers
Node IDs in the influence
table correspond to
Person
IDs in the OMDB data set.
So, you can find the names of the most influential people in the OMDB
by matching people to node IDs in the influence
table
and sorting the results by the value
column:
SELECT Person.name, influence.value
FROM Person, influence
WHERE Person.id = influence.node
ORDER BY influence.value DESC
LIMIT 10;
/*+----------------------+------------------+
| NAME | VALUE |
|----------------------+------------------+
| Christine Westermann | 0.0003306634481 |
| Götz Alsmann | 0.0003304435814 |
| Roger Corman | 0.00003116103839 |
| Godfrey Ho | 0.0001110573419 |
| Lloyd Kaufman | 0.0001055304991 |
| James Burrows | 0.00009991588724 |
| Udo Kier | 0.00009568671509 |
| Ennio Morricone | 0.00009510038097 |
| Uwe Karpa | 0.00009370380723 |
| Jackie Chan | 0.00009297487171 |
+----------------------+------------------+ */
The output includes influential people from across the international film industry, such as directors Roger Corman (opens in a new tab) and Godfrey Ho (opens in a new tab), the composer Ennio Morricone (opens in a new tab), and the actor Jackie Chan (opens in a new tab).
Summary
Graph analytics can help you identify important people in a network based on the relationships in that network. In this tutorial, you saw this in action by using the PageRank algorithm from RelationalAI’s SQL Graph Analytics Library to identify influential people in the OMDB database. All without leaving a Snowflake SQL Worksheet!
In particular, you learned how to:
- Model relationships in a data set using a relationship table.
- Synchronize your data between Snowflake and RelationalAI using the
create_data_stream
procedure. - Create a graph in RelationalAI from a data stream
using the
create_graph
procedure. - Compute the PageRank of each node in a graph
using the
pagerank
function. - Match results from
pagerank
to entities in the data.
This process is the same for each new graph you create. And since data streams keep graphs in sync with their Snowflake source tables, every time you run an algorithm, the results will reflect the latest data.
Next Steps
In this tutorial, every node in the collaboration graph represents the same type of entity: a person. In the Schema Mapping Tutorial, you’ll learn how to create graphs involving multiple types of entities.
Check out the Graph Analytics Overview to see every algorithm available in RelationalAI’s SQL Graph Analytics Library. To learn more about the RAI Integration Services for Snowflake, see Working With the Integration.
Full Worksheet
Copy and paste the following code into a Snowflake worksheet to follow along with the tutorial.
-----------------------------
-- STEP 1: LOAD OMDB DATA SET
-----------------------------
-- Time: ~1m
USE DATABASE <sf_db_name>;
USE SCHEMA <schema_name>;
CREATE TABLE Person(
id INT PRIMARY KEY,
name VARCHAR,
birthday DATE,
deathday DATE,
gender INT
);
COPY INTO Person
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_people.csv'
FILE_FORMAT = (
TYPE = CSV
SKIP_HEADER = 1
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);
CREATE TABLE Movie(
id INT PRIMARY KEY,
name VARCHAR,
parent_id INT,
date date
);
COPY INTO Movie
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_movies.csv'
FILE_FORMAT = (
TYPE = CSV
SKIP_HEADER = 1
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);
CREATE TABLE Job(
job_id INT PRIMARY KEY,
name VARCHAR,
language_iso_639_1 VARCHAR
);
COPY INTO Job
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/job_names.csv'
FILE_FORMAT = (
TYPE = CSV
SKIP_HEADER = 1
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);
CREATE TABLE CastMember(
movie_id INT REFERENCES Movie(id),
person_id INT REFERENCES Person(id),
job_id INT REFERENCES Job(job_id),
role VARCHAR,
position INT,
PRIMARY KEY (movie_id, person_id, job_id)
);
COPY INTO CastMember
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_casts.csv'
FILE_FORMAT = (
TYPE = CSV
SKIP_HEADER = 1
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);
--------------------------------------
-- STEP 2: CREATE A RELATIONSHIP TABLE
--------------------------------------
-- Time: ~5s
CREATE TABLE collaborated_with AS (
SELECT
c1.person_id AS person1_id,
c2.person_id AS person2_id
FROM CastMember c1
JOIN CastMember c2 ON c1.movie_id = c2.movie_id
WHERE c1.person_id < c2.person_id -- To avoid duplicate people
);
----------------------------------------
-- STEP 3: SELECT RELATIONALAI RESOURCES
----------------------------------------
-- Time: ~3s
CALL RAI.use_rai_database('<rai_db_name>');
/*+-------------------+
| USE_RAI_DATABASE |
+-------------------+
| <rai_db_name> |
+-------------------+ */
CALL RAI.use_rai_engine('<rai_db_name>');
/*+-------------------+
| USE_RAI_ENGINE |
+-------------------+
| <rai_engine_name> |
+-------------------+ */
-------------------------------
-- STEP 4: CREATE A DATA STREAM
-------------------------------
-- Time: ~2m
CALL RAI.create_data_stream('collaborated_with');
/*+------------------------------------------------+
| { "account": "***", ..., "state": "CREATED" } |
+------------------------------------------------+ */
CALL RAI.get_data_stream_status('collaborated_with');
/*+--------------------+------------------------------------------------+
| PROPERTY | VALUE |
+--------------------+------------------------------------------------+
| Source | "<sf_db_name>.<schema_name>.collaborated_with" |
| DB Link | "<sf_db_name>.rai" |
| Integration | "<integration_name>" |
| Data sync status | "Syncing: pending unloading from SF" |
| Data stream health | "Healthy" |
| ... | ... |
+--------------------+------------------------------------------------+ */
-------------------------
-- STEP 5: CREATE A GRAPH
-------------------------
-- Time: ~35s
CALL RAI.create_graph('collaboration_graph', 'collaborated_with');
/*+------------------------------------------------------------------------------------------------+
| {"message":"Graph 'collaboration_graph' created in database '<rai_db_name>'.", "success":true} |
+------------------------------------------------------------------------------------------------+ */
---------------------------
-- STEP 6: COMPUTE PAGERANK
---------------------------
-- Time: ~1m 30s
CREATE TABLE influence AS (
SELECT * FROM TABLE(RAI.pagerank('collaboration_graph'))
);
SELECT * FROM influence LIMIT 5;
/*+------+------------------+
| NODE | VALUE |
|------+------------------+
| 1 | 0.0000289266722 |
| 2 | 0.00004992577285 |
| 3 | 0.00003116103839 |
| 4 | 0.00002542779375 |
| 5 | 0.00003826305083 |
+------+------------------+ */
-------------------------------
-- STEP 7: FIND THE INFLUENCERS
-------------------------------
-- Time: ~0.5s
SELECT Person.name, influence.value
FROM Person, influence
WHERE Person.id = influence.node
ORDER BY influence.value DESC
LIMIT 10;
/*+----------------------+------------------+
| NAME | VALUE |
|----------------------+------------------+
| Christine Westermann | 0.0003306634481 |
| Götz Alsmann | 0.0003304435814 |
| Roger Corman | 0.00003116103839 |
| Godfrey Ho | 0.0001110573419 |
| Lloyd Kaufman | 0.0001055304991 |
| James Burrows | 0.00009991588724 |
| Udo Kier | 0.00009568671509 |
| Ennio Morricone | 0.00009510038097 |
| Uwe Karpa | 0.00009370380723 |
| Jackie Chan | 0.00009297487171 |
+----------------------+------------------+ */
---------------------
-- CLEANUP (OPTIONAL)
---------------------
-- It is possible to create multiple graphs from the same data stream,
-- such as a directed and undirected version of the same graph.
-- You should delete every graph attached to a data stream before deleting
-- the data stream. You can delete graphs using the delete_graph prodedure,
-- and delete data streams using the delete_data_stream procedure.
CALL RAI.delete_graph('collaboration_graph');
/*+------------------------------------------------------------------------------------------------------+
| {"message":"Graph 'collaboration_graph' removed from RAI database '<rai_db_name>'.", "success":true} |
+------------------------------------------------------------------------------------------------------+ */
CALL RAI.delete_data_stream('collaborated_with');
/*+---------------------+
| DELETE_DATA_STREAM |
+---------------------+
| "ok" |
+---------------------+ */