Tutorials
Getting Started With Graph Analytics

# Getting Started With Graph Analytics

In this tutorial, youâ€™ll learn the basic workflow for doing graph analytics in Snowflake with RelationalAI. You will learn how to run a graph algorithm on data from the Open Movie Database (OMDB) to answer the question â€śWho are the most influential people in the movie business?â€ť

## Goal

After completing this tutorial, youâ€™ll know how to:

• Model entity relationships as a graph.
• Describe graphs using a relationship table.
• Use RelationalAI to perform graph algorithms in Snowflake.

## Prerequisites

Youâ€™ll need access to a Snowflake account with the RelationalAI Integration installed. This is required. Before reading this tutorial, you should complete the steps in the User Quick Start.

## Scenario

How do you measure a personâ€™s influence in the movie industry? You might start by counting how many other people that person has worked with. To be more accurate, though, you should consider the influence of those other people. That is, someone who has worked with many influential people should be more influential than someone who has worked with lots of less influential people.

Keeping track of who has worked with who, and how each personâ€™s rank affects the rank of others, is a challenge. Itâ€™s easier to grasp the situation as a graph, like the one illustrated below, which represents people in the movie industry as nodes â€” circles in the diagram â€” and connects two people together with an edge â€” a line in the diagram â€” if they have collaborated with each other in a movie. Since this particular graph represents people who have collaborated with each other, it is called a collaboration graph.

For example, hereâ€™s a small portion of the nodes and edges in the OMDB collaboration graph:

One popular graph algorithm for measuring the influence of a node in a collaboration graph is the PageRank algorithm, which assigns to each node a value between 0 and 1 that takes into account the number of and PageRank values of adjacent nodes. The higher the PageRank value, the more influential the node. Many important graph algorithms are difficult, if not impossible, to implement in pure SQL. PageRank, for instance, requires advanced programming techniques, like recursion, and specialized scripting capabilities.

In this tutorial, youâ€™ll use RelationalAIâ€™s SQL Graph Analytics Library to create a collaboration graph using data in the Open Movie Database (OMDB) and run the PageRank algorithm on the graph to determine the most influential people in the database.

## Steps

This tutorial is meant to be completed in a Snowflake SQL Worksheet. Before you continue, open a new worksheet and select a database and user role with access to the RelationalAI Snowflake Integration.

đź”Ž

The complete SQL worksheet that accompanies this tutorial is available at the bottom of this page.

### Load the OMDB Data Set

The OMDB contains tables for people, movies, and casting information with the schema described in the follwing entity relationship (ER) diagram:

The data for each entity is contained in four CSV files:

• `all_people.csv`
• `all_movies.csv`
• `job_names.csv`
• `all_casts.csv`

To load the `all_people.csv` file, for instance, run the following SQL in your worksheet:

``````USE DATABASE <sf_db_name>;
USE SCHEMA <schema_name>;

CREATE TABLE Person(
id INT PRIMARY KEY,
name VARCHAR,
birthday DATE,
deathday DATE,
gender INT
);

COPY INTO Person
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_people.csv'
FILE_FORMAT = (
TYPE = CSV
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);``````

You can load the remaining CSV files in a similar fashion. See the complete worksheet for full details.

The CSV files used in this tutorial were downloaded from the Open Movie Database (opens in a new tab) on October 6, 2023 and are used under the CC BY 2.0 DE (opens in a new tab) license. Small formatting changes were made.

### Create a Relationship Table

Next, execute the following to create a table named `collaborated_with` containing pairs of `Person` IDs corresponding to people who have been cast together in the same movie:

``````CREATE TABLE collaborated_with AS (
SELECT
c1.person_id AS person1_id,
c2.person_id AS person2_id
FROM CastMember c1
JOIN CastMember c2 ON c1.movie_id = c2.movie_id
WHERE c1.person_id < c2.person_id -- To avoid duplicate people
);``````

The `collaborated_with` table describes the relationships between people in the OMDB collaboration graph. Each row corresponds to an edge between two people in the graph.

### Select RelationalAI Resources

Graphs are stored in RelationalAI databases, and RelationalAI engines supply the computational power for executing graph algorithms. So, before you can create a graph and run an algorithm, you must specify which RelationalAI resources to use.

đź”Ž

For the best experience, use an engine of size `S` or greater for this tutorial. RelationalAI resources may only be created by integration administrators. See Integration Management: RAI Resources for more information.

You can select RelationalAI resources with the `use_rai_database` and `use_rai_engine` procedures:

``````CALL RAI.use_rai_database('<rai_db_name>');
/*+-------------------+
| USE_RAI_DATABASE  |
+-------------------+
| <rai_db_name>     |
+-------------------+ */

CALL RAI.use_rai_engine('<rai_engine_name>');
/*+-------------------+
| USE_RAI_ENGINE    |
+-------------------+
| <rai_engine_name> |
+-------------------+ */``````

### Create a Data Stream

Data streams are used to synchronize data in a Snowflake table with a RelationalAI database so that graph algorithms, and other kinds of workloads, may be performed on the data.

Use the `create_data_stream` procedure to initiate this data synchronization:

``````CALL RAI.create_data_stream('collaborated_with');
/*+------------------------------------------------+
| { "account": "***",  ..., "state": "CREATED" } |
+------------------------------------------------+ */``````

By default, data streams have the same name as the table from which they are created.

âš

The `create_data_stream` procedure finishes once the data stream has been created. However, this does not mean that the data stream is ready to use. You must wait until the data stream has finished syncing. This may take up to two minutes.

You may check the status of your data stream using the `get_data_stream_status` procedure:

``````CALL RAI.get_data_stream_status('collaborated_with');
/*+--------------------+------------------------------------------------+
| PROPERTY           | VALUE                                          |
+--------------------+------------------------------------------------+
| Source             | "<sf_db_name>.<schema_name>.collaborated_with" |
| DB Link            | "<sf_db_name>.rai"                             |
| Integration        | "<integration_name>"                           |
| Data stream health | "Healthy"                                      |
| ...                | ...                                            |
+--------------------+------------------------------------------------+ */``````
đź”Ž

When Data sync status says Fully synced, your data stream is ready and you may proceed to the next step.

### Create a Graph

Now you can create a graph in RelationalAI from the `collaborated_with` data stream using the `create_graph` procedure:

``````CALL RAI.create_graph('collaboration_graph', 'collaborated_with');
/*+------------------------------------------------------------------------------------------------+
| {"message":"Graph 'collaboration_graph' created in database '<rai_db_name>'.", "success":true} |
+------------------------------------------------------------------------------------------------+  */``````

### Compute PageRank

To compute the PageRank for each node in the collaboration graph, call the `pagerank` function:

``````CREATE TABLE influence AS (
SELECT * FROM TABLE(RAI.pagerank('collaboration_graph'))
);``````
đź”Ž

The `collaboration_graph` has over 1 million nodes and almost 9 million edges. Computing the PageRank will take a minute or two.

In this example, the PageRank values are computed in RelationalAI and stored in a Snowflake table named `influence`. The `pagerank` function returns a table with two columns:

• `node`, containing the IDs of nodes in the graph.
• `value`, containing the PageRank value of each node.

You can see this by inspecting the first few rows of the `influence` table:

``````SELECT * FROM influence LIMIT 5;
/*+------+------------------+
| NODE | VALUE            |
|------+------------------+
| 1    | 0.0000289266722  |
| 2    | 0.00004992577285 |
| 3    | 0.00003116103839 |
| 4    | 0.00002542779375 |
| 5    | 0.00003826305083 |
+------+------------------+ */``````

### Find the Influencers

Node IDs in the `influence` table correspond to `Person` IDs in the OMDB data set. So, you can find the names of the most influential people in the OMDB by matching people to node IDs in the `influence` table and sorting the results by the `value` column:

``````SELECT Person.name, influence.value
FROM Person, influence
WHERE Person.id = influence.node
ORDER BY influence.value DESC
LIMIT 10;
/*+----------------------+------------------+
| NAME                 | VALUE            |
|----------------------+------------------+
| Christine Westermann | 0.0003306634481  |
| GĂ¶tz Alsmann         | 0.0003304435814  |
| Roger Corman         | 0.00003116103839 |
| Godfrey Ho           | 0.0001110573419  |
| Lloyd Kaufman        | 0.0001055304991  |
| James Burrows        | 0.00009991588724 |
| Udo Kier             | 0.00009568671509 |
| Ennio Morricone      | 0.00009510038097 |
| Uwe Karpa            | 0.00009370380723 |
| Jackie Chan          | 0.00009297487171 |
+----------------------+------------------+ */``````

The output includes influential people from across the international film industry, such as directors Roger Corman (opens in a new tab) and Godfrey Ho (opens in a new tab), the composer Ennio Morricone (opens in a new tab), and the actor Jackie Chan (opens in a new tab).

## Summary

Graph analytics can help you identify important people in a network based on the relationships in that network. In this tutorial, you saw this in action by using the PageRank algorithm from RelationalAIâ€™s SQL Graph Analytics Library to identify influential people in the OMDB database. All without leaving a Snowflake SQL Worksheet!

In particular, you learned how to:

• Model relationships in a data set using a relationship table.
• Synchronize your data between Snowflake and RelationalAI using the `create_data_stream` procedure.
• Create a graph in RelationalAI from a data stream using the `create_graph` procedure.
• Compute the PageRank of each node in a graph using the `pagerank` function.
• Match results from `pagerank` to entities in the data.

This process is the same for each new graph you create. And since data streams keep graphs in sync with their Snowflake source tables, every time you run an algorithm, the results will reflect the latest data.

## Next Steps

In this tutorial, every node in the collaboration graph represents the same type of entity: a person. In the Schema Mapping Tutorial, youâ€™ll learn how to create graphs involving multiple types of entities.

Check out the Graph Analytics Overview to see every algorithm available in RelationalAIâ€™s SQL Graph Analytics Library. To learn more about the RAI Integration Services for Snowflake, see Working With the Integration.

## Full Worksheet

Copy and paste the following code into a Snowflake worksheet to follow along with the tutorial.

``````-----------------------------
-- STEP 1: LOAD OMDB DATA SET
-----------------------------

-- Time: ~1m

USE DATABASE <sf_db_name>;
USE SCHEMA <schema_name>;

CREATE TABLE Person(
id INT PRIMARY KEY,
name VARCHAR,
birthday DATE,
deathday DATE,
gender INT
);

COPY INTO Person
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_people.csv'
FILE_FORMAT = (
TYPE = CSV
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);

CREATE TABLE Movie(
id INT PRIMARY KEY,
name VARCHAR,
parent_id INT,
date date
);

COPY INTO Movie
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_movies.csv'
FILE_FORMAT = (
TYPE = CSV
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);

CREATE TABLE Job(
job_id INT PRIMARY KEY,
name VARCHAR,
language_iso_639_1 VARCHAR
);

COPY INTO Job
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/job_names.csv'
FILE_FORMAT = (
TYPE = CSV
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);

CREATE TABLE CastMember(
movie_id INT REFERENCES Movie(id),
person_id INT REFERENCES Person(id),
job_id INT REFERENCES Job(job_id),
role VARCHAR,
position INT,
PRIMARY KEY (movie_id, person_id, job_id)
);

COPY INTO CastMember
FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_casts.csv'
FILE_FORMAT = (
TYPE = CSV
DATE_FORMAT = 'YYYY-MM-DD'
EMPTY_FIELD_AS_NULL = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE = '\\'
);

--------------------------------------
-- STEP 2: CREATE A RELATIONSHIP TABLE
--------------------------------------

-- Time: ~5s

CREATE TABLE collaborated_with AS (
SELECT
c1.person_id AS person1_id,
c2.person_id AS person2_id
FROM CastMember c1
JOIN CastMember c2 ON c1.movie_id = c2.movie_id
WHERE c1.person_id < c2.person_id -- To avoid duplicate people
);

----------------------------------------
-- STEP 3: SELECT RELATIONALAI RESOURCES
----------------------------------------

-- Time:  ~3s

CALL RAI.use_rai_database('<rai_db_name>');
/*+-------------------+
| USE_RAI_DATABASE  |
+-------------------+
| <rai_db_name>     |
+-------------------+ */

CALL RAI.use_rai_engine('<rai_db_name>');
/*+-------------------+
| USE_RAI_ENGINE    |
+-------------------+
| <rai_engine_name> |
+-------------------+ */

-------------------------------
-- STEP 4: CREATE A DATA STREAM
-------------------------------

-- Time:  ~2m

CALL RAI.create_data_stream('collaborated_with');
/*+------------------------------------------------+
| { "account": "***",  ..., "state": "CREATED" } |
+------------------------------------------------+ */

CALL RAI.get_data_stream_status('collaborated_with');
/*+--------------------+------------------------------------------------+
| PROPERTY           | VALUE                                          |
+--------------------+------------------------------------------------+
| Source             | "<sf_db_name>.<schema_name>.collaborated_with" |
| DB Link            | "<sf_db_name>.rai"                             |
| Integration        | "<integration_name>"                           |
| Data stream health | "Healthy"                                      |
| ...                | ...                                            |
+--------------------+------------------------------------------------+ */

-------------------------
-- STEP 5: CREATE A GRAPH
-------------------------

-- Time: ~35s

CALL RAI.create_graph('collaboration_graph', 'collaborated_with');
/*+------------------------------------------------------------------------------------------------+
| {"message":"Graph 'collaboration_graph' created in database '<rai_db_name>'.", "success":true} |
+------------------------------------------------------------------------------------------------+  */

---------------------------
-- STEP 6: COMPUTE PAGERANK
---------------------------

-- Time: ~1m 30s

CREATE TABLE influence AS (
SELECT * FROM TABLE(RAI.pagerank('collaboration_graph'))
);

SELECT * FROM influence LIMIT 5;
/*+------+------------------+
| NODE | VALUE            |
|------+------------------+
| 1    | 0.0000289266722  |
| 2    | 0.00004992577285 |
| 3    | 0.00003116103839 |
| 4    | 0.00002542779375 |
| 5    | 0.00003826305083 |
+------+------------------+ */

-------------------------------
-- STEP 7: FIND THE INFLUENCERS
-------------------------------

-- Time: ~0.5s

SELECT Person.name, influence.value
FROM Person, influence
WHERE Person.id = influence.node
ORDER BY influence.value DESC
LIMIT 10;
/*+----------------------+------------------+
| NAME                 | VALUE            |
|----------------------+------------------+
| Christine Westermann | 0.0003306634481  |
| GĂ¶tz Alsmann         | 0.0003304435814  |
| Roger Corman         | 0.00003116103839 |
| Godfrey Ho           | 0.0001110573419  |
| Lloyd Kaufman        | 0.0001055304991  |
| James Burrows        | 0.00009991588724 |
| Udo Kier             | 0.00009568671509 |
| Ennio Morricone      | 0.00009510038097 |
| Uwe Karpa            | 0.00009370380723 |
| Jackie Chan          | 0.00009297487171 |
+----------------------+------------------+ */

---------------------
-- CLEANUP (OPTIONAL)
---------------------

-- It is possible to create multiple graphs from the same data stream,
-- such as a directed and undirected version of the same graph.
-- You should delete every graph attached to a data stream before deleting
-- the data stream. You can delete graphs using the delete_graph prodedure,
-- and delete data streams using the delete_data_stream procedure.
CALL RAI.delete_graph('collaboration_graph');
/*+------------------------------------------------------------------------------------------------------+
| {"message":"Graph 'collaboration_graph' removed from RAI database '<rai_db_name>'.", "success":true} |
+------------------------------------------------------------------------------------------------------+  */

CALL RAI.delete_data_stream('collaborated_with');
/*+---------------------+
| DELETE_DATA_STREAM  |
+---------------------+
| "ok"                |
+---------------------+ */

``````