Skip to content
Tutorials
Getting Started With Graph Analytics

Getting Started With Graph Analytics

In this tutorial, you’ll learn the basic workflow for doing graph analytics in Snowflake with RelationalAI. You will learn how to run a graph algorithm on data from the Open Movie Database (OMDB) to answer the question “Who are the most influential people in the movie business?”

Goal

After completing this tutorial, you’ll know how to:

  • Model entity relationships as a graph.
  • Describe graphs using a relationship table.
  • Use RelationalAI to perform graph algorithms in Snowflake.

Prerequisites

You’ll need access to a Snowflake account with the RelationalAI Integration installed. This is required. Before reading this tutorial, you should complete the steps in the User Quick Start.

Scenario

How do you measure a person’s influence in the movie industry? You might start by counting how many other people that person has worked with. To be more accurate, though, you should consider the influence of those other people. That is, someone who has worked with many influential people should be more influential than someone who has worked with lots of less influential people.

Keeping track of who has worked with who, and how each person’s rank affects the rank of others, is a challenge. It’s easier to grasp the situation as a graph, like the one illustrated below, which represents people in the movie industry as nodes — circles in the diagram — and connects two people together with an edge — a line in the diagram — if they have collaborated with each other in a movie. Since this particular graph represents people who have collaborated with each other, it is called a collaboration graph.

For example, here’s a small portion of the nodes and edges in the OMDB collaboration graph:

One popular graph algorithm for measuring the influence of a node in a collaboration graph is the PageRank algorithm, which assigns to each node a value between 0 and 1 that takes into account the number of and PageRank values of adjacent nodes. The higher the PageRank value, the more influential the node. Many important graph algorithms are difficult, if not impossible, to implement in pure SQL. PageRank, for instance, requires advanced programming techniques, like recursion, and specialized scripting capabilities.

In this tutorial, you’ll use RelationalAI’s SQL Graph Analytics Library to create a collaboration graph using data in the Open Movie Database (OMDB) and run the PageRank algorithm on the graph to determine the most influential people in the database.

Steps

This tutorial is meant to be completed in a Snowflake SQL Worksheet. Before you continue, open a new worksheet and select a database and user role with access to the RelationalAI Snowflake Integration.

🔎

The complete SQL worksheet that accompanies this tutorial is available at the bottom of this page.

Load the OMDB Data Set

The OMDB contains tables for people, movies, and casting information with the schema described in the follwing entity relationship (ER) diagram:

The data for each entity is contained in four CSV files:

  • all_people.csv
  • all_movies.csv
  • job_names.csv
  • all_casts.csv

To load the all_people.csv file, for instance, run the following SQL in your worksheet:

USE DATABASE <sf_db_name>;
USE SCHEMA <schema_name>;
 
CREATE TABLE Person(
    id INT PRIMARY KEY,
    name VARCHAR,
    birthday DATE,
    deathday DATE,
    gender INT
);
 
COPY INTO Person
    FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_people.csv'
    FILE_FORMAT = (
        TYPE = CSV
        SKIP_HEADER = 1
        DATE_FORMAT = 'YYYY-MM-DD'
        EMPTY_FIELD_AS_NULL = TRUE
        FIELD_OPTIONALLY_ENCLOSED_BY = '"'
        ESCAPE = '\\'
    );

You can load the remaining CSV files in a similar fashion. See the complete worksheet for full details.

The CSV files used in this tutorial were downloaded from the Open Movie Database (opens in a new tab) on October 6, 2023 and are used under the CC BY 2.0 DE (opens in a new tab) license. Small formatting changes were made.

Create a Relationship Table

Next, execute the following to create a table named collaborated_with containing pairs of Person IDs corresponding to people who have been cast together in the same movie:

CREATE TABLE collaborated_with AS (
    SELECT
        c1.person_id AS person1_id,
        c2.person_id AS person2_id
    FROM CastMember c1
    JOIN CastMember c2 ON c1.movie_id = c2.movie_id
    WHERE c1.person_id < c2.person_id -- To avoid duplicate people
);

The collaborated_with table describes the relationships between people in the OMDB collaboration graph. Each row corresponds to an edge between two people in the graph.

Select RelationalAI Resources

Graphs are stored in RelationalAI databases, and RelationalAI engines supply the computational power for executing graph algorithms. So, before you can create a graph and run an algorithm, you must specify which RelationalAI resources to use.

🔎

For the best experience, use an engine of size S or greater for this tutorial. RelationalAI resources may only be created by integration administrators. See Integration Management: RAI Resources for more information.

You can select RelationalAI resources with the use_rai_database and use_rai_engine procedures:

CALL RAI.use_rai_database('<rai_db_name>');
/*+-------------------+
  | USE_RAI_DATABASE  |
  +-------------------+
  | <rai_db_name>     |
  +-------------------+ */
 
CALL RAI.use_rai_engine('<rai_engine_name>');
/*+-------------------+
  | USE_RAI_ENGINE    |
  +-------------------+
  | <rai_engine_name> |
  +-------------------+ */

Create a Data Stream

Data streams are used to synchronize data in a Snowflake table with a RelationalAI database so that graph algorithms, and other kinds of workloads, may be performed on the data.

Use the create_data_stream procedure to initiate this data synchronization:

CALL RAI.create_data_stream('collaborated_with');
/*+------------------------------------------------+
  | { "account": "***",  ..., "state": "CREATED" } |
  +------------------------------------------------+ */

By default, data streams have the same name as the table from which they are created.

The create_data_stream procedure finishes once the data stream has been created. However, this does not mean that the data stream is ready to use. You must wait until the data stream has finished syncing. This may take up to two minutes.

You may check the status of your data stream using the get_data_stream_status procedure:

CALL RAI.get_data_stream_status('collaborated_with');
/*+--------------------+------------------------------------------------+
  | PROPERTY           | VALUE                                          |
  +--------------------+------------------------------------------------+
  | Source             | "<sf_db_name>.<schema_name>.collaborated_with" |
  | DB Link            | "<sf_db_name>.rai"                             |
  | Integration        | "<integration_name>"                           |
  | Data sync status   | "Syncing: pending unloading from SF"           | 
  | Data stream health | "Healthy"                                      |
  | ...                | ...                                            |
  +--------------------+------------------------------------------------+ */
🔎

When Data sync status says Fully synced, your data stream is ready and you may proceed to the next step.

Create a Graph

Now you can create a graph in RelationalAI from the collaborated_with data stream using the create_graph procedure:

CALL RAI.create_graph('collaboration_graph', 'collaborated_with');
/*+------------------------------------------------------------------------------------------------+
  | {"message":"Graph 'collaboration_graph' created in database '<rai_db_name>'.", "success":true} |
  +------------------------------------------------------------------------------------------------+  */

Compute PageRank

To compute the PageRank for each node in the collaboration graph, call the pagerank function:

CREATE TABLE influence AS (
    SELECT * FROM TABLE(RAI.pagerank('collaboration_graph'))
);
🔎

The collaboration_graph has over 1 million nodes and almost 9 million edges. Computing the PageRank will take a minute or two.

In this example, the PageRank values are computed in RelationalAI and stored in a Snowflake table named influence. The pagerank function returns a table with two columns:

  • node, containing the IDs of nodes in the graph.
  • value, containing the PageRank value of each node.

You can see this by inspecting the first few rows of the influence table:

SELECT * FROM influence LIMIT 5;
/*+------+------------------+
  | NODE | VALUE            |
  |------+------------------+
  | 1    | 0.0000289266722  |
  | 2    | 0.00004992577285 |
  | 3    | 0.00003116103839 |
  | 4    | 0.00002542779375 |
  | 5    | 0.00003826305083 |
  +------+------------------+ */

Find the Influencers

Node IDs in the influence table correspond to Person IDs in the OMDB data set. So, you can find the names of the most influential people in the OMDB by matching people to node IDs in the influence table and sorting the results by the value column:

SELECT Person.name, influence.value
FROM Person, influence
WHERE Person.id = influence.node
ORDER BY influence.value DESC
LIMIT 10;
/*+----------------------+------------------+
  | NAME                 | VALUE            |
  |----------------------+------------------+
  | Christine Westermann | 0.0003306634481  |
  | Götz Alsmann         | 0.0003304435814  |
  | Roger Corman         | 0.00003116103839 |
  | Godfrey Ho           | 0.0001110573419  |
  | Lloyd Kaufman        | 0.0001055304991  |
  | James Burrows        | 0.00009991588724 |
  | Udo Kier             | 0.00009568671509 |
  | Ennio Morricone      | 0.00009510038097 |
  | Uwe Karpa            | 0.00009370380723 |
  | Jackie Chan          | 0.00009297487171 |
  +----------------------+------------------+ */

The output includes influential people from across the international film industry, such as directors Roger Corman (opens in a new tab) and Godfrey Ho (opens in a new tab), the composer Ennio Morricone (opens in a new tab), and the actor Jackie Chan (opens in a new tab).

Summary

Graph analytics can help you identify important people in a network based on the relationships in that network. In this tutorial, you saw this in action by using the PageRank algorithm from RelationalAI’s SQL Graph Analytics Library to identify influential people in the OMDB database. All without leaving a Snowflake SQL Worksheet!

In particular, you learned how to:

  • Model relationships in a data set using a relationship table.
  • Synchronize your data between Snowflake and RelationalAI using the create_data_stream procedure.
  • Create a graph in RelationalAI from a data stream using the create_graph procedure.
  • Compute the PageRank of each node in a graph using the pagerank function.
  • Match results from pagerank to entities in the data.

This process is the same for each new graph you create. And since data streams keep graphs in sync with their Snowflake source tables, every time you run an algorithm, the results will reflect the latest data.

Next Steps

In this tutorial, every node in the collaboration graph represents the same type of entity: a person. In the Schema Mapping Tutorial, you’ll learn how to create graphs involving multiple types of entities.

Check out the Graph Analytics Overview to see every algorithm available in RelationalAI’s SQL Graph Analytics Library. To learn more about the RAI Integration Services for Snowflake, see Working With the Integration.

Full Worksheet

Copy and paste the following code into a Snowflake worksheet to follow along with the tutorial.

-----------------------------
-- STEP 1: LOAD OMDB DATA SET
-----------------------------
 
-- Time: ~1m
 
USE DATABASE <sf_db_name>;
USE SCHEMA <schema_name>;
 
CREATE TABLE Person(
    id INT PRIMARY KEY,
    name VARCHAR,
    birthday DATE,
    deathday DATE,
    gender INT
);
 
COPY INTO Person
    FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_people.csv'
    FILE_FORMAT = (
        TYPE = CSV
        SKIP_HEADER = 1
        DATE_FORMAT = 'YYYY-MM-DD'
        EMPTY_FIELD_AS_NULL = TRUE
        FIELD_OPTIONALLY_ENCLOSED_BY = '"'
        ESCAPE = '\\'
    );
 
CREATE TABLE Movie(
    id INT PRIMARY KEY,
    name VARCHAR,
    parent_id INT,
    date date
);
 
COPY INTO Movie
    FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_movies.csv'
    FILE_FORMAT = (
        TYPE = CSV
        SKIP_HEADER = 1
        DATE_FORMAT = 'YYYY-MM-DD'
        EMPTY_FIELD_AS_NULL = TRUE
        FIELD_OPTIONALLY_ENCLOSED_BY = '"'
        ESCAPE = '\\'
    );
 
CREATE TABLE Job(
    job_id INT PRIMARY KEY,
    name VARCHAR,
    language_iso_639_1 VARCHAR
);
 
COPY INTO Job
    FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/job_names.csv'
    FILE_FORMAT = (
        TYPE = CSV
        SKIP_HEADER = 1
        DATE_FORMAT = 'YYYY-MM-DD'
        EMPTY_FIELD_AS_NULL = TRUE
        FIELD_OPTIONALLY_ENCLOSED_BY = '"'
        ESCAPE = '\\'
    );
 
CREATE TABLE CastMember(
    movie_id INT REFERENCES Movie(id),
    person_id INT REFERENCES Person(id),
    job_id INT REFERENCES Job(job_id),
    role VARCHAR,
    position INT,
    PRIMARY KEY (movie_id, person_id, job_id)
);
 
COPY INTO CastMember
    FROM 'azure://raidocs.blob.core.windows.net/datasets/omdb/all_casts.csv'
    FILE_FORMAT = (
        TYPE = CSV
        SKIP_HEADER = 1
        DATE_FORMAT = 'YYYY-MM-DD'
        EMPTY_FIELD_AS_NULL = TRUE
        FIELD_OPTIONALLY_ENCLOSED_BY = '"'
        ESCAPE = '\\'
    );
 
 
--------------------------------------
-- STEP 2: CREATE A RELATIONSHIP TABLE
--------------------------------------
 
-- Time: ~5s
 
CREATE TABLE collaborated_with AS (
    SELECT
        c1.person_id AS person1_id,
        c2.person_id AS person2_id
    FROM CastMember c1
    JOIN CastMember c2 ON c1.movie_id = c2.movie_id
    WHERE c1.person_id < c2.person_id -- To avoid duplicate people
);
 
 
----------------------------------------
-- STEP 3: SELECT RELATIONALAI RESOURCES
----------------------------------------
 
-- Time:  ~3s
 
CALL RAI.use_rai_database('<rai_db_name>');
/*+-------------------+
  | USE_RAI_DATABASE  |
  +-------------------+
  | <rai_db_name>     |
  +-------------------+ */
 
CALL RAI.use_rai_engine('<rai_db_name>');
/*+-------------------+
  | USE_RAI_ENGINE    |
  +-------------------+
  | <rai_engine_name> |
  +-------------------+ */
 
 
-------------------------------
-- STEP 4: CREATE A DATA STREAM
-------------------------------
 
-- Time:  ~2m
 
CALL RAI.create_data_stream('collaborated_with');
/*+------------------------------------------------+
  | { "account": "***",  ..., "state": "CREATED" } |
  +------------------------------------------------+ */
 
CALL RAI.get_data_stream_status('collaborated_with');
/*+--------------------+------------------------------------------------+
  | PROPERTY           | VALUE                                          |
  +--------------------+------------------------------------------------+
  | Source             | "<sf_db_name>.<schema_name>.collaborated_with" |
  | DB Link            | "<sf_db_name>.rai"                             |
  | Integration        | "<integration_name>"                           |
  | Data sync status   | "Syncing: pending unloading from SF"           | 
  | Data stream health | "Healthy"                                      |
  | ...                | ...                                            |
  +--------------------+------------------------------------------------+ */
 
 
-------------------------
-- STEP 5: CREATE A GRAPH
-------------------------
 
-- Time: ~35s
 
CALL RAI.create_graph('collaboration_graph', 'collaborated_with');
/*+------------------------------------------------------------------------------------------------+
  | {"message":"Graph 'collaboration_graph' created in database '<rai_db_name>'.", "success":true} |
  +------------------------------------------------------------------------------------------------+  */
 
 
---------------------------
-- STEP 6: COMPUTE PAGERANK
---------------------------
 
-- Time: ~1m 30s
 
CREATE TABLE influence AS (
    SELECT * FROM TABLE(RAI.pagerank('collaboration_graph'))
);
 
SELECT * FROM influence LIMIT 5;
/*+------+------------------+
  | NODE | VALUE            |
  |------+------------------+
  | 1    | 0.0000289266722  |
  | 2    | 0.00004992577285 |
  | 3    | 0.00003116103839 |
  | 4    | 0.00002542779375 |
  | 5    | 0.00003826305083 |
  +------+------------------+ */
 
 
-------------------------------
-- STEP 7: FIND THE INFLUENCERS
-------------------------------
 
-- Time: ~0.5s
 
SELECT Person.name, influence.value
FROM Person, influence
WHERE Person.id = influence.node
ORDER BY influence.value DESC
LIMIT 10;
/*+----------------------+------------------+
  | NAME                 | VALUE            |
  |----------------------+------------------+
  | Christine Westermann | 0.0003306634481  |
  | Götz Alsmann         | 0.0003304435814  |
  | Roger Corman         | 0.00003116103839 |
  | Godfrey Ho           | 0.0001110573419  |
  | Lloyd Kaufman        | 0.0001055304991  |
  | James Burrows        | 0.00009991588724 |
  | Udo Kier             | 0.00009568671509 |
  | Ennio Morricone      | 0.00009510038097 |
  | Uwe Karpa            | 0.00009370380723 |
  | Jackie Chan          | 0.00009297487171 |
  +----------------------+------------------+ */
 
 
---------------------
-- CLEANUP (OPTIONAL)
---------------------
 
-- It is possible to create multiple graphs from the same data stream,
-- such as a directed and undirected version of the same graph.
-- You should delete every graph attached to a data stream before deleting
-- the data stream. You can delete graphs using the delete_graph prodedure,
-- and delete data streams using the delete_data_stream procedure.
CALL RAI.delete_graph('collaboration_graph');
/*+------------------------------------------------------------------------------------------------------+
  | {"message":"Graph 'collaboration_graph' removed from RAI database '<rai_db_name>'.", "success":true} |
  +------------------------------------------------------------------------------------------------------+  */
 
CALL RAI.delete_data_stream('collaborated_with');
/*+---------------------+
  | DELETE_DATA_STREAM  |
  +---------------------+
  | "ok"                |
  +---------------------+ */
 
 
Was this doc helpful?