Working With Algorithms

Goal

This guide demonstrates how to execute graph algorithms and work with their results using RelationalAI’s SQL Graph Library for Snowflake.

Introduction

After you create a graph, you may execute algorithms on that graph using the various functions described in the Overview of Graph Algorithms. This guide explains how to:

You can find several examples of the patterns in practice in the Examples section, including how to:

Executing Algorithms

When you call an algorithm function from RelationalAI’s SQL Graph Library, all of the computation happens in RAI — not in Snowflake. The function blocks until the results of the algorithm are returned to Snowflake from RAI.

This section describes the concepts associated with running graph algorithms in RAI from Snowflake.

RAI Engines

Executing algorithms on a graph requires a RAI engine. You can either set the engine in a RAI context with the use_rai_engine procedure, or provide the name of the engine to the function’s arguments parameter:

-- Set the RAI context.
CALL RAI.use_rai_database('my_rai_db');
CALL RAI.use_rai_engine('my_rai_engine');
 
-- Create an undirected graph with three nodes and two edges.
CREATE TABLE my_edges(x INT, y INT) AS SELECT * FROM VALUES (1, 2), (2, 3);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
-- Get the number of nodes in `'my_graph'`.
-- The engine `'my_rai_engine'` from the RAI context is used,
-- since no engine name is explicitly provided.
SELECT RAI.num_nodes('my_graph')
/*+---+
  | 3 |
  +---+  */
 
-- Get the number of edges in `'my_graph'` using an engine that is different
-- from the engine set in the RAI context. The engine name `'my_other_rai_engine'`
-- is explicitly provided to the function.
SELECT RAI.num_edges('my_graph', {'rai_engine': 'my_other_rai_engine'});
/*+---+
  | 2 |
  +---+  */

If no engine with the provided name exists, an error is returned and the algorithm is not executed. See Required Versus Optional Arguments for more information on function arguments.

Scalar Versus Tabular Functions

Some algorithms return a scalar value — such as min_degree, which returns the minimum degree of a graph. Other functions return a table — such as degree, which returns the degree of each node. Tabular functions must be called with SELECT * FROM TABLE(...), as in the following example:

-- Create an undirected graph with three nodes and two edges.
CREATE TABLE my_edges(x INT, y INT) AS SELECT * FROM VALUES (1, 2), (2, 3);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
-- Compute the minimum degree of `'my_graph'`.
-- The return value is a scalar.
SELECT RAI.min_degree('my_graph');
/*+---+
  | 1 |
  +---+  */
 
-- Find the degree of each node in `'my_graph'`.
-- The return value is a table.
SELECT * FROM TABLE(RAI.degree('my_graph'));
/*+------+--------+
  | NODE | DEGREE |
  |------+--------+
  | 1    | 1      |
  | 2    | 2      |
  | 3    | 1      |
  +------+--------+ */

The table returned by tabular functions has named columns to help identify the meaning of each row. For example, the table returned by the degree function has a node column and a degree column, since each row pairs a node with its degree. See Results Schema for more information.

Required Versus Optional Arguments

Every algorithm function has two parameters:

graph_name: the name of the graph on which to execute the algorithm.
arguments: a JSON object containing additional arguments.

The graph_name is always required. Some functions, like the min_degree and degree functions in the preceding section’s example, do not require an arguments object. Other algorithms, such as shortest_path_length, do require arguments to be provided:

-- Create an undirected graph with three nodes and two edges.
CREATE TABLE my_edges(x INT, y INT) AS SELECT * FROM VALUES (1, 2), (2, 3);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
-- Compute the distance between node 1 and every other node in the graph.
-- `shortest_path_length` requires a `'source' node to be specified in the
-- arguments object.
SELECT * FROM TABLE(RAI.shortest_path_length('my_graph', {'source': 1}));
/*+--------+--------+--------+
  | SOURCE | TARGET | LENGTH |
  |--------+--------+--------|
  | 1      | 1      | 0      |
  | 1      | 2      | 1      |
  | 1      | 3      | 2      |
  +--------+--------+--------+ */

If you call a function that requires an arguments object with only the graph name, Snowflake displays an error:

-- The required `'source'` node is missing, so an error is returned.
SELECT * FROM TABLE(RAI.shortest_path_length('my_graph'));
/* +----------------------------------------------------------------+
   | Invalid argument types for function 'RAI.SHORTEST_PATH_LENGTH' |
   +----------------------------------------------------------------+ */

🔎

Required arguments for each function are documented in the SQL Library Reference. You may access each algorithm’s reference documentation by clicking the function name in the Overview of Graph Algorithms.

Array Arguments

In many cases, arguments that accept a node — such as the shortest_path_length function’s source argument, or the degree function’s node argument — can also accept an array of nodes. For instance, when an array of two or more nodes is passed to the degree function’s node argument, the degrees of each node in the array are returned:

-- Create an undirected graph with three nodes and two edges.
CREATE TABLE my_edges(x INT, y INT) AS SELECT * FROM VALUES (1, 2), (2, 3);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
-- Get the degree of node 2 in `'my_graph'`.
-- A single node is provided to the `'node'` argument.
SELECT * FROM TABLE(RAI.degree('my_graph'), {'node': 2})
/*+------+--------+
  | NODE | DEGREE |
  |------+--------+
  | 2    | 2      |
  +------+--------+ */
 
-- Get the degree of nodes 1 and 3 in `'my_graph'`.
-- Both nodes are provided to the `'node'` argument in an array.
SELECT * FROM TABLE(RAI.degree('my_graph'), {'node': [1, 3]});
/*+------+--------+
  | NODE | DEGREE |
  |------+--------+
  | 1    | 1      |
  | 3    | 1      |
  +------+--------+ */

Some functions, such as the shortest_path_length function, can accept multiple array arguments. For example, you may call shortest_path_length with arrays provided to both its source and target arguments:

-- Compute the length of a shortest path in `'my_graph'` starting at either
-- node 1 or node 2 and ending at either node 3 or node 4.
SELECT * FROM TABLE(RAI.shortest_path_length('my_graph', {'source': [1, 2], 'target': [3, 4]}));
/*+--------+--------+--------+
  | SOURCE | TARGET | LENGTH |
  |--------+--------+--------|
  | 1      | 3      | 2      |
  | 1      | 4      | 3      |
  | 2      | 3      | 1      |
  | 2      | 4      | 2      |
  +--------+--------+--------+ */

When multiple arrays are passed to two or more arguments, rows containing all possible node combinations are returned. For example, in the preceding statement, the array [1, 2] is passed to the source argument and the array [3, 4] is passed to the target argument. The table contains rows for all possible (source, target) pairs containing nodes from the arrays: (1, 3), (1, 4), (2, 3), and (2, 4).

Many functions accept a tuples argument that can be used to compute the desired value for specific node combinations. For example, to compute the shortest path length between the (source, target) pairs (1, 3) and (2, 3), you need to pass the array [[1, 3], [2, 3]] to the tuples argument:

-- Compute the shortest path length in `'my_graph'` between nodes 1 and 3
-- and nodes 2 and 3 using the `'tuples'` argument.
SELECT * FROM TABLE(RAI.shortest_path_length('my_graph', {'tuples': [[1, 3], [2, 3]]]}));
/*+--------+--------+--------+
  | SOURCE | TARGET | LENGTH |
  |--------+--------+--------|
  | 1      | 3      | 2      |
  | 2      | 3      | 1      |
  +--------+--------+--------+ */

Arguments that accept arrays are documented in the SQL Library Reference. See the Overview of Graph Algorithms to access each function’s reference documentation.

Scheduled Tasks

Algorithms may be executed as a Snowflake task (opens in a new tab). As such, you may schedule algorithms to run on specific dates and times or repeat with predetermined intervals. In the following example, the pagerank algorithm is executed every five minutes and the results are stored in a Snowflake table:

-- Create a task.
CREATE OR REPLACE TASK run_pagerank
  WAREHOUSE = my_sf_warehouse
  SCHEDULE = '5 MINUTE'
AS
BEGIN
    SELECT * FROM TABLE(RAI.pagerank('my_graph', {
        'rai_engine': 'my_rai_engine',
        'result_table': '<my_sf_db>.<my_sf_schema>.pagerank_results'
    }));
END;
 
-- Start the recurring task.
ALTER TASK run_pagerank RESUME;
 
-- After five minutes, the results are available.
SELECT * FROM pagerank_results;

When executing an algorithm in a task, you must:

Specify the Snowflake warehouse that executes the task.
Provide an engine name to the algorithm’s rai_engine argument, even if you have previously set a RAI context since the task cannot access the current session variables when it is executed.
Provide the fully qualified (opens in a new tab) name of a Snowflake table in which the results of the algorithm will be stored to the algorithm’s result_table argument.

Limitations

Executing algorithms takes time and, for large enough graphs, the execution time may exceed statement timeouts set for your Snowflake account. If you encounter timeouts, contact your Snowflake administrator to request an increase to your timeout limits.

Working With Results

Once an algorithm has run and the results are in Snowflake, you’ll want to do something with those results. This section describes concepts associated with using the results of algorithms run in RAI.

Results Schema

Tabular functions, such as degree, return tables with named columns. For instance, degree returns a table with a node column and a degree column:

-- Create an undirected graph with three nodes and two edges.
CREATE TABLE my_edges(x INT, y INT) AS SELECT * FROM VALUES (1, 2), (2, 3);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
-- Find the degree of each node in `'my_graph'`.
-- The return value is a table with two columns named NODE and DEGREE.
SELECT * FROM TABLE(RAI.degree('my_graph'));
/*+------+--------+
  | NODE | DEGREE |
  |------+--------+
  | 1    | 1      |
  | 2    | 2      |
  | 3    | 1      |
  +------+--------+ */

🔎

The schema for each algorithm’s return table is available in the Overview of Graph Algorithms.

Saving Results in Snowflake

There are two ways to store the results of an algorithm in a Snowflake table:

`CREATE_TABLE`

You can execute an algorithm and create a table from the results using the CREATE TABLE (opens in a new tab) command:

-- Compute the degree of each node in `'my_graph'` and store the results
-- in a Snowflake table named `'my_results_table'`. Rather than using
-- the `result_table` argument, the table is created directly.
CREATE OR REPLACE TABLE my_result_table(node INT, degree INT) AS (
    SELECT * FROM RAI.degree('my_graph')
);
 
-- The result table has the correct schema.
SELECT * FROM my_result_table;
/*+------+--------+
  | NODE | DEGREE |
  |------+--------+
  | 1    | 1      |
  | 2    | 2      |
  | 3    | 1      |
  +------+--------+ */

The schema of my_result_table should match the algorithm’s result schema. In the preceding example, this means that the table should have a node and degree column, both of which are integers, since that is the schema of the table returned by the degree function.

`result_table`

You may execute an algorithm and store the results in a Snowflake table by providing the table’s fully qualified (opens in a new tab) name to the algorithm’s result_table argument:

-- Compute the degree of each node in `'my_graph'` and store the results
-- in a Snowflake table named `'my_result_table'`. Note that the function
-- returns an empty table when `result_table` is set.
SELECT * FROM TABLE(RAI.degree(
    'my_graph', {'result_table': '<my_sf_db>.<my_sf_schema>.my_result_table'}
));
/*+------+--------+
  | NODE | DEGREE |
  |------+--------+
  | Empty table.  |
  +------+--------+ */
 
-- Note that the columns of the result table are given the generic names
-- COL1 and COL2, instead of NODE and DEGREE.
SELECT * FROM my_result_table;
/*+------+------+
  | COL1 | COL2 |
  |------+------+
  | 1    | 1    |
  | 2    | 2    |
  | 3    | 1    |
  +------+------+ */

When you send the results of an algorithm to a Snowflake table using the result_table argument, the provided table is either created if it does not already exist, or is replaced with the new results. The columns of the table are given generic names, such as COL1 and COL2.

⚠

All data sent to a table using the result_table argument are converted to text.

The result_table argument is necessary for storing results of an algorithm executed in a scheduled task.

The table generated with result_table loses the algorithm’s schema and is not well suited for storing intermediate results in multi-step workloads. Using the CREATE TABLE command is a better option.

Multi-Step Workloads

Storing the results returned by an algorithm in a Snowflake table allows you to work with the results in other queries. In particular, you may create multi-step graph workloads that build off previous results.

For example, the following snippet computes the PageRank of the neighbors of a particular node:

-- Create an undirected graph with three nodes and two edges.
CREATE TABLE my_edges(x INT, y INT) AS SELECT * FROM VALUES (1, 2), (2, 3);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
-- Find the neighbors of node 2 in `'my_graph'` and store the results
-- in a table called `neighbor_results`.
CREATE OR REPLACE TABLE neighbor_results(node1 INT, node2 INT) AS (
    SELECT * FROM TABLE(RAI.neighbor('my_graph', {'node': 2}))
)
 
-- Compute the PageRank of the neighbors of node 2. The array of
-- neighbors is built from the `neighbor_results` table.
SELECT * FROM TABLE(RAI.pagerank('my_graph', {
    'node': (SELECT array_agg(node2) FROM neighbor_results)
}));
/*+------+--------------+
  | NODE | VALUE        |
  |------+--------------+
  | 1    | 0.2567563672 |
  | 3    | 0.2567563672 |
  +------+--------------+ */
 
-- Remove the intermediate results table.
DROP TABLE neighbor_results;

In the preceding example, the array provided to the pagerank function’s node argument is prepared by selecting the neighbor_results table’s node2 column and using Snowflake’s built-in array_agg (opens in a new tab) function to build an array.

Examples

This section illustrates the concepts described in Executing Algorithms and Working With Results using several examples of common scenarios.

Plot a Degree Histogram

Viewing a graph’s degree histogram shows you how the degrees are distributed among nodes and may provide some insight into the structure of the graph.

You can count the number of nodes with each degree in a graph using the degree_histogram function:

USE ROLE my_sf_role;
USE WAREHOUSE my_sf_warehouse;
USE DATABASE my_sf_database;
 
CREATE OR REPLACE SCHEMA degree_histogram;
USE SCHEMA degree_histogram;
 
-- Set the RAI context.
CALL use_rai_engine('my_rai_engine');
CALL use_rai_database('my_rai_db');
 
-- Create an undirected graph.
CREATE OR REPLACE TABLE my_edges(node1 INT, node2 INT) AS (
    SELECT * FROM VALUES
    (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 3),
    (2, 6), (3, 7), (4, 5), (4, 8), (5, 9)
);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
-- Count the number of nodes with each degree.
SELECT * FROM TABLE(RAI.degree_histogram('my_graph'));
/*+--------+-------+
  | DEGREE | COUNT |
  |--------+-------+
  | 1      | 3     |
  | 2      | 1     |
  | 3      | 4     |
  | 5      | 1     |
  +--------+-------+ */

To view a histogram, click Chart above the results table for your query. You can use the configuration options on the right-hand panel to view the results as a bar chart with bars from the COUNT column plotted against the DEGREE column on the X-Axis:

You may customize the histogram using the Snowflake UI, including adding buckets. See Visualizing Worksheet Data (opens in a new tab) in the Snowflake documentation for more details.

Rank Nodes by Centrality

Centrality algorithms are used to measure a node’s importance or influence in a graph. The RAI SQL Graph Library implements several centrality algorithms, including:

This example uses the eigenvector_centrality function to sort a subset of nodes in the graph by importance. However, you can replace it with any of the other centrality functions. The example illustrates concepts such as array arguments and working with results schema.

USE ROLE my_sf_role;
USE WAREHOUSE my_sf_warehouse;
USE DATABASE my_sf_database;
 
CREATE OR REPLACE SCHEMA centrality_rank;
USE SCHEMA centrality_rank;
 
-- Set the RAI context.
CALL RAI.use_rai_database('my_rai_db');
CALL RAI.use_rai_engine('my_rai_engine');
 
-- Create an undirected graph with five nodes and five edges.
-- Visually, the graph looks like:  1 -- 3 -- 5
--                                  |    |
--                                  2 -- 4
CREATE OR REPLACE TABLE my_edges(node1 INT, node2 INT)
AS (
    SELECT * FROM VALUES
    (1, 2), (1, 3), (2, 4), (3, 4), (3, 5)
);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
-- Sort nodes 1, 3, and 5 by their eigenvector centrality.
-- from highest to lowest.
SELECT node
FROM TABLE(RAI.eigenvector_centrality(
    'my_graph', {'node': [1, 3, 5]}
))
ORDER BY value DESC;
/*+------+
  | NODE |
  |------+
  | 3    |
  | 1    |
  | 5    |
  +------+ */

You can find more information about centrality algorithms in the Centrality section of the Overview of Graph Algorithms.

Find Similar Nodes

The RAI SQL Graph Library implements several similarity algorithms, including:

This example uses the jaccard_similarity function to find the most similar node that is not a neighbor of a given node. However, you can replace it with any of the other similarity functions.

USE ROLE my_sf_role;
USE WAREHOUSE my_sf_warehouse;
USE DATABASE my_sf_database;
 
CREATE OR REPLACE SCHEMA similar_nodes;
USE SCHEMA similar_nodes;
 
-- Set the RAI context.
CALL RAI.use_rai_database('my_rai_db');
CALL RAI.use_rai_engine('my_rai_engine');
 
-- Create an undirected graph with five nodes and five edges.
-- Visually, the graph looks like:  1 -- 3 -- 5
--                                  |    |
--                                  2 -- 4
CREATE OR REPLACE TABLE my_edges(node1 INT, node2 INT) AS (
    SELECT * FROM VALUES
    (1, 2), (1, 3), (2, 4), (3, 4), (3, 5)
);
CALL RAI.create_data_stream('my_edges');
CALL RAI.create_graph('my_graph', 'my_edges');
 
 
-- Compute the most similar node to node 1 that is not already a neighbor of node 1.
SELECT jacc.node2
FROM
    TABLE(RAI.jaccard_similarity('my_graph', {'node1': 1})) as jacc
WHERE
    jacc.node2 NOT IN (
        SELECT node2 FROM TABLE(RAI.neighbor('my_graph', {'node1': 1}))
    )
    AND jacc.node2 != 1
ORDER BY jacc.score DESC LIMIT 1;
/*+------------+
  | JACC.NODE2 |
  |------------+
  | 4          |
  +------------+ */

Working With Algorithms

Goal

Introduction

Executing Algorithms

RAI Engines

Scalar Versus Tabular Functions

Required Versus Optional Arguments

Array Arguments

Scheduled Tasks

Limitations

Working With Results

Results Schema

Saving Results in Snowflake

CREATE_TABLE

result_table

Multi-Step Workloads

Examples

Plot a Degree Histogram

Rank Nodes by Centrality

Find Similar Nodes

See Also

`CREATE_TABLE`

`result_table`