My First Knowledge Graph

# My First Knowledge Graph

This tutorial is designed to give users their first introduction to the concept of a knowledge graph.

## Goal

The goal of this tutorial is to introduce the concept of a knowledge graph and practice some simple Rel that will allow you to create your first knowledge graph.

## Scenario

Imagine that you want to analyze the performances of different Olympic athletes to find the most successful athlete. Success can be measured in various different ways. You have two questions to ask about your data:

1. Which athlete participated in the most games?
2. Which athletes participated in both Winter and Summer games?

You can now begin to explore how to model this domain as a knowledge graph and how to express these questions as graph queries.

## Elements of a Knowledge Graph

Knowledge graphs allow you to organize information and define relationships between entities. If you think about the Olympic Games scenario, you can intuitively identify two major concepts: Olympic Games and athletes. They are referred to as “entity nodes” and are the solid bordered nodes on the Object-Role Modeling (ORM) diagram below.

These entities can have information associated with them, such as name, age, etc. Related information like this is modeled as a “value node”, and these are represented with dashed borders.

The relationships between concepts (for example, an athlete’s Olympics participation) and the relationship between concepts and values (for example, the athlete’s name) are modeled as edges that connect the nodes with each other in the knowledge graph. In ORM diagrams, edges are represented as the series of boxes connecting the nodes.

🔎

ORM diagrams are powerful tools to express and visualize complex data models at a conceptual level. Rel knowledge graphs and ORM diagrams support advanced representation of data such as hyperedges (edges between more than two nodes). You will find them throughout this tutorial illustrating your knowledge graph as it grows.

Here’s a summary of everything you just learned about knowledge graphs:

TermDescription
Entity nodeMain concepts. For example: athlete.
Value nodeInformation related to an entity node. For example: athlete’s age.
EdgeThe relationship between two nodes.

You can find a more advanced version of this table in Elements of a Relational Knowledge Graph.

To learn more about how entities and value types can be constructed in Rel, see Entities and Value Types. Note: This tutorial uses a simplified construction of entities and values.

## Modeling Your Knowledge Graph in Rel

You can build a model of your data by describing nodes and the edges between them. By giving these nodes and edges meaning, you are turning a graph into a knowledge graph.

Here is an ORM diagram of the knowledge graph you are about to build:

ORM diagrams allow you to give a visual representation of the categories of information you have in your knowledge graph such as “Olympics”. They are not used to represent specific data, such as “Tokyo 2020”.

### Structuring the Data

The easiest way to start is by listing some things that you know. You have information regarding:

• The Olympic Games (its name, the city and year they occurred, and the type of game it is).
• The athletes (their names and the sport(s) they performed in).
• The athletes’ participation in the Olympic Games.

Listing what you know is called defining the graph schema. You can find more details and examples of schemas in Elements of a Relational Knowledge Graph and in Graph Schema.

By defining the schema, you will list the entity nodes as well as the value nodes as follows:

``````// model

entity type Olympics = String
entity type Athlete = String

value type Sport = String
value type Name = String
value type City = String
value type GameType = "Summer"; "Winter"``````

Below is a visual representation of your schema. The entity nodes have solid borders and the value nodes have dashed borders:

Note that entity types are used for entity nodes and value types are used for value nodes. By writing `entity type Olympics = String` you declare that each Olympics is identified by a single string value.

Similarly, by writing `value type Sport = String` you declare that Sport is modeled as a value type containing a single string.

Both entity and value type declarations create a constructor relation, which starts with a caret `^`, that will be used to create the actual node instances.

🔎

Choosing what is an entity node and what is a value node is up to the author. You can find more information in Elements of a Relational Knowledge Graph.

It is not necessary to define the value node `Year` because `Year` is a value type that is already defined by the system via the Standard Library.

### Populating the Graph

Now that you have defined the schema, you can insert some data into your database. It is best practice to organize data with modules and group information together. When building a knowledge graph, it makes sense to insert all the data into a module where the module represents the knowledge graph.

First, the `Olympics` data gets inserted, then the `Athlete`, and finally the athlete’s participation in the Olympic Games.

Updating the data within the module requires two steps:

1. Defining the data within a temporary module (`olympic_game_info` in this case).
2. Storing the data in a base relation (`OlympicGraph` in this case) that persists in the database.

#### Olympic Games

First, define an entity node `Olympics` that describes Olympic Games and the edges (`has_name`, `hosted_in`, `happens_in`, and `has_type`) that relate the entity node to all its value nodes (`Name`, `City`, `Year`, and `Game Type`).

``````// write query

// Define the data within a temporary `olympic_game_info` module.
module olympic_game_info

// Entity node: `Olympics`
def Olympics = {
^Olympics["Tokyo 2020"];
^Olympics["Rio 2016"];
^Olympics["London 2012"];
^Olympics["Beijing 2008"];
^Olympics["Athens 2004"];
^Olympics["PyeongChang 2018"];
^Olympics["Sochi 2014"]
}

// Edge: `has_name`
def has_name = {
(^Olympics["Tokyo 2020"], ^Name["Tokyo 2020"]);
(^Olympics["Rio 2016"], ^Name["Rio 2016"]);
(^Olympics["London 2012"], ^Name["London 2012"]);
(^Olympics["Beijing 2008"], ^Name["Beijing 2008"]);
(^Olympics["Athens 2004"], ^Name["Athens 2004"]);
(^Olympics["PyeongChang 2018"], ^Name["PyeongChan 2018"]);
(^Olympics["Sochi 2014"], ^Name["Sochi 2014"])
}

// Edge:`hosted_in`
def hosted_in = {
(^Olympics["Tokyo 2020"], ^City["Tokyo"]);
(^Olympics["Rio 2016"], ^City["Rio de Janeiro"]);
(^Olympics["London 2012"], ^City["London"]);
(^Olympics["Beijing 2008"], ^City["Beijing"]);
(^Olympics["Athens 2004"], ^City["Athens"]);
(^Olympics["PyeongChang 2018"], ^City["PyeongChang"]);
(^Olympics["Sochi 2014"], ^City["Sochi"])
}

// Edge: `happens_in`
def happens_in = {
(^Olympics["Tokyo 2020"], ^Year[2021]);
(^Olympics["Rio 2016"], ^Year[2016]);
(^Olympics["London 2012"], ^Year[2012]);
(^Olympics["Beijing 2008"], ^Year[2008]);
(^Olympics["Athens 2004"], ^Year[2004]);
(^Olympics["PyeongChang 2018"], ^Year[2018]);
(^Olympics["Sochi 2014"], ^Year[2014])
}

// Edge: `has_type`
def has_type = {
(^Olympics["Tokyo 2020"], ^GameType["Summer"]);
(^Olympics["Rio 2016"], ^GameType["Summer"]);
(^Olympics["London 2012"], ^GameType["Summer"]);
(^Olympics["Beijing 2008"], ^GameType["Summer"]);
(^Olympics["Athens 2004"], ^GameType["Summer"]);
(^Olympics["PyeongChang 2018"], ^GameType["Winter"]);
(^Olympics["Sochi 2014"], ^GameType["Winter"])
}

end

// Store the data in the `OlympicGraph` base relation.
def insert:OlympicGraph = olympic_game_info``````

As you can see, a module is started with the keyword `module` and closed with the keyword `end`. The indentation displayed is designed for readability purposes. It is not required by Rel. So far, you have specified facts about recent Olympic Games and stored these in the `OlympicGraph` base relation. Notice that you have reused `^Olympics["Tokyo 2020"]` in several edge definitions. These edges associate the `^Olympics["Tokyo 2020"]` entity node with different value nodes attributed to that particular Olympic Game.

You can visualize the entity and value nodes as well as their properties, which appear as edges, in the ORM diagram:

#### Athletes

Next, you need to add the `Athlete` information to the database. Although in the real world there are thousands of Olympic athletes, for illustration purposes you are going to use a short list.

``````// write query

// Define the data within a temporary `olympic_game_info` module.
module olympic_game_info

// Entity node: `Athlete`
def Athlete = {
^Athlete["Allyson Felix"];
^Athlete["Eddy Alvarez"];
^Athlete["Tom Daley"];
^Athlete["Simone Biles"]
}

// Edge: `has_name`
def has_name = {
(^Athlete["Allyson Felix"], ^Name["Allyson Michelle Felix"]);
(^Athlete["Eddy Alvarez"], ^Name["Eduardo Cortes Alvarez"]);
(^Athlete["Tom Daley"], ^Name["Thomas Robert Daley"]);
(^Athlete["Simone Biles"], ^Name["Simone Arianne Biles"])
}

// Edge: `performs_in`
def performs_in = {
(^Athlete["Allyson Felix"], ^Sport["Athletics"]);
(^Athlete["Eddy Alvarez"], ^Sport["Baseball"]);
(^Athlete["Eddy Alvarez"], ^Sport["Short track speed skating"]);
(^Athlete["Tom Daley"], ^Sport["Diving"]);
(^Athlete["Simone Biles"], ^Sport["Gymnastics"])
}

end

// Store the data in the `OlympicGraph` base relation.
def insert:OlympicGraph = olympic_game_info``````

Again, the Rel module syntax helps you organize your thinking around modeling with easy-to-recognize blocks. The module includes definitions of names and sports for each athlete.

🔎

Note: Here the temporary module is called `olympic_game_info` again. The previous use of it and the definitions it encapsulated were erased after the storing transaction. Only the information stored in the base relation `OlympicGraph` persist in the database.

Some athletes may play in multiple sports. Eddy Alvarez, for example, competed in both baseball and short track speed skating. But this fact doesn’t change how you specify the attribute `^Sport`. In the case of athletes who have competed in multiple sports, you have to define one edge for each sport that they performed in.

You have specified facts for two entity nodes: `Olympics` and `Athlete` (solid bordered). Notice how your knowledge graph is growing as you define more and more entity nodes and link them to their values through properties, which are modeled by edges:

#### Participation

Now that you have defined all your entity nodes and related value nodes, you want to model athlete participation in Olympic Games as this is the relationship you are interested in. To do so, you write a relation called `participates_in` that captures athlete participation. This relation creates the edges in your knowledge graph that connect athlete nodes with the relevant Olympic Game nodes. Once again, you use the module syntax.

``````// write query

// Define the data within a temporary `olympic_game_info` module.
module olympic_game_info

// Edge: `participates_in`
def participates_in = {
(^Athlete["Eddy Alvarez"], ^Olympics["Tokyo 2020"]);
(^Athlete["Eddy Alvarez"], ^Olympics["Sochi 2014"]);

(^Athlete["Allyson Felix"], ^Olympics["Tokyo 2020"]);
(^Athlete["Allyson Felix"], ^Olympics["Rio 2016"]);
(^Athlete["Allyson Felix"], ^Olympics["London 2012"]);
(^Athlete["Allyson Felix"], ^Olympics["Beijing 2008"]);
(^Athlete["Allyson Felix"], ^Olympics["Athens 2004"]);

(^Athlete["Tom Daley"], ^Olympics["Tokyo 2020"]);
(^Athlete["Tom Daley"], ^Olympics["Rio 2016"]);
(^Athlete["Tom Daley"], ^Olympics["London 2012"]);
(^Athlete["Tom Daley"], ^Olympics["Beijing 2008"]);

(^Athlete["Simone Biles"], ^Olympics["Tokyo 2020"]);
(^Athlete["Simone Biles"], ^Olympics ["Rio 2016"])
}

end

// Store the data in the `OlympicGraph` base relation.
def insert:OlympicGraph = olympic_game_info``````

As you visualize this update in the ORM diagram, notice that the edges between the entity nodes look the same as the edges between entity and value nodes:

## Querying Your Knowledge Graph in Rel

### Start Querying

When the two entities `Olympics` and `Athlete` are related, you can find additional insight by asking questions — queries — about specific entity properties as well as connections between entities. For example, you can find a collection of athletes who have participated in the “Tokyo 2020” Olympic Games. But with the `Sport` value node conveniently linked with the `Athlete` entity node by the edge `performs_in`, you can refine your query to find only athletes performing gymnastics in the “Tokyo 2020” Olympic Games.

``````// read query

def output(person, person_name) = {
OlympicGraph:performs_in(person, ^Sport["Gymnastics"])
and OlympicGraph:participates_in(person, ^Olympics["Tokyo 2020"])
and OlympicGraph:has_name(person, person_name)
}``````

Here’s the breakdown of that query:

• `def output(person, person_name)` indicates that the output will display two values: an entity hash, which is represented by the variable `person`, and the name of an athlete, which is represented by the variable `person_name`.
• `OlympicGraph:performs_in(person, ^Sport["Gymnastics"])` selects edges that associate a person with the sport “Gymnastics”. This information will be retrieved from the module `OlympicGraph`.
• `and OlympicGraph:participates_in(person, ^Olympics["Tokyo 2020"])` refines this selection to gymnasts that competed in the “Tokyo 2020” Olympic Games. This information will be retrieved from the module `OlympicGraph`.
• `and OlympicGraph:has_name(person, person_name)` connects the entity hash with the name of the athlete.

When you first defined the schema of your database, you established that `Olympics` and `Athlete` are modeled as entities. The way to reference entities is by their unique system-generated number, called a “hash”. If you want to display the name of the athlete (`person_name`) rather than its unique hash number, you need to specify it in the query.

This example represents one “hop” on a knowledge-graph, where you start at the `Olympics` node and jump via the edge `participates_in` to all neighboring nodes of type `Athlete` that fulfill your specific condition. This type of query is characteristic of working with knowledge graphs.

### Most Experienced Athlete

1. Athletes in your list who have participated in the most Olympic Games.
2. Athletes who have participated in both Summer and Winter Olympics Games.

You can use the information in the base relation `OlympicGraph` to find the most appropriate athletes from your data.

``````// read query

def experience[person] = count[olympics : OlympicGraph:participates_in(person, olympics)]

def most_experienced_athlete = argmax[experience]

def output(person, person_name) = {
most_experienced_athlete(person)
and OlympicGraph:has_name(person, person_name)
}``````

To solve the first question, you first define a list of athletes coupled with the number of Olympic Games they’ve participated in. You will call this relation `experience`. Here, you can use a built-in relation called `count` to aggregate over `olympics`, counting the number of Olympic Games per athlete.

Then building on the `experience` definition, you can use another built-in relation called `argmax` to define another relation called `most_experienced_athlete`. This function chooses from the list the athlete associated with the largest number of Olympic Games, which is stored in the relation `experience`.

Finally, using the relation `most_experienced_athlete`, you can query your database.

🔎

Notice that the relation `most_experienced_athlete` is not defined within the base relation `OlympicGraph` and therefore referencing it doesn’t require the `OlympicGraph:` prefix. Furthermore, the relation only exists within this query and is not persisted in the database and can’t be accessed elsewhere.

### Multi-Season Athlete

Now you can ask, from the list of athletes, who has participated in both Winter and Summer Olympic Games?

``````// read query

def multi_season_athlete(person) {
exists(g1, g2:
OlympicGraph:participates_in(person, g1)
and OlympicGraph:has_type(g1, ^GameType["Winter"])
and OlympicGraph:participates_in(person, g2)
and OlympicGraph:has_type(g2, ^GameType["Summer"])
)
}

def output(person, person_name) {
multi_season_athlete(person)
and OlympicGraph:has_name(person, person_name)
}``````

You get one result back and you see that Eddy Alvarez is the most diverse athlete. He has participated in both Winter and Summer Olympic Games (Sochi in 2014 and Tokyo in 2020). Here’s how:

The code above defines the relation `multi_season_athlete`. This returns those athletes, if any, who have participated in both Summer and Winter Olympic Games.

The query is a little more complex than examples you have seen so far. It’s a navigational query, as it traverses a graph, finding the athletes from the connections you specify.

The variable `g1` refers to Olympic Games with the value `^GameType["Winter"]` and `g2` refers similarly to Olympic Games with the value `^GameType["Summer"`].

You are asking the system to find you the athlete(s) who participated_in (`participates_in`) both `g1` and `g2` Games — that is, athletes who have participated in both Winter and Summer Olympic Games. The `exists` part of the query specifies that two games need to exist that fulfill the subsequent condition, but that you don’t care which ones they are. In technical terms, the variables `g1` and `g2` are existentially quantified.

By creating and querying your knowledge graph, you have analyzed your data and found the most successful athlete according to your chosen definitions of success. Allyson Felix is the most successful when success is defined as the number of Olympic Games an athlete participates in. If success is defined by diversity of Olympic Game types, Eddy Alvarez is the most successful athlete.

🔎

At a later stage, as you fully test out more complex definitions, you can add them to your model. This way, they can be saved in the database and shared across all your queries. This is how you extend the knowledge graph with new information, building up a semantic layer on top of your base data. See the test queries in The Lehigh University Benchmark for examples.

## Visualizing Your Data Graph in Rel

Throughout this tutorial you have been using ORM diagrams to get a schema-level (for example, `Olympics`, `Athlete`) visualization of the data as it is a simple yet efficient way of looking at it. Rel allows you to get a data-level (for example, `Tokyo 2020`, `Eduardo Cortes Alvarez`) visual representation of it. Here is how to build this visualization using graphviz:

``````// read query

// Define the graph relation.
module DataGraph
with OlympicGraph use Athlete, Olympics, participates_in, has_name

def node(name) {
exists(e :
(Athlete(e) or Olympics(e))
and name = string[has_name[e]]
)
}

def edge(name1, name2) {
exists(e1, e2:
participates_in(e1, e2)
and name1 = string[has_name[e1]]
and name2 = string[has_name[e2]]
)
}

end

// Plot it.
def output = ::std::display::graphviz[DataGraph]
``````

Above is the Graphviz representation of a section of your data.

Here you are using a module again, called `DataGraph`, within which you will define the graph relations. First you state that within the `DataGraph` module you are using `Athlete`, `Olympics`, `participates_in`, and `has_name`, which are all stored in the `OlympicGraph` base relation. Then you define that `Athlete` and `Olympics` entities make up the nodes and `participates_in` is the only edge between those nodes that should be displayed.

As you have seen in previous queries, `has_name` is required in order to display the names of the Olympic Games and of the athletes. Before passing the data to Graphviz, the value type data is converted to a String with `string`.

## Summary

This tutorial has introduced you to the concept of knowledge graphs, entity nodes, value nodes, and edges. You have learned how to model relationships between pieces of data, and how to express these relationships in Rel in the form of a knowledge graph. You also learned how to translate questions into graph queries in Rel that reason over your knowledge graph. You can learn more about those concepts as well as how to build and query a relational knowledge graph in Elements of a Relational Knowledge Graph.

Expressing queries across interesting, complex, knowledge graphs is what Rel does best. You can find a concrete example of that in the Lehigh University Benchmark how-to guide.