Aggregate and group data

Use std.aggregates to compute aggregate values over sets of matches in your model. This guide covers how to use different aggregate functions, control grouping with .per(), scope which matches contribute with aggregate-local .where(), and handle duplicates with distinct().

PyRel is installed. See Set Up Your Environment for instructions.
You are comfortable deriving facts with Model.define() and filtering with Model.where(). See Derive facts with logic.

Understand how aggregation works

An aggregate is a value computed over a set of matches. In PyRel, aggregation works by defining an aggregate expression with an aggregate function like aggregates.count() that describes how to compute a value from matches, and then optionally specifying grouping keys with .per().

Here is an overview of how you define an aggregate in PyRel:

Step 1: Define what can contribute.

Think of an aggregate function, like count() or sum(), as running a small sub-query to find its contributing matches. In PyRel, the aggregate’s inputs are part of that sub-query, so they help define what a match is:
- aggregates.count(Ticket) contributes one match per Ticket that exists.
- aggregates.sum(Ticket.resolution_minutes) contributes one value for each Ticket where Ticket.resolution_minutes exists. If there are multiple tickets with the same resolution time, they each contribute their own value.
You must pass multiple inputs to the aggregate function when a match is defined by a combination of values. For example, aggregates.count(Account, Ticket) contributes one match per (Account, Ticket) pair that exists.
Step 2: Deduplicate matches with distinct().

If your aggregate’s match pattern can produce duplicates, use distinct() to control whether you are aggregating over a multiset or a set.

For example:
- aggregates.count(Ticket.tags) counts tag assignments. If two tickets have the same tag, that tag contributes two matches to the count.
- aggregates.count(distinct(Ticket.tags)) counts tag values. If two tickets have the same tag, that tag contributes one match to the count.
Related resources:
- Deduplicate aggregate inputs with distinct()
Step 3: Scope which matches contribute.

Use the aggregate-local .where() method to add conditions inside the aggregate. Only matches that satisfy the aggregate inputs and the aggregate-local .where() contribute to the aggregate. This is different from filtering output rows with Model.where() / Fragment.where().

For example:
- aggregates.count(Ticket).where(Ticket.status == "open") only counts tickets that are open.
- aggregates.sum(Ticket.resolution_minutes).where(Ticket.status == "closed") only sums resolution minutes for closed tickets.
This is useful when you want to compute multiple aggregations in the same definition or query with different contributing sets.
Related resources:
- Scope aggregate inputs with aggregate-local .where(...)
Step 4: Group matches with .per().

If you want to compute one aggregate value per group of matches, use .per(...) to specify the grouping keys. For example:
- count(Ticket).per(Ticket.account) counts tickets per account.
- count(Ticket).per(Ticket.account, Ticket.status) counts tickets per (account, status) pairs.
Related resources:
- Group results with per()

If you’re coming from SQL, the biggest difference is where the contributing set for an aggregate comes from:

SQL	PyRel
In SQL, `FROM` and `WHERE` define which rows exist, `GROUP BY` defines the groups, and aggregate functions summarize an expression over the rows in each group.	In PyRel, the rows that are aggregated over must be explicitly defined by the aggregate’s inputs, the aggregate-local `.where()` method, and the `.per()` method.

Here’s a complete example that defines a simple model with Ticket concepts, then counts tickets globally and counts ticket–tag assignments per status (an aggregate that needs multiple inputs):

from relationalai.semantics import distinct, Integer, Model, String
from relationalai.semantics.std import aggregates as agg

m = Model("SupportModel")

# Declare the model schema
Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Tag = m.Concept("Tag", identify_by={"name": String})
Ticket.tags = m.Relationship(f"{Ticket} has tag {Tag}")

# Define some facts
m.define(
    t1 := Ticket.new(id=101, status="open"),
    t2 := Ticket.new(id=102, status="open"),
    t3 :=Ticket.new(id=103, status="closed"),
    urgent := Tag.new(name="urgent"),
    vip := Tag.new(name="vip"),
    t1.tags(urgent),
    t1.tags(vip),
    t2.tags(urgent),
)

# Count tickets globally
global_ticket_count = agg.count(Ticket)
m.select(global_ticket_count).inspect()

# Count tag assignments per status, with a fallback of 0 for statuses with no tags
tag_assignments_per_status = agg.count(Ticket.tags).per(Ticket.status) | 0
m.select(distinct(Ticket.status, tag_assignments_per_status)).inspect()

  count
0     3

   status v0
0  closed  0
1    open  3

aggregates.count(Ticket) counts matches of Ticket. Without .per(), the result is a single global value. Since there are three tickets, the count is 3.
aggregates.count(Ticket.tags) counts tag assignments, not unique tags. Since t1 has two tags and t2 has one tag, there are three total tag assignments.
With .per(Ticket.status), the result has one value per matching status.
The | 0 fallback sets the count to 0 for any status that has no tag assignments. Without this, the count statuses with no tag assignments would be missing and show up as NaN or NULL in the output.
distinct(Ticket.status, tag_assignments_per_status) deduplicates the output so that we get one row per status. If we didn’t use distinct(), we would get two identical rows for the open status because there are two open tickets.

Choose the right aggregate

Choose an aggregate based on the question you are answering and the value you are summarizing. If your match pattern can produce duplicates, use distinct() to control whether you are aggregating a multiset or a set.

Use this table to choose the right aggregate for your definition:

What to use	When to use it
`count()`	Use when you need “how many matches exist” for a pattern.
`sum()`	Use when you need a total of a numeric value across matches.
`avg()`	Use when you need the mean of a numeric value across matches.
`min()`	Use when you need the smallest or earliest comparable value across matches.
`max()`	Use when you need the largest or latest comparable value across matches.

Count matches with `.count()`

Use aggregates.count() when you need a global count or a per-group count of matching facts:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")
Account = m.Concept("Account", identify_by={"id": Integer})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Ticket.account = m.Relationship(f"{Ticket} belongs to {Account:account}")

10 collapsed lines
m.define(
    a1 := Account.new(id=1),
    a2 := Account.new(id=2),
    t1 := Ticket.new(id=101, status="open"),
    t2 := Ticket.new(id=102, status="open"),
    t3 := Ticket.new(id=103, status="closed"),
    t1.account(a1),
    t2.account(a1),
    t3.account(a2),
)

ticket_count_per_account = (
    aggregates.count(Ticket)
    .per(Account)
    .where(Ticket.account == Account)
)
ticket_count_per_account_status = (
    aggregates.count(Ticket)
    .per(Account, Ticket.status)
    .where(Ticket.account == Account)
)

# Number of tickets per account
print(
    m.select(
        Account.id,
        ticket_count_per_account.alias("tickets")
    ).to_df()
)

# Number of tickets per account of each status
print(
    m.select(
        Account.id,
        Ticket.status,
        ticket_count_per_account_status.alias("tickets")
    ).to_df()
)

count(Ticket) counts matching Ticket facts.
.per(Account) produces one count per account.
.per(Account, Ticket.status) produces one count per (account, status) pair.
Ticket.account == Account associates each ticket with an account group.

count() counts matches, not “unique entities”. If your match pattern can produce duplicates (for example, through a join), use distinct().
Per-group counts only include rows that have a value for the grouping key. If a ticket is missing Ticket.status, it contributes to count(Ticket) but not to count(Ticket).per(..., Ticket.status).
Aggregates omit groups with zero matches. If an account has zero tickets, it won’t appear in the per-account output. If you expect a group to show up with 0, set a default value using the fallback operator.

Sum values with `.sum()`

Use aggregates.sum() when each match contributes a numeric value you want to total:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Account = m.Concept("Account", identify_by={"id": Integer})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Ticket.resolution_minutes = m.Property(f"{Ticket} resolution minutes is {Integer:minutes}")
Ticket.account = m.Relationship(f"{Ticket} belongs to {Account:account}")

10 collapsed lines
m.define(
    a1 := Account.new(id=1),
    a2 := Account.new(id=2),
    t1 := Ticket.new(id=101, status="closed", resolution_minutes=120),
    t2 := Ticket.new(id=102, status="closed", resolution_minutes=45),
    t3 := Ticket.new(id=103, status="open", resolution_minutes=10),
    t1.account(a1),
    t2.account(a1),
    t3.account(a2),
)

total_resolution_minutes = (
    aggregates.sum(Ticket.resolution_minutes)
    .per(Account)
    .where(Ticket.account == Account, Ticket.status == "closed")
)

print(m.select(Account.id, total_resolution_minutes.alias("total_resolution_minutes")).to_df())

The summed value is Ticket.resolution_minutes.
Only closed tickets contribute because the aggregate-local .where(...) scopes the contributing set.
total_resolution_minutes produces one value per account because of .per(Account).
Missing resolution_minutes values do not contribute to the sum.

sum() only adds values that exist. If Ticket.resolution_minutes is missing for a matching ticket, that ticket contributes no value.
Grouped sums omit groups with zero matches. If an account has no matching closed tickets, it won’t appear in the output.
If your match pattern can produce duplicates (for example, through a join), your sum can be larger than expected. Consider distinct() only when treating duplicates as the same contribution is correct for your metric.

Average values with `.avg()`

Use aggregates.avg() when you need a per-group mean of a numeric value:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Queue = m.Concept("Queue", identify_by={"name": String})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Ticket.first_response_minutes = m.Property(f"{Ticket} first response minutes is {Integer:minutes}")
Ticket.queue = m.Relationship(f"{Ticket} assigned to {Queue:queue}")

10 collapsed lines
m.define(
    q1 := Queue.new(name="triage"),
    q2 := Queue.new(name="billing"),
    t1 := Ticket.new(id=101, status="open", first_response_minutes=12),
    t2 := Ticket.new(id=102, status="open", first_response_minutes=55),
    t3 := Ticket.new(id=103, status="closed", first_response_minutes=5),
    t1.queue(q1),
    t2.queue(q1),
    t3.queue(q2),
)

avg_first_response = (
    aggregates.avg(Ticket.first_response_minutes)
    .per(Queue)
    .where(Ticket.queue == Queue, Ticket.status == "open")
)

print(m.select(Queue.name, avg_first_response.alias("avg_first_response_minutes")).to_df())

The averaged value is Ticket.first_response_minutes.
Only open tickets contribute because the aggregate-local .where(...) scopes the contributing set.
avg_first_response produces one value per queue because of .per(Queue).
When debugging an average, validate .count() and .sum() for the same grouping keys.

avg() averages only values that exist. If Ticket.first_response_minutes is missing for a matching ticket, that ticket contributes no value.
Grouped averages omit groups with zero matches. If a queue has no matching open tickets, it won’t appear in the output. If you expect a group to show up with 0 (or another default), set a default value using the fallback operator.
If your match pattern can produce duplicates (for example, through a join), your average can be skewed. Consider distinct() only when treating duplicates as the same contribution is correct for your metric.

Find minimum values with `.min()`

Use aggregates.min() to find the smallest comparable value per group:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Account = m.Concept("Account", identify_by={"id": Integer})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.first_response_minutes = m.Property(f"{Ticket} first response minutes is {Integer:minutes}")
Ticket.account = m.Relationship(f"{Ticket} belongs to {Account:account}")

10 collapsed lines
m.define(
    a1 := Account.new(id=1),
    a2 := Account.new(id=2),
    t1 := Ticket.new(id=101, first_response_minutes=25),
    t2 := Ticket.new(id=102, first_response_minutes=7),
    t3 := Ticket.new(id=103, first_response_minutes=18),
    t1.account(a1),
    t2.account(a1),
    t3.account(a2),
)

min_first_response = (
    aggregates.min(Ticket.first_response_minutes)
    .per(Account)
    .where(Ticket.account == Account)
)

print(m.select(Account.id, min_first_response.alias("min_first_response_minutes")).to_df())

The minimized value is Ticket.first_response_minutes.
min_first_response produces one value per account because of .per(Account).
Ticket.account == Account associates each ticket with an account group.
Groups with no matching tickets do not appear.

min() only considers values that exist. If Ticket.first_response_minutes is missing for a matching ticket, that ticket contributes no value.
Grouped minimums omit groups with zero matches. If an account has no tickets, it won’t appear in the output. If you expect a group to show up with a default value, set a default value using the fallback operator.
min() returns the minimum value, not “the row that produced it”. If you need the ticket ID for the minimum, select contributing rows (ticket + value) and validate which one is smallest.

Find maximum values with `.max()`

Use aggregates.max() to find the largest comparable value per group:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Queue = m.Concept("Queue", identify_by={"name": String})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Ticket.resolution_minutes = m.Property(f"{Ticket} resolution minutes is {Integer:minutes}")
Ticket.queue = m.Relationship(f"{Ticket} assigned to {Queue:queue}")

10 collapsed lines
m.define(
    q1 := Queue.new(name="triage"),
    q2 := Queue.new(name="billing"),
    t1 := Ticket.new(id=101, status="closed", resolution_minutes=240),
    t2 := Ticket.new(id=102, status="closed", resolution_minutes=45),
    t3 := Ticket.new(id=103, status="open", resolution_minutes=10),
    t1.queue(q1),
    t2.queue(q1),
    t3.queue(q2),
)

max_resolution = (
    aggregates.max(Ticket.resolution_minutes)
    .per(Queue)
    .where(Ticket.queue == Queue, Ticket.status == "closed")
)

print(m.select(Queue.name, max_resolution.alias("max_resolution_minutes")).to_df())

The maximized value is Ticket.resolution_minutes.
Only closed tickets contribute because the aggregate-local .where() scopes the contributing set.
max_resolution produces one value per queue because of .per(Queue).
Groups with no matching closed tickets do not appear.

max() only considers values that exist. If Ticket.resolution_minutes is missing for a matching ticket, that ticket contributes no value.
Grouped maxima omit groups with zero matches. If you expect a group to show up with a default value, set a default value using the fallback operator.
max() returns the maximum value, not “the row that produced it”. If you need the ticket ID for the maximum, select contributing rows (ticket + value) and validate which one is largest.

Group results with `per()`

Use .per() to define the grouping keys for an aggregate. You can group by one key or multiple keys.

As a rule of thumb:

Use one key when your question is “per thing” (for example “per account”).
Use multiple keys when your question is “per thing and per breakdown” (for example “per account, per queue”).

Adding keys usually makes the result more granular and increases the number of rows.

This example groups by both account and queue:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Account = m.Concept("Account", identify_by={"id": Integer})
Queue = m.Concept("Queue", identify_by={"name": String})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Ticket.account = m.Relationship(f"{Ticket} belongs to {Account:account}")
Ticket.queue = m.Relationship(f"{Ticket} assigned to {Queue:queue}")

15 collapsed lines
m.define(
    a1 := Account.new(id=1),
    a2 := Account.new(id=2),
    q1 := Queue.new(name="triage"),
    q2 := Queue.new(name="billing"),
    t1 := Ticket.new(id=101, status="open"),
    t2 := Ticket.new(id=102, status="open"),
    t3 := Ticket.new(id=103, status="open"),
    t1.account(a1),
    t2.account(a1),
    t3.account(a2),
    t1.queue(q1),
    t2.queue(q2),
    t3.queue(q1),
)

open_tickets_per_account_queue = (
    aggregates.count(Ticket)
    .per(Account, Queue)
    .where(Ticket.account == Account, Ticket.queue == Queue, Ticket.status == "open")
)

df = m.select(
    Account.id,
    Queue.name,
    open_tickets_per_account_queue.alias("open_tickets"),
).to_df()
print(df)

.per(Account, Queue) produces one value per (account, queue) pair.
The aggregate-local .where(...) associates each ticket with its account group and queue group.
Select the grouping keys next to the aggregate value to validate output shape.
If your .per() keys do not match your intended grouping level, you can get “too many rows”.

You can also use the module-level aggregates.per() function to define a reusable grouped aggregate that you can apply to different measures, and then specialize it with an aggregate function and a .where():

open_tickets_per_account_queue = (
    aggregates.per(Account, Queue)
    .count(Ticket)
    .where(Ticket.account == Account, Ticket.queue == Queue, Ticket.status == "open")
)

.per(...) controls output shape. If you forget .per(...), you’ll compute a global aggregate.
Grouped aggregates only include matches that bind all the grouping keys. If a ticket is missing an account or queue, it won’t contribute to a per-(account, queue) result. See Handle missing data when groups or rows seem to “disappear”.
Prefer stable grouping keys. Grouping by concepts (like Account and Queue) or canonical identifiers (like Account.id) is usually easier to validate than grouping by free-form strings.

Scope aggregate inputs with aggregate-local `.where()`

Aggregate-local .where() controls which matches contribute to an aggregate. This is different from filtering the rows you select with Fragment.where().

Use this rule of thumb:

Put predicates in the aggregate-local .where() when they define the contributing matches. This includes join edges (like Ticket.has_comment(Comment)) and any “only count these” filters.
Put predicates in the fragment .where() when they define which groups you want to display. This does not change how the aggregate is computed, only which keys show up in your output.

This example computes comment counts per ticket. The aggregate-local .where() makes the relationship predicate part of the aggregate’s contributing set:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Ticket = m.Concept("Ticket", identify_by={"id": Integer})

Comment = m.Concept("Comment", identify_by={"id": Integer})
Ticket.has_comment = m.Relationship(f"{Ticket} has comment {Comment:comment}")

10 collapsed lines
m.define(
    t1 := Ticket.new(id=1),
    t2 := Ticket.new(id=2),
    c1 := Comment.new(id=11),
    c2 := Comment.new(id=12),
    c3 := Comment.new(id=13),
    t1.has_comment(c1),
    t1.has_comment(c2),
    t2.has_comment(c3),
)

comment_count = aggregates.count(Comment).per(Ticket).where(Ticket.has_comment(Comment))

print(m.select(Ticket.id, comment_count.alias("comment_count")).to_df())

For example, this filters the output to one ticket (but keeps the same per-ticket aggregate definition):

q = m.select(Ticket.id, comment_count.alias("comment_count")).where(Ticket.id == 1)
print(q.to_df())

Ticket.has_comment(Comment) is part of the aggregate’s match pattern. Only those comment matches contribute to the count.
When you scope contributions with aggregate-local .where(), the output shape stays stable. You can select the grouping key and the aggregate side-by-side to validate it.

Aggregate-local .where() is the easiest way to keep output shape stable while changing which matches contribute. When results look wrong, validate the contributing rows first (for example Ticket.id + Comment.id).
Groups with zero matches do not appear in grouped aggregates. If you expect a ticket with 0 comments to show up, set a default value using the fallback operator.
You can also pre-filter by defining a derived concept (for example “TicketWithComment”) and aggregating over that. Aggregate-local .where() is often the most direct approach when the predicate is “inside” the aggregate.

Deduplicate aggregate inputs with `distinct()`

Use distinct() inside an aggregate when your match pattern can produce duplicates. This is common when a join expands one entity into multiple matching rows.

The key idea is that distinct() changes the contributing multiset into a set:

count(User, Comment) counts how many (User, Comment) matches contribute. A user who comments twice contributes two pairs.
count(distinct(User)) counts unique users contributing to the aggregate.

This example counts comment-author matches per support ticket and compares that to a distinct-user count:

from relationalai.semantics import Integer, Model, String, distinct
from relationalai.semantics.std import aggregates as agg

m = Model("SupportModel")

# Declare the model schema
Ticket = m.Concept("Ticket", identify_by={"id": Integer})
User = m.Concept("User", identify_by={"id": Integer})
User.email = m.Property(f"{User} has email {String:email}")
Comment = m.Concept("Comment", identify_by={"id": Integer})
Ticket.comments = m.Relationship(f"{Ticket} has comment {Comment}")
Comment.author = m.Relationship(f"{Comment} authored by {User}")

# Define some facts
18 collapsed lines
m.define(
    t1 := Ticket.new(id=1),
    t2 := Ticket.new(id=2),
    u1 := User.new(id=101, email="ava@example.com"),
    u2 := User.new(id=102, email="ben@example.com"),
    c1 := Comment.new(id=11),
    c2 := Comment.new(id=12),
    c3 := Comment.new(id=13),
    c4 := Comment.new(id=14),
    t1.comments(c1),
    t1.comments(c2),
    t2.comments(c3),
    t2.comments(c4),
    c1.author(u1),
    c2.author(u1),
    c3.author(u1),
    c4.author(u2),
)

comments_per_ticket = agg.count(Ticket.comments).per(Ticket) | 0  # Fallback to 0 for tickets with no comments

unique_commenters_per_ticket = (
    agg.count(distinct(Ticket.comments.author))
    .per(Ticket)
) | 0  # Fallback to 0 for tickets with no comments

df = m.select(
    Ticket.id,
    comments_per_ticket.alias("total_comments"),
    unique_commenters_per_ticket.alias("unique_commenters"),
).inspect()

count(User, Comment) counts comment-author matches. A user who comments twice contributes two matches (because the comment differs).
count(distinct(Ticket.comments.author)) counts unique users contributing to the aggregate.
When the numbers do not match your expectation, select the raw contributing rows first. That is usually faster than changing the aggregate.

distinct() must wrap all the inputs to the aggregate. You cannot deduplicate just one input. For example count(distinct(User), Comment) is not valid.
distinct() can also change the meaning of .sum() and .avg() by deduplicating the value expression. Use it only when you want to treat duplicates as the same contribution.

Rank groups with `.rank()`

Use aggregates.rank() when you want an ordinal rank value computed from one or more ordering keys.

Here’s the mental model:

rank(...) produces a rank column. It does not sort your output.
Use aggregates.asc() and aggregates.desc() to make ordering explicit.
Add a stable secondary key (like an ID) when ties matter.

This example ranks accounts by their derived open ticket count:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Account = m.Concept("Account", identify_by={"id": Integer})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Ticket.account = m.Relationship(f"{Ticket} belongs to {Account:account}")

Account.open_ticket_count = m.Property(f"{Account} open ticket count is {Integer:count}")

13 collapsed lines
m.define(
    a1 := Account.new(id=1),
    a2 := Account.new(id=2),
    a3 := Account.new(id=3),
    t1 := Ticket.new(id=101, status="open"),
    t2 := Ticket.new(id=102, status="open"),
    t3 := Ticket.new(id=103, status="closed"),
    t4 := Ticket.new(id=104, status="open"),
    t1.account(a1),
    t2.account(a1),
    t3.account(a2),
    t4.account(a3),
)

open_ticket_count = (
    aggregates.count(Ticket)
    .per(Account)
    .where(Ticket.account == Account, Ticket.status == "open")
)

m.define(Account.open_ticket_count(open_ticket_count))

open_ticket_rank = aggregates.rank(
    aggregates.desc(Account.open_ticket_count),
    aggregates.asc(Account.id),
)

df = m.select(
    Account.id,
    Account.open_ticket_count,
    open_ticket_rank.alias("rank_by_open_tickets"),
).to_df()
print(df)

Account.open_ticket_count materializes the aggregate metric as a reusable property.
rank(desc(...), asc(Account.id)) adds a deterministic secondary ordering key when counts tie.
rank() produces a rank column. It does not sort the output DataFrame.

If you need a “top N” list, use top() or bottom() as predicates in where(). rank() by itself does not filter results.
You can rank within a group by calling .per() on the rank expression. For example, rank(desc(Product.revenue)).per(Store) ranks products per store.
When a rank looks wrong, validate the ordering keys first. Select the key, the metric you’re ranking by, and the rank() column side-by-side.

Pick top and bottom results

Use aggregates.top() and aggregates.bottom() as predicates in .where() when you want to keep only the highest or lowest groups.

Here’s how to think about it:

top() keeps the largest values. bottom() keeps the smallest values.
The first ordering argument is your primary sort key. Any additional arguments act as tie-breakers.
Add .per() when you need “top N per group”.

This example keeps only the 2 accounts with the highest open ticket counts:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Account = m.Concept("Account", identify_by={"id": Integer})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Ticket.account = m.Relationship(f"{Ticket} belongs to {Account:account}")

Account.open_ticket_count = m.Property(f"{Account} open ticket count is {Integer:count}")

13 collapsed lines
m.define(
    a1 := Account.new(id=1),
    a2 := Account.new(id=2),
    a3 := Account.new(id=3),
    t1 := Ticket.new(id=101, status="open"),
    t2 := Ticket.new(id=102, status="open"),
    t3 := Ticket.new(id=103, status="open"),
    t4 := Ticket.new(id=104, status="open"),
    t1.account(a1),
    t2.account(a1),
    t3.account(a2),
    t4.account(a3),
)

open_ticket_count = (
    aggregates.count(Ticket)
    .per(Account)
    .where(Ticket.account == Account, Ticket.status == "open")
)

m.define(Account.open_ticket_count(open_ticket_count))

q = (
    m.select(Account.id, Account.open_ticket_count)
    .where(aggregates.top(2, Account.open_ticket_count, Account.id))
)

print(q.to_df())

top(2, Account.open_ticket_count, Account.id) keeps the highest counts.
The extra Account.id argument acts as a stable tie-breaker.
Use aggregates.bottom() to keep the smallest values instead.

top() and bottom() filter rows. They do not sort your output. If you want results ordered in the DataFrame, sort after materializing with .to_df().
Ties are the most common source of surprises. If multiple rows share the same boundary value, add a stable tie-breaker key (like an ID) so selection is deterministic.
To select extremes within each group, call .per() on the top() / bottom() expression. For example, top(3, Product.revenue).per(Store) keeps the top 3 products per store.

Build derived classifications from aggregates

Use this pattern when you want a named, reusable category based on an aggregate metric. This is common for definitions like “high value customer” or “at risk account” where you want to classify entities based on the results of an aggregate.

Here’s how it works:

Compute a per-group metric

Start with one metric, one grouping key, and one condition for contributions.
Assign the metric to a property (optional)

A property makes the metric reusable in ranking, top/bottom filters, and other definitions.
Define the classification and validate it

Select the key, the metric, and the classification side-by-side so you can spot mistakes quickly.

This example defines HighVolumeAccount for accounts with more than 2 open tickets:

from relationalai.semantics import Integer, Model, String
from relationalai.semantics.std import aggregates

m = Model("SupportModel")

Account = m.Concept("Account", identify_by={"id": Integer})

Ticket = m.Concept("Ticket", identify_by={"id": Integer})
Ticket.status = m.Property(f"{Ticket} status is {String:status}")
Ticket.account = m.Relationship(f"{Ticket} belongs to {Account:account}")

Account.open_ticket_count = m.Property(f"{Account} open ticket count is {Integer:count}")
HighVolumeAccount = m.Concept("HighVolumeAccount", extends=[Account])

12 collapsed lines
m.define(
    a1 := Account.new(id=1),
    a2 := Account.new(id=2),
    t1 := Ticket.new(id=101, status="open"),
    t2 := Ticket.new(id=102, status="open"),
    t3 := Ticket.new(id=103, status="open"),
    t4 := Ticket.new(id=104, status="closed"),
    t1.account(a1),
    t2.account(a1),
    t3.account(a1),
    t4.account(a2),
)

# Compute the per-account open ticket count.
open_ticket_count = (
    aggregates.count(Ticket)
    .per(Account)
    .where(Ticket.account == Account, Ticket.status == "open")
)
m.define(Account.open_ticket_count(open_ticket_count))

# Define HighVolumeAccount as accounts with more than 2 open tickets.
m.define(HighVolumeAccount(Account)).where(Account.open_ticket_count > 2)

print(m.select(Account.id, Account.open_ticket_count).to_df())
print(m.select(HighVolumeAccount.id).to_df())

The metric is grouped with .per(Account), so it applies at the account level.
HighVolumeAccount(Account) defines concept membership for accounts that cross the threshold.
Selecting Account.id, open_ticket_count, and the derived concept makes the logic easy to validate.

Avoid common aggregation pitfalls

Most surprising aggregate outputs come from one of a few root causes:

Missing .per() (accidental global aggregation)
Duplicate-generating joins (more matches than you expected)
Scoping the wrong .where() (filtering contributions vs filtering displayed groups)

Before you rewrite your definition logic, validate the shape and inputs first.

Use this checklist when results are too large, too small, or unstable:

Confirm the output shape first. Select your grouping keys next to the aggregate value. A missing .per() produces a global result, and extra .per() keys can produce “too many rows”.
Inspect the contributing rows. When a count or sum looks wrong, select the raw match pattern first. This is the fastest way to see duplicates, missing links, or unexpected join expansion.
Separate contribution scoping from display filtering. Use aggregate-local .where() to define what contributes. Use fragment .where() to choose which groups you show.
Account for missing values and zero-match groups. Missing values do not contribute to sum() / avg() / min() / max(). Grouped aggregates omit groups with zero matches. If you need explicit zeros, set a default value using the fallback operator.
Make ordering deterministic when ties matter. For rank(), top(), and bottom(), add a stable secondary key (like an ID) when boundary ties matter.

Use this table to map common symptoms to likely causes and fixes:

Symptom	Likely cause	Fix
You get a single row when you expected one row per group	Missing `.per()`	Add `.per()` and select the group key(s) next to the aggregate value. See Group results with `per()`.
You get “too many rows”	The grouping keys are too granular	Remove unintended `.per()` keys and validate the grouping level by selecting only the key(s) you intend to group by.
Counts or sums are larger than expected	Duplicate-generating joins expanded the contributing matches	Inspect the raw match pattern and decide whether you need to deduplicate inputs with `distinct()`. See Deduplicate aggregate inputs with `distinct()`.
Groups disappear after you add a grouping key	The grouping key is missing for some entities	Validate the grouping key chain exists and handle missing values intentionally. See Handle missing data.
Counts are smaller than expected	Predicates are scoping away contributions	Re-check whether predicates belong in aggregate-local `.where()` (contributions) or fragment `.where()` (display). See Scope aggregate inputs with aggregate-local `.where()`.
`top()` / `bottom()` picks surprising rows around the boundary	Ties and ambiguous ordering	Add a stable tie-breaker ordering key so selection is deterministic. See Pick top and bottom results.
Your rank column looks right but the DataFrame isn’t ordered	`rank()` does not sort output	Sort after materializing with `.to_df()` if you need a sorted display.

Aggregate and group data

Understand how aggregation works

Choose the right aggregate

Count matches with .count()

Sum values with .sum()

Average values with .avg()

Find minimum values with .min()

Find maximum values with .max()

Group results with per()

Scope aggregate inputs with aggregate-local .where()

Deduplicate aggregate inputs with distinct()

Rank groups with .rank()