Test Data Generation

Version:

Experience level: Intermediate
Reasoning types: Prescriptive
Industry: Technology & Telecom
Tags: DesignLPData
Experience level: Intermediate
Reasoning types: Prescriptive
Industry: Technology & Telecom
Tags: DesignLPData
Experience level: Intermediate
Reasoning types: Prescriptive
Industry: Technology & Telecom
Tags: Test DataDatabase SchemaConstraint Satisfaction

What this template is for

QA and data engineering teams often need synthetic datasets that are large enough for realistic testing, but still satisfy core schema constraints like foreign keys and parent–child cardinality rules. Purely random generation tends to break referential integrity (for example, order lines pointing at non-existent orders) or produce unrealistic table size ratios.

This template uses RelationalAI’s prescriptive reasoning (optimization) capabilities to choose feasible row counts for each table that are as close as possible to your desired targets, subject to schema-derived constraints.

Who this is for

QA engineers and test automation developers who need repeatable datasets.
Data engineers who need scalable dev/load-test datasets while maintaining realistic relationships.
Anyone comfortable reading CSV schemas and running a Python script.

What you’ll build

A linear program (LP) that chooses per-table row counts within min/max bounds.
Constraints that link child and parent table sizes based on foreign keys plus optional cardinality/coverage metadata.
A priority-weighted objective that keeps important tables closer to their targets.
A seeded, procedural generator that prints a small sample of synthetic rows.

What’s included

Model + solve script: test_data_generation.py
Sample data: data/testgen_schema.csv, data/testgen_constraints.csv, data/testgen_targets.csv
Outputs: solver status + objective, per-table row counts, and a preview of generated rows printed to stdout

Prerequisites

Access

A Snowflake account that has the RAI Native App installed.
A Snowflake user with permissions to access the RAI Native App.

Tools

Python >= 3.10

Quickstart

Follow these steps to run the template with the included sample data.

Download the ZIP file for this template and extract it:
Terminal window
```
curl -O https://private.relational.ai/templates/zips/v0.13/test_data_generation.zip
unzip test_data_generation.zip
cd test_data_generation
```
You can also download the template ZIP using the “Download ZIP” button at the top of this page.

Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

Install dependencies
Terminal window
```
python -m pip install .
```
Configure Snowflake connection and RAI profile
Terminal window
```
rai init
```
Run the template
Terminal window
```
python test_data_generation.py
```

Expected output (abridged)

Status: OPTIMAL
Total weighted deviation: 18000.00

Optimal row counts:
  Customer: 100 rows (target: 100, deviation: 0)
  Product: 500 rows (target: 500, deviation: 0)
  ...

==================================================
GENERATED TEST DATA
==================================================

Customer: 100 rows
 customer_id                 email  region created_date
          ...

Template structure

.
├─ README.md
├─ pyproject.toml
├─ test_data_generation.py       # main runner / entrypoint
└─ data/                         # sample input CSVs
   ├─ testgen_constraints.csv
   ├─ testgen_schema.csv
   └─ testgen_targets.csv

Start here: test_data_generation.py

Sample data

All sample inputs live under data/.

`testgen_targets.csv`

Defines target row counts, min/max bounds, and per-table priorities.

Column	Meaning
`table_name`	Table identifier used throughout the template
`target_rows`	Desired number of rows
`min_rows`	Minimum allowed rows
`max_rows`	Maximum allowed rows
`priority`	Priority used in the objective (1 is highest priority)

`testgen_schema.csv`

Describes tables and columns, including which columns are foreign keys and which parent table they reference. The solver uses this file to discover table pairs that should be linked by constraints.

Column	Meaning
`table_name`	Table containing the column
`column_name`	Column name
`column_type`	Data type label (for example, `int`, `string`, `date`)
`is_primary_key`	Whether the column is a primary key
`is_foreign_key`	Whether the column is a foreign key
`references_table`	Parent table name (for foreign keys)
`references_column`	Parent column name (for foreign keys)
`is_unique`	Whether values must be unique
`is_nullable`	Whether null values are allowed
`min_value`	Optional numeric lower bound (when present)
`max_value`	Optional numeric upper bound (when present)

`testgen_constraints.csv`

Adds constraint metadata that can create tension with targets. In this template, the solver reads cardinality-style metadata per foreign-key table pair and (optionally) coverage percentages.

Column	Meaning
`constraint_type`	Constraint kind (for example, `cardinality_bound`, `mandatory_participation`, `coverage`, `frequency`, `value_domain`)
`table_name`	Child/source table
`column_name`	Column the constraint applies to
`related_table`	Related/parent table (when applicable)
`related_column`	Related/parent column (when applicable)
`min_value`	Minimum value for the constraint (when applicable)
`max_value`	Maximum value for the constraint (when applicable)
`percentage`	Coverage percent (when applicable)
`description`	Human-readable summary

Model overview

The optimization model is built around a single concept (Table) populated from testgen_targets.csv, plus two decision properties used by the solver.

Key entity: Table
Inputs (loaded): Table.table_name, Table.target_rows, Table.min_rows, Table.max_rows, Table.priority
Decision variables:
- Table.x_actual_rows — solver-chosen row count (continuous)
- Table.x_deviation — auxiliary variable used to model

How it works

This section walks through the highlights in test_data_generation.py.

Import libraries and configure inputs

First, the script sets up a data directory and an optional scaling knob for the target sizes:

DATA_DIR = Path(__file__).parent / "data"
SCALE_FACTOR = 1.0

# Disable pandas inference of string types. This ensures that string columns
# in the CSVs are loaded as object dtype. This is only required when using
# relationalai versions prior to v1.0.
pd.options.future.infer_string = False

Load CSV inputs and build the model container

Next, it creates a Semantics Model, loads the three CSV inputs, and scales targets/bounds with SCALE_FACTOR:

# Create a Semantics model container.
model = Model(
    f"test_data_generation_{time_ns()}",
    config=globals().get("config", None),
    use_lqp=False,
)

# Load schema data from CSV.
schema_df = read_csv(DATA_DIR / "testgen_schema.csv")

# Load constraint data from CSV.
constraints_df = read_csv(DATA_DIR / "testgen_constraints.csv")

# Load row-count targets from CSV.
targets_df = read_csv(DATA_DIR / "testgen_targets.csv")

# Scale row-count targets and bounds.
targets_df = targets_df.copy()
targets_df["target_rows"] = (targets_df["target_rows"] * SCALE_FACTOR).astype(int)
targets_df["min_rows"] = (targets_df["min_rows"] * SCALE_FACTOR).astype(int)
targets_df["max_rows"] = (targets_df["max_rows"] * SCALE_FACTOR).astype(int)

Define the `Table` concept and decision properties

Then it loads one Table entity per row in testgen_targets.csv and adds two solver-chosen properties (actual_rows and deviation):

# Table concept: represents a table in the schema.
Table = model.Concept("Table")

# Load table data from CSV.
data(targets_df).into(Table, keys=["table_name"])

# Table.x_actual_rows decision property: solver-chosen row count per table.
Table.x_actual_rows = model.Property("{Table} has actual {actual_rows:float}")

# Table.x_deviation decision property: absolute deviation from the target.
Table.x_deviation = model.Property("{Table} has {deviation:float}")

Extract foreign keys and constraint metadata

With inputs loaded, the script scans testgen_schema.csv for foreign keys and joins in any per-table-pair metadata (like cardinality bounds and coverage percentages) from testgen_constraints.csv:

# Extract foreign key relationships from the schema.
fk_df = schema_df[schema_df["is_foreign_key"] == True].copy()
cardinality_constraints = constraints_df[
    constraints_df["constraint_type"].isin(
        ["cardinality_bound", "mandatory_participation", "frequency"]
    )
]

fk_objs = []
for _, fk_row in fk_df.iterrows():
    child_table = fk_row["table_name"]
    parent_table = fk_row["references_table"]

    card = cardinality_constraints[
        (cardinality_constraints["table_name"] == child_table)
        & (cardinality_constraints["related_table"] == parent_table)
    ]

    min_per = 1
    max_per = 100

    for _, c in card.iterrows():
        if c["constraint_type"] == "cardinality_bound":
            min_per = int(c["min_value"]) if not pd.isna(c["min_value"]) else 1
            max_per = int(c["max_value"]) if not pd.isna(c["max_value"]) else 100
        elif c["constraint_type"] == "frequency":
            max_per = int(c["max_value"]) if not pd.isna(c["max_value"]) else 100

    coverage = constraints_df[
        (constraints_df["constraint_type"] == "coverage")
        & (constraints_df["table_name"] == child_table)
        & (constraints_df["related_table"] == parent_table)
    ]
    coverage_pct = 0.0
    if len(coverage) > 0:
        coverage_pct = float(coverage.iloc[0]["percentage"]) / 100.0

    fk_objs.append(
        {
            "child": child_table,
            "parent": parent_table,
            "min": min_per,
            "max": max_per,
            "coverage": coverage_pct,
        }
    )

Define variables, constraints, and objective

With the feasible region defined at the table level, the script declares two variable families and linearizes the absolute deviation from each target:

s = SolverModel(model, "cont", use_pb=True)

# Variable: actual row counts for each table
s.solve_for(
    Table.x_actual_rows,
    name=["n", Table.table_name],
    lower=Table.min_rows,
    upper=Table.max_rows,
)

# Variable: deviation from target (for objective)
s.solve_for(
    Table.x_deviation,
    name=["dev", Table.table_name],
    lower=0,
)

# Constraint: deviation captures |actual - target| (linearized)
s.satisfy(require(Table.x_deviation >= Table.x_actual_rows - Table.target_rows))
s.satisfy(require(Table.x_deviation >= Table.target_rows - Table.x_actual_rows))

Then it links child and parent table sizes using where(...).require(...) for each discovered foreign key pair (including optional mandatory participation and coverage constraints):

# Constraint: referential integrity - child rows bounded by parent capacity.
# These constraints link specific table pairs via their actual_rows variables.
Table2 = Table.ref()
for fk_info in fk_objs:
    child_name = fk_info["child"]
    parent_name = fk_info["parent"]

    # Upper bound: can't have more children than max per parent.
    s.satisfy(
        where(
            Table.table_name == child_name,
            Table2.table_name == parent_name
        ).require(Table.x_actual_rows <= Table2.actual_rows * fk_info["max"])
    )

    # Lower bound for mandatory participation
    mandatory = constraints_df[
        (constraints_df["constraint_type"] == "mandatory_participation")
        & (constraints_df["table_name"] == child_name)
        & (constraints_df["related_table"] == parent_name)
    ]
    if len(mandatory) > 0:
        min_per = (
            int(mandatory.iloc[0]["min_value"])
            if not pd.isna(mandatory.iloc[0]["min_value"])
            else 1
        )
        s.satisfy(
            where(
                Table.table_name == child_name,
                Table2.table_name == parent_name
            ).require(Table.x_actual_rows >= Table2.actual_rows * min_per)
        )

# Constraint: coverage requirements
for fk_info in fk_objs:
    if fk_info["coverage"] > 0:
        child_name = fk_info["child"]
        parent_name = fk_info["parent"]
        s.satisfy(
            where(
                Table.table_name == child_name,
                Table2.table_name == parent_name
            ).require(Table.x_actual_rows >= fk_info["coverage"] * Table2.actual_rows)
        )

Finally, it minimizes a priority-weighted deviation (higher priority means a higher penalty for missing the target):

# Objective: minimize weighted deviation from targets
# Weight by priority (higher priority = more important to match target)
total_deviation = rai_sum(Table.x_deviation * (11 - Table.priority))
s.minimize(total_deviation)

Solve and generate example rows

The script solves with the HiGHS backend and prints a summary:

solver = Solver("highs")
s.solve(solver, time_limit_sec=60)

print(f"Status: {s.termination_status}")
print(f"Total weighted deviation: {s.objective_value:.2f}")

Then it rounds the continuous solution to integers for reporting and for driving the generator:

# Extract row counts
row_counts = {}
results_df = select(Table.table_name, Table.x_actual_rows, Table.target_rows).to_df()
for _, row in results_df.iterrows():
    actual = int(round(row["actual_rows"]))
    target = int(row["target_rows"])
    row_counts[row["table_name"]] = {"actual": actual, "target": target}

To close the loop, it generates a small set of synthetic records and prints a preview of each table:

# Generate data and show samples.
print("\n" + "=" * 50)
print("GENERATED TEST DATA")
print("=" * 50)

generated_data = generate_test_data(row_counts)
for table, df in generated_data.items():
    print(f"\n{table}: {len(df)} rows")
    print(df.head(3).to_string(index=False))

Customize this template

Use your own schema and constraints

Replace the files in data/ with your own versions that follow the same headers.
Start by updating testgen_targets.csv (targets + bounds), then ensure foreign keys are reflected in testgen_schema.csv.

Tune dataset size quickly

Change SCALE_FACTOR in test_data_generation.py to scale all targets/bounds at once.

Adjust priorities

If some tables must match targets more closely than others, change priority in testgen_targets.csv. Higher priority tables are penalized more heavily in the objective.

Export generated data

The template prints a preview. If you want CSV outputs, add df.to_csv(...) calls after generation.

Troubleshooting

I get ModuleNotFoundError when running the script

Confirm you activated your virtual environment: source .venv/bin/activate.
Reinstall dependencies from the template root: python -m pip install ..

rai init fails or the script can’t authenticate

Re-run rai init and confirm your Snowflake account has the RAI Native App installed.
If you use multiple profiles, set RAI_PROFILE before running: export RAI_PROFILE=<profile>.
Verify your Snowflake role/warehouse has access to the RelationalAI Native App.

The script fails reading CSV files (missing file or columns)

Confirm you are running from the template folder that contains data/.
Verify the input filenames exist: testgen_schema.csv, testgen_constraints.csv, testgen_targets.csv.
Check that the CSV headers match the expected columns documented above.

The solver returns Status: INFEASIBLE

This usually indicates the target bounds and the parent-child constraints cannot be satisfied at the same time.
Start by widening min_rows/max_rows in testgen_targets.csv for the most constrained tables.
Check whether tight constraints (for example, low max cardinality or high frequency minimums) force row counts beyond your bounds.

The row counts look “off by 1” versus my targets

The solver uses continuous variables and the script rounds to integers when printing and generating.
If you need exact integer row counts, consider switching to an integer model (MILP) for Table.x_actual_rows.

I expected larger output tables

The script prints only a preview of generated data using df.head(3).
To export full tables, add a df.to_csv(…) step after generation.

The generated tables don’t include all schema columns

The included generator is intentionally small and only produces a subset of the columns described in testgen_schema.csv.
Extend generate_test_data(…) to populate additional columns (including nullable fields, value domains, and min/max bounds).

What this template is for

Who this is for

QA engineers and test automation developers who need repeatable datasets.
Data engineers who need scalable dev/load-test datasets while maintaining realistic relationships.
Anyone comfortable reading CSV schemas and running a Python script.

What you’ll build

A linear program (LP) that chooses per-table row counts within min/max bounds.
Constraints that link child and parent table sizes based on foreign keys plus optional cardinality/coverage metadata.
A priority-weighted objective that keeps important tables closer to their targets.
A seeded, procedural generator that prints a small sample of synthetic rows.

What’s included

Model + solve script: test_data_generation.py
Sample data: data/testgen_schema.csv, data/testgen_constraints.csv, data/testgen_targets.csv
Outputs: solver status + objective, per-table row counts, and a preview of generated rows printed to stdout

Prerequisites

Access

A Snowflake account that has the RAI Native App installed.
A Snowflake user with permissions to access the RAI Native App.

Tools

Python >= 3.10

Quickstart

Follow these steps to run the template with the included sample data.

Download the ZIP file for this template and extract it:
Terminal window
```
curl -O https://private.relational.ai/templates/zips/v0.14/test_data_generation.zip
unzip test_data_generation.zip
cd test_data_generation
```
You can also download the template ZIP using the “Download ZIP” button at the top of this page.

Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

Install dependencies
Terminal window
```
python -m pip install .
```
Configure Snowflake connection and RAI profile
Terminal window
```
rai init
```
Run the template
Terminal window
```
python test_data_generation.py
```

Expected output (abridged)

Status: OPTIMAL
Total weighted deviation: 18000.00

Optimal row counts:
  Customer: 100 rows (target: 100, deviation: 0)
  Product: 500 rows (target: 500, deviation: 0)
  ...

==================================================
GENERATED TEST DATA
==================================================

Customer: 100 rows
 customer_id                 email  region created_date
          ...

Template structure

.
├─ README.md
├─ pyproject.toml
├─ test_data_generation.py       # main runner / entrypoint
└─ data/                         # sample input CSVs
   ├─ testgen_constraints.csv
   ├─ testgen_schema.csv
   └─ testgen_targets.csv

Start here: test_data_generation.py

Sample data

All sample inputs live under data/.

`testgen_targets.csv`

Defines target row counts, min/max bounds, and per-table priorities.

Column	Meaning
`table_name`	Table identifier used throughout the template
`target_rows`	Desired number of rows
`min_rows`	Minimum allowed rows
`max_rows`	Maximum allowed rows
`priority`	Priority used in the objective (1 is highest priority)

`testgen_schema.csv`

Describes tables and columns, including which columns are foreign keys and which parent table they reference. The solver uses this file to discover table pairs that should be linked by constraints.

Column	Meaning
`table_name`	Table containing the column
`column_name`	Column name
`column_type`	Data type label (for example, `int`, `string`, `date`)
`is_primary_key`	Whether the column is a primary key
`is_foreign_key`	Whether the column is a foreign key
`references_table`	Parent table name (for foreign keys)
`references_column`	Parent column name (for foreign keys)
`is_unique`	Whether values must be unique
`is_nullable`	Whether null values are allowed
`min_value`	Optional numeric lower bound (when present)
`max_value`	Optional numeric upper bound (when present)

`testgen_constraints.csv`

Adds constraint metadata that can create tension with targets. In this template, the solver reads cardinality-style metadata per foreign-key table pair and (optionally) coverage percentages.

Column	Meaning
`constraint_type`	Constraint kind (for example, `cardinality_bound`, `mandatory_participation`, `coverage`, `frequency`, `value_domain`)
`table_name`	Child/source table
`column_name`	Column the constraint applies to
`related_table`	Related/parent table (when applicable)
`related_column`	Related/parent column (when applicable)
`min_value`	Minimum value for the constraint (when applicable)
`max_value`	Maximum value for the constraint (when applicable)
`percentage`	Coverage percent (when applicable)
`description`	Human-readable summary

Model overview

The optimization model is built around a single concept (Table) populated from testgen_targets.csv, plus two decision properties used by the solver.

Key entity: Table
Inputs (loaded): Table.table_name, Table.target_rows, Table.min_rows, Table.max_rows, Table.priority
Decision variables:
- Table.x_actual_rows — solver-chosen row count (continuous)
- Table.x_deviation — auxiliary variable used to model

How it works

This section walks through the highlights in test_data_generation.py.

Import libraries and configure inputs

First, the script sets up a data directory and an optional scaling knob for the target sizes:

DATA_DIR = Path(__file__).parent / "data"
SCALE_FACTOR = 1.0

# Disable pandas inference of string types. This ensures that string columns
# in the CSVs are loaded as object dtype. This is only required when using
# relationalai versions prior to v1.0.
pd.options.future.infer_string = False

Load CSV inputs and build the model container

Next, it creates a Semantics Model, loads the three CSV inputs, and scales targets/bounds with SCALE_FACTOR:

# Create a Semantics model container.
model = Model(
    f"test_data_generation_{time_ns()}",
    config=globals().get("config", None),
)

# Load schema data from CSV.
schema_df = read_csv(DATA_DIR / "testgen_schema.csv")

# Load constraint data from CSV.
constraints_df = read_csv(DATA_DIR / "testgen_constraints.csv")

# Load row-count targets from CSV.
targets_df = read_csv(DATA_DIR / "testgen_targets.csv")

# Scale row-count targets and bounds.
targets_df = targets_df.copy()
targets_df["target_rows"] = (targets_df["target_rows"] * SCALE_FACTOR).astype(int)
targets_df["min_rows"] = (targets_df["min_rows"] * SCALE_FACTOR).astype(int)
targets_df["max_rows"] = (targets_df["max_rows"] * SCALE_FACTOR).astype(int)

Define the `Table` concept and decision properties

Then it loads one Table entity per row in testgen_targets.csv and adds two solver-chosen properties (actual_rows and deviation):

# Table concept: represents a table in the schema.
Table = model.Concept("Table")

# Load table data from CSV.
data(targets_df).into(Table, keys=["table_name"])

# Table.x_actual_rows decision property: solver-chosen row count per table.
Table.x_actual_rows = model.Property("{Table} has actual {actual_rows:float}")

# Table.x_deviation decision property: absolute deviation from the target.
Table.x_deviation = model.Property("{Table} has {deviation:float}")

Extract foreign keys and constraint metadata

With inputs loaded, the script scans testgen_schema.csv for foreign keys and joins in any per-table-pair metadata (like cardinality bounds and coverage percentages) from testgen_constraints.csv:

# Extract foreign key relationships from the schema.
fk_df = schema_df[schema_df["is_foreign_key"] == True].copy()
cardinality_constraints = constraints_df[
    constraints_df["constraint_type"].isin(
        ["cardinality_bound", "mandatory_participation", "frequency"]
    )
]

fk_objs = []
for _, fk_row in fk_df.iterrows():
    child_table = fk_row["table_name"]
    parent_table = fk_row["references_table"]

    card = cardinality_constraints[
        (cardinality_constraints["table_name"] == child_table)
        & (cardinality_constraints["related_table"] == parent_table)
    ]

    min_per = 1
    max_per = 100

    for _, c in card.iterrows():
        if c["constraint_type"] == "cardinality_bound":
            min_per = int(c["min_value"]) if not pd.isna(c["min_value"]) else 1
            max_per = int(c["max_value"]) if not pd.isna(c["max_value"]) else 100
        elif c["constraint_type"] == "frequency":
            max_per = int(c["max_value"]) if not pd.isna(c["max_value"]) else 100

    coverage = constraints_df[
        (constraints_df["constraint_type"] == "coverage")
        & (constraints_df["table_name"] == child_table)
        & (constraints_df["related_table"] == parent_table)
    ]
    coverage_pct = 0.0
    if len(coverage) > 0:
        coverage_pct = float(coverage.iloc[0]["percentage"]) / 100.0

    fk_objs.append(
        {
            "child": child_table,
            "parent": parent_table,
            "min": min_per,
            "max": max_per,
            "coverage": coverage_pct,
        }
    )

Define variables, constraints, and objective

With the feasible region defined at the table level, the script declares two variable families and linearizes the absolute deviation from each target:

s = SolverModel(model, "cont")

# Variable: actual row counts for each table
s.solve_for(
    Table.x_actual_rows,
    name=["n", Table.table_name],
    lower=Table.min_rows,
    upper=Table.max_rows,
)

# Variable: deviation from target (for objective)
s.solve_for(
    Table.x_deviation,
    name=["dev", Table.table_name],
    lower=0,
)

# Constraint: deviation captures |actual - target| (linearized)
s.satisfy(require(Table.x_deviation >= Table.x_actual_rows - Table.target_rows))
s.satisfy(require(Table.x_deviation >= Table.target_rows - Table.x_actual_rows))

Then it links child and parent table sizes using where(...).require(...) for each discovered foreign key pair (including optional mandatory participation and coverage constraints):

# Constraint: referential integrity - child rows bounded by parent capacity.
# These constraints link specific table pairs via their actual_rows variables.
Table2 = Table.ref()
for fk_info in fk_objs:
    child_name = fk_info["child"]
    parent_name = fk_info["parent"]

    # Upper bound: can't have more children than max per parent.
    s.satisfy(
        where(
            Table.table_name == child_name,
            Table2.table_name == parent_name
        ).require(Table.x_actual_rows <= Table2.x_actual_rows * fk_info["max"])
    )

    # Lower bound for mandatory participation
    mandatory = constraints_df[
        (constraints_df["constraint_type"] == "mandatory_participation")
        & (constraints_df["table_name"] == child_name)
        & (constraints_df["related_table"] == parent_name)
    ]
    if len(mandatory) > 0:
        min_per = (
            int(mandatory.iloc[0]["min_value"])
            if not pd.isna(mandatory.iloc[0]["min_value"])
            else 1
        )
        s.satisfy(
            where(
                Table.table_name == child_name,
                Table2.table_name == parent_name
            ).require(Table.x_actual_rows >= Table2.x_actual_rows * min_per)
        )

# Constraint: coverage requirements
for fk_info in fk_objs:
    if fk_info["coverage"] > 0:
        child_name = fk_info["child"]
        parent_name = fk_info["parent"]
        s.satisfy(
            where(
                Table.table_name == child_name,
                Table2.table_name == parent_name
            ).require(Table.x_actual_rows >= fk_info["coverage"] * Table2.x_actual_rows)
        )

Finally, it minimizes a priority-weighted deviation (higher priority means a higher penalty for missing the target):

# Objective: minimize weighted deviation from targets
# Weight by priority (higher priority = more important to match target)
total_deviation = rai_sum(Table.x_deviation * (11 - Table.priority))
s.minimize(total_deviation)

Solve and generate example rows

The script solves with the HiGHS backend and prints a summary:

solver = Solver("highs")
s.solve(solver, time_limit_sec=60)

print(f"Status: {s.termination_status}")
print(f"Total weighted deviation: {s.objective_value:.2f}")

Then it rounds the continuous solution to integers for reporting and for driving the generator:

# Extract row counts
row_counts = {}
results_df = select(Table.table_name, Table.x_actual_rows, Table.target_rows).to_df()
for _, row in results_df.iterrows():
    actual = int(round(row["actual_rows"]))
    target = int(row["target_rows"])
    row_counts[row["table_name"]] = {"actual": actual, "target": target}

To close the loop, it generates a small set of synthetic records and prints a preview of each table:

# Generate data and show samples.
print("\n" + "=" * 50)
print("GENERATED TEST DATA")
print("=" * 50)

generated_data = generate_test_data(row_counts)
for table, df in generated_data.items():
    print(f"\n{table}: {len(df)} rows")
    print(df.head(3).to_string(index=False))

Customize this template

Use your own schema and constraints

Replace the files in data/ with your own versions that follow the same headers.
Start by updating testgen_targets.csv (targets + bounds), then ensure foreign keys are reflected in testgen_schema.csv.

Tune dataset size quickly

Change SCALE_FACTOR in test_data_generation.py to scale all targets/bounds at once.

Adjust priorities

If some tables must match targets more closely than others, change priority in testgen_targets.csv. Higher priority tables are penalized more heavily in the objective.

Export generated data

The template prints a preview. If you want CSV outputs, add df.to_csv(...) calls after generation.

Troubleshooting

I get ModuleNotFoundError when running the script

Confirm you activated your virtual environment: source .venv/bin/activate.
Reinstall dependencies from the template root: python -m pip install ..

rai init fails or the script can’t authenticate

Re-run rai init and confirm your Snowflake account has the RAI Native App installed.
If you use multiple profiles, set RAI_PROFILE before running: export RAI_PROFILE=<profile>.
Verify your Snowflake role/warehouse has access to the RelationalAI Native App.

The script fails reading CSV files (missing file or columns)

Confirm you are running from the template folder that contains data/.
Verify the input filenames exist: testgen_schema.csv, testgen_constraints.csv, testgen_targets.csv.
Check that the CSV headers match the expected columns documented above.

The solver returns Status: INFEASIBLE

This usually indicates the target bounds and the parent-child constraints cannot be satisfied at the same time.
Start by widening min_rows/max_rows in testgen_targets.csv for the most constrained tables.
Check whether tight constraints (for example, low max cardinality or high frequency minimums) force row counts beyond your bounds.

The row counts look “off by 1” versus my targets

The solver uses continuous variables and the script rounds to integers when printing and generating.
If you need exact integer row counts, consider switching to an integer model (MILP) for Table.x_actual_rows.

I expected larger output tables

The script prints only a preview of generated data using df.head(3).
To export full tables, add a df.to_csv(…) step after generation.

The generated tables don’t include all schema columns

The included generator is intentionally small and only produces a subset of the columns described in testgen_schema.csv.
Extend generate_test_data(…) to populate additional columns (including nullable fields, value domains, and min/max bounds).

What this template is for

Generating realistic test databases requires careful coordination of row counts across tables. Foreign key relationships, cardinality bounds, and coverage requirements create complex interdependencies — for example, if each order must have 1-3 line items, then 800 orders can have at most 2,400 order lines. Manually choosing row counts that satisfy all these constraints while staying close to target sizes is tedious and error-prone.

This template uses Prescriptive reasoning to find optimal row counts for each table in a test database. It encodes referential integrity constraints, cardinality bounds, mandatory participation rules, and coverage requirements as a linear program, then minimizes weighted deviation from target row counts. Higher-priority tables stay closer to their targets.

After determining row counts, the script generates actual test data records with realistic values — emails, dates, prices, and proper foreign key references — giving you a complete, constraint-consistent test dataset.

Who this is for

QA engineers who need realistic test databases with consistent referential integrity
Database developers building test fixtures for integration testing
Data engineers validating ETL pipelines with controlled synthetic data

What you’ll build

A linear programming model that balances row count targets against schema constraints
Referential integrity constraints encoding foreign key cardinality bounds
Mandatory participation and coverage requirements
A data generation phase that produces actual CSV-ready test records

What’s included

test_data_generation.py — Main script for row count optimization and data generation
data/testgen_schema.csv — Table and column definitions with types and foreign keys
data/testgen_constraints.csv — Cardinality bounds, coverage, and participation rules
data/testgen_targets.csv — Target, minimum, and maximum row counts per table with priorities
pyproject.toml — Python package configuration

Prerequisites

Access

A Snowflake account that has the RAI Native App installed.
A Snowflake user with permissions to access the RAI Native App.

Tools

Python >= 3.10

Quickstart

Download ZIP:

curl -O https://docs.relational.ai/templates/zips/v1/test_data_generation.zip
unzip test_data_generation.zip
cd test_data_generation

Create venv:

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

Install:
Terminal window
```
python -m pip install .
```
Configure:
Terminal window
```
rai init
```
Run:
Terminal window
```
python test_data_generation.py
```

Expected output:

Status: OPTIMAL
Total weighted deviation: 1200.00

Optimal row counts:
  Customer: 100 rows (target: 100, deviation: 0)
  Product: 500 rows (target: 500, deviation: 0)
  Order: 800 rows (target: 800, deviation: 0)
  OrderLine: 2400 rows (target: 5000, deviation: 2600)
  Supplier: 20 rows (target: 20, deviation: 0)
  SupplierProduct: 1500 rows (target: 1500, deviation: 0)

==================================================
GENERATED TEST DATA
==================================================

Customer: 100 rows
 customer_id                          email region created_date
           1  user1_abcd@gmail.com   North   2020-03-15

Product: 500 rows
 product_id              name   category  price
          1  Electronics Item 1  Clothing  42.99

Order: 800 rows
 order_id  customer_id  order_date    status
        1           47  2023-06-12  delivered

Template structure

.
├── README.md
├── pyproject.toml
├── test_data_generation.py
└── data/
    ├── testgen_constraints.csv
    ├── testgen_schema.csv
    └── testgen_targets.csv

How it works

1. Load schema metadata

The script reads three CSV files describing the database schema: table definitions with column types and foreign keys, constraints (cardinality bounds, coverage, mandatory participation), and row count targets with priorities:

schema_df = read_csv(data_dir / "testgen_schema.csv")
constraints_df = read_csv(data_dir / "testgen_constraints.csv")
targets_df = read_csv(data_dir / "testgen_targets.csv")

2. Define decision variables

Each table gets two decision variables — actual row count and deviation from target:

Table.x_actual_rows = Property(f"{Table} has actual {Float:actual_rows}")
Table.x_deviation = Property(f"{Table} has {Float:deviation}")

problem.solve_for(Table.x_actual_rows, name=["n", Table.table_name], lower=Table.min_rows, upper=Table.max_rows)
problem.solve_for(Table.x_deviation, name=["dev", Table.table_name], lower=0)

3. Encode referential integrity constraints

Foreign key relationships impose bounds linking child and parent row counts. For example, if each order has 1-3 order lines, the OrderLine count must be between 1x and 3x the Order count:

ParentTable = Table.ref()
problem.satisfy(model.require(
    Table.x_actual_rows <= ParentTable.x_actual_rows * fk_info['max']
).where(
    Table.table_name == child_name,
    ParentTable.table_name == parent_name
))

4. Minimize weighted deviation

The objective minimizes deviation from targets, weighted by priority so that critical tables stay closer to their targets:

total_deviation = rai_sum(Table.x_deviation * (11 - Table.priority))
problem.minimize(total_deviation)

5. Generate test records

After solving, the script generates actual data records with realistic values using the computed row counts:

generated_data = generate_test_data(row_counts)

Customize this template

Add your own schema by editing the three CSV files to match your database structure with its constraints and targets.
Adjust the scale factor by changing scale_factor to proportionally scale all targets up or down.
Extend data generation to produce additional column types or more realistic value distributions for your domain.
Add uniqueness constraints to ensure generated foreign key references follow realistic distributions rather than uniform random.

Troubleshooting

Solver returns INFEASIBLE

The constraints are contradictory — for example, a cardinality bound may require more child rows than the maximum allows. Review testgen_constraints.csv and check that the min/max row ranges in testgen_targets.csv are compatible with the cardinality bounds. Widen the min/max row ranges or relax cardinality constraints.

Large deviations from target row counts

This is expected when constraints are tight. For example, if each order has at most 3 order lines and there are 800 orders, the OrderLine count cannot exceed 2,400 even if the target is 5,000. Review the constraints to understand which ones are binding.

ModuleNotFoundError: No module named 'relationalai'

Make sure you activated the virtual environment and ran python -m pip install . to install all dependencies listed in pyproject.toml.

Connection or authentication errors

Run rai init to configure your Snowflake connection. Verify that your account has the RAI Native App installed and that your user has the required permissions.

Test Data Generation

What this template is for

Who this is for

What you’ll build

What’s included

Prerequisites

Access

Tools

Quickstart

Template structure

Sample data

testgen_targets.csv

testgen_schema.csv

testgen_constraints.csv

Model overview

How it works

Import libraries and configure inputs

Load CSV inputs and build the model container

Define the Table concept and decision properties

Extract foreign keys and constraint metadata

Define variables, constraints, and objective

Solve and generate example rows

Customize this template

Use your own schema and constraints

Tune dataset size quickly

Adjust priorities

Export generated data

Troubleshooting

What this template is for

Who this is for

What you’ll build

What’s included

Prerequisites

Access

Tools

Quickstart

Template structure

Sample data

testgen_targets.csv

testgen_schema.csv

testgen_constraints.csv

Model overview

How it works

Import libraries and configure inputs

Load CSV inputs and build the model container

Define the Table concept and decision properties

Extract foreign keys and constraint metadata

Define variables, constraints, and objective

Solve and generate example rows

Customize this template

Use your own schema and constraints

Tune dataset size quickly

Adjust priorities

Export generated data

Troubleshooting

What this template is for

Who this is for

What you’ll build

What’s included

Prerequisites

Access

Tools

Quickstart

Template structure

How it works

1. Load schema metadata

2. Define decision variables

3. Encode referential integrity constraints

4. Minimize weighted deviation

5. Generate test records

Customize this template

Troubleshooting

`testgen_targets.csv`

`testgen_schema.csv`

`testgen_constraints.csv`

Define the `Table` concept and decision properties

`testgen_targets.csv`

`testgen_schema.csv`

`testgen_constraints.csv`

Define the `Table` concept and decision properties