Test Data Generation
Determine feasible row counts for test database tables that satisfy schema constraints, then generate example synthetic rows.
Determine feasible row counts for test database tables that satisfy schema constraints, then generate example synthetic rows.
Determine optimal row counts for test database tables satisfying schema and referential integrity constraints.
Browse files
Browse files
What this template is for
QA and data engineering teams often need synthetic datasets that are large enough for realistic testing, but still satisfy core schema constraints like foreign keys and parent–child cardinality rules. Purely random generation tends to break referential integrity (for example, order lines pointing at non-existent orders) or produce unrealistic table size ratios.
This template uses RelationalAI’s prescriptive reasoning (optimization) capabilities to choose feasible row counts for each table that are as close as possible to your desired targets, subject to schema-derived constraints.
Who this is for
- QA engineers and test automation developers who need repeatable datasets.
- Data engineers who need scalable dev/load-test datasets while maintaining realistic relationships.
- Anyone comfortable reading CSV schemas and running a Python script.
What you’ll build
- A linear program (LP) that chooses per-table row counts within min/max bounds.
- Constraints that link child and parent table sizes based on foreign keys plus optional cardinality/coverage metadata.
- A priority-weighted objective that keeps important tables closer to their targets.
- A seeded, procedural generator that prints a small sample of synthetic rows.
What’s included
- Model + solve script:
test_data_generation.py - Sample data:
data/testgen_schema.csv,data/testgen_constraints.csv,data/testgen_targets.csv - Outputs: solver status + objective, per-table row counts, and a preview of generated rows printed to stdout
Prerequisites
Access
- A Snowflake account that has the RAI Native App installed.
- A Snowflake user with permissions to access the RAI Native App.
Tools
- Python >= 3.10
Quickstart
Follow these steps to run the template with the included sample data.
-
Download the ZIP file for this template and extract it:
Terminal window curl -O https://private.relational.ai/templates/zips/v0.13/test_data_generation.zipunzip test_data_generation.zipcd test_data_generation -
Create and activate a virtual environment
Terminal window python -m venv .venvsource .venv/bin/activatepython -m pip install -U pip -
Install dependencies
Terminal window python -m pip install . -
Configure Snowflake connection and RAI profile
Terminal window rai init -
Run the template
Terminal window python test_data_generation.py -
Expected output (abridged)
Status: OPTIMALTotal weighted deviation: 18000.00Optimal row counts:Customer: 100 rows (target: 100, deviation: 0)Product: 500 rows (target: 500, deviation: 0)...==================================================GENERATED TEST DATA==================================================Customer: 100 rowscustomer_id email region created_date...
Template structure
.├─ README.md├─ pyproject.toml├─ test_data_generation.py # main runner / entrypoint└─ data/ # sample input CSVs ├─ testgen_constraints.csv ├─ testgen_schema.csv └─ testgen_targets.csvStart here: test_data_generation.py
Sample data
All sample inputs live under data/.
testgen_targets.csv
Defines target row counts, min/max bounds, and per-table priorities.
| Column | Meaning |
|---|---|
table_name | Table identifier used throughout the template |
target_rows | Desired number of rows |
min_rows | Minimum allowed rows |
max_rows | Maximum allowed rows |
priority | Priority used in the objective (1 is highest priority) |
testgen_schema.csv
Describes tables and columns, including which columns are foreign keys and which parent table they reference. The solver uses this file to discover table pairs that should be linked by constraints.
| Column | Meaning |
|---|---|
table_name | Table containing the column |
column_name | Column name |
column_type | Data type label (for example, int, string, date) |
is_primary_key | Whether the column is a primary key |
is_foreign_key | Whether the column is a foreign key |
references_table | Parent table name (for foreign keys) |
references_column | Parent column name (for foreign keys) |
is_unique | Whether values must be unique |
is_nullable | Whether null values are allowed |
min_value | Optional numeric lower bound (when present) |
max_value | Optional numeric upper bound (when present) |
testgen_constraints.csv
Adds constraint metadata that can create tension with targets. In this template, the solver reads cardinality-style metadata per foreign-key table pair and (optionally) coverage percentages.
| Column | Meaning |
|---|---|
constraint_type | Constraint kind (for example, cardinality_bound, mandatory_participation, coverage, frequency, value_domain) |
table_name | Child/source table |
column_name | Column the constraint applies to |
related_table | Related/parent table (when applicable) |
related_column | Related/parent column (when applicable) |
min_value | Minimum value for the constraint (when applicable) |
max_value | Maximum value for the constraint (when applicable) |
percentage | Coverage percent (when applicable) |
description | Human-readable summary |
Model overview
The optimization model is built around a single concept (Table) populated from testgen_targets.csv, plus two decision properties used by the solver.
- Key entity:
Table - Inputs (loaded):
Table.table_name,Table.target_rows,Table.min_rows,Table.max_rows,Table.priority - Decision variables:
Table.x_actual_rows— solver-chosen row count (continuous)Table.x_deviation— auxiliary variable used to model
How it works
This section walks through the highlights in test_data_generation.py.
Import libraries and configure inputs
First, the script sets up a data directory and an optional scaling knob for the target sizes:
DATA_DIR = Path(__file__).parent / "data"SCALE_FACTOR = 1.0
# Disable pandas inference of string types. This ensures that string columns# in the CSVs are loaded as object dtype. This is only required when using# relationalai versions prior to v1.0.pd.options.future.infer_string = FalseLoad CSV inputs and build the model container
Next, it creates a Semantics Model, loads the three CSV inputs, and scales targets/bounds with SCALE_FACTOR:
# Create a Semantics model container.model = Model( f"test_data_generation_{time_ns()}", config=globals().get("config", None), use_lqp=False,)
# Load schema data from CSV.schema_df = read_csv(DATA_DIR / "testgen_schema.csv")
# Load constraint data from CSV.constraints_df = read_csv(DATA_DIR / "testgen_constraints.csv")
# Load row-count targets from CSV.targets_df = read_csv(DATA_DIR / "testgen_targets.csv")
# Scale row-count targets and bounds.targets_df = targets_df.copy()targets_df["target_rows"] = (targets_df["target_rows"] * SCALE_FACTOR).astype(int)targets_df["min_rows"] = (targets_df["min_rows"] * SCALE_FACTOR).astype(int)targets_df["max_rows"] = (targets_df["max_rows"] * SCALE_FACTOR).astype(int)Define the Table concept and decision properties
Then it loads one Table entity per row in testgen_targets.csv and adds two solver-chosen properties (actual_rows and deviation):
# Table concept: represents a table in the schema.Table = model.Concept("Table")
# Load table data from CSV.data(targets_df).into(Table, keys=["table_name"])
# Table.x_actual_rows decision property: solver-chosen row count per table.Table.x_actual_rows = model.Property("{Table} has actual {actual_rows:float}")
# Table.x_deviation decision property: absolute deviation from the target.Table.x_deviation = model.Property("{Table} has {deviation:float}")Extract foreign keys and constraint metadata
With inputs loaded, the script scans testgen_schema.csv for foreign keys and joins in any per-table-pair metadata (like cardinality bounds and coverage percentages) from testgen_constraints.csv:
# Extract foreign key relationships from the schema.fk_df = schema_df[schema_df["is_foreign_key"] == True].copy()cardinality_constraints = constraints_df[ constraints_df["constraint_type"].isin( ["cardinality_bound", "mandatory_participation", "frequency"] )]
fk_objs = []for _, fk_row in fk_df.iterrows(): child_table = fk_row["table_name"] parent_table = fk_row["references_table"]
card = cardinality_constraints[ (cardinality_constraints["table_name"] == child_table) & (cardinality_constraints["related_table"] == parent_table) ]
min_per = 1 max_per = 100
for _, c in card.iterrows(): if c["constraint_type"] == "cardinality_bound": min_per = int(c["min_value"]) if not pd.isna(c["min_value"]) else 1 max_per = int(c["max_value"]) if not pd.isna(c["max_value"]) else 100 elif c["constraint_type"] == "frequency": max_per = int(c["max_value"]) if not pd.isna(c["max_value"]) else 100
coverage = constraints_df[ (constraints_df["constraint_type"] == "coverage") & (constraints_df["table_name"] == child_table) & (constraints_df["related_table"] == parent_table) ] coverage_pct = 0.0 if len(coverage) > 0: coverage_pct = float(coverage.iloc[0]["percentage"]) / 100.0
fk_objs.append( { "child": child_table, "parent": parent_table, "min": min_per, "max": max_per, "coverage": coverage_pct, } )Define variables, constraints, and objective
With the feasible region defined at the table level, the script declares two variable families and linearizes the absolute deviation from each target:
s = SolverModel(model, "cont", use_pb=True)
# Variable: actual row counts for each tables.solve_for( Table.x_actual_rows, name=["n", Table.table_name], lower=Table.min_rows, upper=Table.max_rows,)
# Variable: deviation from target (for objective)s.solve_for( Table.x_deviation, name=["dev", Table.table_name], lower=0,)
# Constraint: deviation captures |actual - target| (linearized)s.satisfy(require(Table.x_deviation >= Table.x_actual_rows - Table.target_rows))s.satisfy(require(Table.x_deviation >= Table.target_rows - Table.x_actual_rows))Then it links child and parent table sizes using where(...).require(...) for each discovered foreign key pair (including optional mandatory participation and coverage constraints):
# Constraint: referential integrity - child rows bounded by parent capacity.# These constraints link specific table pairs via their actual_rows variables.Table2 = Table.ref()for fk_info in fk_objs: child_name = fk_info["child"] parent_name = fk_info["parent"]
# Upper bound: can't have more children than max per parent. s.satisfy( where( Table.table_name == child_name, Table2.table_name == parent_name ).require(Table.x_actual_rows <= Table2.actual_rows * fk_info["max"]) )
# Lower bound for mandatory participation mandatory = constraints_df[ (constraints_df["constraint_type"] == "mandatory_participation") & (constraints_df["table_name"] == child_name) & (constraints_df["related_table"] == parent_name) ] if len(mandatory) > 0: min_per = ( int(mandatory.iloc[0]["min_value"]) if not pd.isna(mandatory.iloc[0]["min_value"]) else 1 ) s.satisfy( where( Table.table_name == child_name, Table2.table_name == parent_name ).require(Table.x_actual_rows >= Table2.actual_rows * min_per) )
# Constraint: coverage requirementsfor fk_info in fk_objs: if fk_info["coverage"] > 0: child_name = fk_info["child"] parent_name = fk_info["parent"] s.satisfy( where( Table.table_name == child_name, Table2.table_name == parent_name ).require(Table.x_actual_rows >= fk_info["coverage"] * Table2.actual_rows) )Finally, it minimizes a priority-weighted deviation (higher priority means a higher penalty for missing the target):
# Objective: minimize weighted deviation from targets# Weight by priority (higher priority = more important to match target)total_deviation = rai_sum(Table.x_deviation * (11 - Table.priority))s.minimize(total_deviation)Solve and generate example rows
The script solves with the HiGHS backend and prints a summary:
solver = Solver("highs")s.solve(solver, time_limit_sec=60)
print(f"Status: {s.termination_status}")print(f"Total weighted deviation: {s.objective_value:.2f}")Then it rounds the continuous solution to integers for reporting and for driving the generator:
# Extract row countsrow_counts = {}results_df = select(Table.table_name, Table.x_actual_rows, Table.target_rows).to_df()for _, row in results_df.iterrows(): actual = int(round(row["actual_rows"])) target = int(row["target_rows"]) row_counts[row["table_name"]] = {"actual": actual, "target": target}To close the loop, it generates a small set of synthetic records and prints a preview of each table:
# Generate data and show samples.print("\n" + "=" * 50)print("GENERATED TEST DATA")print("=" * 50)
generated_data = generate_test_data(row_counts)for table, df in generated_data.items(): print(f"\n{table}: {len(df)} rows") print(df.head(3).to_string(index=False))Customize this template
Use your own schema and constraints
- Replace the files in
data/with your own versions that follow the same headers. - Start by updating
testgen_targets.csv(targets + bounds), then ensure foreign keys are reflected intestgen_schema.csv.
Tune dataset size quickly
- Change
SCALE_FACTORintest_data_generation.pyto scale all targets/bounds at once.
Adjust priorities
- If some tables must match targets more closely than others, change
priorityintestgen_targets.csv. Higher priority tables are penalized more heavily in the objective.
Export generated data
- The template prints a preview. If you want CSV outputs, add
df.to_csv(...)calls after generation.
Troubleshooting
I get ModuleNotFoundError when running the script
- Confirm you activated your virtual environment:
source .venv/bin/activate. - Reinstall dependencies from the template root:
python -m pip install ..
rai init fails or the script can’t authenticate
- Re-run
rai initand confirm your Snowflake account has the RAI Native App installed. - If you use multiple profiles, set
RAI_PROFILEbefore running:export RAI_PROFILE=<profile>. - Verify your Snowflake role/warehouse has access to the RelationalAI Native App.
The script fails reading CSV files (missing file or columns)
- Confirm you are running from the template folder that contains
data/. - Verify the input filenames exist:
testgen_schema.csv,testgen_constraints.csv,testgen_targets.csv. - Check that the CSV headers match the expected columns documented above.
The solver returns Status: INFEASIBLE
- This usually indicates the target bounds and the parent-child constraints cannot be satisfied at the same time.
- Start by widening
min_rows/max_rowsintestgen_targets.csvfor the most constrained tables. - Check whether tight constraints (for example, low max cardinality or high frequency minimums) force row counts beyond your bounds.
The row counts look “off by 1” versus my targets
- The solver uses continuous variables and the script rounds to integers when printing and generating.
- If you need exact integer row counts, consider switching to an integer model (MILP) for
Table.x_actual_rows.
I expected larger output tables
- The script prints only a preview of generated data using
df.head(3). - To export full tables, add a
df.to_csv(…)step after generation.
The generated tables don’t include all schema columns
- The included generator is intentionally small and only produces a subset of the columns described in
testgen_schema.csv. - Extend
generate_test_data(…)to populate additional columns (including nullable fields, value domains, and min/max bounds).
What this template is for
QA and data engineering teams often need synthetic datasets that are large enough for realistic testing, but still satisfy core schema constraints like foreign keys and parent–child cardinality rules. Purely random generation tends to break referential integrity (for example, order lines pointing at non-existent orders) or produce unrealistic table size ratios.
This template uses RelationalAI’s prescriptive reasoning (optimization) capabilities to choose feasible row counts for each table that are as close as possible to your desired targets, subject to schema-derived constraints.
Who this is for
- QA engineers and test automation developers who need repeatable datasets.
- Data engineers who need scalable dev/load-test datasets while maintaining realistic relationships.
- Anyone comfortable reading CSV schemas and running a Python script.
What you’ll build
- A linear program (LP) that chooses per-table row counts within min/max bounds.
- Constraints that link child and parent table sizes based on foreign keys plus optional cardinality/coverage metadata.
- A priority-weighted objective that keeps important tables closer to their targets.
- A seeded, procedural generator that prints a small sample of synthetic rows.
What’s included
- Model + solve script:
test_data_generation.py - Sample data:
data/testgen_schema.csv,data/testgen_constraints.csv,data/testgen_targets.csv - Outputs: solver status + objective, per-table row counts, and a preview of generated rows printed to stdout
Prerequisites
Access
- A Snowflake account that has the RAI Native App installed.
- A Snowflake user with permissions to access the RAI Native App.
Tools
- Python >= 3.10
Quickstart
Follow these steps to run the template with the included sample data.
-
Download the ZIP file for this template and extract it:
Terminal window curl -O https://private.relational.ai/templates/zips/v0.14/test_data_generation.zipunzip test_data_generation.zipcd test_data_generation -
Create and activate a virtual environment
Terminal window python -m venv .venvsource .venv/bin/activatepython -m pip install -U pip -
Install dependencies
Terminal window python -m pip install . -
Configure Snowflake connection and RAI profile
Terminal window rai init -
Run the template
Terminal window python test_data_generation.py -
Expected output (abridged)
Status: OPTIMALTotal weighted deviation: 18000.00Optimal row counts:Customer: 100 rows (target: 100, deviation: 0)Product: 500 rows (target: 500, deviation: 0)...==================================================GENERATED TEST DATA==================================================Customer: 100 rowscustomer_id email region created_date...
Template structure
.├─ README.md├─ pyproject.toml├─ test_data_generation.py # main runner / entrypoint└─ data/ # sample input CSVs ├─ testgen_constraints.csv ├─ testgen_schema.csv └─ testgen_targets.csvStart here: test_data_generation.py
Sample data
All sample inputs live under data/.
testgen_targets.csv
Defines target row counts, min/max bounds, and per-table priorities.
| Column | Meaning |
|---|---|
table_name | Table identifier used throughout the template |
target_rows | Desired number of rows |
min_rows | Minimum allowed rows |
max_rows | Maximum allowed rows |
priority | Priority used in the objective (1 is highest priority) |
testgen_schema.csv
Describes tables and columns, including which columns are foreign keys and which parent table they reference. The solver uses this file to discover table pairs that should be linked by constraints.
| Column | Meaning |
|---|---|
table_name | Table containing the column |
column_name | Column name |
column_type | Data type label (for example, int, string, date) |
is_primary_key | Whether the column is a primary key |
is_foreign_key | Whether the column is a foreign key |
references_table | Parent table name (for foreign keys) |
references_column | Parent column name (for foreign keys) |
is_unique | Whether values must be unique |
is_nullable | Whether null values are allowed |
min_value | Optional numeric lower bound (when present) |
max_value | Optional numeric upper bound (when present) |
testgen_constraints.csv
Adds constraint metadata that can create tension with targets. In this template, the solver reads cardinality-style metadata per foreign-key table pair and (optionally) coverage percentages.
| Column | Meaning |
|---|---|
constraint_type | Constraint kind (for example, cardinality_bound, mandatory_participation, coverage, frequency, value_domain) |
table_name | Child/source table |
column_name | Column the constraint applies to |
related_table | Related/parent table (when applicable) |
related_column | Related/parent column (when applicable) |
min_value | Minimum value for the constraint (when applicable) |
max_value | Maximum value for the constraint (when applicable) |
percentage | Coverage percent (when applicable) |
description | Human-readable summary |
Model overview
The optimization model is built around a single concept (Table) populated from testgen_targets.csv, plus two decision properties used by the solver.
- Key entity:
Table - Inputs (loaded):
Table.table_name,Table.target_rows,Table.min_rows,Table.max_rows,Table.priority - Decision variables:
Table.x_actual_rows— solver-chosen row count (continuous)Table.x_deviation— auxiliary variable used to model
How it works
This section walks through the highlights in test_data_generation.py.
Import libraries and configure inputs
First, the script sets up a data directory and an optional scaling knob for the target sizes:
DATA_DIR = Path(__file__).parent / "data"SCALE_FACTOR = 1.0
# Disable pandas inference of string types. This ensures that string columns# in the CSVs are loaded as object dtype. This is only required when using# relationalai versions prior to v1.0.pd.options.future.infer_string = FalseLoad CSV inputs and build the model container
Next, it creates a Semantics Model, loads the three CSV inputs, and scales targets/bounds with SCALE_FACTOR:
# Create a Semantics model container.model = Model( f"test_data_generation_{time_ns()}", config=globals().get("config", None),)
# Load schema data from CSV.schema_df = read_csv(DATA_DIR / "testgen_schema.csv")
# Load constraint data from CSV.constraints_df = read_csv(DATA_DIR / "testgen_constraints.csv")
# Load row-count targets from CSV.targets_df = read_csv(DATA_DIR / "testgen_targets.csv")
# Scale row-count targets and bounds.targets_df = targets_df.copy()targets_df["target_rows"] = (targets_df["target_rows"] * SCALE_FACTOR).astype(int)targets_df["min_rows"] = (targets_df["min_rows"] * SCALE_FACTOR).astype(int)targets_df["max_rows"] = (targets_df["max_rows"] * SCALE_FACTOR).astype(int)Define the Table concept and decision properties
Then it loads one Table entity per row in testgen_targets.csv and adds two solver-chosen properties (actual_rows and deviation):
# Table concept: represents a table in the schema.Table = model.Concept("Table")
# Load table data from CSV.data(targets_df).into(Table, keys=["table_name"])
# Table.x_actual_rows decision property: solver-chosen row count per table.Table.x_actual_rows = model.Property("{Table} has actual {actual_rows:float}")
# Table.x_deviation decision property: absolute deviation from the target.Table.x_deviation = model.Property("{Table} has {deviation:float}")Extract foreign keys and constraint metadata
With inputs loaded, the script scans testgen_schema.csv for foreign keys and joins in any per-table-pair metadata (like cardinality bounds and coverage percentages) from testgen_constraints.csv:
# Extract foreign key relationships from the schema.fk_df = schema_df[schema_df["is_foreign_key"] == True].copy()cardinality_constraints = constraints_df[ constraints_df["constraint_type"].isin( ["cardinality_bound", "mandatory_participation", "frequency"] )]
fk_objs = []for _, fk_row in fk_df.iterrows(): child_table = fk_row["table_name"] parent_table = fk_row["references_table"]
card = cardinality_constraints[ (cardinality_constraints["table_name"] == child_table) & (cardinality_constraints["related_table"] == parent_table) ]
min_per = 1 max_per = 100
for _, c in card.iterrows(): if c["constraint_type"] == "cardinality_bound": min_per = int(c["min_value"]) if not pd.isna(c["min_value"]) else 1 max_per = int(c["max_value"]) if not pd.isna(c["max_value"]) else 100 elif c["constraint_type"] == "frequency": max_per = int(c["max_value"]) if not pd.isna(c["max_value"]) else 100
coverage = constraints_df[ (constraints_df["constraint_type"] == "coverage") & (constraints_df["table_name"] == child_table) & (constraints_df["related_table"] == parent_table) ] coverage_pct = 0.0 if len(coverage) > 0: coverage_pct = float(coverage.iloc[0]["percentage"]) / 100.0
fk_objs.append( { "child": child_table, "parent": parent_table, "min": min_per, "max": max_per, "coverage": coverage_pct, } )Define variables, constraints, and objective
With the feasible region defined at the table level, the script declares two variable families and linearizes the absolute deviation from each target:
s = SolverModel(model, "cont")
# Variable: actual row counts for each tables.solve_for( Table.x_actual_rows, name=["n", Table.table_name], lower=Table.min_rows, upper=Table.max_rows,)
# Variable: deviation from target (for objective)s.solve_for( Table.x_deviation, name=["dev", Table.table_name], lower=0,)
# Constraint: deviation captures |actual - target| (linearized)s.satisfy(require(Table.x_deviation >= Table.x_actual_rows - Table.target_rows))s.satisfy(require(Table.x_deviation >= Table.target_rows - Table.x_actual_rows))Then it links child and parent table sizes using where(...).require(...) for each discovered foreign key pair (including optional mandatory participation and coverage constraints):
# Constraint: referential integrity - child rows bounded by parent capacity.# These constraints link specific table pairs via their actual_rows variables.Table2 = Table.ref()for fk_info in fk_objs: child_name = fk_info["child"] parent_name = fk_info["parent"]
# Upper bound: can't have more children than max per parent. s.satisfy( where( Table.table_name == child_name, Table2.table_name == parent_name ).require(Table.x_actual_rows <= Table2.x_actual_rows * fk_info["max"]) )
# Lower bound for mandatory participation mandatory = constraints_df[ (constraints_df["constraint_type"] == "mandatory_participation") & (constraints_df["table_name"] == child_name) & (constraints_df["related_table"] == parent_name) ] if len(mandatory) > 0: min_per = ( int(mandatory.iloc[0]["min_value"]) if not pd.isna(mandatory.iloc[0]["min_value"]) else 1 ) s.satisfy( where( Table.table_name == child_name, Table2.table_name == parent_name ).require(Table.x_actual_rows >= Table2.x_actual_rows * min_per) )
# Constraint: coverage requirementsfor fk_info in fk_objs: if fk_info["coverage"] > 0: child_name = fk_info["child"] parent_name = fk_info["parent"] s.satisfy( where( Table.table_name == child_name, Table2.table_name == parent_name ).require(Table.x_actual_rows >= fk_info["coverage"] * Table2.x_actual_rows) )Finally, it minimizes a priority-weighted deviation (higher priority means a higher penalty for missing the target):
# Objective: minimize weighted deviation from targets# Weight by priority (higher priority = more important to match target)total_deviation = rai_sum(Table.x_deviation * (11 - Table.priority))s.minimize(total_deviation)Solve and generate example rows
The script solves with the HiGHS backend and prints a summary:
solver = Solver("highs")s.solve(solver, time_limit_sec=60)
print(f"Status: {s.termination_status}")print(f"Total weighted deviation: {s.objective_value:.2f}")Then it rounds the continuous solution to integers for reporting and for driving the generator:
# Extract row countsrow_counts = {}results_df = select(Table.table_name, Table.x_actual_rows, Table.target_rows).to_df()for _, row in results_df.iterrows(): actual = int(round(row["actual_rows"])) target = int(row["target_rows"]) row_counts[row["table_name"]] = {"actual": actual, "target": target}To close the loop, it generates a small set of synthetic records and prints a preview of each table:
# Generate data and show samples.print("\n" + "=" * 50)print("GENERATED TEST DATA")print("=" * 50)
generated_data = generate_test_data(row_counts)for table, df in generated_data.items(): print(f"\n{table}: {len(df)} rows") print(df.head(3).to_string(index=False))Customize this template
Use your own schema and constraints
- Replace the files in
data/with your own versions that follow the same headers. - Start by updating
testgen_targets.csv(targets + bounds), then ensure foreign keys are reflected intestgen_schema.csv.
Tune dataset size quickly
- Change
SCALE_FACTORintest_data_generation.pyto scale all targets/bounds at once.
Adjust priorities
- If some tables must match targets more closely than others, change
priorityintestgen_targets.csv. Higher priority tables are penalized more heavily in the objective.
Export generated data
- The template prints a preview. If you want CSV outputs, add
df.to_csv(...)calls after generation.
Troubleshooting
I get ModuleNotFoundError when running the script
- Confirm you activated your virtual environment:
source .venv/bin/activate. - Reinstall dependencies from the template root:
python -m pip install ..
rai init fails or the script can’t authenticate
- Re-run
rai initand confirm your Snowflake account has the RAI Native App installed. - If you use multiple profiles, set
RAI_PROFILEbefore running:export RAI_PROFILE=<profile>. - Verify your Snowflake role/warehouse has access to the RelationalAI Native App.
The script fails reading CSV files (missing file or columns)
- Confirm you are running from the template folder that contains
data/. - Verify the input filenames exist:
testgen_schema.csv,testgen_constraints.csv,testgen_targets.csv. - Check that the CSV headers match the expected columns documented above.
The solver returns Status: INFEASIBLE
- This usually indicates the target bounds and the parent-child constraints cannot be satisfied at the same time.
- Start by widening
min_rows/max_rowsintestgen_targets.csvfor the most constrained tables. - Check whether tight constraints (for example, low max cardinality or high frequency minimums) force row counts beyond your bounds.
The row counts look “off by 1” versus my targets
- The solver uses continuous variables and the script rounds to integers when printing and generating.
- If you need exact integer row counts, consider switching to an integer model (MILP) for
Table.x_actual_rows.
I expected larger output tables
- The script prints only a preview of generated data using
df.head(3). - To export full tables, add a
df.to_csv(…)step after generation.
The generated tables don’t include all schema columns
- The included generator is intentionally small and only produces a subset of the columns described in
testgen_schema.csv. - Extend
generate_test_data(…)to populate additional columns (including nullable fields, value domains, and min/max bounds).
What this template is for
Generating realistic test databases requires careful coordination of row counts across tables. Foreign key relationships, cardinality bounds, and coverage requirements create complex interdependencies — for example, if each order must have 1-3 line items, then 800 orders can have at most 2,400 order lines. Manually choosing row counts that satisfy all these constraints while staying close to target sizes is tedious and error-prone.
This template uses prescriptive reasoning to find optimal row counts for each table in a test database. It encodes referential integrity constraints, cardinality bounds, mandatory participation rules, and coverage requirements as a linear program, then minimizes weighted deviation from target row counts. Higher-priority tables stay closer to their targets.
After determining row counts, the script generates actual test data records with realistic values — emails, dates, prices, and proper foreign key references — giving you a complete, constraint-consistent test dataset.
Who this is for
- QA engineers who need realistic test databases with consistent referential integrity
- Database developers building test fixtures for integration testing
- Data engineers validating ETL pipelines with controlled synthetic data
What you’ll build
- A linear programming model that balances row count targets against schema constraints
- Referential integrity constraints encoding foreign key cardinality bounds
- Mandatory participation and coverage requirements
- A data generation phase that produces actual CSV-ready test records
What’s included
test_data_generation.py— Main script for row count optimization and data generationdata/testgen_schema.csv— Table and column definitions with types and foreign keysdata/testgen_constraints.csv— Cardinality bounds, coverage, and participation rulesdata/testgen_targets.csv— Target, minimum, and maximum row counts per table with prioritiespyproject.toml— Python package configuration
Prerequisites
Access
- A Snowflake account that has the RAI Native App installed.
- A Snowflake user with permissions to access the RAI Native App.
Tools
- Python >= 3.10
Quickstart
-
Download ZIP:
Terminal window curl -O https://docs.relational.ai/templates/zips/v1/test_data_generation.zipunzip test_data_generation.zipcd test_data_generation -
Create venv:
Terminal window python -m venv .venvsource .venv/bin/activatepython -m pip install --upgrade pip -
Install:
Terminal window python -m pip install . -
Configure:
Terminal window rai init -
Run:
Terminal window python test_data_generation.py -
Expected output:
Status: OPTIMALTotal weighted deviation: 1200.00Optimal row counts:Customer: 100 rows (target: 100, deviation: 0)Product: 500 rows (target: 500, deviation: 0)Order: 800 rows (target: 800, deviation: 0)OrderLine: 2400 rows (target: 5000, deviation: 2600)Supplier: 20 rows (target: 20, deviation: 0)SupplierProduct: 1500 rows (target: 1500, deviation: 0)==================================================GENERATED TEST DATA==================================================Customer: 100 rowscustomer_id email region created_date1 user1_abcd@gmail.com North 2020-03-15Product: 500 rowsproduct_id name category price1 Electronics Item 1 Clothing 42.99Order: 800 rowsorder_id customer_id order_date status1 47 2023-06-12 delivered
Template structure
.├── README.md├── pyproject.toml├── test_data_generation.py└── data/ ├── testgen_constraints.csv ├── testgen_schema.csv └── testgen_targets.csvHow it works
1. Load schema metadata
The script reads three CSV files describing the database schema: table definitions with column types and foreign keys, constraints (cardinality bounds, coverage, mandatory participation), and row count targets with priorities:
schema_df = read_csv(data_dir / "testgen_schema.csv")constraints_df = read_csv(data_dir / "testgen_constraints.csv")targets_df = read_csv(data_dir / "testgen_targets.csv")2. Define decision variables
Each table gets two decision variables — actual row count and deviation from target:
Table.x_actual_rows = Property(f"{Table} has actual {Float:actual_rows}")Table.x_deviation = Property(f"{Table} has {Float:deviation}")
problem.solve_for(Table.x_actual_rows, name=["n", Table.table_name], lower=Table.min_rows, upper=Table.max_rows)problem.solve_for(Table.x_deviation, name=["dev", Table.table_name], lower=0)3. Encode referential integrity constraints
Foreign key relationships impose bounds linking child and parent row counts. For example, if each order has 1-3 order lines, the OrderLine count must be between 1x and 3x the Order count:
ParentTable = Table.ref()problem.satisfy(model.require( Table.x_actual_rows <= ParentTable.x_actual_rows * fk_info['max']).where( Table.table_name == child_name, ParentTable.table_name == parent_name))4. Minimize weighted deviation
The objective minimizes deviation from targets, weighted by priority so that critical tables stay closer to their targets:
total_deviation = rai_sum(Table.x_deviation * (11 - Table.priority))problem.minimize(total_deviation)5. Generate test records
After solving, the script generates actual data records with realistic values using the computed row counts:
generated_data = generate_test_data(row_counts)Customize this template
- Add your own schema by editing the three CSV files to match your database structure with its constraints and targets.
- Adjust the scale factor by changing
scale_factorto proportionally scale all targets up or down. - Extend data generation to produce additional column types or more realistic value distributions for your domain.
- Add uniqueness constraints to ensure generated foreign key references follow realistic distributions rather than uniform random.
Troubleshooting
Solver returns INFEASIBLE
The constraints are contradictory — for example, a cardinality bound may require more child rows than the maximum allows. Review testgen_constraints.csv and check that the min/max row ranges in testgen_targets.csv are compatible with the cardinality bounds. Widen the min/max row ranges or relax cardinality constraints.
Large deviations from target row counts
This is expected when constraints are tight. For example, if each order has at most 3 order lines and there are 800 orders, the OrderLine count cannot exceed 2,400 even if the target is 5,000. Review the constraints to understand which ones are binding.
ModuleNotFoundError: No module named 'relationalai'
Make sure you activated the virtual environment and ran python -m pip install . to install all dependencies listed in pyproject.toml.
Connection or authentication errors
Run rai init to configure your Snowflake connection. Verify that your account has the RAI Native App installed and that your user has the required permissions.