Smoker Status Prediction
Predict whether a person is a smoker from demographic and medical attributes plus a network of social connections.
What this template is for
Predicting health-related behaviors like smoking status from medical and demographic data is a common tabular machine learning task. In practice, though, these behaviors are also shaped by social context: friends, family, and peers often influence one another. This template demonstrates how to model both individual attributes and social relationships with a Graph Neural Network (GNN), using the RelationalAI Predictive reasoner to train a single end-to-end model.
Who this is for
- Data scientists who want to leverage the relational structure of data stored across connected tables
- ML engineers learning the RelationalAI Predictive reasoner workflow
- Health analytics teams interested in incorporating social or relational structure into predictive models
Assumes familiarity with Python and basic ML concepts (binary classification, train/val/test splits).
What you’ll build
- A graph model where rows of a
PEOPLEtable become nodes and rows of aRELATEDedge list become edges. - A
PropertyTransformerthat exposes 17 medical and demographic features (16 continuous + 1 binary category). - A binary-classification GNN trained to predict each person’s smoking status.
- (Optional) A registered model in the Snowflake Model Registry that can be loaded and reused without retraining.
What’s included
- Scripts:
smoker_status_prediction_local.py— primary, runnable out of the box. Loads CSVs fromdata/viamodel.data().smoker_status_prediction.py— reference pattern for adapting the same pipeline to Snowflake-hosted tables.
- Sample data (
data/) — bundled demo dataset for end-to-end runs; predictions are illustrative of the GNN methodology, not clinically meaningful (see Sample data for what’s real vs. constructed):people.csv— 38,984 individuals with demographic and medical features.related.csv— 58,355 connection pairs between people.train.csv/validation.csv/test.csv— labeled splits with smoking status.
Prerequisites
Access
To run the local demo (smoker_status_prediction_local.py) you need a Snowflake account with the RelationalAI Native App. The bundled CSVs in data/ ship with the template; GNN training runs on the RelationalAI engine, so the native app needs USAGE + CREATE EXPERIMENT / CREATE MODEL grants on the experiment schema (see the SQL block in step 5 of the Quickstart).
To adapt to your own Snowflake pipeline (smoker_status_prediction.py as reference) you’ll additionally need the CSVs uploaded to Snowflake tables (or your own schema-equivalent dataset). Quote column names when creating the tables so spaces and parentheses are preserved (e.g. "height(cm)", "fasting blood sugar").
Tools
- Python >= 3.10
- RelationalAI Python SDK (
relationalai == 1.8)
Quickstart
-
Download ZIP:
Terminal window curl -O https://docs.relational.ai/templates/zips/v1/smoker_status_prediction.zipunzip smoker_status_prediction.zipcd smoker_status_prediction -
Create venv:
Terminal window python -m venv .venvsource .venv/bin/activatepython -m pip install --upgrade pip -
Install:
Terminal window python -m pip install . -
Configure:
Terminal window rai initAfter
rai initgenerates the config file, add the following to yourraiconfig.yaml:data:ensure_change_tracking: true -
Grant the RelationalAI Native App access to a schema for experiment artifacts. The local script uses
SMOKER_STATUS_PREDICTION.EXPERIMENTSby default (or change the constants at the top of the script). Update theSETstatements below to match your database, schema, and Native App name, then run the following in a Snowflake SQL worksheet:SET db_name = 'SMOKER_STATUS_PREDICTION';SET schema_experiments = 'SMOKER_STATUS_PREDICTION.EXPERIMENTS';SET app_name = 'RELATIONALAI'; -- replace with your app nameCREATE DATABASE IF NOT EXISTS identifier($db_name);CREATE SCHEMA IF NOT EXISTS identifier($schema_experiments);GRANT USAGE ON DATABASE identifier($db_name) TO APPLICATION identifier($app_name);GRANT USAGE ON SCHEMA identifier($schema_experiments) TO APPLICATION identifier($app_name);GRANT CREATE EXPERIMENT ON SCHEMA identifier($schema_experiments) TO APPLICATION identifier($app_name);GRANT CREATE MODEL ON SCHEMA identifier($schema_experiments) TO APPLICATION identifier($app_name); -
Run the local demo on the bundled CSVs:
Terminal window python smoker_status_prediction_local.py
Expected output (abbreviated)
Real numbers from a verified end-to-end run on the bundled CSVs (smoker_status_prediction_local.py). Dataset shape: 38,984 People with 58,355 Related edges, split into 31,187 train / 3,898 validation / 3,899 test rows. Exact predicted probabilities shift slightly with numerical noise; metric values are stable within a few hundredths.
============================================================PREDICTIVE: Smoker status binary classification============================================================
=== Start GNN Training === ✓ Step 1 completed (4.77s) # prepare trainer configuration ✓ Step 2 completed (16.65s) # prepare dataset and GNN tables ✓ Step 3 completed (31.29s) # submit training job and stream logs=== GNN Training Complete (52.71s) ===
=== Start GNN Prediction === ✓ Step 1 completed (5.50s) # prepare test table ✓ Step 2 completed (4.85s) # prepare model for prediction ✓ Step 3 completed (31.87s) # submit prediction job and stream logs ✓ Step 4 completed (9.40s) # load results into the logic engine=== GNN Prediction Complete (51.62s) ===
=== Smoker predictions (sample) ===Id predicted_label prob actual_smoking 1 1 0.645295 116 0 0.351588 021 0 0.178555 024 0 0.352124 029 1 0.855304 135 1 0.836192 049 1 0.576184 175 0 0.045360 092 1 0.707472 197 1 0.851055 1
=== Test-set metrics (n=3899) === Test-set ROC-AUC: 0.8485 Test-set accuracy: 0.7671 Actual class dist: {0: 0.6327, 1: 0.3673} Predicted class dist: {0: 0.6163, 1: 0.3837}
============================================================Local run complete.============================================================Adapting to your own Snowflake data
smoker_status_prediction.py shows the same pipeline against Snowflake-hosted tables. The script’s defaults expect five tables — PEOPLE, RELATED, TRAIN, VALIDATION, TEST — inside SMOKER_STATUS_PREDICTION.DATA. To adapt:
-
Create the database and
DATAschema, and define the five tables. Run this in a Snowflake SQL worksheet. Column names are quoted so spaces and parentheses in feature names (height(cm),fasting blood sugar, …) are preserved as-is.SET db_name = 'SMOKER_STATUS_PREDICTION';SET schema_data = 'SMOKER_STATUS_PREDICTION.DATA';CREATE DATABASE IF NOT EXISTS identifier($db_name);CREATE SCHEMA IF NOT EXISTS identifier($schema_data);USE SCHEMA identifier($schema_data);CREATE OR REPLACE TABLE PEOPLE ("Id" NUMBER,"age" NUMBER,"height(cm)" NUMBER,"weight(kg)" NUMBER,"systolic" NUMBER,"relaxation" NUMBER,"fasting blood sugar" NUMBER,"Cholesterol" NUMBER,"triglyceride" NUMBER,"HDL" NUMBER,"LDL" NUMBER,"hemoglobin" FLOAT,"Urine protein" NUMBER,"serum creatinine" FLOAT,"AST" NUMBER,"ALT" NUMBER,"Gtp" NUMBER,"dental caries" NUMBER);CREATE OR REPLACE TABLE RELATED ("person1" NUMBER, "person2" NUMBER);CREATE OR REPLACE TABLE TRAIN ("Id" NUMBER, "smoking" NUMBER);CREATE OR REPLACE TABLE VALIDATION ("Id" NUMBER, "smoking" NUMBER);CREATE OR REPLACE TABLE TEST ("Id" NUMBER, "smoking" NUMBER); -
Load the bundled CSVs into the five tables. Two ways:
- Snowsight UI: For each table, open the table page and click Load Data -> upload the matching CSV from
data/. The wizard auto-detects headers; confirm that each column maps to the quoted name created above. - SnowSQL / Snowflake CLI (reproducible): Stage each CSV via
PUTand load withCOPY INTO. Run from a shell pointing at the template’sdata/directory:CREATE OR REPLACE FILE FORMAT csv_with_headerTYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1FIELD_OPTIONALLY_ENCLOSED_BY = '"';PUT file://data/people.csv @%PEOPLE AUTO_COMPRESS=FALSE OVERWRITE=TRUE;PUT file://data/related.csv @%RELATED AUTO_COMPRESS=FALSE OVERWRITE=TRUE;PUT file://data/train.csv @%TRAIN AUTO_COMPRESS=FALSE OVERWRITE=TRUE;PUT file://data/validation.csv @%VALIDATION AUTO_COMPRESS=FALSE OVERWRITE=TRUE;PUT file://data/test.csv @%TEST AUTO_COMPRESS=FALSE OVERWRITE=TRUE;COPY INTO PEOPLE FROM @%PEOPLE FILE_FORMAT = csv_with_header;COPY INTO RELATED FROM @%RELATED FILE_FORMAT = csv_with_header;COPY INTO TRAIN FROM @%TRAIN FILE_FORMAT = csv_with_header;COPY INTO VALIDATION FROM @%VALIDATION FILE_FORMAT = csv_with_header;COPY INTO TEST FROM @%TEST FILE_FORMAT = csv_with_header;
- Snowsight UI: For each table, open the table page and click Load Data -> upload the matching CSV from
-
Grant the Native App access to the
EXPERIMENTSschema for GNN artifacts — same SQL block as step 5 of the Quickstart. -
Enable change tracking in your
raiconfig.yaml. The Snowflake-path script imports tables viaTable(...).to_schema(), which requires change tracking to be on:data:ensure_change_tracking: trueWithout it, the SDK warns at startup that “GNN workflows using Snowflake tables will fail without it” and the script will not be able to import the source tables. The local CSV path (
smoker_status_prediction_local.py) does not need this. -
Edit the table references at the top of the script if your database/schema names differ from the defaults:
DATABASE = "YOUR_DB"SCHEMA = "YOUR_SCHEMA" # holds PEOPLE and RELATEDTASK_SCHEMA = "YOUR_TASK_SCHEMA" # holds TRAIN, VALIDATION, TESTGNN_EXP_DATABASE = "YOUR_DB"GNN_EXP_SCHEMA = "EXPERIMENTS" -
Run:
Terminal window python smoker_status_prediction.py
Template structure
.├── README.md # this file├── pyproject.toml # dependencies├── smoker_status_prediction_local.py # primary: CSV-based script├── smoker_status_prediction.py # reference: Snowflake pipeline└── data/ ├── people.csv # 38,984 rows, 18 columns (Id + 17 features) ├── related.csv # 58,355 (person1, person2) pairs ├── train.csv # 31,187 (Id, smoking) rows ├── validation.csv # 3,898 rows └── test.csv # 3,899 rowsStart here: smoker_status_prediction_local.py (no external setup beyond Snowflake grants). Use smoker_status_prediction.py as the adaptation reference when you wire this pattern into your own Snowflake data.
Sample data
The dataset contains medical and demographic attributes for a population of individuals in the PEOPLE table, along with a RELATED table of connected pairs. Connections were constructed so that linked individuals are more likely to share the same smoking status, giving the network a genuinely informative signal for the GNN.
- people.csv — an
Ididentifier plus 17 medical/demographic features: 16 continuous (age,height(cm),weight(kg), blood pressure (systolic,relaxation),fasting blood sugar, lipids (Cholesterol,triglyceride,HDL,LDL),hemoglobin, urinalysis (Urine protein,serum creatinine), liver enzymes (AST,ALT,Gtp)) plus a binarydental cariescategory. - related.csv — pairs of
Ids representing connections between persons. - train / validation / test — split-specific
(Id, smoking)rows wheresmoking ∈ {0, 1}.
Model overview
Key entities
- People (
Id): individuals with demographic and medical attributes. - Related: pairs of people, used as the edge list of the GNN graph.
Pipeline stages
People + Related (CSVs or Snowflake tables) → Build the graph: nodes as persons, self-referential edges People ↔ People (via Related) → Configure features: PropertyTransformer (continuous medical features + dental_caries category) → Define the task: Train / Validation / Test relationships → Train and predict: binary-classification GNN → predictions on the Test cohort → (Optional) Register and load the trained modelConcepts
People — individuals with demographic and medical attributes.
| Property | Type | Notes |
|---|---|---|
Id | integer | Identifying; unique per person |
age | integer | Age in years |
height(cm) | integer | Height in centimeters |
weight(kg) | integer | Weight in kilograms |
systolic | integer | Systolic blood pressure |
relaxation | integer | Diastolic blood pressure |
fasting blood sugar | integer | Fasting blood glucose level |
Cholesterol | integer | Total cholesterol |
triglyceride | integer | Triglyceride level |
HDL | integer | High-density lipoprotein |
LDL | integer | Low-density lipoprotein |
hemoglobin | float | Hemoglobin level |
Urine protein | integer | Urine protein indicator |
serum creatinine | float | Serum creatinine level |
AST | integer | Aspartate aminotransferase (liver enzyme) |
ALT | integer | Alanine aminotransferase (liver enzyme) |
Gtp | integer | Gamma-glutamyl transferase (liver enzyme) |
dental caries | integer | Binary indicator (0 / 1) |
Related — pairs of connected people; used to construct edges in the GNN graph. No primary key.
| Property | Type | Notes |
|---|---|---|
person1 | integer | Foreign key into People.Id |
person2 | integer | Foreign key into People.Id |
TrainTable / ValidationTable / TestTable — split tables joined to People by Id to build the train, validation, and test relationships. TestTable.smoking is held out from the model.
| Property | Type | Notes |
|---|---|---|
Id | integer | Foreign key into People.Id |
smoking | integer | Binary label (0 / 1); held out for TestTable |
How it works
1. Build the graph
Each row of Related defines a directed edge from one People instance to another. Because both endpoints are the same concept, the destination uses .ref():
PeopleRef = People.ref()model.define(Edge.new(src=People, dst=PeopleRef)).where( People.Id == Related.person1, PeopleRef.Id == Related.person2,)2. Configure features
PropertyTransformer lists the medical and demographic features as continuous, treats the binary dental caries indicator as a category, and drops the Id so it isn’t fed into the model:
pt = PropertyTransformer( continuous=[ People.age, getattr(People, "height(cm)"), getattr(People, "weight(kg)"), ... ], category=[getattr(People, "dental caries")], drop=[People.Id],)The getattr() calls handle column names with special characters (spaces, parentheses) that aren’t valid Python identifiers. The schema is preserved as-is so the same names work in Snowflake queries against the underlying tables.
3. Define the task
A simple node-classification task: each labeled row pairs a People instance with its smoking status. The Test relationship omits the label since it’s held out.
Train = Relationship(f"{People} has {Any:smoking}")model.define(Train(People, TrainTable.smoking)).where( People.Id == TrainTable.Id)
Test = Relationship(f"{People}")model.define(Test(People)).where( People.Id == TestTable.Id)4. Train and predict
Instantiate the GNN with the graph, the PropertyTransformer, and the Train / Validation relationships, fit it, then attach predictions over the held-out Test cohort to each People instance:
gnn = GNN( exp_database=GNN_EXP_DATABASE, exp_schema=GNN_EXP_SCHEMA, graph=gnn_graph, property_transformer=pt, train=Train, validation=Validation, task_type="binary_classification", eval_metric="roc_auc", device="cuda", n_epochs=5, lr=0.005, train_batch_size=256,)gnn.fit()People.predictions = gnn.predictions(domain=Test)5. (Optional) Register and load
The bottom of each script has a commented-out block that registers the trained model in the Snowflake Model Registry, then loads it into a fresh GNN instance and predicts without retraining. Uncomment it if you want to register a model, load it back, and run predictions without retraining.
Customize this template
Use your own data
- Replace the CSVs in
data/with your own equivalent files (people, edges, splits). Column names need to match thePropertyTransformer, or you’ll need to edit the transformer. - For Snowflake adaptation, edit the
DATABASE,SCHEMA,TASK_SCHEMA, andGNN_EXP_*constants at the top ofsmoker_status_prediction.py.
Tune the model
n_epochs— increase for better convergence on a larger dataset.lr— lower if training loss bounces.eval_metric— switch to"accuracy"or"f1"if those metrics suit your problem better.device— use"cuda"for faster training on a GPU-enabled engine.
For the full hyperparameter list, see the Configure a GNN guide.
Extend the model
- Add categorical demographics (e.g. occupation, income bracket): list them under
category=[...]in thePropertyTransformer. - Try a multiclass task: if your label has more than two values (e.g. never / light / heavy smoker), change
task_type="multiclass_classification"and useeval_metric="macro_f1"or"accuracy". - Register the model for reuse: uncomment the bonus section at the bottom of either script.
Troubleshooting
Predictions are all the same class
Check the class balance in train.csv — if it’s heavily skewed, consider re-sampling or training for more epochs with a lower learning rate. Also verify that the RELATED edges correlate with shared smoking status; without an informative network, the GNN reduces to a tabular model.
GNN training is very slow
GNN training runs on the RelationalAI engine you’ve provisioned. The device flag in the script picks which engine flavor to use — "cuda" for GPU-enabled engines (default; significantly faster), "cpu" for CPU-only engines. If training is slow with device="cuda", your engine may not actually have GPU; check the engine type or fall back to "cpu".
Permissions error from RelationalAI native app
The native app needs USAGE on the database and CREATE EXPERIMENT / CREATE MODEL on the experiment schema. Run:
SET db_name = '<db>';SET schema_experiments = '<db>.<exp_schema>';SET app_name = 'RELATIONALAI'; -- replace with your app name
CREATE DATABASE IF NOT EXISTS identifier($db_name);CREATE SCHEMA IF NOT EXISTS identifier($schema_experiments);
GRANT USAGE ON DATABASE identifier($db_name) TO APPLICATION identifier($app_name);GRANT USAGE ON SCHEMA identifier($schema_experiments) TO APPLICATION identifier($app_name);GRANT CREATE EXPERIMENT ON SCHEMA identifier($schema_experiments) TO APPLICATION identifier($app_name);GRANT CREATE MODEL ON SCHEMA identifier($schema_experiments) TO APPLICATION identifier($app_name);rai init fails or connection errors
Ensure your Snowflake credentials are configured correctly and that the RAI Native App is installed on your account. Run rai init again and verify the connection settings.