Patient Cohort Recruitment
Build a clinical-research cohort over a patient knowledge graph: close a kinase-pathway sub-ontology, lift the closure to per-patient eligibility and coverage facts, then pick K patients whose joint coverage spans enough distinct genes, therapies, and adverse events to generalize.
What this template is for
Clinical research and pharma R&D teams build patient cohorts to power studies. A typical ask: “find patients with a mutation in some gene from the kinase pathway who received a therapy and developed an adverse event within 90 days, and pick K of them so the cohort spans enough distinct genes, therapies, and toxicity profiles for the analysis to generalize.”
The pieces are recognizable across healthcare data. A gene ontology with is_a edges defines the pathway. A patient knowledge graph carries mutations / therapies / adverse events with timestamps. The cohort itself is a small set of patients chosen against multiple coverage criteria.
This template encodes that ask as a three-reasoner pipeline. The Graph reasoner runs a single reachable(full=True) call to close is_a over the gene ontology, returning every gene in the kinase-pathway sub-ontology in one step. Pure relational Rules then lift the closure to per-patient eligibility — represented as a sub-concept EligiblePatient extends Patient whose membership is the eligibility conjunction — and to per-(patient, gene), per-(patient, therapy), and per-(patient, adverse-event) coverage facts. The Prescriptive reasoner (CSP, MiniZinc / Chuffed) selects the cohort: binary is_in_cohort decisions are scoped to EligiblePatient rows only, plus is_covered indicators on three sub-concepts (CoverableGene, CoverableTherapy, CoverableAdverseEvent) that the upper-bound ICs link back to the patient decisions.
The same pattern applies to other knowledge-graph cohort / set-cover problems where eligibility is a relational predicate over a labelled subgraph and the chosen set must witness a minimum spread on several attributes: insurance claim audits (find K members spanning N procedures and M comorbidities), grant-applicant diversification (find K applicants spanning institutions, fields, and career stages), security alert triage (find K alerts spanning attack categories and asset classes).
Who this is for
- Clinical research teams enrolling cohorts for pathway-targeted studies
- Pharma R&D teams running biomarker-driven feasibility analyses
- Healthcare data engineers building cohort-discovery pipelines on top of patient knowledge graphs
- Operations researchers learning multi-reasoner (Graph + Rules + CSP) composition over an OMOP / FHIR-class ontology
What you’ll build
- A gene ontology + patient knowledge graph:
Gene,GeneIsA,Patient,MutationEvent,TherapyEvent,AdverseEventOcc,Therapy,AdverseEvent - A Graph-reasoner closure of the kinase-pathway sub-ontology via
graph.reachable(full=True), producing aKinaseGene extends Genesub-concept - A 3-arity
Patient.qualifying_pairrelationship over(Patient, TherapyEvent, AdverseEventOcc)triples where the AE follows the therapy withinMAX_THERAPY_TO_AE_DAYS, single-sourcing the AE-window predicate - Pure-relational rules that derive sub-concepts
KinaseMutationCarrier,QualifyingPairPatient, and the eligibility conjunctionEligiblePatient extends Patient, plus per-axis coverage relations (Patient.covers_kinase_gene,Patient.covers_therapy,Patient.covers_ae) projected fromqualifying_pair, and the coverable sub-conceptsCoverableGene,CoverableTherapy,CoverableAdverseEventscoped to eligible-patient coverage - A constraint model with four binary decision streams targeting sub-concepts directly:
EligiblePatient.is_in_cohort(eligible patients only),CoverableGene.is_covered(coverable kinase genes only),CoverableTherapy.is_covered, andCoverableAdverseEvent.is_covered - Per-axis coverage upper-bound ICs that link
is_coveredto in-cohort patient decisions (Sub.is_covered <= sum(EligiblePatient.is_in_cohort).per(Sub)), and lower-bound ICs (sum(Sub.is_covered) >= MIN_*) that force the cohort to span enough distinct values - Pre-solve Python invariants that catch silent-failure modes (duplicate keys, dangling foreign keys, missing kinase root, negative timestamps) before the solver runs
- Post-solve verification via
problem.verify()confirming every IC holds in the returned solution, plus atermination_status() == "OPTIMAL"assertion
What’s included
The bundled CSVs are illustrative, fully synthetic demo data (e.g. patient names P_Alpha…P_Oscar, fictional gene/therapy/AE labels) sized so the pipeline runs end-to-end in a few seconds; swap in your own ontology and patient KG to apply the template to real cohorts.
patient_cohort_recruitment.py— main script with concepts, the Graph closure, the rules, the decisions and constraints, and the solver calldata/genes.csv— 10 sample genes: 7 in the kinase pathway sub-ontology (a root plus two intermediate sub-roots and four leaves) and 3 unrelated metabolism genesdata/gene_is_a.csv— 8is_aedges that lay out the kinase-pathway tree and a parallel unrelated treedata/patients.csv— 15 synthetic patients with names and agesdata/mutation_events.csv— 26 illustrative mutation events spanning kinase and non-kinase genes; some patients carry only non-kinase mutations and are correctly excluded from the cohortdata/therapy_events.csv— 15 therapy events across 5 therapiesdata/adverse_events.csv— 12 adverse-event occurrences across 4 AE terms; some patients have AEs outside the 90-day window from any therapy and are excluded by the qualifying-pair ruledata/therapies.csv— 5 therapy concepts (3 kinase inhibitors, 2 unrelated)data/ae_terms.csv— 4 AE termspyproject.toml— Python package configuration
Prerequisites
Access
- A Snowflake account that has the RAI Native App installed.
- A Snowflake user with permissions to access the RAI Native App.
Tools
- Python >= 3.10
Quickstart
-
Download ZIP:
Terminal window curl -O https://docs.relational.ai/templates/zips/v1/patient_cohort_recruitment.zipunzip patient_cohort_recruitment.zipcd patient_cohort_recruitment -
Create venv:
Terminal window python -m venv .venvsource .venv/bin/activatepython -m pip install --upgrade pip -
Install:
Terminal window python -m pip install . -
Configure:
Terminal window rai init -
Run:
Terminal window python patient_cohort_recruitment.py -
Expected output (the bundled data has eight eligible patients and several feasible cohorts that hit the coverage floors — the exact choice may vary across solver versions):
Solve result:• status: OPTIMAL• objective: 0• solve time: ~1s• num_points: 1• solver: MiniZinc_unknownKinase-pathway gene closure (reachable from KINASE_ROOT_GENE_ID):gene_id gene_name1 KinasePathwayRoot2 SerineThreonineKinase3 TyrosineKinase4 EGFR5 HER26 BRAF7 MEK1Eligible patients (carry a kinase mutation and have a qualifying pair):patient_id patient_name1 P_Alpha2 P_Bravo3 P_Charlie4 P_Delta5 P_Echo7 P_Golf8 P_Hotel9 P_IndiaSelected cohort:patient_id patient_name age_years2 P_Bravo 617 P_Golf 638 P_Hotel 499 P_India 57Kinase-pathway genes covered by the cohort:gene_id gene_name4 EGFR5 HER26 BRAF7 MEK1Therapies covered by the cohort:therapy_id therapy_name1 EGFR_Inhibitor2 HER2_Inhibitor3 MEK_InhibitorAdverse events covered by the cohort:ae_id ae_term1 Rash2 Hepatotoxicity3 CardiomyopathyEight of the 15 patients are eligible (the seven excluded patients each fail either the kinase-mutation test or the qualifying-pair test within the 90-day window). The kinase-pathway closure transitively covers the root, the two sub-roots, and all four leaves. Several four-patient cohorts hit the
MIN_GENES_COVERED = 3/MIN_THERAPIES_COVERED = 2/MIN_AES_COVERED = 2floors; the solver returns one of them (the specific cohort can vary across runs).StatinandGLP1_Agonist(therapies 4 and 5) andMyalgia(AE 4) appear in the data but are unreachable for any cohort because no eligible patient has a qualifying-pair on them — theCoverable*sub-concepts correctly exclude them from the count.
Template structure
.├── README.md├── pyproject.toml├── patient_cohort_recruitment.py└── data/ ├── genes.csv ├── gene_is_a.csv ├── patients.csv ├── mutation_events.csv ├── therapy_events.csv ├── adverse_events.csv ├── therapies.csv └── ae_terms.csvHow it works
The pipeline runs three stages in order: Graph closes the ontology, Rules lift the closure to patient-level facts, and the CSP solver selects the cohort.
Graph reasoner: one call to close the ontology. The is_a CSV is in OMOP / SNOMED convention (child -> parent), but the Graph constructor takes the same edge concept with edge_src_relationship=GeneIsA.parent and edge_dst_relationship=GeneIsA.child — so reachability from a root flows downwards through the subclass tree onto every member. reachable(full=True) returns every (ancestor, descendant) pair, the full transitive closure. The closure is then materialized as a sub-concept KinaseGene extends Gene:
gene_reachable = Graph( model, directed=True, weighted=False, node_concept=Gene, edge_concept=GeneIsA, edge_src_relationship=GeneIsA.parent, edge_dst_relationship=GeneIsA.child,).reachable(full=True)
KinaseGene = model.Concept("KinaseGene", extends=[Gene])KinaseRootGene = Gene.ref()model.define(KinaseGene(Gene)).where( KinaseRootGene.id == KINASE_ROOT_GENE_ID, gene_reachable(KinaseRootGene, Gene),)Rules: lift the closure to patient-level sub-concepts. Pure relational arithmetic, no decisions. Predicates are encoded as sub-concepts — their membership is the predicate — so downstream rules and the CSP just check Sub(Parent) to test the predicate (cheaper and clearer than Boolean indicator properties). The qualifying-pair AE-window predicate is lifted into a 3-arity Patient.qualifying_pair(TherapyEvent, AdverseEventOcc) relationship and the three downstream rules (eligibility, therapy coverage, AE coverage) all project from it — changing the qualifying-pair definition (severity matching, treatment duration, multi-event sequencing) is then an edit to one rule rather than three:
KinaseMutationCarrier = model.Concept("KinaseMutationCarrier", extends=[Patient])model.define(KinaseMutationCarrier(Patient)).where( MutationEvent.patient == Patient, KinaseGene(MutationEvent.gene),)
Patient.qualifying_pair = model.Relationship( f"{Patient} qualifies on {TherapyEvent:therapy_event} and {AdverseEventOcc:ae_occ}")model.define(Patient.qualifying_pair(TherapyEvent, AdverseEventOcc)).where( TherapyEvent.patient == Patient, AdverseEventOcc.patient == Patient, AdverseEventOcc.t_days - TherapyEvent.t_days >= 0, AdverseEventOcc.t_days - TherapyEvent.t_days <= MAX_THERAPY_TO_AE_DAYS,)
QualifyingPairPatient = model.Concept("QualifyingPairPatient", extends=[Patient])model.define(QualifyingPairPatient(Patient)).where( Patient.qualifying_pair(TherapyEvent, AdverseEventOcc),)
EligiblePatient = model.Concept("EligiblePatient", extends=[Patient])model.define(EligiblePatient(Patient)).where( KinaseMutationCarrier(Patient), QualifyingPairPatient(Patient),)
# Per-axis coverage projects from `qualifying_pair`:Patient.covers_therapy = model.Relationship(f"{Patient} covers {Therapy:therapy}")model.define(Patient.covers_therapy(Therapy)).where( Patient.qualifying_pair(TherapyEvent, AdverseEventOcc), TherapyEvent.therapy == Therapy,)Prescriptive reasoner: cohort selection as a CSP. Decisions target the sub-concept directly (EligiblePatient.is_in_cohort, CoverableTherapy.is_covered, …), which creates one binary variable per sub-concept row. The Coverable* sub-concepts are themselves scoped to eligible-patient coverage, not any-patient coverage: a Y covered only by ineligible patients would otherwise sit in Coverable* with no upper-bound IC binding (the per-pair where body has no eligible-patient row for it), and the solver would mark it covered to satisfy the lower bound trivially. Scoping Coverable* to EligiblePatient.covers_* closes that gap; scoping solve_for to Coverable* then ensures every is_covered decision has a real upper bound.
problem.solve_for( EligiblePatient.is_in_cohort, type="bin", name=["is_in_cohort", EligiblePatient.id],)problem.solve_for( CoverableTherapy.is_covered, type="bin", name=["therapy_covered", CoverableTherapy.id],)Coverage upper bound + per-pair lower bound + floor is the CSP signature. For each coverable Y, Y.is_covered is bounded above by the number of in-cohort patients that cover it (so an unsupported Y cannot be marked covered) AND bounded below per pair by EligiblePatient.is_in_cohort (so any in-cohort patient covering Y forces is_covered to saturate to 1). The two bounds together pin is_covered to the actual coverage. The floor IC sum(is_covered) >= MIN_* then constrains the cohort to span at least MIN_* distinct values. Without the per-pair lower bound the solver could leave indicators at 0 even when the cohort actually covers them, making the inspect() output underreport.
gene_cover_ub_ic = model.where(EligiblePatient.covers_kinase_gene(CoverableGene)).require( CoverableGene.is_covered <= sum(EligiblePatient.is_in_cohort).per(CoverableGene))gene_cover_lb_ic = model.where(EligiblePatient.covers_kinase_gene(CoverableGene)).require( CoverableGene.is_covered >= EligiblePatient.is_in_cohort)gene_min_ic = model.require(sum(CoverableGene.is_covered) >= MIN_GENES_COVERED)All ten ICs are pure relational arithmetic, so problem.verify() re-evaluates every one in the returned solution — no constraint is solver-only.
Customize this template
- Use your own data by replacing the eight CSV files with your gene ontology and patient knowledge graph. The constraint structure does not change. If your ontology already stores
is_aparent -> child, drop theparent/childflip in theGraphconstructor. If you don’t have ontology data, defineKinaseGenemembership directly on the genes you care about and skip the Graph step. - Change the cohort target by adjusting
COHORT_SIZEand the threeMIN_*_COVEREDfloors at the top. Tightening any one of them shrinks the feasible region; settingMIN_GENES_COVERED = COHORT_SIZEforces every patient in the cohort to cover a distinct gene (rules out two patients with identical mutation patterns). - Move from feasibility to optimization. This template is a satisfaction model — any cohort that hits the floors is correct. To rank, swap
problem.solve(...)forproblem.maximize(sum(CoverableGene.is_covered) + sum(CoverableTherapy.is_covered) + sum(CoverableAdverseEvent.is_covered))to find the cohort with the broadest joint span, orproblem.minimize(sum(EligiblePatient.is_in_cohort * EligiblePatient.age_years))for a younger cohort. MiniZinc / Chuffed handles both. - Keep aggregates on a single sub-concept. Aggregations over a decision must reference the sub-concept the decision was scoped to — mixing parent and sub-concept references in a single aggregate triggers a TypeError.
- Tighten the qualifying window by editing
MAX_THERAPY_TO_AE_DAYS. The 90-day window is a common attribution choice for treatment-emergent AEs in oncology trials; some indications use 28 days for acute toxicity, others 180 days for late-onset events. - Add patient-level eligibility rules — minimum age, treatment-naive status, organ-function thresholds — by adding more conjuncts to the
EligiblePatientdefinition (or by introducing furtherextends=[Patient]sub-concepts). Each extra rule narrows the eligible set; the CSP automatically drops decisions for newly-ineligible patients. - Add cohort-level fairness rules. A balanced-cohort study might require a minimum count from each of two demographic strata. Add a stratum property (
Patient.stratum) and an ICsum(EligiblePatient.is_in_cohort).per(EligiblePatient.stratum) >= MIN_PER_STRATUMto enforce minimum representation per stratum. The decision-side aggregate must key on the sub-concept (EligiblePatient), not the parent (Patient). - Anchor on a different ontology root by changing
KINASE_ROOT_GENE_ID. Multi-pathway studies can run several queries with different roots and union the results.
Learn more
Cohort discovery and patient knowledge graphs (the domain background for relational cohort enumeration on a labelled patient graph):
- Wang, W. et al., Building Patient Cohorts with NLP and Knowledge Graphs. End-to-end pipeline shape, ontology-driven cohort enumeration.
- Xu et al., Enhanced pre-recruitment framework through KG + LLMs. Knowledge-graph-driven trial-eligibility screening.
Subgraph and set-cover techniques (the academic backbone for “find K nodes whose joint coverage spans enough labels”):
- McCreesh, Prosser & Trimble, The Glasgow Subgraph Solver. State-of-the-art constraint-based subgraph isomorphism.
- Caprara, Toth & Fischetti, Algorithms for the Set Covering Problem. The classical IP/CP encoding behind the coverage upper-bound + lower-bound pattern.
Healthcare data standards (the data shapes the patient KG mirrors):
- OHDSI, OMOP Common Data Model v6. The standardised relational schema for observational patient data.
- HL7, FHIR R5. The FHIR resource graph for clinical data exchange.
Troubleshooting
Solver returns INFEASIBLE
- The data may not contain a feasible cohort under the current floors. Loosen one constraint at a time — drop
MIN_GENES_COVERED, dropMIN_THERAPIES_COVERED, lowerCOHORT_SIZE— to confirm whether the data or a specific floor is the bottleneck. - The kinase-pathway closure may be empty: if
KINASE_ROOT_GENE_IDdoesn’t appear in the gene table,KinaseGeneis empty and no patient can be eligible. Print the closure (model.select(KinaseGene.id).inspect()) before the solve to confirm. - The eligible-patient set may be smaller than
COHORT_SIZE. PrintEligiblePatient.idbefore the solve; if fewer thanCOHORT_SIZEpatients are eligible, lower the floor or relax the qualifying-pair window. - Coverage floors may be unsatisfiable in principle: if the eligible patients only cover two distinct therapies,
MIN_THERAPIES_COVERED >= 3is infeasible. Inspect thePatient.covers_therapy/Patient.covers_kinase_gene/Patient.covers_aerelations to see what’s actually reachable.
Multiple feasible cohorts exist; which one does the solver return?
- This is constraint satisfaction, not optimization. Any cohort that hits the floors is a correct answer; the solver is free to return different ones across runs.
- To enumerate cohorts (e.g., for clinical-team review), pass
solution_limit=Ntoproblem.solve(...)and iterate overproblem.num_points()solutions. - To pin a single answer, switch to optimization — e.g.
problem.maximize(sum(CoverableGene.is_covered) + sum(CoverableTherapy.is_covered) + sum(CoverableAdverseEvent.is_covered))returns the cohort with the broadest joint span.
A coverable Y still appears as is_covered = 1 with no in-cohort patient covering it
Two encoding pitfalls produce this symptom; both must be guarded against.
-
Pitfall 1 —
solve_fortargets the parent concept. With a per-pairwherein the upper-bound IC, rows that no patient covers have no IC asserted (thewhereyields no rows there) andis_coveredfloats free. Fix: target the sub-concept directly insolve_for(CoverableGene.is_covered, ...)so unbounded decisions never get created. If you accidentally drop the sub-concept (solve_for(Gene.is_covered, ...)), the symptom comes back — restore the sub-concept target. -
Pitfall 2 —
Coverable*is scoped to any-patient coverage. A kinase gene mutated only by patients with no qualifying therapy/AE pair sits inCoverable*but has no eligible covering patient in the upper-bound IC’swherebody, so its decision floats free again. Fix: scopeCoverable*to eligible-patient coverage:model.define(CoverableGene(Gene)).where(EligiblePatient.covers_kinase_gene(Gene))If you scope to
Patient.covers_kinase_geneinstead, the symptom comes back — swap the sub-concept’swhereback toEligiblePatient.covers_*.
Import error for relationalai
- Confirm your virtual environment is active:
which pythonshould point to.venv. - Reinstall dependencies:
python -m pip install ..
Authentication or configuration errors
- Run
rai initto create or update your RelationalAI/Snowflake configuration. - If you have multiple profiles, set
export RAI_PROFILE=<your_profile>.
MiniZinc solver not available
- This template uses the MiniZinc constraint solver. Ensure the RAI Native App version supports MiniZinc.
- HiGHS is not appropriate here — this is a discrete satisfaction model with categorical decisions and structural propagation, not LP/MILP.