CI/CD Runner Allocation
Assign CI/CD workflow jobs to the cheapest compatible runner type, subject to concurrency limits, with scenario analysis across capacity levels and conflict analysis to diagnose an infeasible outage.
Browse files
What this template is for
This template uses prescriptive reasoning (optimization) to assign CI/CD workflow jobs to runner types at minimum cost. Given a set of runners (GitHub Actions runner types with CPU, memory, OS, and per-minute cost) and workflow jobs (with resource requirements and estimated durations), the optimizer assigns each job to the cheapest compatible runner while respecting per-runner concurrency limits.
The template also demonstrates scenario analysis by sweeping concurrency multipliers (0.5x, 1.0x, 1.5x) to show how pipeline cost changes under capacity constraints — useful for evaluating maintenance windows, cost reduction, or burst provisioning.
Finally, it shows conflict analysis (infeasibility diagnosis). A maintenance outage takes two well-connected Linux runners offline, which funnels every high-CPU Linux job onto the one surviving large runner — whose concurrency cap cannot hold them all, so the schedule is impossible. Rather than just reporting “infeasible,” the outage solve requests solve(conflict=True), which returns an irreducible infeasible subsystem (IIS): the minimal set of rules that cannot all hold at once. The diagnosis names the stranded jobs and the binding runner cap, so an operator knows exactly which cap to raise or which runner to restore.
Who this is for
- DevOps engineers optimizing CI/CD runner costs
- Platform teams sizing runner fleets for GitHub Actions or similar CI systems
- Anyone learning resource assignment optimization with RelationalAI
What you’ll build
- A binary assignment model that maps each workflow job to a compatible runner type
- Per-runner concurrency constraints scaled by a scenario parameter
- Cost minimization objective (runner cost_per_minute * job estimated_minutes)
- Scenario comparison showing cost impact of halving or increasing runner capacity
- A maintenance-outage diagnosis that reads the IIS (stranded jobs + binding cap) back by entity key
What’s included
cicd_runner_allocation.py— Main script with optimization and scenario analysisdata/runners.csv— 8 runner types with CPU, memory, OS, cost, and concurrency limitsdata/workflows.csv— 20 CI/CD workflow jobs with resource requirements and durationsdata/compatibility.csv— Pre-computed runner-workflow compatibility pairs (OS + resource match)pyproject.toml— Python package configuration with dependencies
Prerequisites
Access
- A Snowflake account that has the RAI Native App installed.
- A Snowflake user with permissions to access the RAI Native App.
Tools
- Python >= 3.10
- RelationalAI Python SDK (
relationalai) == 1.11.0
Quickstart
-
Download ZIP:
Terminal window curl -O https://docs.relational.ai/templates/zips/v1/cicd_runner_allocation.zipunzip cicd_runner_allocation.zipcd cicd_runner_allocation -
Create venv:
Terminal window python -m venv .venvsource .venv/bin/activatepython -m pip install --upgrade pip -
Install:
Terminal window python -m pip install . -
Configure:
Terminal window rai init -
Run:
Terminal window python cicd_runner_allocation.py -
Expected output (representative — equal-cost runners may be swapped between tied optima, and the IIS may name a different six of the seven stranded jobs; the statuses, costs, and binding runner cap are stable):
Running scenario: concurrency_multiplier = 0.5--------------------------------------------------Status: OPTIMALTotal pipeline cost: $10.18Assignments:macos-large (1 jobs): ios-testflightmacos-latest (2 jobs): build-mobile-ios, e2e-tests-safariself-hosted-linux (4 jobs): e2e-tests-chrome, integration-tests, nightly-build, performance-testsubuntu-22.04 (2 jobs): build-api, release-notesubuntu-large (3 jobs): build-mobile-android, docker-build, unit-tests-apiubuntu-latest (7 jobs): build-frontend, dependency-audit, deploy-production, deploy-staging, lint-and-format, security-scan, unit-tests-frontendwindows-latest (1 jobs): windows-installerRunning scenario: concurrency_multiplier = 1.0--------------------------------------------------Status: OPTIMALTotal pipeline cost: $9.62Assignments:macos-large (1 jobs): ios-testflightmacos-latest (2 jobs): build-mobile-ios, e2e-tests-safariself-hosted-linux (8 jobs): build-api, build-mobile-android, docker-build, e2e-tests-chrome, integration-tests, nightly-build, performance-tests, unit-tests-apiubuntu-22.04 (1 jobs): release-notesubuntu-latest (7 jobs): build-frontend, dependency-audit, deploy-production, deploy-staging, lint-and-format, security-scan, unit-tests-frontendwindows-latest (1 jobs): windows-installerRunning scenario: concurrency_multiplier = 1.5--------------------------------------------------Status: OPTIMALTotal pipeline cost: $9.53Assignments:macos-large (1 jobs): ios-testflightmacos-latest (2 jobs): build-mobile-ios, e2e-tests-safariself-hosted-linux (12 jobs): build-api, build-frontend, build-mobile-android, dependency-audit, docker-build, e2e-tests-chrome, integration-tests, nightly-build, performance-tests, security-scan, unit-tests-api, unit-tests-frontendubuntu-22.04 (2 jobs): deploy-production, release-notesubuntu-latest (2 jobs): deploy-staging, lint-and-formatwindows-latest (1 jobs): windows-installer==================================================Scenario Analysis Summary==================================================concurrency_multiplier=0.5: OPTIMAL, cost=$10.18concurrency_multiplier=1.0: OPTIMAL, cost=$9.62concurrency_multiplier=1.5: OPTIMAL, cost=$9.53==================================================Maintenance outage: ubuntu-large, self-hosted-linux offline==================================================Solve result:• status: INFEASIBLE• primal status: NO_SOLUTION• dual status: INFEASIBILITY_CERTIFICATE• conflict status: CONFLICT_FOUND...Stranded jobs (assign-one rule in conflict):workflowbuild-mobile-androiddocker-builde2e-tests-chromeintegration-testsperformance-testsunit-tests-apiBinding runner caps (concurrency rule in conflict):runner max_concurrentubuntu-xlarge 5To restore feasibility, relax one member of the conflict: bring ubuntu-large orself-hosted-linux back online, or raise ubuntu-xlarge's concurrency cap. ...At full capacity (1.0x), self-hosted-linux absorbs 8 of 20 jobs at
0.09 by pulling four more low-CPU jobs off the pricier ubuntu runners (the high-CPU jobs already fit on self-hosted at 1.0x). (How the cheap, low-CPU jobs split between the two equal-cost ubuntu runners is one of several tied optima — a different HiGHS build may place them differently at the same total cost.) The maintenance outage is infeasible: with
ubuntu-largeandself-hosted-linuxoffline, the seven high-CPU Linux jobs can only run onubuntu-xlarge(cap 5).solve(conflict=True)returns the IIS — six of the seven stranded jobs plus theubuntu-xlargeconcurrency rule (which six is solver-dependent, since any six already exceed the cap). The diagnosis points an operator straight at the fix: restore a runner or raise that one cap.
Template structure
.├── README.md├── pyproject.toml├── cicd_runner_allocation.py└── data/ ├── runners.csv ├── workflows.csv └── compatibility.csvHow it works
This section walks through the highlights in cicd_runner_allocation.py.
Define concepts and load CSV data
The model defines three concepts. Runner represents CI/CD runner types with resource specs and cost:
Runner = model.Concept("Runner", identify_by={"runner_id": Integer})Runner.name = model.Property(f"{Runner} has {String:runner_name}")Runner.os = model.Property(f"{Runner} has {String:runner_os}")Runner.cpu = model.Property(f"{Runner} has {Integer:cpu}")Runner.memory_gb = model.Property(f"{Runner} has {Integer:memory_gb}")Runner.cost_per_minute = model.Property(f"{Runner} has {Float:cost_per_minute}")Runner.max_concurrent = model.Property(f"{Runner} has {Integer:max_concurrent}")Workflow represents CI/CD jobs with resource requirements:
Workflow = model.Concept("Workflow", identify_by={"workflow_id": Integer})Workflow.name = model.Property(f"{Workflow} has {String:workflow_name}")Workflow.event = model.Property(f"{Workflow} has {String:workflow_event}")Workflow.required_os = model.Property(f"{Workflow} has {String:required_os}")Workflow.min_cpu = model.Property(f"{Workflow} has {Integer:min_cpu}")Workflow.min_memory_gb = model.Property(f"{Workflow} has {Integer:min_memory_gb}")Workflow.estimated_minutes = model.Property( f"{Workflow} has {Integer:estimated_minutes}")Compatibility links workflows to runners that meet their OS and resource requirements. Assignment is the decision concept — only compatible (workflow, runner) pairs exist:
Compatibility = model.Concept( "Compatibility", identify_by={"workflow": Workflow, "runner": Runner})
Assignment = model.Concept( "Assignment", identify_by={"workflow": Workflow, "runner": Runner})Assignment.x_assigned = model.Property(f"{Assignment} assigned {Float:x_assigned}")model.define( Assignment.new(workflow=Compatibility.workflow, runner=Compatibility.runner))Define decision variables, constraints, and objective
Each assignment is a binary variable — assign this workflow to this runner or not:
problem.solve_for( Assignment.x_assigned, type="bin", name=["assign", Assignment.workflow.name, Assignment.runner.name],)Two constraints enforce feasibility. Each is captured as a handle, named per entity (a readable label), and declared with keyed_by — the entity key its conflict membership reads back through if the model turns out infeasible. First, each workflow must be assigned to exactly one runner:
assign_one = problem.satisfy( model.require( sum(AssignRef.x_assigned).where(AssignRef.workflow == Workflow).per(Workflow) == 1 ), name=["assign_one", Workflow.name], keyed_by={"workflow": Workflow},)Second, the number of workflows assigned to each runner cannot exceed its concurrency limit, scaled by the scenario multiplier:
conc = problem.satisfy( model.require( sum(AssignRef.x_assigned).where(AssignRef.runner == Runner).per(Runner) <= concurrency_multiplier * Runner.max_concurrent ), name=["concurrency", Runner.name], keyed_by={"runner": Runner},)The objective minimizes total pipeline cost — the sum of (runner cost per minute * job duration) across all assignments:
problem.minimize( sum( Assignment.x_assigned * Assignment.runner.cost_per_minute * Assignment.workflow.estimated_minutes ))Solve with scenario analysis
The script loops over three concurrency multipliers (0.5x, 1.0x, 1.5x), creating a fresh Problem for each. This reveals the cost of operating at reduced capacity (maintenance window) versus full or burst capacity:
SCENARIO_VALUES = [0.5, 1.0, 1.5]
for multiplier in SCENARIO_VALUES: alloc = solve_allocation(multiplier)After all scenarios, a summary table compares status and cost:
for r in scenario_results: print(f" {SCENARIO_PARAM}={r['scenario']}: " f"{r['status']}, cost=${r['objective']:.2f}")Diagnose a maintenance outage with conflict analysis
The final section models a maintenance outage: ubuntu-large and self-hosted-linux go offline (their assignments are dropped with a where= filter). Every high-CPU Linux job (min_cpu >= 4) is compatible only with runners in {ubuntu-large, ubuntu-xlarge, self-hosted-linux} (the two heaviest jobs with just the latter two), so with two of those three down, all seven funnel onto ubuntu-xlarge — whose concurrency cap of 5 cannot hold them. The solve requests a conflict diagnosis:
outage = solve_allocation(1.0, offline_runners=["ubuntu-large", "self-hosted-linux"], conflict=True)
assert outage.si.conflict is Trueassert outage.si.termination_status in ("INFEASIBLE", "INFEASIBLE_OR_UNBOUNDED")
# conflict_status gates whether an IIS is available -- dispatch on it.if outage.si.conflict_status == "CONFLICT_FOUND": ... # read the stranded jobs and the binding cap (below)else: # NO_CONFLICT_EXISTS (the model was feasible) or NOT_SUPPORTED / FAILED (this solver # build produced no IIS, e.g. needs HiGHS >= 1.13) -- report the status, don't read it. print(f"No IIS to inspect: {outage.si.conflict_status}")(The template’s own else branch raises instead of printing: its outage is infeasible by construction, so a missing IIS there is a regression. The print form above is the one to copy when infeasibility is not guaranteed.)
in_conflict is a bare predicate on each constraint instance — true when the solver reports that instance in the conflict (it collapses the solver’s IN_CONFLICT and MAYBE_IN_CONFLICT into one membership). Each constraint’s declared key gives it an entity back-pointer (assign_one.workflow, conc.runner), mirroring the variable’s automatic back-pointer, so the conflict reads back as the actual stranded jobs and the binding runner cap, joined by KEY — no rule-name parsing:
# Stranded jobs (their assign-one rule is in the conflict):model.select(outage.assign_one.workflow.name).where(outage.assign_one.in_conflict).inspect()# The binding runner cap (its concurrency rule is in the conflict):model.select(outage.conc.runner.name, outage.conc.runner.max_concurrent).where( outage.conc.in_conflict).inspect()(.inspect() prints the rows for a quick look; the script materializes the same selects as DataFrames with .to_df() for its printed report and assertions.)
The IIS is minimal: it names six of the seven high-CPU jobs (any six already exceed the cap of five, so which six is solver-dependent) plus the ubuntu-xlarge concurrency rule. To restore feasibility, relax one member — bring a runner back online or raise the cap. Because all seven jobs share the one survivor, lift the cap enough for all of them (or restore a runner) and re-solve to confirm; clearing a single job only resolves that one row of the conflict.
Customize this template
- Add runners: Extend
runners.csvwith new runner types (e.g., GPU runners for ML workflows). - Adjust concurrency: Change
max_concurrentinrunners.csvor modifySCENARIO_VALUESto test different capacity levels. - Weight by priority: Add a priority column to
workflows.csvand incorporate it into the objective to prefer assigning critical jobs to faster runners. - Time windows: Add time slot concepts to model scheduling across discrete time periods, not just assignment.
- Real data: Replace CSVs with queries against your CI/CD platform’s API or data warehouse.
Troubleshooting
Problem is infeasible
The concurrency limits are too tight for the number of workflows. Rather than guess, request a conflict diagnosis — solve(conflict=True) returns the irreducible infeasible subsystem (the stranded jobs and the binding runner cap), as shown in the maintenance-outage section. Then increase max_concurrent for the named runner in runners.csv, reduce the number of workflows competing for it, or raise the concurrency multiplier.
rai init fails or connection errors
Ensure your Snowflake credentials are configured correctly and that the RAI Native App is installed on your account. Run rai init again and verify the connection settings.
ModuleNotFoundError for relationalai
Make sure you activated the virtual environment and ran python -m pip install . from the template directory. The pyproject.toml declares the required dependencies.
All workflows assigned to the same runner
This is expected if one runner is cheapest and has enough concurrency. Check that compatibility.csv correctly restricts which runners can handle each workflow’s OS and resource requirements.