Pod Placement
Assign pods to nodes in a Kubernetes-style cluster subject to per-node CPU / memory / GPU bin-packing, pairwise tenant anti-affinity, deployment co-location affinity, failure-domain spread, gang-placement atomicity, and topology rack-clique rules. Pure CSP via MiniZinc / Chuffed.
What this template is for
Multi-tenant Kubernetes / SaaS / HPC platform teams place pods (containerised workloads) onto cluster nodes against a stack of rules that don’t fit a clean linear program:
- Resource bin-packing — each node has fixed CPU / memory / GPU capacity, and the sum of pod demands on a node can’t exceed it.
- Tenant anti-affinity — regulated workloads (SEC / OCC / NAIC / FedRAMP) require pairs of tenants to never share a host, so a single noisy-neighbour or compromised node can’t span the isolation boundary.
- Deployment co-location affinity — deployments declared affinity-paired (e.g. shared storage class, low-latency RDMA peering) must co-locate (same node) or share their unplaced state: the
xi == xjIC forces both to agree on every node’s assignment bit, which means they share a single node when placed and are simultaneously unplaced otherwise. - Failure-domain spread — replicas of one deployment fan out across zones so a single-zone outage can’t take down a quorum.
- Gang-placement atomicity — a multi-replica deployment is either fully placed (every replica scheduled) or fully unplaced; a partial schedule is worse than nothing for stateful systems with leader elections, quorums, and shared-state protocols.
- Topology rack-clique — distributed-training pods need NVLink-class bandwidth between every pair, so the whole group must land on hosts within the same rack.
This template encodes that stack as a pure constraint satisfaction / optimisation problem: the Prescriptive reasoner (MiniZinc / Chuffed) maximises the number of placed deployments subject to all six rules as hard constraints. The shape applies to every multi-tenant K8s shop, financial-services regulated multi-tenancy, defence / FedRAMP-cleared environments, telco 5G network slicing, and HPC + AI infrastructure that runs gang-scheduled distributed training.
Who this is for
- Platform / infrastructure engineers running multi-tenant Kubernetes, Mesos, or Volcano clusters
- Site-reliability engineers designing failure-domain spread policies and gang-scheduling rules
- HPC / ML-platform teams placing distributed-training jobs across NVLink islands
- Operations researchers learning pure-CSP encodings of pairwise anti-affinity, reified cardinality (the all-or-nothing gang-placement constraint, encoded as a 0/1 deployment-level indicator), and topology-clique constraints
What you’ll build
- A semantic model with
Node,Pod,Deployment,Tenant,Zone,Rackconcepts, plus three symmetric data relationships:TenantAntiAffinity,DeploymentAffinity, andDistributedTraining - A binary 2D assignment matrix decision
Pod.on_node(Node, x)paired with two 0/1 indicators (Pod.placed,Deployment.placed) that pin each pod’s row sum and each deployment’s gang state - Three bin-packing ICs (
sum(Pod.cpu_millicores * x).per(Node) <= Node.cpu_millicores, and analogues for memory and GPU) - A pairwise tenant anti-affinity IC —
xi + xj <= 1per node for every (Pi, Pj) pair whose deployments’ tenants are anti-affine — expressed directly over the matrix without big-M reformulation - A pairwise deployment co-location affinity IC that forces affinity-paired pods to agree on every node’s assignment bit
- A per-(deployment, zone) failure-domain spread IC bounded by
Deployment.max_per_zone— by defaultceil(replicas / num_zones), with a per-rowmax_per_zone_overridecolumn indeployments.csvfor deployments (e.g. distributed-training groups) that need a wider blast radius to remain feasible - A reified-cardinality gang-placement IC
sum(Pod.placed).per(Deployment) == Deployment.replicas * Deployment.placed - A pairwise topology rack-clique IC for distributed-training groups —
xa + xb <= 1for every (Pa, Pb, Na, Nb) tuple where the pods are in the same training group and the nodes are in different racks - A linear
maximize(sum(Deployment.placed))objective the CSP optimises - Post-solve verification via
problem.verify()confirming every IC holds in the returned solution, plus atermination_status() == "OPTIMAL"assertion
What’s included
The bundled CSVs are illustrative, fully synthetic demo data tuned so multiple ICs bind at the optimum — rack-clique forces the distributed-training group onto one rack, CPU / memory / GPU bin-packing then pin it to a single node at exact 100% utilisation on all three resources, anti-affinity partitions the alpha-beta and gamma-delta tenant pods across disjoint node sets, deployment co-location affinity co-locates the cache pair, and the spread cap binds for every multi-replica non-overridden deployment. Swap in your own cluster topology and workload to apply the template to a real cluster.
pod_placement.py— main script with concepts, decisions, constraints, the solver call, and the post-solve inspectiondata/nodes.csv— 8 nodes across 2 zones (us-east-1a,us-east-1b) and 4 racks; six general-purpose nodes (12000 millicores, 49152 MiB) and two GPU nodes (8000 millicores, 32768 MiB, 4 GPUs each). Total cluster: 88000 millicores, 360448 MiB, 8 GPUsdata/tenants.csv— 4 tenants (tenant_alpha,tenant_beta,tenant_gamma,tenant_delta)data/tenant_anti_affinity.csv— 2 anti-affine pairs (tenant_alphaxtenant_beta,tenant_gammaxtenant_delta)data/deployments.csv— 12 deployments across the 4 tenants: 8-replica web services, 4-replica api services, single-replica caches, a 6-replica database (db_gamma), and a 4-pod distributed-training job (ml_gamma_train). One column,max_per_zone_override, is a nullable per-row spread cap; the bundled data sets it to4forml_gamma_trainso the same-rack training group can land entirely in one zone (the GPU nodes are concentrated there)data/deployment_affinity.csv— 1 affinity-paired pair (cache_alphaxcache_delta, motivated by a shared storage class)data/pods.csv— 50 pods with per-pod CPU / memory / GPU demands aligned to deployment role (web 1000m / 2048 MiB, api 1500m / 3072 MiB, cache 500m / 1024 MiB, db 2000m / 4096 MiB, ml 2000m / 8192 MiB / 1 GPU)data/distributed_training.csv— 6 pairs forming the 4-pod NVLink clique forml_gamma_trainpyproject.toml— Python package configuration
Prerequisites
Access
- A Snowflake account that has the RAI Native App installed.
- A Snowflake user with permissions to access the RAI Native App.
Tools
- Python >= 3.10
Quickstart
-
Download ZIP:
Terminal window curl -O https://docs.relational.ai/templates/zips/v1/pod_placement.zipunzip pod_placement.zipcd pod_placement -
Create venv:
Terminal window python -m venv .venvsource .venv/bin/activatepython -m pip install --upgrade pip -
Install:
Terminal window python -m pip install . -
Configure:
Terminal window rai init -
Run:
Terminal window python pod_placement.py -
Expected output (the solver maximises the number of placed deployments; the specific (pod, node) tuples and per-node utilisation may vary across solver versions when multiple placements tie at the optimum). The script first prints the formulation (a few thousand lines for the bundled data, omitted here for brevity), then the solve-result block, then the per-node utilisation, the per-pod placement, and any unplaced deployments. A representative run looks like:
Solve result:• status: OPTIMAL• objective: 12• solve time: ~1s• num_points: 1• solver: MiniZinc_unknownPer-node utilization:node_id node cpu_used cpu_cap memory_used gpu_used0 1 gen-a1-n1 5500 12000 11264 01 2 gen-a1-n2 11000 12000 22528 02 3 gen-a2-n1 8500 12000 17408 03 4 gen-a2-n2 3500 12000 7168 04 5 gpu-b1-n1 5500 8000 11264 05 6 gen-b1-n2 11000 12000 22528 06 7 gpu-b2-n1 8000 8000 32768 47 8 gen-b2-n2 11000 12000 22528 0Placed pods (pod_id -> node_id):pod_id node_id...Unplaced deployments (if any):(empty)The optimum places all
12deployments and binds four ICs at the chosen GPU node: rack-clique pinsml_gamma_train’s 4 pods onto a single rack, and CPU / memory / GPU bin-packing then pin them all to one node within that rack at exactly 100% utilisation on every resource (gpu-b2-n1here: 4 × 2000m = 8000m CPU = the node’s CPU cap, 4 × 8192 MiB = 32768 MiB memory = the node’s memory cap, 4 × 1 = 4 GPUs = the node’s GPU cap — a tight three-resource fit). The training group could equivalently land ongpu-b1-n1(same shape, other GPU rack) and the solver may tie-break either way across versions; what’s solver-invariant is “all 4 pods on the same rack on a single node, 100% utilised on three resources”. Themax_per_zone_override = 4indeployments.csvis what unlocks this: without it, the default spread capceil(4 / 2) = 2plus the all-or-nothing gang rule plus the rack-clique requirement plus the GPU racks all sitting inus-east-1b(a rack belongs to exactly one zone, so all-pods-on-one-rack implies all-pods-in-one-zone) would makeml_gamma_trainfully unplaced and the objective11. The override is the operator’s explicit acceptance of a wider blast radius for the training group: a single-zone outage takes the whole training job, which is the standard tradeoff for an NVLink-tight ML deployment. The other bound counts that remain solver-invariant across the optimum: each multi-replica non-overridden deployment lands at exactlyceil(replicas / 2)pods per zone (the spread cap binds forweb_*at 4+4,api_*at 2+2,db_gammaat 3+3), the affinity-paired caches (cache_alpha,cache_delta) co-locate on a single node, and alpha/beta and gamma/delta tenant pods partition the cluster into disjoint node sets.
Template structure
.├── README.md├── pyproject.toml├── pod_placement.py└── data/ ├── nodes.csv ├── tenants.csv ├── tenant_anti_affinity.csv ├── deployments.csv ├── deployment_affinity.csv ├── pods.csv └── distributed_training.csvHow it works
The CSP picks a node for each pod via a binary 2D assignment matrix; everything else is plain relational arithmetic over that matrix. The script consists of these patterns:
Binary 2D matrix as the primary decision. Each pod gets one binary x per node, materialised as Pod.on_node(Node, x). A per-pod cardinality IC pins the row sum to a 0/1 placement indicator:
Pod.on_node = model.Property(f"{Pod} runs on {Node} if {Integer:assigned}")Pod.placed = model.Property(f"{Pod} has {Integer:placed}")
x = Integer.ref()problem.solve_for(Pod.on_node(Node, x), type="bin", name=["x", Pod.id, Node.id])problem.solve_for(Pod.placed, type="bin", name=["placed", Pod.id])
placement_coupling_ic = model.where(Pod.on_node(Node, x)).require( sum(x).per(Pod) == Pod.placed)An integer-valued Pod.node_id decision would read more naturally for the “different-node” pairwise pattern (Pi.node_id != Pj.node_id), but per-node aggregates would then force a where(Pod.node_id == Node.id)-shaped binding that mixes a decision variable with a data property — the prescriptive rewriter does not lower that form today (the planogram template’s implies-cascade table lookup is the canonical workaround). The 2D matrix sidesteps the rewriter limitation: every aggregate becomes a plain relational sum over x.
Bin-packing is one sum(... * x).per(Node) per resource. Three near-identical ICs, one for each of CPU / memory / GPU:
cpu_capacity_ic = model.where(Pod.on_node(Node, x)).require( sum(Pod.cpu_millicores * x).per(Node) <= Node.cpu_millicores)Pairwise anti-affinity is the CSP signature. For every ordered pod-pair (Pi, Pj) whose deployments’ tenants are anti-affine, at most one of (Pi, Pj) is on any given node:
Pi = PodPj = Pod.ref()xi = Integer.ref()xj = Integer.ref()anti_affinity_ic = model.where( Pi.id < Pj.id, TenantAntiAffinity(Pi.deployment.tenant, Pj.deployment.tenant), Pi.on_node(Node, xi), Pj.on_node(Node, xj),).require(xi + xj <= 1)The ordered Pi.id < Pj.id filter canonicalises each pair so the IC fires once per {Pi, Pj} unordered pair. TenantAntiAffinity is closed symmetrically at definition time (the two model.define rules), so the predicate matches in either argument order. The same pairwise-pod / pairwise-node shape encodes the deployment co-location affinity IC (xi == xj on every node) and the topology rack-clique IC (xa + xb <= 1 whenever the two pods are in a distributed-training group and the two nodes are in different racks).
Gang placement is a reified cardinality IC. A deployment is either fully placed (sum(Pod.placed).per(Deployment) == Deployment.replicas) or fully unplaced (sum(Pod.placed).per(Deployment) == 0). The unified form uses a 0/1 deployment-level indicator:
Deployment.placed = model.Property(f"{Deployment} has {Integer:placed}")problem.solve_for(Deployment.placed, type="bin", name=["dep_placed", Deployment.id])
gang_placement_ic = model.where(Pod.deployment == Deployment).require( sum(Pod.placed).per(Deployment) == Deployment.replicas * Deployment.placed)Deployment.placed == 0 forces every replica’s Pod.placed == 0 (sum == 0); Deployment.placed == 1 forces every replica’s Pod.placed == 1 (sum == replicas). No big-M, no aux indicator stack.
Failure-domain spread is a two-key sum(x).per(Deployment, Zone). With max_per_zone pre-computed (default ceil(replicas / num_zones), or the max_per_zone_override value when set on the row) and joined onto Deployment, the IC reads as a plain relational inequality:
spread_ic = model.where( Pod.on_node(Node, x), Pod.deployment == Deployment, Node.zone == Zone,).require(sum(x).per(Deployment, Zone) <= Deployment.max_per_zone)The override is a deployment-level escape hatch for cases where the default spread cap collides with another constraint — the bundled data uses it on ml_gamma_train so the rack-clique IC (which forces all distributed-training pods onto a single rack, and therefore into a single zone) doesn’t make the deployment infeasible.
All nine ICs are pure relational arithmetic. None of them carry an implies body, so problem.verify() re-evaluates every constraint in the returned solution. The post-solve assertion model.require(problem.termination_status() == "OPTIMAL") then guarantees the solver actually proved optimality rather than timing out.
Customize this template
- Use your own data by replacing the seven CSV files with your cluster topology and workload. The constraint structure does not change. The data requirements are: primary keys must be unique across
nodes.csv/tenants.csv/deployments.csv/pods.csv; every foreign key (deployments.tenant_id,pods.deployment_id,tenant_anti_affinity.tenant_*, etc.) must resolve; node capacities must be positive;pods.csvrow counts perdeployment_idmust equaldeployments.csv.replicas(the gang IC pins them); each rack belongs to exactly one zone; and anymax_per_zone_overridevalue, when present, must be a positive integer no greater than the row’sreplicas. - Change the objective.
problem.maximize(sum(Deployment.placed))maximises the number of placed deployments. Swap forproblem.maximize(sum(Deployment.replicas * Deployment.placed))to weight by replica count,problem.minimize(...)over a per-node fragmentation cost (sum-of-squared-utilisation-gaps), or pure satisfaction (problem.satisfy()with aDeployment.placed == 1hard floor) when every deployment must run. - Tier-aware placement. Mark pods with
gpu_units > 0and add an IC forcing them onto GPU-capable nodes only — e.g.model.where(Pod.on_node(Node, 1), Pod.gpu_units >= 1).require(Node.gpu_units >= 1)(matches themodel.where(...).require(...)chaining used by every IC inpod_placement.py). Conversely, mark some nodes as “general-purpose only” and forbid GPU workloads there. - Migration cost under churn. Add a
Pod.current_nodedata property reflecting the existing assignment, then introduce a per-pod “moved” indicator (moved = (1 - Pod.on_node(current_node, x))) and minimise the total move count — multi-solution then enumerates the N least-disruptive placements. - NVLink-clique with explicit edges. The bundled
data/distributed_training.csvlists every pair in the training group, encoding rack-clique implicitly via the pairwise IC. To model a true edge-restricted clique (e.g. NVLink-2 connectivity between specific GPU pairs only), pre-compute the set of(node_a, node_b)pairs that lack a direct NVLink edge in Python and load them as aNoNVLinkdata relationship; then drop theNa.rack != Nb.rackclause fromrack_clique_icand replace it withNoNVLink(Na, Nb)to forbid co-placement on any pair of hosts without a direct NVLink path. - Spot / preemptible tier. Mark a subset of pods as preemptible (
Pod.is_preemptible == 1) and let them displace lower-priority workloads when capacity is tight. Express as a per-node tiered capacity: preemptible pods consume “remaining” capacity, on-demand pods consume the primary allotment. - Tighten or loosen the spread cap. The default
max_per_zone = ceil(replicas / num_zones)is the loosest spread that still admits a placement when replicas fan out evenly across zones. For a strict 1-per-zone policy (so even a hot zone can’t host two replicas of the same service), setmax_per_zone_override = 1for that row indeployments.csv. For a deployment whose other ICs force same-zone placement (the bundledml_gamma_trainis the canonical example — rack-clique plus all-GPU-racks-in-one-zone), setmax_per_zone_override = replicasto opt out of spread for that deployment alone, leaving the cluster-wide default intact for everything else. - Multi-cluster (federated placement). Add a
Clusterconcept aboveZoneand an outerPod.on_cluster(Cluster)decision; spread / anti-affinity then have to lift to cluster-level constraints. Karmada-style multi-cluster placement is the standard precedent.
Learn more
Production cluster managers (the field this template models):
- Verma et al., Large-scale cluster management at Google with Borg, EuroSys 2015. Constraint-driven cluster scheduling at scale; affinity / anti-affinity / failure-domain spread as first-class concepts.
- Hindman et al., Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011. Two-level resource scheduling — the conceptual predecessor of every modern cluster manager.
- Volcano, open-source batch scheduler for Kubernetes (CNCF incubating) with first-class gang-scheduling, queue fairness, and topology-aware placement.
Kubernetes scheduling surface (the canonical vocabulary):
- Kubernetes scheduler framework: affinity & anti-affinity. Official documentation of
nodeAffinity,podAffinity,podAntiAffinity. - Kubernetes topology-spread constraints. Failure-domain spread as a first-class Kubernetes API.
- Kueue documentation. Kubernetes-native queueing with placement constraints.
- Karmada multi-cluster placement. Cluster-affinity scheduling for multi-cluster Kubernetes deployments.
Constraint-programming background (the academic backbone for the encoding choices):
- CSPLib problem 057 — RCPSP. Resource-constrained project scheduling — the closest classical CSP benchmark to multi-resource bin-packing with side constraints.
- Mehta & van Beek, Mapping Problems with Side Constraints to SAT, CP 2007. Academic precedent for mapping placement problems with pairwise / clique / cardinality side constraints to discrete satisfaction.
Troubleshooting
Solver returns INFEASIBLE
- A deployment’s replica count may exceed what the failure-domain spread allows. With two zones and a replica count of 5, the default
max_per_zone = ceil(5/2) = 3requires at least one zone to host 3 replicas of that deployment — if no three-node subset in any zone fits them, the model is infeasible. - Tenant anti-affinity may over-partition the cluster. If
tenant_aandtenant_bare anti-affine and your cluster has only enough capacity for one of them on each node, the model can’t fit both. Loosen by adding more nodes, dropping the anti-affinity pair, or shrinking workload demands. - The bundled affinity-paired deployments must be able to land on a single node. If
cache_alphaandcache_deltahave a combined demand that doesn’t fit on any one node, the affinity IC is unsatisfiable. Check the affinity-paired demands against the largest node’s capacity. - Distributed-training rack-cliques interact with failure-domain spread when the GPU racks all sit in the same zone. On the bundled data, both GPU racks (
rack-b1,rack-b2) are in zoneus-east-1b, so a 4-pod training job that must share a rack also ends up sharing that zone — but the defaultmax_per_zone = ceil(4/2) = 2allows only 2 of the 4 replicas in any one zone. Combined with gang placement (all 4 or none), the deployment would be left fully unplaced and the objective would drop to11. The bundleddeployments.csvsetsmax_per_zone_override = 4onml_gamma_trainto opt out of spread for the training group only, which is what lets the bundled solve hitobjective = 12. To reproduce the infeasibility for the training deployment, blank that override and re-run; to widen the fix to your own workload, raisemax_per_zone_overrideon the training row, or drop the rack-clique IC AND spread the GPU nodes across zones (so the 4 pods can split across racks in different zones and still satisfy spread). Note that simply spreading the GPU nodes across zones is not enough on its own: a rack belongs to exactly one zone, so as long as rack-clique forces all 4 pods onto a single rack, those 4 pods still share a zone.
One deployment is unplaced; how do I find out why?
- The optimal solution can leave a deployment unplaced if placing it would require violating another IC. Check the printed “Unplaced deployments” block. Common binding constraints: failure-domain spread (a deployment’s GPU/CPU demand is concentrated in fewer zones than
max_per_zoneallows — raise the row’smax_per_zone_overrideif that’s the right operational tradeoff); rack-clique (a distributed-training group’s GPU requirement exceeds any single rack’s capacity, or all GPU racks sit in one zone so a single-rack placement collides with spread — the bundledml_gamma_trainlives at this exact intersection, which is why its row sets the override); gang placement (the deployment’s replicas can’t all fit somewhere, so the all-or-none rule leaves it fully unplaced). - To isolate which IC is binding, drop one IC at a time and re-solve: comment out
problem.satisfy(rack_clique_ic), run, and see whether the deployment then places. Repeat with the spread and gang ICs. The first IC whose removal allows placement is the binding one.
The `Pod.deployment.tenant` chained navigation produces a runtime error on inspection
- Multi-hop property navigation inside
model.select(...)combined with a decision-variable filter (where(Pod.on_node(Node, 1))) can produce a query the engine refuses to evaluate within its runtime budget. The bundled inspection avoids the issue by keeping each query to single-hop navigation (Node.id,Pod.id). - If you need richer post-solve reporting (e.g. a
(pod, deployment_name, tenant_name, node, zone, rack)table), pull the placement into pandas and join in Python:placements = model.select(Pod.id, Node.id).where(Pod.on_node(Node, 1)).to_df(), thenplacements.merge(pods_csv, on="pod_id").merge(deployments_csv, on="deployment_id")....
Import error or AttributeError on relationalai
- Confirm your virtual environment is active:
which pythonshould point to.venv. - Reinstall dependencies:
python -m pip install .. The pinned version (relationalai==1.1.0) ships thesolve_info(),verify(), andwhere().require()APIs this template uses; older versions lack them and produce attribute errors.
FileNotFoundError on a CSV
- The script resolves data paths as
Path(__file__).parent / "data". Runpython pod_placement.pyfrom the unzipped template root, not from a parent directory. - Confirm
data/containsnodes.csv,tenants.csv,tenant_anti_affinity.csv,deployments.csv,deployment_affinity.csv,pods.csv, anddistributed_training.csv.
Authentication or configuration errors
- Run
rai initto create or update your RelationalAI/Snowflake configuration. - If you have multiple profiles, set
export RAI_PROFILE=<your_profile>.
MiniZinc solver not available
- This template uses the MiniZinc / Chuffed constraint solver. Ensure the RAI Native App version supports MiniZinc.
Solver hits the time limit (status is not `OPTIMAL`) and the script crashes
- The post-solve
model.require(problem.termination_status() == "OPTIMAL")is a hard assertion: if you raisetime_limit_secand the solver returns a feasible-but-not-proven-optimal result (SOLUTION_LIMITorLOCALLY_SOLVED), the script raises rather than print a partial solution. - For a tutorial run on the bundled data this is the right posture (the oracle is “the solver proved 11/12”). For larger user data, either (a) raise
time_limit_secuntil the solver proves optimality, or (b) replacemodel.require(...)with aprint(...)warning and let the inspection blocks run on the best feasible placement found.
A distributed-training group is silently splitting across racks
- The
rack_clique_icfires once per row indistributed_training.csv. The IC enforces “same rack” only for the pairs of pods listed there — it does NOT close the relation transitively. The bundled CSV provides allC(4, 2) = 6pairs for the 4-podml_gamma_traingroup, so every pair is constrained. If you supply only a spanning tree (e.g. 3 edges instead of 6 for a 4-pod group), non-adjacent pods can legally land on different racks. - To model an N-pod training group, enumerate all
C(N, 2)pairs indistributed_training.csv. Alternatively, pre-compute the transitive closure of the group graph in Python before passing it tomodel.data(...).