waldur-site-agent-rancher-kc-crd
CRD-driven Rancher + Keycloak membership-sync plugin for Waldur Site
Agent. Translates Waldur Resource + ResourceProject + UserRole
state into ManagedRancherProject Custom Resources; the
rancher-keycloak-operator running inside the target Kubernetes
cluster owns the actual Rancher project + Keycloak group lifecycle.
Expected operator
This plugin only writes CRs — it relies on
waldur/rancher-keycloak-operator to be running
in the target Kubernetes cluster. That operator owns the
ManagedRancherProject and RancherProjectInventory CRD definitions
and is the only thing that talks to Rancher and Keycloak APIs.
Minimum operator version compatible with the current plugin: 0.3.0.
Recommended: 0.3.1+ — earlier 0.3.x bug-fixes a metadata-sync gap
where description / organization / projectSlug changes never reached
Rancher after the initial create. The plugin no longer emits
spec.namespace, which 0.2.x operators relied on for namespace
creation; earlier versions also miss audit fields and the
stale-project-ID cleanup fallback that the plugin assumes.
The operator's helm chart lives in its own repo under
helm/rancher-keycloak-operator/; install instructions are in the
Setup section below.
Scope
membership_sync_backend only. This plugin handles user-to-project
role bindings. It does not provision or terminate Rancher clusters,
does not process orders, and does not report usage. Order processing
and reporting are out of scope; if the offering needs them they have
to be wired through other backends.
Cluster pre-exists. Each Waldur Resource's backend_id is its
Rancher cluster ID; this plugin does not stand up Rancher clusters.
Opt-in per offering. Set
membership_sync_backend: rancher-kc-crd on the offering's
site-agent config to route its membership sync through this plugin.
Architecture
flowchart LR
subgraph Waldur ["Waldur Mastermind"]
WR[Resource]
WRP[ResourceProject]
WUR[UserRole]
WR --- WRP --- WUR
end
subgraph Agent ["Site Agent / this plugin"]
PR[pull_resource]
TR[translator.build_cr_spec]
CC[CrdClient.apply / list / delete]
SR[status_reader]
end
subgraph K8s ["Kubernetes cluster"]
CR[(ManagedRancherProject CR)]
OP[rancher-keycloak-operator]
CR -. watched by .-> OP
end
subgraph External ["External APIs"]
RA[Rancher Mgmt API]
KC[Keycloak Admin API]
end
Waldur -- "SDK GET<br/>(provider-resource-projects,<br/>list_users)" --> PR
PR --> TR --> CC -- "kube apply" --> CR
CR --> SR
SR -- "status.* + drift" --> Waldur
OP -- "v3 REST" --> RA
OP -- "Admin REST" --> KC
The plugin's responsibilities stop at writing CRs and reading their status. The operator owns:
- creating / adopting Rancher projects, namespaces, and resource quotas
- creating Keycloak parent + child groups and binding them to the
Rancher project via
ProjectRoleTemplateBinding(PRTB) withgroupPrincipalId: keycloakoidc_group://<group-name> - adding/removing users to/from Keycloak groups based on
spec.keycloak.roleBindings[].members[] - cascading cleanup on CR delete (
@kopf.on.delete)
Sequence: end-to-end membership sync
This is what happens on a single pull_resource call (i.e. one
membership-sync cycle per Resource the offering owns).
sequenceDiagram
autonumber
participant SA as Site Agent
participant PG as plugin
participant W as Waldur API
participant K as Kubernetes API
participant OP as Operator
participant R as Rancher
participant KC as Keycloak
SA->>PG: pull_resource(WaldurResource)
PG->>W: GET /api/marketplace-provider-resource-projects/?resource_uuid=…
W-->>PG: [RP1, RP2, …]
loop per ResourceProject
PG->>W: GET /api/marketplace-provider-resource-projects/<uuid>/list_users/
W-->>PG: [UserRole, …]
PG->>PG: build_cr_spec(resource, RP, users)
PG->>K: kube apply ManagedRancherProject (server-side)
K-->>OP: watch event (create/update)
OP->>R: ensure project (create or adopt by name)
OP->>R: ensure namespace + ResourceQuota
OP->>KC: ensure parent group, child groups
OP->>R: ensure PRTB(s) bound to KC groups
OP->>KC: GET /users/{id} or /users?username=X<br/>(per member)
OP->>KC: add/remove user from KC group
OP->>K: patch status.conditions + status.keycloakRoleBindings.syncedMembers
PG->>K: kube get ManagedRancherProject
K-->>PG: status.* (synced members per binding)
end
PG->>K: list ManagedRancherProject by label waldur.io/resource-uuid
K-->>PG: [all CRs for this resource]
PG->>K: kube delete (CRs whose RP no longer in Waldur)
K-->>OP: watch event (delete)
OP->>R: delete PRTBs + project (with stale-ID find-by-name fallback)
OP->>KC: delete child groups, parent group if empty
PG-->>SA: BackendResourceInfo(users=union of synced members)
Key invariants:
- Idempotent at every step. Re-applying the same CR yields the same Rancher project and Keycloak groups. Re-syncing with the same user set is a no-op on the Rancher/Keycloak side.
- One CR per ResourceProject, named
<resource.slug>-<rp.uuid[:8]>. Stable across renames. - Orphan pruning by label. CRs are stamped with
metadata.labels.waldur.io/resource-uuid; on each sync the plugin computes the set of expected CR names from the current Waldur RP list, then deletes any label-matching CR outside that set. - Lookup-only for users. The operator never creates Keycloak
users; it only binds existing ones. Users absent from Keycloak get
a
WARNING User <id> not found in Keycloakand are skipped (the PRTB and the group are still created — they're just empty for that user).
Sequence: orphan-CR pruning
sequenceDiagram
autonumber
participant SA as Site Agent
participant W as Waldur
participant K as Kubernetes
participant OP as Operator
participant R as Rancher
participant KC as Keycloak
Note over SA: pull_resource() loop completes;<br/>2 CRs applied for RP1, RP2
SA->>W: GET resource-projects?resource_uuid=…
W-->>SA: [RP1] (RP2 was deleted in Waldur)
SA->>K: list mrp -l waldur.io/resource-uuid=…
K-->>SA: [CR_RP1, CR_RP2]
Note over SA: expected={CR_RP1}<br/>found={CR_RP1, CR_RP2}<br/>orphans={CR_RP2}
SA->>K: kube delete CR_RP2
K-->>OP: watch event (delete + finalizer hold)
OP->>R: delete PRTBs for RP2's project
OP->>KC: remove members from KC groups
OP->>KC: delete KC child groups
OP->>KC: delete KC parent group (if empty)
OP->>R: delete Rancher project
alt stored project ID is stale (e.g. project recreated externally with new ID)
OP->>R: DELETE /v3/projects/<stored_id> -> 404
OP->>R: GET /v3/projects?clusterId=…&name=<projectName>
R-->>OP: {id: <current_id>}
OP->>R: DELETE /v3/projects/<current_id> -> 200
end
OP->>K: remove finalizer (CR fully gone)
Why pruning lives in the plugin, not the operator: the operator doesn't know about Waldur — the Waldur RP list is the source of truth, and only the plugin sees both sides. The label selector keeps pruning safe: CRs without the label (manually-created, or from a different source) are never touched.
Sequence: cleanup on stale rancherProjectId
When a Rancher project is externally deleted and recreated between
two operator reconciles, the CR's status.rancherProjectId points
at a non-existent ID. The cleanup falls back to finding the live
project by clusterId+projectName:
sequenceDiagram
autonumber
participant U as User / Test
participant K as Kubernetes
participant OP as Operator
participant R as Rancher
Note over U: status.rancherProjectId = p-OLD<br/>(actual Rancher project: p-NEW, same name)
U->>K: kubectl delete mrp <name>
K-->>OP: on_delete handler fires
OP->>R: DELETE /v3/projects/p-OLD
R-->>OP: 404 Not Found
Note over OP: delete_project returned False<br/>-> stored ID is stale
OP->>R: GET /v3/projects?clusterId=…&name=<projectName>
R-->>OP: [{id: p-NEW, …}]
OP->>OP: log WARNING "Stored projectId p-OLD was stale;<br/>deleting current p-NEW found by name"
OP->>R: DELETE /v3/projects/p-NEW
R-->>OP: 200 OK
OP->>K: cleanup complete; release finalizer
This was the failure mode that left orphan Rancher projects after
external recreation; fix landed in operator 0.2.2.
Setup
1. Operator: install in target cluster
The waldur/rancher-keycloak-operator must be
running in the cluster you point this plugin at. One operator
instance handles all ManagedRancherProject CRs in its
namespace; one operator can manage Rancher projects across multiple
downstream Rancher clusters (each CR specifies its own clusterId).
1a. CRDs
1 2 3 4 5 6 7 8 9 | |
1b. Helm install (published image)
The published image is opennode/rancher-keycloak-operator:<version>
on Docker Hub. Pin a specific version (don't use :latest):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
1c. Smoke-test the operator (optional but recommended)
The operator repo ships a Tier-1 runbook at
docs/tier-1-runbook.md that walks through applying a hand-crafted
CR end-to-end against the configured Rancher and Keycloak before
wiring in the site-agent. Run it once per cluster to catch
credential / connectivity issues early.
1d. Required Rancher + Keycloak permissions
| System | Role / scope |
|---|---|
| Rancher token | unscoped admin OR cluster-owner across all clusters this operator instance will manage |
| Keycloak admin user | realm-admin on the target realm (group create/delete, group member add/remove, user lookup) |
2. Plugin: install on the site-agent host
The plugin is a workspace member of the
waldur-site-agent repo and is installed automatically when you run
uv sync --all-packages at the repo root. To verify it's discovered:
1 | |
2a. Site-agent host needs
- Network access to: Waldur Mastermind API, the Kubernetes API of the cluster running the operator.
- A kubeconfig file (or in-cluster service-account credentials if you run the agent inside the operator's cluster).
- The Kubernetes API user must be allowed to
get/list/create/update/delete/patchmanagedrancherprojects.waldur.ioin the chosen namespace.
2b. Configure the offering
Add a stanza to waldur-site-agent-config.yaml. Full reference at
examples/rancher-kc-crd-config.yaml.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
2c. Run
rancher-kc-crd is a membership_sync_backend only. Order
processing and reporting still need other backends (or none).
1 2 | |
Observability:
1 2 3 4 5 6 7 8 | |
User identity matching
The plugin can match Waldur users to Keycloak users either by username
(default) or by UUID. Choose with backend_settings.keycloak_use_user_id:
false(default) — match by username- Plugin sends
UserRole.user_username. Operator doesGET /admin/realms/<realm>/users?username=X&exact=trueand uses the resultinguser.idfor group membership operations. Works in both OIDC and self-hosted Waldur as long as Waldur usernames align with Keycloak usernames — the typical OIDC mapping does this via thepreferred_usernameclaim. true— match by UUID- Plugin sends
UserRole.user_uuid. Operator doesGET /admin/realms/<realm>/users/{uuid}(matches the Keycloak internaluser.id). Use this only when Waldur was OIDC-provisioned AND its user UUIDs were seeded from the Keycloaksubclaim, so thatWaldur.user.uuid == Keycloak.user.id. The username path is preferred because it tolerates UUID divergence and works in more topologies.
The operator never creates users. A user that doesn't exist in
Keycloak under the chosen identifier gets logged as
WARNING User <id> not found in Keycloak and is skipped — the
PRTB and the group are still created and bound, the user just isn't
a member yet. They become a member on the next reconcile after the
user appears in Keycloak (e.g. their first OIDC login).
Configuration reference
| Key | Type | Required | Description |
|---|---|---|---|
waldur_api_url |
string | yes | Mastermind API root with /api/. Plugin strips trailing /api for the SDK. |
waldur_api_token |
string | yes | Long-lived token from /api/users/<uuid>/keys/. Don't use a session token. |
waldur_verify_ssl |
bool | no (default true) |
TLS verify for Waldur calls. |
kubeconfig_path |
string | no | Path to a kubeconfig file. Omit to use in-cluster credentials. |
context |
string | no | kubeconfig context to use when kubeconfig_path is set. |
namespace |
string | yes | Namespace for ManagedRancherProject CRs (typically waldur-system). |
parent_group_name |
string | no | Top-level KC group; var ${cluster_id}. Default c_${cluster_id}. |
group_name_template |
string | no | Per-role child KC group; vars listed in note A. |
role_map |
dict | yes | Waldur role name → Rancher role template ID. Roles outside the map are skipped. |
keycloak_use_user_id |
bool | no | false (default) → match by username. true → match by UUID. See above. |
spec.clusterId is resolved from each Resource's backend_id (1:1
with a Rancher cluster) — there is no offering-level cluster_id
setting. An empty backend_id raises a clear KeyError rather than
silently emitting an invalid CR.
Note A — group_name_template variables. Available substitutions:
| Variable | Source | Notes |
|---|---|---|
${cluster_id} |
Resource.backend_id |
Rancher cluster ID, opaque |
${role_name} |
UserRole.role_name | Pre-mapping (before role_map) |
${rp_uuid} |
ResourceProject.uuid |
Full 32-char hex |
${rp_uuid_short} |
first 8 chars of ${rp_uuid} |
~4B combos, collision-free; same as cr_name |
${customer_slug} |
Resource.customer_slug |
Waldur Customer (organization) slug |
${project_slug} |
Resource.project_slug |
Waldur Project slug (parent Project; RPs have no slug) |
${resource_slug} |
Resource.slug |
Waldur Resource slug (1:1 with cluster) |
${project_name} |
ResourceProject.name |
Human-readable; may contain spaces |
Default is c_${cluster_id}_${rp_uuid}_${role_name} — one Keycloak group
per (cluster × project × role), matching Rancher's per-project-PRTB
access model. The default uses ${rp_uuid} (immutable) for stability;
override only if you have a strong reason.
Recommended human-readable opt-in template:
1 | |
Renders e.g. c_c-m-glwxdksp_hpc-demo-org_genomics-2026_8706dd1a_project_member.
Stays unique per RP via ${rp_uuid_short} while the slugs make the
group name self-explaining in the Keycloak admin UI.
Custom-template constraints.
- MUST include a per-project discriminator (
${rp_uuid},${rp_uuid_short}, or${project_name}); without it, multiple projects share one group and a user added to project A also gains access to B, C, … via the shared group's PRTBs. - Slugs (
customer_slug,project_slug,resource_slug,project_name) are mutable -- renaming the entity in Waldur creates a new Keycloak group on the next reconcile and orphans the old one (the operator adopts groups by name and never renames adopted groups). Memberships in the old group become stale. - Switching the template after deployment has the same effect as a bulk rename: every existing CR re-renders, the operator creates fresh groups, the old groups linger with their stale members. Plan a one-time manual migration if you change the template against an existing deployment.
- Keycloak's
GROUP.NAMEcolumn isvarchar(255). The plugin guards this at render time and raises aValueError(with a hint about${rp_uuid_short}) if the rendered name would exceed 255 chars, so the operator never tries to apply a CR that Keycloak would reject with HTTP 500. Stay well under by preferring${rp_uuid_short}over${rp_uuid}in long templates and keeping Waldur slugs reasonably short (say ≤ 50 chars each).
Troubleshooting
- Plugin logs
HTTP/1.1 401 Unauthorizedfrom Waldur on every iteration waldur_api_tokenis a session token from/api-auth/password/(rotates on each call). Use a long-lived API token from/api/users/<uuid>/keys/.pull_resourcesucceeds but no CRs are created- The resource has an empty
backend_id. The agent's resource fetcher (waldur_site_agent/common/processors.py:_get_waldur_resources) drops resources without one. Fix:POST /api/marketplace-provider-resources/<uuid>/set_backend_id/. - Plugin logs
IndexError: list index out of rangein processor__init__ - Customer is not registered as a service provider. Fix:
POST /api/marketplace-service-providers/with thecustomerURL. - Plugin logs
GET .../api/api/marketplace-provider-resource-projects/...(doubled/api/) - You're on a build older than the URL fix bundled with the orphan-pruning commit. Pull latest plugin code,
or as a workaround drop the trailing
/api/fromwaldur_api_url. - Operator logs
WARNING User X not found in Keycloakfor every user - Identity mismatch — the chosen identifier (username by default, UUID with
keycloak_use_user_id: true) isn't resolvable in Keycloak. With the default username path: ensure Waldur usernames map to existing Keycloak usernames. With the UUID path: align Waldur user UUIDs with the Keycloak OIDCsub. kubectl delete mrpsucceeds in 0s but the Rancher project remainsstatus.rancherProjectIdis stale; operator versions before0.2.2treated 404 on delete as success. Upgrade the operator to0.2.2+ — cleanup now falls back to find-by-name.
Operator status.conditions to check for any CR (kubectl describe mrp <name>):
| Condition | What status=False means |
|---|---|
RancherProjectReady |
Rancher create/adopt failed. Check operator log for httpx.HTTPStatusError. |
ResourceQuotaReady |
spec.resourceQuotas apply failed (only emitted by operator 0.3.0+). |
KeycloakGroupsReady |
Couldn't create/find parent or child KC groups. Check KC admin credentials. |
RancherBindingsReady |
PRTB creation failed — usually invalid rancherRole in role_map. |
MembershipSynced |
Per-user add/remove failed — see User X not found in Keycloak warnings. |
Development
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
tests/ layout:
| File | Coverage |
|---|---|
test_translator.py |
15 pure tests: cr_name, group templates, role bindings, full CR build. |
test_status_reader.py |
13 pure tests: status → BackendResourceInfo + drift detection. |
test_backend_integration.py |
6 tests, K8S_CRD_TEST=1: quotas, no-client, orphan pruning. |
Companion components
| Component | Role |
|---|---|
rancher-keycloak-operator (separate repo) |
Reconciles ManagedRancherProject CRs. |
ManagedRancherProject CRD (in operator helm chart) |
API surface the plugin writes to. |