Plugin Development Guide
This guide covers everything needed to build a custom backend plugin for Waldur Site Agent. It is written for both human developers and LLM-based code generators.
Waldur Mastermind concepts
Before implementing a plugin, understand how Waldur Mastermind concepts map to plugin operations.
| Waldur concept | Description | Plugin relevance |
|---|---|---|
| Offering | Service catalog entry | Config block per offering; picks backend plugin |
| Resource | Allocation from an offering | CRUD via BaseBackend; keyed by backend_id |
| Order | Create/update/terminate request | Triggers order_process mode |
| Component | Measurable dimension (CPU, RAM) | Defined in backend_components config |
| OfferingUser | User linked to an offering | Username backend generates usernames |
| billing_type | usage or limit |
Metered vs quota accounting |
| backend_id | Resource ID on the backend | Generated by _get_resource_backend_id |
Architecture overview
A plugin consists of two main classes:
- Backend (inherits
BaseBackend): Orchestrates high-level operations (create resource, collect usage, manage users). - Client (inherits
BaseClient): Handles low-level communication with the external system (CLI commands, API calls).
1 2 3 4 5 6 7 8 9 10 | |
BaseBackend method reference
Abstract methods (must implement)
ping(raise_exception: bool = False) -> bool
- Mode: All (health check)
- Purpose: Verify backend connectivity.
- No-op: Return
False.
diagnostics() -> bool
- Mode: Diagnostics CLI
- Purpose: Log diagnostic info and return health status.
- No-op: Log a message, return
True.
list_components() -> list[str]
- Mode: Diagnostics
- Purpose: Return component types available on the backend.
- No-op: Return
[].
_get_usage_report(resource_backend_ids: list[str]) -> dict
- Mode:
report,membership_sync - Purpose: Collect usage data for resources.
- Return format:
1 2 3 4 5 6 7 | |
- Key rules:
- Component keys must match
backend_componentsconfig keys. - Values must be in Waldur units (after
unit_factorconversion). TOTAL_ACCOUNT_USAGEis required and must equal the sum of per-user values.- No-op: Return
{}.
_collect_resource_limits(waldur_resource) -> tuple[dict, dict]
- Mode:
order_process(resource creation) - Purpose: Convert Waldur limits to backend limits and back.
- Returns:
(backend_limits, waldur_limits)wherebackend_limitshas values multiplied byunit_factor. - No-op: Return
({}, {}).
_pre_create_resource(waldur_resource, user_context=None) -> None
- Mode:
order_process(resource creation) - Purpose: Set up prerequisites before resource creation (e.g., parent accounts).
- No-op: Use
pass.
downscale_resource(resource_backend_id: str) -> bool
- Mode:
membership_sync - Purpose: Restrict resource capabilities (e.g., set restrictive QoS).
- No-op: Return
True.
pause_resource(resource_backend_id: str) -> bool
- Mode:
membership_sync - Purpose: Prevent all usage of the resource.
- No-op: Return
True.
restore_resource(resource_backend_id: str) -> bool
- Mode:
membership_sync - Purpose: Restore resource to normal operation.
- No-op: Return
True.
get_resource_metadata(resource_backend_id: str) -> dict
- Mode:
membership_sync - Purpose: Return backend-specific metadata for Waldur.
- No-op: Return
{}.
Hook methods (override as needed)
These have default implementations in BaseBackend. Override only when your
backend needs custom behavior.
| Method | Default | When to override |
|---|---|---|
post_create_resource |
No-op | Post-creation setup (homedirs) |
_pre_delete_resource |
No-op | Pre-deletion cleanup (cancel jobs) |
_pre_delete_user_actions |
No-op | Per-user cleanup before removal |
process_existing_users |
No-op | Process existing users (homedirs) |
BaseClient method reference
All methods below are abstract and must be implemented.
| Method | Signature | Purpose |
|---|---|---|
list_resources |
() -> list[ClientResource] |
List all resources on backend |
get_resource |
(resource_id) -> ClientResource or None |
Get single resource or None |
create_resource |
(name, description, organization, parent_name=None) -> str |
Create resource |
delete_resource |
(name) -> str |
Delete resource |
set_resource_limits |
(resource_id, limits_dict) -> str or None |
Set limits (backend units) |
get_resource_limits |
(resource_id) -> dict[str, int] |
Get limits (backend units) |
get_resource_user_limits |
(resource_id) -> dict[str, dict[str, int]] |
Per-user limits |
set_resource_user_limits |
(resource_id, username, limits_dict) -> str |
Set per-user limits |
get_association |
(user, resource_id) -> Association or None |
Check user-resource link |
create_association |
(username, resource_id, default_account=None) -> str |
Create user-resource link |
delete_association |
(username, resource_id) -> str |
Remove user-resource link |
get_usage_report |
(resource_ids) -> list |
Raw usage data from backend |
list_resource_users |
(resource_id) -> list[str] |
List usernames for resource |
Important: BaseClient also provides execute_command(command, silent=False)
for running CLI commands with error handling. Use it for CLI-based backends.
Agent mode method matrix
This table shows which BaseBackend methods are called by each agent mode.
| Method | order_process | report | membership_sync | event_process |
|---|---|---|---|---|
ping |
startup | startup | startup | startup |
create_resource / create_resource_with_id |
CREATE order | - | - | CREATE event |
_pre_create_resource |
CREATE order | - | - | CREATE event |
post_create_resource |
CREATE order | - | - | CREATE event |
_collect_resource_limits |
CREATE order | - | - | CREATE event |
set_resource_limits |
UPDATE order | - | - | UPDATE event |
delete_resource |
TERMINATE order | - | - | TERMINATE event |
_pre_delete_resource |
TERMINATE order | - | - | TERMINATE event |
pull_resource / pull_resources |
CREATE order | usage pull | sync cycle | various events |
_get_usage_report |
- | usage pull | sync cycle | - |
add_users_to_resource |
post-create | - | user sync | role events |
remove_users_from_resource |
- | - | user sync | role events |
add_user / remove_user |
- | - | role changes | role events |
downscale_resource |
- | - | status sync | - |
pause_resource |
- | - | status sync | - |
restore_resource |
- | - | status sync | - |
get_resource_metadata |
- | - | status sync | - |
list_resources |
- | import | - | import event |
get_resource_limits |
- | import | - | import event |
get_resource_user_limits |
- | - | limits sync | - |
set_resource_user_limits |
- | - | limits sync | - |
process_existing_users |
- | - | user sync | - |
Usage report format specification
The _get_usage_report method must return data in this exact structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Rules
- Component keys must exactly match those in the
backend_componentsYAML config. - Values must be integers in Waldur units (i.e., divide raw backend values
by
unit_factor). TOTAL_ACCOUNT_USAGEis a required key and must equal the sum of all per-user values for each component.- If a resource has no usage, return
{"TOTAL_ACCOUNT_USAGE": {"cpu": 0, "mem": 0, ...}}. - If usage reporting is not supported, return
{}(empty dict).
Example: SLURM CPU and memory
Given config:
1 2 3 4 5 6 7 | |
If SLURM reports 120000 cpu-minutes and 122880 MB-minutes for user1:
1 2 3 4 5 6 | |
Calculation: 120000 / 60000 = 2, 122880 / 61440 = 2.
supports_decreasing_usage class attribute
Set this to True on your backend class if usage values can decrease between
reports (e.g., a storage backend reporting current disk usage rather than
accumulated compute time).
1 2 | |
When False (default), the reporting processor skips updates where the new
usage value is lower than the previously reported value, treating it as a
data anomaly.
Decision matrix for no-op implementations
If your backend does not support a certain operation, use these return values:
| Method | No-op return | Meaning |
|---|---|---|
ping |
False |
Backend has no health check |
diagnostics |
True |
Diagnostics not implemented but OK |
list_components |
[] |
No component discovery |
_get_usage_report |
{} |
No usage reporting |
_collect_resource_limits |
({}, {}) |
No limits support |
_pre_create_resource |
pass |
No pre-creation setup |
downscale_resource |
True |
No downscaling concept |
pause_resource |
True |
No pausing concept |
restore_resource |
True |
No restore concept |
get_resource_metadata |
{} |
No metadata |
Annotated YAML configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
unit_factor explained
The unit_factor converts between Waldur display units and backend-native units:
backend_value = waldur_value * unit_factorwaldur_value = backend_value / unit_factor
Examples:
- CPU k-Hours to SLURM cpu-minutes:
unit_factor = 60000(60 min x 1000) - GB-Hours to SLURM MB-minutes:
unit_factor = 61440(60 min x 1024 MB) - GB to GB (no conversion):
unit_factor = 1
Entry point registration
Register your plugin in pyproject.toml:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
The entry point name (e.g., mycustom) is what users put in
backend_type or order_processing_backend in the config YAML.
Common pitfalls
1. Unit factor direction
The unit_factor converts from Waldur units to backend units by multiplication.
When reporting usage back, you must divide by unit_factor. Getting this
backwards causes limits to be set at 1/60000th of the intended value
or usage to be reported 60000x too high.
2. Missing TOTAL_ACCOUNT_USAGE
The _get_usage_report return dict must include a "TOTAL_ACCOUNT_USAGE" key
for each resource. If missing, the core will substitute zeros, and reported
usage will appear as zero in Waldur.
3. Entry point not discovered
Common causes:
- Package not installed (
uv sync --all-packages) - Entry point group name misspelled (must be
"waldur_site_agent.backends") - Entry point value points to wrong class or module
Debug with:
1 2 | |
4. Forgetting super().init()
Your backend __init__ must call super().__init__(backend_settings, backend_components).
This sets up self.backend_settings, self.backend_components, and
self.client. Then assign your own client:
1 2 3 4 | |
5. Returning wrong types from client methods
get_resourcemust returnNone(not raise) when resource is absent.get_associationmust returnNone(not raise) when no association exists.list_resourcesmust returnlist[ClientResource], not raw dicts.
6. Component key mismatch
Component keys in _get_usage_report must exactly match the keys in
backend_components config. If config has "cpu" but you report "CPU",
the usage will be silently ignored.
Testing guidance
What to test per mode
| Mode | Test focus |
|---|---|
order_process |
create_resource, delete_resource, limit conversion |
report |
_get_usage_report format, unit conversion math |
membership_sync |
add_user, remove_user, pause/restore |
| All | ping, error handling, edge cases |
Mock patterns
Mock the client to avoid needing a real backend:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Fixtures
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
Key assertions
1 2 3 4 5 6 7 8 9 | |
LLM implementation checklist
When implementing a new backend plugin with an LLM, follow these steps in order:
- Read existing plugins: Study
plugins/slurm/andplugins/mup/for patterns. - Copy the template: Start from
docs/plugin-template/and rename. - Implement
__init__: Callsuper().__init__(), setbackend_type, create client. - Implement
BaseClientmethods: Start withget_resource,create_resource,delete_resource,list_resources. - Implement
BaseBackendabstract methods: Start withping, then_pre_create_resource, then_collect_resource_limits, then_get_usage_report. - Handle unit conversion: Verify
unit_factormath in both directions. - Write tests: Mock the client, test each abstract method.
- Register entry points: Add to
pyproject.toml. - Test integration: Install with
uv sync --all-packagesand runwaldur_site_diagnostics. - Verify: Run
uv run pytestandpre-commit run --all-files.
Files to study
waldur_site_agent/backend/backends.py- Base classes with all abstract methodswaldur_site_agent/backend/clients.py- Base client classwaldur_site_agent/backend/structures.py- Data structures (ClientResource,Association,BackendResourceInfo)plugins/slurm/waldur_site_agent_slurm/backend.py- Reference implementation (CLI-based)plugins/mup/waldur_site_agent_mup/backend.py- Reference implementation (API-based)
Common mistakes to avoid
- Do not forget
super().__init__(backend_settings, backend_components). - Do not return raw dicts from
list_resources; returnClientResourceobjects. - Do not raise exceptions from
get_resourcewhen resource is absent; returnNone. - Do not forget the
"TOTAL_ACCOUNT_USAGE"key in usage reports. - Do not confuse Waldur units with backend units in
_collect_resource_limits. - Do not hardcode component keys; read them from
self.backend_components.