SLURM Plugin for Waldur Site Agent
The SLURM plugin provides SLURM cluster management capabilities for Waldur Site Agent, including resource management, usage reporting, periodic limits, and historical data loading.
Features
Core SLURM Management
- Account Management: Create, delete, list, and manage SLURM accounts
- User Association: Add/remove users from SLURM accounts with automatic association management
- Resource Limits: Set and manage CPU, memory, GPU, and custom TRES limits
- Usage Reporting: Real-time usage data collection and reporting to Waldur
- Health Monitoring: Cluster status checking and connectivity validation
Periodic Limits System
- Dynamic Fairshare: Automatic fairshare adjustments based on usage patterns
- TRES Limits: GrpTRESMins, MaxTRESMins, and GrpTRES limit management
- QoS Management: Threshold-based Quality of Service adjustments
- Carryover Allocation: Unused allocation carryover between billing periods
- Decay Calculations: Configurable half-life decay for historical usage
- Event-Driven Updates: Real-time periodic limits updates via STOMP
Historical Usage Loading
The waldur_site_load_historical_usage command has been moved to the core package and is now
available to all backend plugins. The SLURM backend implements get_usage_report_for_period()
to supply historical data from SLURM accounting records.
Dual-Mode Operation
- Production Mode: Direct SLURM cluster integration via
sacctmgrandsacct - Emulator Mode: Development and testing with SLURM emulator integration
- Seamless Switching: Configuration-driven mode selection
Installation
The SLURM plugin is included in the main Waldur Site Agent installation. For specific installation instructions, see the main Installation Guide.
Dependencies
- SLURM Tools:
sacctmgr,sacctcommands available on cluster head node - Python Packages: Automatically installed with the plugin
- Optional: SLURM emulator for development and testing
Configuration
Basic Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Periodic Limits Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
QoS state (normal / downscaled / paused) is driven by the resource flags
paused / downscaled set by Waldur Mastermind and applied by the agent
through the top-level qos_default / qos_downscaled / qos_paused
backend settings — the same path used by manual pause/downscale.
Storage Quotas
The SLURM plugin supports two independent filesystem-quota subsystems:
- Per-user home directory quota (
homedir_quota/homedir_base_path) with CephFS xattr, XFS, or Lustre user-quota providers. - Per-project directory + Lustre group/project quota
(
project_directorywith optional nestedlustre_quota).
See docs/slurm-storage-quotas.md for configuration reference, command flow, prerequisites (Lustre project quotas require LDAP integration), and operator troubleshooting tips.
Event Processing Configuration
STOMP event processing is configured with top-level keys on the offering
(not in a separate event_processing block):
1 2 3 4 5 6 7 8 9 10 11 12 | |
The set of STOMP topics the agent subscribes to is derived automatically
from the offering configuration — there is no user-settable
observable_object_types key. When backend_settings.periodic_limits.enabled
is true, the agent subscribes to the RESOURCE_PERIODIC_LIMITS topic
(see _determine_observable_object_types in
waldur_site_agent/event_processing/utils.py).
Usage
Basic Agent Operations
1 2 3 4 5 6 7 8 9 10 11 | |
Loading Historical Usage
1 2 3 4 5 6 7 | |
Requirements for historical loading:
- Staff user token (regular offering tokens cannot submit historical data).
Use --no-staff-check to bypass staff validation when submitting with a
service-provider token.
- Optional flags: --skip-user-usage (submit resource-level totals only),
--no-staff-check (skip staff validation)
- Resources must already exist in Waldur
- SLURM accounting database must contain historical data for requested periods
Periodic Limits Management
Periodic limits are managed automatically via event processing when enabled. The system:
- Receives signals from Waldur Mastermind with calculated periodic settings
- Applies settings to SLURM cluster (fairshare, limits, QoS)
- Monitors thresholds and adjusts QoS based on current usage
- Reports status back to Waldur
Account Diagnostics
The waldur_site_diagnose_slurm_account command provides diagnostic information for SLURM
accounts by comparing local cluster state with Waldur Mastermind configuration.
1 2 3 4 5 6 7 8 | |
Diagnostic Data Flow
flowchart TB
subgraph Input
ACCOUNT[Account Name<br/>e.g., alloc_myproject]
CONFIG[Configuration<br/>config.yaml]
end
subgraph "Local SLURM Cluster"
SACCTMGR_Q[sacctmgr queries]
SLURM_DATA[Account Data<br/>• Fairshare<br/>• QoS<br/>• GrpTRESMins<br/>• Users]
end
subgraph "Waldur Mastermind API"
RESOURCE_API[Resources API<br/>GET /marketplace-provider-resources/]
POLICY_API[Policy API<br/>GET /marketplace-slurm-periodic-usage-policies/]
WALDUR_DATA[Resource Data<br/>• Limits<br/>• State<br/>• Offering]
POLICY_DATA[Policy Data<br/>• Limit Type<br/>• TRES Billing<br/>• Grace Ratio<br/>• Component Limits]
end
subgraph "Diagnostic Service"
FETCH_SLURM[Get SLURM<br/>Account Info]
FETCH_WALDUR[Get Waldur<br/>Resource Info]
FETCH_POLICY[Get SLURM<br/>Policy Info]
CALCULATE[Calculate<br/>Expected Settings]
COMPARE[Compare<br/>Actual vs Expected]
GENERATE[Generate<br/>Fix Commands]
end
subgraph Output
HUMAN[Human-Readable<br/>Report]
JSON[JSON<br/>Output]
FIX_CMDS[sacctmgr<br/>Fix Commands]
end
%% Flow
ACCOUNT --> FETCH_SLURM
CONFIG --> FETCH_SLURM
CONFIG --> FETCH_WALDUR
FETCH_SLURM --> SACCTMGR_Q
SACCTMGR_Q --> SLURM_DATA
SLURM_DATA --> COMPARE
FETCH_WALDUR --> RESOURCE_API
RESOURCE_API --> WALDUR_DATA
WALDUR_DATA --> FETCH_POLICY
WALDUR_DATA --> CALCULATE
FETCH_POLICY --> POLICY_API
POLICY_API --> POLICY_DATA
POLICY_DATA --> CALCULATE
CALCULATE --> COMPARE
COMPARE --> GENERATE
GENERATE --> HUMAN
GENERATE --> JSON
GENERATE --> FIX_CMDS
%% Styling
classDef input fill:#e8f5e9
classDef slurm fill:#f3e5f5
classDef waldur fill:#fff3e0
classDef service fill:#e3f2fd
classDef output fill:#fce4ec
class ACCOUNT,CONFIG input
class SACCTMGR_Q,SLURM_DATA slurm
class RESOURCE_API,POLICY_API,WALDUR_DATA,POLICY_DATA waldur
class FETCH_SLURM,FETCH_WALDUR,FETCH_POLICY,CALCULATE,COMPARE,GENERATE service
class HUMAN,JSON,FIX_CMDS output
Diagnostic Output
The diagnostic provides:
- SLURM Cluster Status: Account existence, fairshare, QoS, limits, users
- Waldur Mastermind Status: Resource state, offering, configured limits
- SLURM Policy Status: Period, limit type, TRES billing, grace ratio
- Expected vs Actual Comparison: Field-by-field comparison with status
- Unit Conversion Info: Shows how Waldur units convert to SLURM units
- Remediation Commands:
sacctmgrcommands to fix any mismatches
Unit Conversions
Waldur and SLURM may use different units for resource limits. The diagnostic shows:
- Waldur units: e.g., Hours, GB-Hours (from offering configuration)
- SLURM units: e.g., TRES-minutes (from limit type: GrpTRESMins, MaxTRESMins)
- Conversion factor: The
unit_factorfrom backend component configuration
For example, if Waldur uses "k-Hours" (kilo-hours) and SLURM uses "TRES-minutes", with a
unit_factor of 60000:
1 | |
Use -v/--verbose to see detailed unit conversion information for each component.
Example output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
CLI Options
| Option | Description |
|---|---|
account_name |
SLURM account name to diagnose (required) |
-c, --config |
Path to configuration file (default: waldur-site-agent-config.yaml) |
--offering-uuid |
Specific offering UUID (auto-detected if not specified) |
--json |
Output in JSON format for scripting |
-v, --verbose |
Include detailed reasoning in output |
--no-color |
Disable colored output |
Architecture
Component Overview
graph TB
subgraph "Waldur Site Agent"
BACKEND[SLURM Backend<br/>Core Logic]
CLIENT[SLURM Client<br/>Command Execution]
EVENTS[Event Handler<br/>Periodic Limits]
end
subgraph "SLURM Cluster"
SACCTMGR[sacctmgr<br/>Account Management]
SACCT[sacct<br/>Usage Reporting]
SQUEUE[squeue<br/>Status Monitoring]
end
subgraph "Waldur Mastermind"
API[REST API<br/>Resource Management]
STOMP[STOMP Broker<br/>Event Publishing]
POLICY[Periodic Policy<br/>Usage Calculations]
end
subgraph "Development Tools"
EMULATOR[SLURM Emulator<br/>Testing Environment]
end
%% Connections
BACKEND --> CLIENT
CLIENT --> SACCTMGR
CLIENT --> SACCT
CLIENT --> SQUEUE
CLIENT -.-> EMULATOR
BACKEND <--> API
EVENTS <--> STOMP
POLICY --> STOMP
EVENTS --> BACKEND
%% Styling
classDef agent fill:#e3f2fd
classDef slurm fill:#f3e5f5
classDef waldur fill:#fff3e0
classDef dev fill:#f1f8e9
class BACKEND,CLIENT,EVENTS agent
class SACCTMGR,SACCT,SQUEUE slurm
class API,STOMP,POLICY waldur
class EMULATOR dev
Backend Methods
The SLURM backend (SlurmBackend) extends BaseBackend and implements or overrides these methods:
Resource Lifecycle
create_resource(waldur_resource, user_context=None)— inherited fromBaseBackenddelete_resource(waldur_resource, **kwargs)— inherited fromBaseBackend_pre_create_resource(waldur_resource, user_context=None)— sets up SLURM account hierarchy, LDAP groups, QoS, and project directoriespost_create_resource(resource, waldur_resource, user_context=None)— creates home directories for users_pre_delete_resource(waldur_resource)— cancels jobs, removes users, cleans up QoS and LDAP groups_collect_resource_limits(waldur_resource)— converts Waldur limits to SLURM TRES limits (with ComponentMapper support)set_resource_limits(resource_backend_id, limits)— sets limits using ComponentMapper when target_components are configuredget_resource_limits(resource_backend_id)— gets account limits converted to Waldur units
User Management
add_user(waldur_resource, username, **kwargs)— adds user to SLURM account with optional partition and LDAP groupadd_users_to_resource(waldur_resource, user_ids, **kwargs)— adds users and creates home directoriesremove_user(waldur_resource, username, **kwargs)— removes user from SLURM account and LDAP groupremove_users_from_resource(waldur_resource, usernames)— inherited fromBaseBackendset_resource_user_limits(resource_backend_id, username, limits)— sets per-user limits with unit_factor conversionprocess_existing_users(existing_users)— ensures home directories exist for current users
Usage Reporting
_get_usage_report(resource_backend_ids)— collects current usage from SLURM accountingget_usage_report_for_period(resource_backend_ids, year, month)— collects historical usage for a billing period
Resource State Management
downscale_resource(resource_backend_id)— sets QoS to downscaled statepause_resource(resource_backend_id)— sets QoS to paused staterestore_resource(resource_backend_id)— restores QoS to defaultget_resource_metadata(resource_backend_id)— returns current QoS as metadata
Periodic Limits
apply_periodic_settings(resource_id, settings, config=None)— applies periodic settings (production or emulator mode)
Health and Diagnostics
ping(raise_exception=False)— checks if the SLURM cluster is onlinediagnostics()— logs diagnostic information and validates cluster connectivitylist_components()— returns available TRES on the SLURM cluster
Client Commands
The SLURM client executes commands via sacctmgr and sacct:
Account Commands
1 2 3 4 5 6 7 8 | |
User Association Commands
1 2 3 4 5 | |
Usage Reporting Commands
1 2 3 4 5 | |
Periodic Limits Commands
1 2 3 4 5 6 7 8 9 10 11 | |
Testing
Test Structure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | |
Running Tests
Run plugin tests from inside the plugin directory — the SLURM backend
entry point only resolves when pytest runs from plugins/slurm/ (running
from the workspace root fails with Unsupported backend type: slurm).
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Test Features
Mock Mastermind Integration
The test suite includes complete mocking of Waldur Mastermind's periodic limits policy system:
MockWaldurMastermindPolicy: Simulates real policy calculationsMockSTOMPFrame: Simulates STOMP message structure- End-to-end testing: Complete workflow validation without external dependencies
SLURM Emulator Integration
Tests can use the SLURM emulator for realistic command testing:
- Development dependency:
uv add --dev slurm-emulator - Automatic switching: Tests detect emulator availability
- Realistic scenarios: Built-in scenario framework
Development
Development Environment
1 2 3 4 5 6 7 8 9 10 11 12 | |
Adding New Features
- Implement backend methods in
waldur_site_agent_slurm/backend.py - Add client commands in
waldur_site_agent_slurm/client.py - Write unit tests with mocked dependencies
- Add integration tests with emulator if needed
- Update documentation in README and docstrings
Debugging
1 2 3 4 5 6 7 8 9 10 11 12 | |
Advanced Configuration
Production Deployment
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Multi-Cluster Setup
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | |
Development/Testing Setup
1 2 3 4 5 6 7 8 9 10 11 | |
Troubleshooting
Common Issues
SLURM Commands Not Found
1 | |
Solution: Install SLURM client tools or use emulator mode for development.
Permission Denied
1 | |
Solution: Ensure site agent runs with appropriate SLURM privileges or configure sudo access.
Periodic Limits Not Working
1 | |
Solutions:
- Verify event processing is enabled
- Check STOMP connection settings
- Ensure offering has periodic_limits.enabled: true
- Verify STOMP broker is publishing periodic limits events
Historical Loading Errors
1 | |
Solution: Use an API token from a user with is_staff=True in Waldur.
Debug Commands
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Support
For issues, bug reports, or feature requests related to the SLURM plugin, please check:
- Plugin documentation - This README and test documentation
- Main project documentation - Waldur Site Agent docs
- Test coverage - Run tests to verify expected behavior
- Debug logging - Enable debug mode for detailed troubleshooting
The SLURM plugin provides enterprise-grade SLURM cluster integration with advanced features like periodic limits and historical data loading, making it suitable for production HPC environments.