SLURM Usage Reporting Setup Guide
This guide explains how to set up a single Waldur Site Agent instance for usage reporting with SLURM backend. This configuration is ideal when you only need to collect and report usage data from your SLURM cluster to Waldur Mastermind.
Overview
The usage reporting agent (report mode) collects CPU, memory, and other resource usage data from SLURM
accounting records and sends it to Waldur Mastermind. It runs in a continuous loop, fetching usage data for
the current billing period and reporting it at regular intervals.
Prerequisites
System Requirements
- Linux system with access to SLURM cluster head node
- Python 3.11 or higher
uvpackage manager installed- Root access (required for SLURM commands)
- Network access to Waldur Mastermind API
SLURM Requirements
- SLURM accounting enabled (
sacctandsacctmgrcommands available) - Access to SLURM accounting database
- Required SLURM commands:
sacct- for usage reportingsacctmgr- for account managementsinfo- for cluster diagnostics
Installation
1. Clone and Install the Application
1 2 3 4 5 6 | |
2. Create Configuration Directory
1 | |
Configuration
1. Create Configuration File
Create /etc/waldur/waldur-site-agent-config.yaml with the following configuration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | |
2. Configuration Parameters Explained
Waldur Connection
waldur_api_url: URL to your Waldur Mastermind API endpointwaldur_api_token: API token for authentication (create in Waldur admin)waldur_offering_uuid: UUID of the SLURM offering in Waldur
Backend Settings
default_account: Root account in SLURM cluster- Prefixes: Used to identify accounts created by the agent (for filtering)
Backend Components
cpu: CPU usage tracking in CPU-minutes (SLURM native unit)mem: Memory usage tracking in MB-minutes (SLURM native unit)unit_factor: Conversion factor from SLURM units to Waldur unitsaccounting_type: "usage": Report actual usage (not limits)
Deployment
Option 1: Systemd Service (Recommended)
- Copy service file:
1 | |
- Reload systemd and enable service:
1 2 3 | |
- Check service status:
1 | |
Option 2: Manual Execution
For testing or one-time runs:
1 2 3 4 5 | |
Operation
How It Works
- Initialization: Agent loads configuration and connects to SLURM cluster
- Account Discovery: Identifies accounts matching configured prefixes
- Usage Collection:
- Runs
sacctto collect usage data for current billing period - Aggregates CPU and memory usage per account and user
- Converts SLURM units to Waldur units using configured factors
- Reporting: Sends usage data to Waldur Mastermind API
- Sleep: Waits for configured interval (default: 30 minutes)
- Repeat: Returns to step 3
Timing Configuration
Control reporting frequency with environment variable:
1 2 | |
Logging
Systemd Service Logs
1 2 3 4 5 | |
Manual Execution Logs
Logs are written to stdout/stderr when running manually.
Monitoring and Troubleshooting
Health Checks
- Test SLURM connectivity:
1 | |
- Verify configuration:
1 2 | |
Common Issues
SLURM Commands Not Found
- Ensure SLURM tools are in PATH
- Verify
sacctandsacctmgrare executable - Check SLURM accounting is enabled
Authentication Errors
- Verify Waldur API token is valid
- Check network connectivity to Waldur Mastermind
- Ensure offering UUID exists in Waldur
No Usage Data
- Verify accounts exist in SLURM with configured prefixes
- Check SLURM accounting database has recent data
- Ensure users have submitted jobs in the current billing period
Permission Errors
- Agent typically needs root access for SLURM commands
- Verify service runs as root user
- Check file permissions on configuration file
Debugging
Enable debug logging by setting environment variable:
1 | |
Data Flow
1 2 3 4 5 6 | |
Security Considerations
- API Token Security: Store Waldur API token securely, restrict file permissions
- Root Access: Agent needs root for SLURM commands - run in controlled environment
- Network: Ensure secure connection to Waldur Mastermind (HTTPS)
- Logging: Avoid logging sensitive data, configure log rotation
Historical Usage Loading
In addition to regular usage reporting, the SLURM plugin supports loading historical usage data into Waldur. This is useful for:
- Migrating existing SLURM usage data when first deploying Waldur
- Backfilling missing usage data due to outages or configuration issues
- Reconciling billing periods with historical SLURM accounting records
Prerequisites for Historical Loading
Staff User Requirements:
- Historical usage loading requires a staff user API token
- Regular offering API tokens cannot submit historical data
- The staff user must have appropriate permissions in Waldur
Data Requirements:
- SLURM accounting database must contain historical data for the requested periods
- Resources must already exist in Waldur (historical loading cannot create resources)
- Offering users must be configured in Waldur for user-level usage attribution
Historical Usage Command
1 2 3 4 5 6 7 | |
Command Parameters
--config: Path to agent configuration file (same as regular usage reporting)--offering-uuid: UUID of the Waldur offering to load data for--user-token: Staff user API token (not the offering's regular API token)--start-date: Start date in YYYY-MM-DD format--end-date: End date in YYYY-MM-DD format
Processing Behavior
Monthly Processing:
- Historical usage is always processed monthly to align with Waldur's billing model
- Date ranges are automatically split into monthly billing periods
- Each month is processed independently for reliability and progress tracking
Data Attribution:
- Usage data is attributed to the first day of each billing month
- User usage includes both username and offering user URL when available
- Resource-level usage totals are calculated and submitted separately
Error Handling:
- Failed months are logged but don't stop processing of other months
- Individual user usage failures don't affect resource-level usage submission
- Progress is displayed: "Processing month 3/12: 2024-03"
Usage Examples
Load Full Year of Data
1 2 3 4 5 6 7 | |
Load Specific Quarter
1 2 3 4 5 6 7 | |
Load Single Month
1 2 3 4 5 6 7 | |
Monitoring Historical Loads
Progress Tracking
The command provides detailed progress information:
1 2 3 4 5 6 7 8 9 10 11 | |
Log Files
For production use, redirect output to log files:
1 2 3 4 5 6 7 | |
Troubleshooting Historical Loads
Error Messages and Solutions
No Staff Privileges:
1 | |
- Solution: Use an API token from a user with
is_staff=Truein Waldur
No Resources Found:
1 | |
- Solution: Ensure resources exist in Waldur and have
backend_idvalues set
No Usage Data:
1 | |
- Solution: Check SLURM accounting database has data for that period
- Verify SLURM account names match Waldur resource
backend_idvalues
Backend Not Supported:
1 | |
- Solution: Ensure you're using the SLURM backend and have updated code
Performance Considerations
Large Date Ranges:
- Historical loads can take hours for multi-year ranges
- Each month requires multiple API calls to Waldur
- SLURM database queries may be slow for old data
Rate Limiting:
- Waldur may rate limit API calls during bulk submission
- Consider adding delays between months if encountering 429 errors
Database Impact:
- Large historical queries may impact SLURM cluster performance
- Consider running during maintenance windows for multi-year loads
Validation and Verification
Verify Data in Waldur:
- Check resource usage in Waldur marketplace
- Verify billing calculations include historical periods
- Confirm user-level usage attribution is correct
Cross-Reference with SLURM:
1 2 3 4 5 6 7 | |
Integration Notes
This setup is designed for usage reporting only. For a complete Waldur Site Agent deployment that includes:
- Order processing (resource creation/deletion)
- Membership synchronization
- Event processing
You would need additional agent instances or a multi-mode configuration with different service files for each mode.
Historical Loading Integration:
- Historical loading is a separate command, not part of regular agent operation
- Run historical loads before starting regular usage reporting to avoid conflicts
- Historical data submission requires staff tokens, regular reporting uses offering tokens