SLURM Periodic Usage Policy Configuration Guide
Overview
The SlurmPeriodicUsagePolicy enables automatic management of SLURM resource allocations with:
- Periodic (quarterly) usage tracking
- Automatic QoS adjustments based on usage thresholds
- Carryover of unused allocations with decay
- Grace periods for temporary overconsumption
- Integration with site agent for SLURM account management
Available Actions
Core Actions (Inherited from OfferingPolicy)
notify_organization_owners- Send email notifications to organization ownersnotify_external_user- Send notifications to external email addressesblock_creation_of_new_resources- Block creation of new SLURM resources
SLURM-Specific Actions
request_slurm_resource_downscaling- Apply slowdown QoS (setsresource.downscaled = True)request_slurm_resource_pausing- Apply blocked QoS (setsresource.paused = True)
How It Works
Threshold Triggers
The policy checks usage percentages and triggers actions at different thresholds:
- 80%: Notification threshold (hardcoded)
- 100%: Normal threshold - triggers
request_slurm_resource_downscaling - 120% (with 20% grace): Grace limit - triggers
request_slurm_resource_pausing
Site Agent Integration
When actions are triggered:
request_slurm_resource_downscaling→ Site agent appliesqos_downscaled(e.g., "limited")request_slurm_resource_pausing→ Site agent appliesqos_paused(e.g., "paused")- Normal state → Site agent applies
qos_default(e.g., "normal")
Configuration Examples
1. Basic Notification Policy
Send notifications when usage reaches 80%:
1 2 3 4 5 6 7 8 9 | |
2. Progressive QoS Management
Apply slowdown at 100% usage with notifications:
1 2 3 4 5 6 7 | |
3. Full Enforcement Policy
Complete enforcement with notifications, slowdown, and blocking:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
4. Organization-Specific Policy
Apply policy only to specific organization groups:
1 2 3 4 5 6 7 8 9 10 | |
Site Agent Configuration
Configure the site agent to handle QoS changes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Policy Parameters
Core Parameters
apply_to_all:Truefor all customers,Falsefor specific groupsorganization_groups: Specific groups if not applying to allactions: Comma-separated list of actions to trigger
SLURM-Specific Parameters
limit_type:"GrpTRESMins","MaxTRESMins", or"GrpTRES"tres_billing_enabled: Use TRES billing units vs raw valuestres_billing_weights: Weight configuration for billing unitsfairshare_decay_half_life: Days for fairshare decay (default: 15)grace_ratio: Grace period ratio (0.2 = 20% overconsumption)carryover_enabled: Allow unused allocation carryoverraw_usage_reset: Reset SLURM raw usage at period transitionsqos_strategy:"threshold"or"progressive"
Usage Scenarios
Scenario 1: Academic Institution with Quarterly Allocations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Scenario 2: Commercial Cloud with Strict Limits
1 2 3 4 5 6 7 8 | |
Scenario 3: Research Consortium with Flexible Limits
1 2 3 4 5 6 7 8 9 10 | |
API Usage
Create Policy via API
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Check Policy Status
1 2 | |
Evaluation and Testing
Staff-Only API Actions
Two staff-only API actions allow testing policy evaluation directly from the frontend or API without waiting for automatic triggers.
Dry Run
Calculate usage percentages and show what actions would be triggered without applying any changes.
1 2 3 4 | |
Optionally scope to a single resource:
1 2 3 4 | |
Response includes per-resource: usage_percentage, current paused/downscaled state, and would_trigger actions.
Evaluate (Synchronous)
Run the full evaluation: calculate usage, apply actions (pause/downscale/notify), and create evaluation log entries.
1 2 3 4 | |
Response includes per-resource: usage_percentage, actions_taken, previous_state, and new_state.
Frontend
Staff users see an Evaluate button on the SLURM policy configuration panel. This opens a dialog with:
- Dry run — read-only preview of what would happen
- Evaluate now — runs the full evaluation synchronously and shows results
Management Commands
Three management commands are available for CLI-based testing and monitoring:
evaluate_slurm_policy
1 2 3 4 5 6 7 8 9 10 11 | |
slurm_policy_status
1 2 3 4 5 6 7 8 | |
cleanup_slurm_logs
1 2 | |
Monitoring and Observability
Evaluation Log
Every policy evaluation creates a SlurmPolicyEvaluationLog record with:
usage_percentage— resource usage at the time of evaluationgrace_limit_percentage— the grace threshold that was appliedactions_taken— list of actions triggered (e.g.["downscale", "notify"])previous_state/new_state—pausedanddownscaledflags before and afterstomp_message_sent— whether a STOMP message was published to the site agentsite_agent_confirmed— whether the site agent reported success (null = pending)site_agent_response— full response from the site agent
Command History
When STOMP messages are sent to the site agent, each generated SLURM command is recorded in SlurmCommandHistory:
command_type— e.g.fairshare,limits,qos,reset_usageshell_command— the actualsacctmgrcommandexecution_mode—productionoremulatorsuccess/error_message— filled in by site agent report-back
API Endpoints
1 2 3 4 5 6 7 8 | |
Frontend Execution Log
The SLURM policy panel includes:
- Status summary — inline card showing last evaluation timestamp, count of paused/downscaled resources, and site agent confirmation status
- Execution log dialog with two tabs:
- Evaluation History — table with timestamps, resource names, usage percentages (colour-coded), action badges, and state transitions
- Command History — table with command types, shell commands, execution mode, and success/failure status
Structured Events
Policy evaluations emit a SLURM_POLICY_EVALUATION event type, visible in the Waldur events system.
Log Retention
Evaluation logs are automatically cleaned up by a daily Celery beat task (cleanup-slurm-evaluation-logs, runs at 03:00). The retention period is configurable via:
- Constance setting:
SLURM_POLICY_EVALUATION_LOG_RETENTION_DAYS(default: 90 days) - HomePort admin: Administration > Marketplace > SLURM policy
Check Resource Usage (Django Shell)
1 2 3 4 5 | |
Debug Carryover Calculations
1 2 3 | |
Site Agent Feedback Loop
After the site agent applies SLURM commands, it reports results back to Waldur:
- Site agent receives STOMP message with
action: apply_periodic_settings - Site agent executes
sacctmgrcommands via the backend - Site agent POSTs the result to
/api/marketplace-slurm-periodic-usage-policies/{policy_uuid}/report-command-result/ - Waldur updates
SlurmCommandHistory.successandSlurmPolicyEvaluationLog.site_agent_confirmed
The STOMP message payload includes policy_uuid so the site agent knows which policy endpoint to report to.
Best Practices
- Start with Notifications: Begin with notification-only policies to understand usage patterns
- Use Dry Run First: Run
waldur evaluate_slurm_policy --dry-runor the frontend Dry Run button before enabling enforcement - Test in Staging: Validate policies in a test environment first
- Monitor Grace Periods: Ensure grace ratios align with user needs
- Review Evaluation Logs: Check the execution log regularly for unexpected actions
- Regular Review: Review carryover and decay settings quarterly
- Clear Communication: Inform users about thresholds and consequences
Troubleshooting Common Issues
Policy Not Triggering
- Check that
apply_to_all=Trueor resource's customer is inorganization_groups - Verify component usage data exists for the current period
- Ensure resource is not in TERMINATED state
- Run
waldur evaluate_slurm_policy --policy <UUID> --dry-runto see current usage percentages
QoS Not Changing
- Verify site agent configuration has correct QoS names
- Check site agent logs for SLURM command execution
- Ensure resource backend_id matches SLURM account name
- Check the command history endpoint or
waldur slurm_policy_statusfor sent commands and site agent responses
Incorrect Usage Calculations
- Review carryover settings and decay factor
- Check billing period alignment (quarterly boundaries)
- Verify component type matches between policy and usage data
No Evaluation Logs Appearing
- Confirm the evaluation was triggered (check Celery worker logs)
- Verify the policy has resources in the offering
- Use the staff Evaluate button or
waldur evaluate_slurm_policy --syncto run synchronously and see immediate results
Site Agent Not Reporting Back
- Check that
policy_uuidis present in the STOMP message payload - Verify the site agent has network access to the Waldur API
- Check site agent logs for HTTP errors when POSTing to
report-command-result
Migration from Manual Management
For organisations transitioning from manual SLURM management:
- Audit Current Allocations: Document existing quotas and QoS settings
- Create Initial Policies: Start with generous grace periods
- Enable Notifications First: Monitor before enforcing — use the execution log to verify calculations
- Dry Run Testing: Use the staff dry-run feature to validate policy behaviour before enabling enforcement actions
- Gradual Enforcement: Phase in QoS changes over 2-3 quarters
- User Training: Educate users about automatic management