SLURM Periodic Usage Policy Configuration Guide

Overview

The SlurmPeriodicUsagePolicy enables automatic management of SLURM resource allocations with:

Periodic (quarterly) usage tracking
Automatic QoS adjustments based on usage thresholds
Carryover of unused allocations with decay
Grace periods for temporary overconsumption
Integration with site agent for SLURM account management

Available Actions

Core Actions (Inherited from OfferingPolicy)

notify_organization_owners - Send email notifications to organization owners
notify_external_user - Send notifications to external email addresses
block_creation_of_new_resources - Block creation of new SLURM resources

SLURM-Specific Actions

request_downscaling - Apply slowdown QoS (sets resource.downscaled = True)
request_pausing - Apply blocked QoS (sets resource.paused = True)

How It Works

Threshold Triggers

The policy checks usage percentages and triggers actions at different thresholds:

80%: Notification threshold (configurable)
100%: Normal threshold - triggers request_downscaling
120% (with 20% grace): Grace limit - triggers request_pausing

Site Agent Integration

When actions are triggered:

request_downscaling → Site agent applies qos_downscaled (e.g., "limited")
request_pausing → Site agent applies qos_paused (e.g., "paused")
Normal state → Site agent applies qos_default (e.g., "normal")

Configuration Examples

1. Basic Notification Policy

Send notifications when usage reaches 80%:

from waldur_mastermind.policy import models

policy = models.SlurmPeriodicUsagePolicy.objects.create(
    offering=slurm_offering,
    actions="notify_organization_owners",
    apply_to_all=True,
    grace_ratio=0.2,
    carryover_enabled=True,
)

2. Progressive QoS Management

Apply slowdown at 100% usage with notifications:

policy = models.SlurmPeriodicUsagePolicy.objects.create(
    offering=slurm_offering,
    actions="notify_organization_owners,request_downscaling",
    apply_to_all=True,
    grace_ratio=0.2,
    carryover_enabled=True,
)

3. Full Enforcement Policy

Complete enforcement with notifications, slowdown, and blocking:

# Policy for 100% threshold
threshold_policy = models.SlurmPeriodicUsagePolicy.objects.create(
    offering=slurm_offering,
    actions="notify_organization_owners,request_downscaling,block_creation_of_new_resources",
    apply_to_all=True,
    grace_ratio=0.2,
    carryover_enabled=True,
)

# Additional policy for grace limit (would need separate instance)
grace_policy = models.SlurmPeriodicUsagePolicy.objects.create(
    offering=slurm_offering,
    actions="notify_external_user,request_pausing",
    apply_to_all=True,
    grace_ratio=0.2,
    options={"notify_external_user": "hpc-admin@example.com"},
)

4. Organization-Specific Policy

Apply policy only to specific organization groups:

research_group = OrganizationGroup.objects.get(name="Research Universities")

policy = models.SlurmPeriodicUsagePolicy.objects.create(
    offering=slurm_offering,
    actions="request_downscaling",
    apply_to_all=False,  # Not universal
    grace_ratio=0.3,  # 30% grace for research
    carryover_enabled=True,
)
policy.organization_groups.add(research_group)

Site Agent Configuration

Configure the site agent to handle QoS changes:

# waldur-site-agent-config.yaml
offerings:
  - name: "SLURM HPC Cluster"
    backend_type: "slurm"
    backend_settings:
      # QoS mappings
      qos_downscaled: "slowdown"   # Applied at 100% usage
      qos_paused: "blocked"        # Applied at grace limit
      qos_default: "normal"        # Applied when below thresholds

      # Periodic limits configuration
      periodic_limits:
        enabled: true
        limit_type: "GrpTRESMins"
        tres_billing_enabled: true
        tres_billing_weights:
          CPU: 0.015625
          Mem: 0.001953125G
          "GRES/gpu": 0.25
        fairshare_decay_half_life: 15

Policy Parameters

Core Parameters

apply_to_all: True for all customers, False for specific groups
organization_groups: Specific groups if not applying to all
actions: Comma-separated list of actions to trigger

SLURM-Specific Parameters

limit_type: "GrpTRESMins", "MaxTRESMins", or "GrpTRES"
tres_billing_enabled: Use TRES billing units vs raw values
tres_billing_weights: Weight configuration for billing units
fairshare_decay_half_life: Days for fairshare decay (default: 15)
grace_ratio: Grace period ratio (0.2 = 20% overconsumption)
carryover_enabled: Allow unused allocation carryover
raw_usage_reset: Reset SLURM raw usage at period transitions
qos_strategy: "threshold" or "progressive"

Usage Scenarios

Scenario 1: Academic Institution with Quarterly Allocations

# 1000 node-hours per quarter with 20% grace
policy = models.SlurmPeriodicUsagePolicy.objects.create(
    offering=academic_slurm,
    actions="notify_organization_owners,request_downscaling",
    apply_to_all=True,
    limit_type="GrpTRESMins",
    grace_ratio=0.2,
    carryover_enabled=True,
    fairshare_decay_half_life=15,
)

# Add component limit
models.OfferingComponentLimit.objects.create(
    policy=policy,
    component=node_hours_component,
    limit=1000,
)

Scenario 2: Commercial Cloud with Strict Limits

# No grace period, immediate blocking
policy = models.SlurmPeriodicUsagePolicy.objects.create(
    offering=commercial_slurm,
    actions="request_pausing,block_creation_of_new_resources",
    apply_to_all=True,
    grace_ratio=0.0,  # No grace period
    carryover_enabled=False,  # No carryover
)

Scenario 3: Research Consortium with Flexible Limits

# Generous grace period with carryover
policy = models.SlurmPeriodicUsagePolicy.objects.create(
    offering=consortium_slurm,
    actions="notify_organization_owners",
    apply_to_all=False,
    grace_ratio=0.5,  # 50% grace period
    carryover_enabled=True,
    fairshare_decay_half_life=30,  # Slower decay
)
policy.organization_groups.add(consortium_members)

API Usage

Create Policy via API

curl -X POST https://waldur.example.com/api/marketplace-offering-usage-policies/ \
  -H "Authorization: Token YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "scope": "OFFERING_UUID",
    "actions": "notify_organization_owners,request_downscaling",
    "apply_to_all": true,
    "grace_ratio": 0.2,
    "carryover_enabled": true,
    "component_limits_set": [
      {
        "type": "node_hours",
        "limit": 1000
      }
    ]
  }'

Check Policy Status

curl https://waldur.example.com/api/marketplace-offering-usage-policies/POLICY_UUID/ \
  -H "Authorization: Token YOUR_TOKEN"

Monitoring and Troubleshooting

Check Resource Usage Percentage

policy = SlurmPeriodicUsagePolicy.objects.get(offering=offering)
resource = Resource.objects.get(uuid="RESOURCE_UUID")

usage_percentage = policy.get_resource_usage_percentage(resource)
print(f"Current usage: {usage_percentage:.1f}%")

Verify Policy Triggers

# Check if policy should trigger
if policy.is_triggered():
    print("Policy is triggered - actions will be executed")
    print(f"Actions: {policy.actions}")

Debug Carryover Calculations

settings = policy.calculate_slurm_settings(resource)
print(f"Carryover details: {settings['carryover_details']}")
print(f"Total allocation: {settings['carryover_details']['total_allocation']} node-hours")

Best Practices

Start with Notifications: Begin with notification-only policies to understand usage patterns
Test in Staging: Validate policies in a test environment first
Monitor Grace Periods: Ensure grace ratios align with user needs
Regular Review: Review carryover and decay settings quarterly
Clear Communication: Inform users about thresholds and consequences

Troubleshooting Common Issues

Policy Not Triggering

Check that apply_to_all=True or resource's customer is in organization_groups
Verify component usage data exists for the current period
Ensure resource is not in TERMINATED state

QoS Not Changing

Verify site agent configuration has correct QoS names
Check site agent logs for SLURM command execution
Ensure resource backend_id matches SLURM account name

Incorrect Usage Calculations

Review carryover settings and decay factor
Check billing period alignment (quarterly boundaries)
Verify component type matches between policy and usage data

Migration from Manual Management

For organizations transitioning from manual SLURM management:

Audit Current Allocations: Document existing quotas and QoS settings
Create Initial Policies: Start with generous grace periods
Enable Notifications First: Monitor before enforcing
Gradual Enforcement: Phase in QoS changes over 2-3 quarters
User Training: Educate users about automatic management

This configuration guide provides a complete framework for implementing automatic SLURM resource management through periodic usage policies.