STOMP-Based Event Notification System
System Overview
The STOMP-based event notification system allows Waldur to communicate changes to resources, orders, and user roles to the waldur-site-agent that runs on a remote cluster. This eliminates the need for constant polling and enables immediate reactions to events.
The key components include:
-
STOMP Publisher (Waldur side): Located in the waldur_core/logging/utils.py file, this component publishes messages to STOMP queues when specific events occur.
-
Event Subscription Service: Manages subscriptions to events by creating unique topics for each type of notification. Related file: event subscription management via API: waldur_core/logging/views.py
-
STOMP Consumer (Agent side): The
waldur-site-agentrunning on the resource provider's infrastructure that subscribes to these topics and processes incoming messages. Related files: - Event subscription registration: waldur_site_agent/event_processing/utils.py
- STOMP message handlers: waldur_site_agent/event_processing/handlers.py
- STOMP listener: waldur_site_agent/event_processing/listener.py
Event Flow
- An event occurs in Waldur (e.g., a new order is created, a user role changes, or a resource is updated)
- Waldur publishes a message to the appropriate STOMP queue(s)
- The site agent receives the message and processes it based on the event type
- The agent communicates with the backend (e.g., SLURM) to execute the necessary actions
Queue Naming Strategy
The system follows an object-based naming convention for STOMP queues rather than event-based naming. This design choice provides several benefits:
- Simplified Client Configuration: Clients subscribe to object types (e.g.,
resource_periodic_limits) rather than specific event types - Action Flexibility: Specific actions (e.g.,
apply_periodic_settings,update_limits) are stored in the message payload - Easier Maintenance: Adding new actions doesn't require queue reconfiguration
- Future Migration Path: Sets foundation for eventual migration to event-based naming without immediate client changes
Current Approach:
- Queue:
resource_periodic_limits - Payload:
{"action": "apply_periodic_settings", "settings": {...}}
Alternative Event-Based Approach (for future consideration):
- Queue:
resource_periodic_limits_update - More specific but requires client reconfiguration for each new event type
Message Types
The system handles several types of events:
- Order Messages (
order): Notifications about marketplace orders (create, update, terminate) - User Role Messages (
user_role): Changes to user permissions in projects - Resource Messages (
resource): Updates to resource configuration or status - Resource Periodic Limits (
resource_periodic_limits): SLURM periodic usage policy updates with allocation and limit settings - Offering User Messages (
offering_user): Creation, updates, and deletion of offering users - Service Account Messages (
service_account): Service account lifecycle events - Course Account Messages (
course_account): Course account management events - Importable Resources Messages (
importable_resources): Backend resource discovery events
Implementation Details
Publishing Messages (Waldur Side)
When events like order creation occur, Waldur prepares and publishes STOMP messages: code link
These messages are then sent via: publish_stomp_messages
Offering User Event Messages
Offering user events are published when offering users are created, updated, or deleted. These handlers are located in waldur_mastermind/marketplace/handlers.py:
send_offering_user_created_message- Triggers when an OfferingUser is createdsend_offering_user_updated_message- Triggers when an OfferingUser is updatedsend_offering_user_deleted_message- Triggers when an OfferingUser is deleted
Message Payload Structure for OfferingUser Events:
1 2 3 4 5 6 7 8 9 | |
Event Triggers:
- Create: When a new offering user account is created for a user in an offering
- Update: When any field of an existing offering user is modified (username, state, etc.)
- Delete: When an offering user account is removed from an offering
Resource Periodic Limits Event Messages
Resource periodic limits events are published when SLURM periodic usage policies are applied to resources. These messages contain calculated SLURM settings including allocation limits, fairshare values, and QoS thresholds. The handler is located in waldur_mastermind/policy/models.py.
Message Payload Structure for Resource Periodic Limits:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
Event Triggers:
- Policy Application: When a SLURM periodic usage policy calculates new allocation limits and sends them to the site agent
- Carryover Calculation: When unused allocation from previous periods is calculated with decay factors
- Limit Updates: When fairshare values, TRES limits, or QoS thresholds need to be updated on the SLURM backend
Subscription Management (Agent Side)
The EventSubscriptionManager class handles creation of event subscriptions and setup of STOMP consumers:
- get_or_create_event_subscription - create an event subscription in Waldur if doesn't exist yet
- start_stomp_connection - setup STOMP client, connect agent to the broker and subscribe consumer to a queue
Message Processing (Agent Side)
When a message arrives, it's routed to the appropriate handler based on the event type:
- on_order_message_stomp - create or update resources on backend
- on_user_role_message_stomp - create or update access permissions on backend
- on_resource_message_stomp - create or update resource configuration on backend
- on_resource_periodic_limits_message_stomp - apply SLURM periodic usage policies, fairshare values, and allocation limits
Technical Components
- WebSocket Transport: The system uses STOMP over WebSockets for communication
- TLS Security: Connections can be secured with TLS
- User Authentication: Each subscription has its own credentials and permissions in RabbitMQ
- Queue Structure: Queue names follow the pattern
/queue/subscription_{subscription_uuid}_offering_{offering_uuid}_{affected_object}
Error Handling and Resilience
The system includes:
- Graceful connection handling
- Signal handlers for proper shutdown
- Retry mechanisms for order processing
- Error logging and optional Sentry integration
Benefits of the STOMP Approach
- Real-time Processing: Actions are triggered immediately when events occur
- Reduced Network Traffic: No constant polling needed
- Decoupling: The agent doesn't need direct access to Waldur's database
- Scalability: Multiple agents can subscribe to different events
- Reliability: The STOMP protocol provides queue persistency to ensure message delivery and different acknowledgement options on the agent side
This event-driven architecture significantly improves the responsiveness and efficiency of the order processing system compared to traditional polling approaches.