monitor¶
Real-time monitoring with optional quota enforcement.
Usage¶
Options¶
| Option | Short | Description |
|---|---|---|
--interval |
-i |
Refresh interval in seconds (default: 30) |
--once |
Single check, then exit (for cron) | |
--enforce |
Enable quota enforcement |
Examples¶
Live dashboard¶
# Interactive monitoring with 30s refresh
slurmq monitor
# Faster refresh
slurmq monitor --interval 10
Single check (for cron)¶
Enforcement¶
When --enforce is enabled, slurmq will cancel jobs for users who exceed their quota.
Enable in config first
Enforcement must be enabled in your config file:
```toml
[enforcement]
enabled = true
dry_run = true # Start with dry_run to preview
```
Enforcement behavior¶
- Grace period - Users get a warning email before cancellation
- Cancel order - Jobs cancelled in LIFO or FIFO order
- Exemptions - Skip specific users or job prefixes
- Dry run - Preview what would be cancelled
Configuration¶
[enforcement]
enabled = true
dry_run = true # Set false when ready
grace_period_hours = 24 # Warning window
cancel_order = "lifo" # "lifo" or "fifo"
exempt_users = ["admin"]
exempt_job_prefixes = ["checkpoint_"]
Enforcement flow¶
flowchart TD
A[User exceeds quota] --> B{Grace period?}
B -->|Yes| C[Send warning email]
B -->|No| D{User exempt?}
D -->|Yes| E[Skip]
D -->|No| F{Job exempt?}
F -->|Yes| E
F -->|No| G{dry_run?}
G -->|Yes| H[Log only]
G -->|No| I[Cancel job + send notification]
Cron setup¶
Run monitoring every 5 minutes:
# /etc/cron.d/slurmq
*/5 * * * * root slurmq --quiet monitor --once --enforce >> /var/log/slurmq.log 2>&1
Or with systemd timer:
# /etc/systemd/system/slurmq-monitor.timer
[Unit]
Description=Run slurmq quota check
[Timer]
OnCalendar=*:0/5
Persistent=true
[Install]
WantedBy=timers.target
# /etc/systemd/system/slurmq-monitor.service
[Unit]
Description=slurmq quota monitoring
[Service]
Type=oneshot
ExecStart=/usr/local/bin/slurmq --quiet monitor --once --enforce
Dashboard output¶
The live dashboard shows:
┌─────────────────── slurmq monitor ───────────────────┐
│ Cluster: Stella QoS: normal Refresh: 30s │
├──────────────────────────────────────────────────────┤
│ User Used Remaining Status Active │
│ alice 487.5 12.5 ! WARN 3 │
│ bob 342.0 158.0 ok 1 │
│ charlie 512.5 -12.5 x OVER 2 -> cancel │
├──────────────────────────────────────────────────────┤
│ Last check: 2024-01-15 14:30:00 │
│ Press q to quit │
└──────────────────────────────────────────────────────┘