# slurmq > GPU quota management for Slurm clusters slurmq is a CLI tool for GPU quota management on Slurm clusters. It helps HPC administrators and users track GPU usage, enforce quotas, and generate reports. Key features: quota checking, multi-cluster support, configurable via TOML, JSON/CSV/YAML output, live monitoring dashboard, automated enforcement. # Getting started # [Installation](#installation) ## [Requirements](#requirements) - Python 3.11+ - Slurm with `sacct` command available ## [Install with uv (recommended)](#install-with-uv-recommended) [uv](https://docs.astral.sh/uv/) is the fastest way to install Python CLI tools: ``` uv tool install slurmq ``` This installs `slurmq` globally, isolated from your project environments. ## [Install with pip](#install-with-pip) ``` pip install slurmq ``` ## [Install with pipx](#install-with-pipx) ``` pipx install slurmq ``` ## [Install from source](#install-from-source) ``` git clone https://github.com/dedalus-labs/slurmq.git cd slurmq uv tool install . ``` ## [Verify installation](#verify-installation) ``` slurmq --version ``` ## [System-wide installation (HPC)](#system-wide-installation-hpc) For HPC environments where users don't have write access to their home directories, administrators can install slurmq system-wide: ``` # As root or with sudo pip install --prefix=/opt/slurmq slurmq # Add to system PATH in /etc/profile.d/slurmq.sh export PATH="/opt/slurmq/bin:$PATH" ``` Then create a system-wide config at `/etc/slurmq/config.toml`. ## [Shell completion](#shell-completion) slurmq supports shell completion for bash, zsh, and fish: ``` # Add to ~/.bashrc eval "$(slurmq --show-completion bash)" ``` ``` # Add to ~/.zshrc eval "$(slurmq --show-completion zsh)" ``` ``` # Add to ~/.config/fish/config.fish slurmq --show-completion fish | source ``` # [Quick Start](#quick-start) Get up and running with slurmq in 5 minutes. ## [1. Initialize configuration](#1-initialize-configuration) Run the interactive setup wizard: ``` slurmq config init ``` This creates `~/.config/slurmq/config.toml` with your cluster settings. Minimal config You can also create a minimal config manually: ```` ```toml default_cluster = "mycluster" [clusters.mycluster] name = "My Cluster" qos = ["normal"] quota_limit = 500 ```` ``` ## [2. Check your quota](#2-check-your-quota) ``` slurmq check ``` That's it! You'll see your current GPU usage and remaining quota. ## [3. Useful variations](#3-useful-variations) ``` # Check another user's quota slurmq check --user alice # Use a different QoS slurmq check --qos high-priority # Get JSON output for scripting slurmq --json check # Show forecast of when quota frees up slurmq check --forecast ``` ## [4. Admin commands](#4-admin-commands) If you're an administrator: ``` # See all users' usage slurmq report # Live monitoring dashboard slurmq monitor # Cluster analytics slurmq stats ```` ## [Next steps](#next-steps) - [Configuration](https://dedalus-labs.github.io/slurmq/getting-started/configuration/index.md) - Full config file reference - [Commands](https://dedalus-labs.github.io/slurmq/commands/index.md) - All available commands - [Recipes](https://dedalus-labs.github.io/slurmq/recipes/common/index.md) - Common usage patterns``` ```` # [Configuration](#configuration) slurmq uses a layered configuration system with these priorities (highest first): 1. **CLI flags** - `--qos`, `--account`, `--cluster` 1. **Environment variables** - `SLURMQ_*` 1. **Config file** - `~/.config/slurmq/config.toml` 1. **System config** - `/etc/slurmq/config.toml` 1. **Defaults** ## [Config file locations](#config-file-locations) slurmq looks for config files in this order: 1. `$SLURMQ_CONFIG` (env var override) 1. `~/.config/slurmq/config.toml` (user config) 1. `/etc/slurmq/config.toml` (system-wide) ## [Full config example](#full-config-example) ``` # ~/.config/slurmq/config.toml # Default cluster to use when --cluster is not specified default_cluster = "stella" # Cluster profiles [clusters.stella] name = "Stella HPC GPU Cluster" account = "research-group" qos = ["high-priority", "normal", "low"] partitions = ["gpu", "gpu-large"] quota_limit = 500 # GPU-hours per user rolling_window_days = 30 # Rolling window for quota [clusters.stellar] name = "Stellar" account = "astro" qos = ["standard"] partitions = ["gpu"] quota_limit = 200 # Monitoring thresholds [monitoring] check_interval_minutes = 30 warning_threshold = 0.8 # Warn at 80% usage critical_threshold = 1.0 # Critical at 100% # Quota enforcement (admin) [enforcement] enabled = false # Must explicitly enable dry_run = true # Preview mode - no actual cancels grace_period_hours = 24 # Warning before cancel cancel_order = "lifo" # "lifo" or "fifo" exempt_users = ["admin", "root"] exempt_job_prefixes = ["checkpoint_", "debug_"] # Email notifications (optional) [email] enabled = false sender = "hpc-alerts@example.edu" domain = "example.edu" # User email domain subject_prefix = "[Stella HPC-GPU]" smtp_host = "smtp.example.edu" smtp_port = 587 smtp_user_env = "SLURMQ_SMTP_USER" smtp_pass_env = "SLURMQ_SMTP_PASS" # Display preferences [display] color = true # Respects NO_COLOR env output_format = "rich" # "rich", "plain", "json" # Cache settings [cache] enabled = true ttl_minutes = 60 directory = "" # Default: ~/.cache/slurmq/ ``` ## [Environment variables](#environment-variables) All config values can be overridden with environment variables using the `SLURMQ_` prefix: ``` # Override default cluster export SLURMQ_DEFAULT_CLUSTER=stellar # Override nested values with double underscore export SLURMQ_MONITORING__WARNING_THRESHOLD=0.9 export SLURMQ_ENFORCEMENT__DRY_RUN=false # Point to a different config file export SLURMQ_CONFIG=/path/to/config.toml ``` ## [CLI flag overrides](#cli-flag-overrides) Most config values can be overridden per-command: ``` # Override cluster slurmq --cluster stellar check # Override QoS, account, partition slurmq check --qos high-priority --account mygroup --partition gpu-large # Override output format slurmq --json check slurmq --yaml check ``` ## [Multiple clusters](#multiple-clusters) Define multiple clusters and switch between them: ``` default_cluster = "stella" [clusters.stella] name = "Stella HPC" quota_limit = 500 qos = ["normal"] [clusters.traverse] name = "Traverse" quota_limit = 1000 qos = ["gpu"] ``` ``` slurmq check # Uses stella (default) slurmq --cluster traverse check ``` ## [Validating config](#validating-config) Check your config file for errors: ``` slurmq config validate ``` This catches: - TOML syntax errors - Invalid threshold values - Missing cluster definitions - Unknown config keys # Commands # [check](#check) Check GPU quota usage for a user. ## [Usage](#usage) ``` slurmq check [OPTIONS] ``` ## [Options](#options) | Option | Short | Description | | ------------- | ----- | ------------------------------------- | | `--user` | `-u` | User to check (default: current user) | | `--qos` | `-q` | QoS to check (overrides config) | | `--account` | `-a` | Account to check (overrides config) | | `--partition` | `-p` | Partition to check (overrides config) | | `--forecast` | `-f` | Show quota forecast | ## [Examples](#examples) ### [Basic usage](#basic-usage) ``` # Check your own quota slurmq check # Check another user's quota slurmq check --user alice ``` ### [Override filters](#override-filters) ``` # Check a specific QoS slurmq check --qos high-priority # Check a specific account slurmq check --account research-gpu # Combine multiple overrides slurmq check --qos normal --partition gpu-large ``` ### [Output formats](#output-formats) ``` # Rich terminal output (default) slurmq check # JSON for scripting slurmq --json check # YAML output slurmq --yaml check ``` ### [Forecast](#forecast) Show when quota will free up: ``` slurmq check --forecast ``` ``` ╭─────────── Quota Forecast ───────────╮ │ Time Available % Available │ │ +12h 78.5 15.7% │ │ +24h 125.0 25.0% │ │ +72h 312.5 62.5% │ │ +168h 500.0 100.0% │ ╰──────────────────────────────────────╯ ``` ## [JSON output](#json-output) ``` { "user": "dedalus", "qos": "normal", "used_gpu_hours": 342.5, "quota_limit": 500, "remaining_gpu_hours": 157.5, "usage_percentage": 68.5, "status": "ok", "rolling_window_days": 30, "active_jobs": 2 } ``` ## [Exit codes](#exit-codes) | Code | Condition | | ---- | ------------------------------- | | 0 | Success | | 1 | Error | | 2 | Quota exceeded (with `--quiet`) | ## [Scripting](#scripting) ``` # Check if quota exceeded if slurmq --json check | jq -e '.status == "exceeded"' > /dev/null; then echo "Quota exceeded!" exit 1 fi # Get remaining hours remaining=$(slurmq --json check | jq '.remaining_gpu_hours') echo "You have $remaining GPU-hours remaining" ``` # [stats](#stats) Cluster-wide analytics with month-over-month comparison. ## [Usage](#usage) ``` slurmq stats [OPTIONS] ``` ## [Options](#options) | Option | Short | Description | | ------------------------ | ----- | ------------------------------------------------------ | | `--days` | `-d` | Analysis period in days (default: 30) | | `--compare/--no-compare` | | Show month-over-month comparison (default: on) | | `--partition` | `-p` | Filter by partition(s) (repeatable) | | `--qos` | `-q` | Filter by QoS(s) (repeatable) | | `--small-threshold` | | GPU-hours threshold for small/large jobs (default: 50) | ## [Examples](#examples) ### [Basic usage](#basic-usage) ``` # Analyze the last 30 days with MoM comparison slurmq stats ``` Output: ``` GPU Utilization (Last 30 Days) ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Partition ┃ GPU Hours ┃ Jobs ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ gpu │ 125.3k (+12%) │ 1,245 (-5%)│ │ gpu-large │ 45.2k (+8%) │ 234 (+15%) │ └──────────────┴───────────────┴────────────┘ Wait Times: Small (≤50 GPU-h) ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓ ┃ Partition ┃ Median Wait┃ Wait > 6h ┃ Jobs ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩ │ gpu │ 15min (-8%)│ 2.1% │ 1,100 │ │ gpu-large │ 45min (+3%)│ 8.5% │ 180 │ └──────────────┴────────────┴───────────┴───────┘ Wait Times: Large (>50 GPU-h) ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓ ┃ Partition ┃ Median Wait┃ Wait > 6h ┃ Jobs ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩ │ gpu │ 2h15 (+15%)│ 25.0% │ 145 │ │ gpu-large │ 8h30 (-2%) │ 62.5% │ 54 │ └──────────────┴────────────┴───────────┴───────┘ ``` ### [Custom time range](#custom-time-range) ``` # Last 14 days slurmq stats --days 14 # Last 7 days, no comparison slurmq stats --days 7 --no-compare ``` ### [Filter by partition or QoS](#filter-by-partition-or-qos) ``` # Specific partitions slurmq stats --partition gpu --partition gpu-large # Specific QoS slurmq stats --qos high-priority ``` ### [Custom job size threshold](#custom-job-size-threshold) ``` # Jobs ≤25 GPU-hours are "small" slurmq stats --small-threshold 25 ``` ### [JSON output](#json-output) ``` slurmq --json stats ``` ``` { "period_days": 30, "current": { "gpu": { "all": { "job_count": 1245, "gpu_hours": 125300.5, "median_wait_hours": 0.75, "long_wait_pct": 12.5 }, "small": { ... }, "large": { ... } } }, "previous": { ... } } ``` ## [Metrics explained](#metrics-explained) | Metric | Description | | --------------- | -------------------------------------------- | | **GPU Hours** | Total GPU-hours consumed | | **Jobs** | Number of jobs completed | | **Median Wait** | Median time from submission to start | | **Wait > 6h** | Percentage of jobs waiting more than 6 hours | ## [Use cases](#use-cases) - **Capacity planning** - Track utilization trends - **SLA monitoring** - Watch wait time percentiles - **QoS tuning** - Compare partition performance - **Monthly reports** - Export JSON for dashboards # [report](#report) Generate usage reports for all users. ## [Usage](#usage) ``` slurmq report [OPTIONS] ``` ## [Options](#options) | Option | Short | Description | | ------------- | ----- | ---------------------------------------------- | | `--format` | `-f` | Output format: rich, json, csv (default: rich) | | `--output` | `-o` | Output file path | | `--qos` | `-q` | QoS to report on (overrides config) | | `--account` | `-a` | Account to report on (overrides config) | | `--partition` | `-p` | Partition to report on (overrides config) | ## [Examples](#examples) ### [Basic usage](#basic-usage) ``` slurmq report ``` Output: ``` GPU Usage Report: Stella (normal) ┏━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ ┃ User ┃ Used (GPU-h)┃ Remaining ┃ Usage % ┃ Status ┃ Active ┃ ┡━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ alice │ 487.5 │ 12.5 │ 98% │ ! │ 3 │ │ bob │ 342.0 │ 158.0 │ 68% │ ok │ 1 │ │ charlie │ 125.5 │ 374.5 │ 25% │ ok │ - │ └─────────┴─────────────┴───────────┴─────────┴────────┴────────┘ Total users: 3 ``` ### [Export formats](#export-formats) ``` # JSON slurmq report --format json # CSV (for spreadsheets) slurmq report --format csv # Save to file slurmq report --format csv --output usage.csv ``` ### [Filter by QoS/account](#filter-by-qosaccount) ``` # Specific QoS slurmq report --qos high-priority # Specific account slurmq report --account research-gpu ``` ## [JSON output](#json-output) ``` { "cluster": "Stella", "qos": "normal", "users": [ { "user": "alice", "used_gpu_hours": 487.5, "quota_limit": 500, "remaining_gpu_hours": 12.5, "usage_percentage": 97.5, "status": "warning", "active_jobs": 3, "total_jobs": 45 } ] } ``` ## [CSV output](#csv-output) ``` user,used_gpu_hours,quota_limit,remaining_gpu_hours,usage_percentage,status,active_jobs,total_jobs alice,487.5,500,12.5,97.5,warning,3,45 bob,342.0,500,158.0,68.4,ok,1,23 charlie,125.5,500,374.5,25.1,ok,0,12 ``` ## [Use cases](#use-cases) - **Weekly reports** - Track who's using resources - **Capacity planning** - Identify heavy users - **Billing** - Export CSV for chargeback - **Compliance** - Audit quota enforcement # [monitor](#monitor) Real-time monitoring with optional quota enforcement. ## [Usage](#usage) ``` slurmq monitor [OPTIONS] ``` ## [Options](#options) | Option | Short | Description | | ------------ | ----- | ----------------------------------------- | | `--interval` | `-i` | Refresh interval in seconds (default: 30) | | `--once` | | Single check, then exit (for cron) | | `--enforce` | | Enable quota enforcement | ## [Examples](#examples) ### [Live dashboard](#live-dashboard) ``` # Interactive monitoring with 30s refresh slurmq monitor # Faster refresh slurmq monitor --interval 10 ``` ### [Single check (for cron)](#single-check-for-cron) ``` # Check once and exit slurmq monitor --once # With enforcement slurmq monitor --once --enforce ``` ## [Enforcement](#enforcement) When `--enforce` is enabled, slurmq will cancel jobs for users who exceed their quota. Enable in config first Enforcement must be enabled in your config file: ```` ```toml [enforcement] enabled = true dry_run = true # Start with dry_run to preview ```` ``` ### [Enforcement behavior](#enforcement-behavior) 1. **Grace period** - Users get a warning email before cancellation 2. **Cancel order** - Jobs cancelled in LIFO or FIFO order 3. **Exemptions** - Skip specific users or job prefixes 4. **Dry run** - Preview what would be cancelled ### [Configuration](#configuration) ``` [enforcement] enabled = true dry_run = true # Set false when ready grace_period_hours = 24 # Warning window cancel_order = "lifo" # "lifo" or "fifo" exempt_users = ["admin"] exempt_job_prefixes = ["checkpoint\_"] ``` ### [Enforcement flow](#enforcement-flow) ``` flowchart TD A[User exceeds quota] --> B{Grace period?} B -->|Yes| C[Send warning email] B -->|No| D{User exempt?} D -->|Yes| E[Skip] D -->|No| F{Job exempt?} F -->|Yes| E F -->|No| G{dry_run?} G -->|Yes| H[Log only] G -->|No| I[Cancel job + send notification] ``` ## [Cron setup](#cron-setup) Run monitoring every 5 minutes: ``` # /etc/cron.d/slurmq \*/5 * * * * root slurmq --quiet monitor --once --enforce >> /var/log/slurmq.log 2>&1 ``` Or with systemd timer: ``` # /etc/systemd/system/slurmq-monitor.timer [Unit] Description=Run slurmq quota check [Timer] OnCalendar=\*:0/5 Persistent=true [Install] WantedBy=timers.target ``` ``` # /etc/systemd/system/slurmq-monitor.service [Unit] Description=slurmq quota monitoring [Service] Type=oneshot ExecStart=/usr/local/bin/slurmq --quiet monitor --once --enforce ``` ## [Dashboard output](#dashboard-output) The live dashboard shows: ``` ┌─────────────────── slurmq monitor ───────────────────┐ │ Cluster: Stella QoS: normal Refresh: 30s │ ├──────────────────────────────────────────────────────┤ │ User Used Remaining Status Active │ │ alice 487.5 12.5 ! WARN 3 │ │ bob 342.0 158.0 ok 1 │ │ charlie 512.5 -12.5 x OVER 2 -> cancel │ ├──────────────────────────────────────────────────────┤ │ Last check: 2024-01-15 14:30:00 │ │ Press q to quit │ └──────────────────────────────────────────────────────┘ ``` ``` # [efficiency](#efficiency) Analyze job resource efficiency (inspired by Slurm's `seff`). ## [Usage](#usage) ``` slurmq efficiency ``` ## [Arguments](#arguments) | Argument | Description | | -------- | ----------------------- | | `JOB_ID` | Slurm job ID to analyze | ## [Examples](#examples) ### [Basic usage](#basic-usage) ``` slurmq efficiency 12345678 ``` Output: ``` ╭──────────────── Job Efficiency Report ────────────────╮ │ │ │ Job ID: 12345678 │ │ Job Name: train_model │ │ User: alice │ │ Account: research │ │ State: COMPLETED │ │ │ │ CPU Efficiency: 78.5% ████████░░ │ │ Memory Efficiency: 45.2% █████░░░░░ ! Low │ │ │ │ Cores requested: 32 │ │ CPU time used: 12h 30m (of 16h allocated) │ │ │ │ Memory requested: 128GB │ │ Memory used: 58GB (peak) │ │ │ ╰───────────────────────────────────────────────────────╯ ``` ## [Efficiency thresholds](#efficiency-thresholds) | Metric | Good | Warning | Low | | ------ | ---- | ------- | ----- | | CPU | ≥50% | 30-50% | \<30% | | Memory | ≥30% | 20-30% | \<20% | ## [Use cases](#use-cases) - **Debugging** - Why did my job fail (OOM)? - **Optimization** - Am I requesting too many resources? - **Education** - Help users right-size their jobs ## [Tips for improving efficiency](#tips-for-improving-efficiency) ### [CPU efficiency low?](#cpu-efficiency-low) - Your job may be I/O bound - Consider fewer cores with faster storage - Check for serial bottlenecks in your code ### [Memory efficiency low?](#memory-efficiency-low) - Request less memory in your submission - Use `--mem-per-cpu` instead of `--mem` - Profile your application's memory usage ## [JSON output](#json-output) ``` slurmq --json efficiency 12345678 ``` ``` { "job_id": "12345678", "job_name": "train_model", "user": "alice", "state": "COMPLETED", "cpu_efficiency": 78.5, "memory_efficiency": 45.2, "cores_requested": 32, "cpu_time_used_seconds": 45000, "memory_requested_gb": 128, "memory_used_gb": 58 } ``` # [config](#config) Manage slurmq configuration. ## [Subcommands](#subcommands) ### [init](#init) Interactive configuration wizard. ``` slurmq config init ``` Creates `~/.config/slurmq/config.toml` with your cluster settings. The wizard prompts for: 1. Cluster name 1. Account 1. QoS list 1. Quota limit 1. Rolling window days ### [show](#show) Display current configuration. ``` slurmq config show ``` Shows the merged configuration from all sources (env vars, config file, defaults). ### [path](#path) Show config file location. ``` slurmq config path ``` Output: ``` Config file: /home/user/.config/slurmq/config.toml Status: exists ``` ### [validate](#validate) Validate configuration file. ``` slurmq config validate ``` Checks for: - TOML syntax errors - Invalid threshold values (must be 0-1) - Missing cluster definitions - Unknown configuration keys - Logical errors (e.g., warning > critical threshold) Example output: ``` ok: Config file is valid: /home/user/.config/slurmq/config.toml ``` Or with errors: ``` error: Config validation failed: - monitoring.warning_threshold (1.5) must be between 0 and 1 - default_cluster "nonexistent" not found in clusters ``` ### [set](#set) Set a configuration value. ``` slurmq config set ``` Examples: ``` # Set default cluster slurmq config set default_cluster stellar # Set nested value slurmq config set monitoring.warning_threshold 0.9 ``` ## [Examples](#examples) ### [Initial setup](#initial-setup) ``` # Create config interactively slurmq config init # Verify slurmq config show slurmq config validate ``` ### [Check config location](#check-config-location) ``` $ slurmq config path Config file: /home/user/.config/slurmq/config.toml Status: exists ``` ### [Debug configuration](#debug-configuration) ``` # See what config slurmq is actually using slurmq config show # Check for errors slurmq config validate # See where it's loading from SLURMQ_CONFIG=/tmp/test.toml slurmq config path ``` # Recipes # [Common Patterns](#common-patterns) Recipes for common slurmq usage patterns. ## [Check before submitting](#check-before-submitting) Block job submission if quota exceeded: ``` #!/bin/bash # submit-job.sh # Check quota first if ! slurmq --quiet check; then echo "Quota exceeded! Cannot submit job." exit 1 fi # Submit job sbatch my_job.sh ``` ## [Get remaining quota](#get-remaining-quota) Extract remaining hours for scripts: ``` remaining=$(slurmq --json check | jq '.remaining_gpu_hours') echo "You have $remaining GPU-hours remaining" # Check if enough for a job if (( $(echo "$remaining < 10" | bc -l) )); then echo "Warning: Less than 10 hours remaining!" fi ``` ## [Multi-cluster workflow](#multi-cluster-workflow) Switch between clusters: ``` # Check all clusters for cluster in stella stellar traverse; do echo "=== $cluster ===" slurmq --cluster $cluster check done ``` ## [Export weekly report](#export-weekly-report) Generate CSV report for spreadsheets: ``` #!/bin/bash # weekly-report.sh DATE=$(date +%Y-%m-%d) OUTPUT_DIR="/shared/reports" slurmq report --format csv --output "$OUTPUT_DIR/usage-$DATE.csv" ``` ## [Slack notification](#slack-notification) Send alerts to Slack: ``` #!/bin/bash # check-and-alert.sh WEBHOOK_URL="https://hooks.slack.com/services/..." # Check for exceeded users exceeded=$(slurmq --json report | jq '[.users[] | select(.status == "exceeded")] | length') if [ "$exceeded" -gt 0 ]; then curl -X POST -H 'Content-type: application/json' \ --data "{\"text\":\"warning: $exceeded users have exceeded GPU quota\"}" \ "$WEBHOOK_URL" fi ``` ## [Pre-commit hook](#pre-commit-hook) Check quota before git push (for ML repos): ``` #!/bin/bash # .git/hooks/pre-push # Only check if on training branch branch=$(git rev-parse --abbrev-ref HEAD) if [[ "$branch" == "train-"* ]]; then if ! slurmq --quiet check; then echo "Quota exceeded! Training jobs may fail." read -p "Push anyway? (y/N) " confirm [[ "$confirm" != "y" ]] && exit 1 fi fi ``` ## [Dashboard integration](#dashboard-integration) Export metrics for Grafana/Prometheus: ``` #!/bin/bash # metrics.sh - Run every minute via cron METRICS_FILE="/var/lib/prometheus/slurmq.prom" # Get stats stats=$(slurmq --json report 2>/dev/null) if [ $? -eq 0 ]; then # Generate Prometheus metrics echo "# HELP slurmq_users_total Total number of users" echo "# TYPE slurmq_users_total gauge" echo "slurmq_users_total $(echo $stats | jq '.users | length')" echo "# HELP slurmq_exceeded_users Users over quota" echo "# TYPE slurmq_exceeded_users gauge" echo "slurmq_exceeded_users $(echo $stats | jq '[.users[] | select(.status == \"exceeded\")] | length')" fi > "$METRICS_FILE" ``` ## [Job wrapper](#job-wrapper) Wrap sbatch to check quota first: ``` #!/bin/bash # /usr/local/bin/sbatch-safe # Check quota remaining=$(slurmq --json check 2>/dev/null | jq -r '.remaining_gpu_hours // 0') if (( $(echo "$remaining <= 0" | bc -l) )); then echo "Error: GPU quota exceeded. Cannot submit job." >&2 exit 1 fi # Pass through to real sbatch exec /usr/bin/sbatch "$@" ``` Then alias: `alias sbatch=/usr/local/bin/sbatch-safe` # [Scripting](#scripting) Using slurmq in scripts and automation. ## [JSON output](#json-output) Always use `--json` for machine-readable output: ``` slurmq --json check slurmq --json report slurmq --json stats ``` ## [Parsing with jq](#parsing-with-jq) ### [Get specific fields](#get-specific-fields) ``` # Get remaining hours slurmq --json check | jq '.remaining_gpu_hours' # Get status slurmq --json check | jq -r '.status' # Check if exceeded slurmq --json check | jq -e '.status == "exceeded"' ``` ### [Filter users](#filter-users) ``` # Users over 80% usage slurmq --json report | jq '.users[] | select(.usage_percentage > 80)' # Users with active jobs slurmq --json report | jq '.users[] | select(.active_jobs > 0)' # Top 5 users by usage slurmq --json report | jq '.users | sort_by(-.used_gpu_hours) | .[:5]' ``` ### [Extract for other tools](#extract-for-other-tools) ``` # Get user list for email slurmq --json report | jq -r '.users[] | select(.status == "exceeded") | .user' # CSV-like output slurmq --json report | jq -r '.users[] | [.user, .used_gpu_hours, .status] | @csv' ``` ## [Exit codes](#exit-codes) Use exit codes for conditionals: ``` # Check succeeds if quota OK if slurmq --quiet check; then echo "Quota OK" else echo "Quota problem" fi ``` | Code | Meaning | | ---- | ------------------------------- | | 0 | Success / quota OK | | 1 | Error | | 2 | Quota exceeded (with `--quiet`) | ## [Python integration](#python-integration) ``` import json import subprocess def get_quota(): """Get quota info as dict.""" result = subprocess.run( ["slurmq", "--json", "check"], capture_output=True, text=True, check=True, ) return json.loads(result.stdout) def check_quota_ok() -> bool: """Return True if quota is not exceeded.""" try: info = get_quota() return info["status"] != "exceeded" except subprocess.CalledProcessError: return False # Usage if check_quota_ok(): submit_training_job() else: print("Quota exceeded!") ``` ## [Environment variables](#environment-variables) Override config via environment: ``` # In scripts export SLURMQ_DEFAULT_CLUSTER=stellar export SLURMQ_CONFIG=/etc/slurmq/prod.toml # One-liner SLURMQ_DEFAULT_CLUSTER=stellar slurmq check ``` ## [Error handling](#error-handling) Handle errors gracefully: ``` #!/bin/bash set -e # Capture output and exit code if output=$(slurmq --json check 2>&1); then remaining=$(echo "$output" | jq '.remaining_gpu_hours') echo "Remaining: $remaining hours" else echo "Error checking quota: $output" >&2 exit 1 fi ``` ## [Parallel checks](#parallel-checks) Check multiple clusters in parallel: ``` #!/bin/bash clusters=(stella stellar traverse) for cluster in "${clusters[@]}"; do slurmq --cluster "$cluster" --json check & done | jq -s '.' wait ``` # [Cron Jobs](#cron-jobs) Setting up automated quota monitoring and enforcement. ## [Basic monitoring cron](#basic-monitoring-cron) Check quotas every 5 minutes: ``` # /etc/cron.d/slurmq-monitor */5 * * * * root /usr/local/bin/slurmq --quiet monitor --once >> /var/log/slurmq/monitor.log 2>&1 ``` ## [Enforcement cron](#enforcement-cron) With quota enforcement enabled: ``` # /etc/cron.d/slurmq-enforce */5 * * * * root /usr/local/bin/slurmq --quiet monitor --once --enforce >> /var/log/slurmq/enforce.log 2>&1 ``` Enable in config first Make sure `[enforcement]` is configured properly before enabling: ```` ```toml [enforcement] enabled = true dry_run = false # Only set false when ready! ```` ``` ## [Daily reports](#daily-reports) Generate daily usage report: ``` # /etc/cron.d/slurmq-report 0 8 * * * root /usr/local/bin/slurmq report --format csv -o /var/log/slurmq/daily-$(date +%Y%m%d).csv ``` ## [Weekly stats](#weekly-stats) Generate weekly analytics: ``` # /etc/cron.d/slurmq-weekly 0 9 * * 1 root /usr/local/bin/slurmq --json stats --days 7 > /var/log/slurmq/weekly-$(date +%Y%W).json ``` ## [Systemd timers (alternative)](#systemd-timers-alternative) More robust than cron for modern systems. ### [Monitor timer](#monitor-timer) ``` # /etc/systemd/system/slurmq-monitor.timer [Unit] Description=Run slurmq quota monitoring [Timer] OnCalendar=\*:0/5 Persistent=true RandomizedDelaySec=30 [Install] WantedBy=timers.target ``` ``` # /etc/systemd/system/slurmq-monitor.service [Unit] Description=slurmq quota monitoring After=network.target [Service] Type=oneshot ExecStart=/usr/local/bin/slurmq --quiet monitor --once --enforce StandardOutput=append:/var/log/slurmq/monitor.log StandardError=append:/var/log/slurmq/monitor.log ``` Enable: ``` sudo systemctl enable --now slurmq-monitor.timer ``` ### [Daily report timer](#daily-report-timer) ``` # /etc/systemd/system/slurmq-report.timer [Unit] Description=Daily slurmq report [Timer] OnCalendar=*-*-\* 08:00:00 Persistent=true [Install] WantedBy=timers.target ``` ``` # /etc/systemd/system/slurmq-report.service [Unit] Description=Generate daily slurmq report [Service] Type=oneshot ExecStart=/bin/bash -c '/usr/local/bin/slurmq report --format csv -o /var/log/slurmq/daily-$(date +%%Y%%m%%d).csv' ``` ## [Log rotation](#log-rotation) Rotate slurmq logs: ``` # /etc/logrotate.d/slurmq /var/log/slurmq/\*.log { daily rotate 30 compress delaycompress missingok notifempty create 0644 root root } ``` ## [Monitoring the monitor](#monitoring-the-monitor) Alert if slurmq itself fails: ``` #!/bin/bash # /usr/local/bin/slurmq-healthcheck.sh if ! slurmq --quiet check 2>/dev/null; then echo "slurmq check failed!" | mail -s "slurmq alert" admin@example.edu fi ``` ``` # Cron entry \*/30 * * * * root /usr/local/bin/slurmq-healthcheck.sh ```` ## [Best practices](#best-practices) 1. **Start with dry_run** - Always test enforcement with `dry_run = true` first 2. **Log everything** - Keep logs for audit trail 3. **Use systemd** - More reliable than cron for critical monitoring 4. **Set up alerts** - Know when slurmq itself fails 5. **Test restores** - Verify config after system updates``` ```` # Reference # [Environment Variables](#environment-variables) All configuration values can be overridden with environment variables. ## [Naming convention](#naming-convention) Environment variables use the `SLURMQ_` prefix with double underscores for nesting: ``` SLURMQ_ SLURMQ_
__ ``` ## [Available variables](#available-variables) ### [Core settings](#core-settings) | Variable | Description | Example | | ------------------------ | -------------------- | ------------------------- | | `SLURMQ_CONFIG` | Path to config file | `/etc/slurmq/config.toml` | | `SLURMQ_DEFAULT_CLUSTER` | Default cluster name | `stella` | ### [Monitoring](#monitoring) | Variable | Description | Default | | ------------------------------------------- | ------------------ | ------- | | `SLURMQ_MONITORING__CHECK_INTERVAL_MINUTES` | Check interval | `30` | | `SLURMQ_MONITORING__WARNING_THRESHOLD` | Warn at this % | `0.8` | | `SLURMQ_MONITORING__CRITICAL_THRESHOLD` | Critical at this % | `1.0` | ### [Enforcement](#enforcement) | Variable | Description | Default | | ---------------------------------------- | ------------------ | ------- | | `SLURMQ_ENFORCEMENT__ENABLED` | Enable enforcement | `false` | | `SLURMQ_ENFORCEMENT__DRY_RUN` | Preview mode | `true` | | `SLURMQ_ENFORCEMENT__GRACE_PERIOD_HOURS` | Warning window | `24` | | `SLURMQ_ENFORCEMENT__CANCEL_ORDER` | `lifo` or `fifo` | `lifo` | ### [Display](#display) | Variable | Description | Default | | ------------------------------- | ------------------------- | ------- | | `SLURMQ_DISPLAY__COLOR` | Enable colors | `true` | | `SLURMQ_DISPLAY__OUTPUT_FORMAT` | `rich`, `plain`, `json` | `rich` | | `NO_COLOR` | Disable colors (standard) | - | ### [Email](#email) | Variable | Description | Default | | ------------------------- | -------------------------- | ------- | | `SLURMQ_EMAIL__ENABLED` | Enable email | `false` | | `SLURMQ_EMAIL__SENDER` | From address | - | | `SLURMQ_EMAIL__SMTP_HOST` | SMTP server | - | | `SLURMQ_EMAIL__SMTP_PORT` | SMTP port | `587` | | `SLURMQ_SMTP_USER` | SMTP username (via config) | - | | `SLURMQ_SMTP_PASS` | SMTP password (via config) | - | ### [Cache](#cache) | Variable | Description | Default | | --------------------------- | --------------- | ------------------ | | `SLURMQ_CACHE__ENABLED` | Enable caching | `true` | | `SLURMQ_CACHE__TTL_MINUTES` | Cache TTL | `60` | | `SLURMQ_CACHE__DIRECTORY` | Cache directory | `~/.cache/slurmq/` | ## [Examples](#examples) ### [Override default cluster](#override-default-cluster) ``` export SLURMQ_DEFAULT_CLUSTER=stellar slurmq check ``` ### [Disable colors](#disable-colors) ``` export NO_COLOR=1 slurmq check ``` ### [Use different config](#use-different-config) ``` export SLURMQ_CONFIG=/etc/slurmq/production.toml slurmq check ``` ### [Enable enforcement in CI](#enable-enforcement-in-ci) ``` export SLURMQ_ENFORCEMENT__ENABLED=true export SLURMQ_ENFORCEMENT__DRY_RUN=false slurmq monitor --once --enforce ``` ## [Priority](#priority) Environment variables override config file values: 1. CLI flags (highest) 1. Environment variables 1. Config file 1. System config (`/etc/slurmq/config.toml`) 1. Defaults (lowest) ## [Boolean values](#boolean-values) For boolean environment variables, these values are truthy: - `true`, `True`, `TRUE` - `1` - `yes`, `Yes`, `YES` - `on`, `On`, `ON` Everything else is falsy. # [Config File Reference](#config-file-reference) Complete reference for `config.toml`. ## [File locations](#file-locations) slurmq searches for config files in this order: 1. `$SLURMQ_CONFIG` (environment variable) 1. `~/.config/slurmq/config.toml` (user) 1. `/etc/slurmq/config.toml` (system) ## [Complete example](#complete-example) ``` # Default cluster when --cluster is not specified default_cluster = "stella" # Cluster definitions [clusters.stella] name = "Stella HPC GPU Cluster" account = "research-group" qos = ["high-priority", "normal", "low"] partitions = ["gpu", "gpu-large"] quota_limit = 500 rolling_window_days = 30 [clusters.stellar] name = "Stellar" account = "astro" qos = ["standard"] partitions = ["gpu"] quota_limit = 200 rolling_window_days = 30 # Monitoring settings [monitoring] check_interval_minutes = 30 warning_threshold = 0.8 critical_threshold = 1.0 # Quota enforcement [enforcement] enabled = false dry_run = true grace_period_hours = 24 cancel_order = "lifo" exempt_users = ["admin", "root"] exempt_job_prefixes = ["checkpoint_", "debug_"] # Email notifications [email] enabled = false sender = "hpc-alerts@example.edu" domain = "example.edu" subject_prefix = "[Stella HPC-GPU]" smtp_host = "smtp.example.edu" smtp_port = 587 smtp_user_env = "SLURMQ_SMTP_USER" smtp_pass_env = "SLURMQ_SMTP_PASS" # Display preferences [display] color = true output_format = "rich" # Cache settings [cache] enabled = true ttl_minutes = 60 directory = "" ``` ## [Section reference](#section-reference) ### [`default_cluster`](#default_cluster) ``` default_cluster = "stella" ``` Name of the cluster to use when `--cluster` is not specified. ### [`[clusters.]`](#clustersname) Define cluster profiles. You can have multiple clusters. | Key | Type | Required | Description | | --------------------- | ------ | -------- | ------------------------- | | `name` | string | No | Display name | | `account` | string | No | Slurm account | | `qos` | array | No | List of QoS names | | `partitions` | array | No | List of partitions | | `quota_limit` | int | Yes | GPU-hours quota | | `rolling_window_days` | int | No | Window size (default: 30) | ### [`[monitoring]`](#monitoring) | Key | Type | Default | Description | | ------------------------ | ----- | ------- | --------------------------- | | `check_interval_minutes` | int | 30 | Monitoring interval | | `warning_threshold` | float | 0.8 | Warn at this percentage | | `critical_threshold` | float | 1.0 | Critical at this percentage | ### [`[enforcement]`](#enforcement) | Key | Type | Default | Description | | --------------------- | ------ | ------- | ----------------------------- | | `enabled` | bool | false | Enable enforcement | | `dry_run` | bool | true | Preview mode | | `grace_period_hours` | int | 24 | Warning window | | `cancel_order` | string | "lifo" | "lifo" or "fifo" | | `exempt_users` | array | [] | Skip these users | | `exempt_job_prefixes` | array | [] | Skip jobs with these prefixes | ### [`[email]`](#email) | Key | Type | Default | Description | | ---------------- | ------ | ------- | -------------------- | | `enabled` | bool | false | Enable email | | `sender` | string | - | From address | | `domain` | string | - | User email domain | | `subject_prefix` | string | "" | Email subject prefix | | `smtp_host` | string | - | SMTP server | | `smtp_port` | int | 587 | SMTP port | | `smtp_user_env` | string | - | Env var for username | | `smtp_pass_env` | string | - | Env var for password | ### [`[display]`](#display) | Key | Type | Default | Description | | --------------- | ------ | ------- | --------------------- | | `color` | bool | true | Enable colors | | `output_format` | string | "rich" | Default output format | ### [`[cache]`](#cache) | Key | Type | Default | Description | | ------------- | ------ | ------- | --------------- | | `enabled` | bool | true | Enable caching | | `ttl_minutes` | int | 60 | Cache TTL | | `directory` | string | "" | Cache directory | ## [Validation](#validation) Check your config for errors: ``` slurmq config validate ``` ## [Minimal config](#minimal-config) The smallest valid config: ``` default_cluster = "mycluster" [clusters.mycluster] quota_limit = 500 ``` # [Exit Codes](#exit-codes) slurmq uses standard exit codes for scripting. ## [Exit code reference](#exit-code-reference) | Code | Meaning | When | | ---- | -------------- | -------------------------------------- | | 0 | Success | Command completed successfully | | 1 | Error | Configuration error, Slurm error, etc. | | 2 | Quota exceeded | With `--quiet` flag only | ## [Usage in scripts](#usage-in-scripts) ### [Check success](#check-success) ``` if slurmq check; then echo "Quota check passed" else echo "Quota check failed" fi ``` ### [Check specific codes](#check-specific-codes) ``` slurmq --quiet check case $? in 0) echo "Quota OK" ;; 1) echo "Error running check" ;; 2) echo "Quota exceeded" ;; esac ``` ### [Guard job submission](#guard-job-submission) ``` #!/bin/bash set -e # This will exit 2 if quota exceeded slurmq --quiet check # Only reached if quota OK sbatch my_job.sh ``` ## [Quiet mode](#quiet-mode) The `--quiet` flag: - Suppresses normal output - Returns exit code 2 for exceeded quota - Still shows errors on stderr ``` # Silent check slurmq --quiet check echo "Exit code: $?" # Capture errors only if ! slurmq --quiet check 2>/dev/null; then echo "Problem detected" fi ``` ## [Error messages](#error-messages) Errors are always written to stderr: ``` # Separate stdout and stderr slurmq check > /tmp/output.txt 2> /tmp/errors.txt ``` ## [JSON error output](#json-error-output) With `--json`, errors are also JSON: ``` slurmq --json check ``` ``` { "error": "No cluster specified and no default_cluster set" } ``` ## [Best practices](#best-practices) 1. **Use `--quiet` for scripts** - Clean exit codes without parsing output 1. **Check exit codes explicitly** - Don't assume success 1. **Capture stderr** - Errors go there 1. **Use `set -e`** - Exit on first error in shell scripts