# slurmq

> GPU quota management for Slurm clusters

slurmq is a CLI tool for GPU quota management on Slurm clusters. It helps HPC
administrators and users track GPU usage, enforce quotas, and generate reports.

Key features: quota checking, multi-cluster support, configurable via TOML,
JSON/CSV/YAML output, live monitoring dashboard, automated enforcement.


# Getting started

# [Installation](#installation)

## [Requirements](#requirements)

- Python 3.11+
- Slurm with `sacct` command available

## [Install with uv (recommended)](#install-with-uv-recommended)

[uv](https://docs.astral.sh/uv/) is the fastest way to install Python CLI tools:

```
uv tool install slurmq
```

This installs `slurmq` globally, isolated from your project environments.

## [Install with pip](#install-with-pip)

```
pip install slurmq
```

## [Install with pipx](#install-with-pipx)

```
pipx install slurmq
```

## [Install from source](#install-from-source)

```
git clone https://github.com/dedalus-labs/slurmq.git
cd slurmq
uv tool install .
```

## [Verify installation](#verify-installation)

```
slurmq --version
```

## [System-wide installation (HPC)](#system-wide-installation-hpc)

For HPC environments where users don't have write access to their home directories, administrators can install slurmq system-wide:

```
# As root or with sudo
pip install --prefix=/opt/slurmq slurmq

# Add to system PATH in /etc/profile.d/slurmq.sh
export PATH="/opt/slurmq/bin:$PATH"
```

Then create a system-wide config at `/etc/slurmq/config.toml`.

## [Shell completion](#shell-completion)

slurmq supports shell completion for bash, zsh, and fish:

```
# Add to ~/.bashrc
eval "$(slurmq --show-completion bash)"
```

```
# Add to ~/.zshrc
eval "$(slurmq --show-completion zsh)"
```

```
# Add to ~/.config/fish/config.fish
slurmq --show-completion fish | source
```

# [Quick Start](#quick-start)

Get up and running with slurmq in 5 minutes.

## [1. Initialize configuration](#1-initialize-configuration)

Run the interactive setup wizard:

```
slurmq config init
```

This creates `~/.config/slurmq/config.toml` with your cluster settings.

Minimal config

You can also create a minimal config manually:

````
```toml
default_cluster = "mycluster"

[clusters.mycluster]
name = "My Cluster"
qos = ["normal"]
quota_limit = 500
````

```

## [2. Check your quota](#2-check-your-quota)

```

slurmq check

```

That's it! You'll see your current GPU usage and remaining quota.

## [3. Useful variations](#3-useful-variations)

```

# Check another user's quota

slurmq check --user alice

# Use a different QoS

slurmq check --qos high-priority

# Get JSON output for scripting

slurmq --json check

# Show forecast of when quota frees up

slurmq check --forecast

```

## [4. Admin commands](#4-admin-commands)

If you're an administrator:

```

# See all users' usage

slurmq report

# Live monitoring dashboard

slurmq monitor

# Cluster analytics

slurmq stats

````

## [Next steps](#next-steps)

- [Configuration](https://dedalus-labs.github.io/slurmq/getting-started/configuration/index.md) - Full config file reference
- [Commands](https://dedalus-labs.github.io/slurmq/commands/index.md) - All available commands
- [Recipes](https://dedalus-labs.github.io/slurmq/recipes/common/index.md) - Common usage patterns```
````

# [Configuration](#configuration)

slurmq uses a layered configuration system with these priorities (highest first):

1. **CLI flags** - `--qos`, `--account`, `--cluster`
1. **Environment variables** - `SLURMQ_*`
1. **Config file** - `~/.config/slurmq/config.toml`
1. **System config** - `/etc/slurmq/config.toml`
1. **Defaults**

## [Config file locations](#config-file-locations)

slurmq looks for config files in this order:

1. `$SLURMQ_CONFIG` (env var override)
1. `~/.config/slurmq/config.toml` (user config)
1. `/etc/slurmq/config.toml` (system-wide)

## [Full config example](#full-config-example)

```
# ~/.config/slurmq/config.toml

# Default cluster to use when --cluster is not specified
default_cluster = "stella"

# Cluster profiles
[clusters.stella]
name = "Stella HPC GPU Cluster"
account = "research-group"
qos = ["high-priority", "normal", "low"]
partitions = ["gpu", "gpu-large"]
quota_limit = 500                 # GPU-hours per user
rolling_window_days = 30          # Rolling window for quota

[clusters.stellar]
name = "Stellar"
account = "astro"
qos = ["standard"]
partitions = ["gpu"]
quota_limit = 200

# Monitoring thresholds
[monitoring]
check_interval_minutes = 30
warning_threshold = 0.8           # Warn at 80% usage
critical_threshold = 1.0          # Critical at 100%

# Quota enforcement (admin)
[enforcement]
enabled = false                   # Must explicitly enable
dry_run = true                    # Preview mode - no actual cancels
grace_period_hours = 24           # Warning before cancel
cancel_order = "lifo"             # "lifo" or "fifo"
exempt_users = ["admin", "root"]
exempt_job_prefixes = ["checkpoint_", "debug_"]

# Email notifications (optional)
[email]
enabled = false
sender = "hpc-alerts@example.edu"
domain = "example.edu"            # User email domain
subject_prefix = "[Stella HPC-GPU]"
smtp_host = "smtp.example.edu"
smtp_port = 587
smtp_user_env = "SLURMQ_SMTP_USER"
smtp_pass_env = "SLURMQ_SMTP_PASS"

# Display preferences
[display]
color = true                      # Respects NO_COLOR env
output_format = "rich"            # "rich", "plain", "json"

# Cache settings
[cache]
enabled = true
ttl_minutes = 60
directory = ""                    # Default: ~/.cache/slurmq/
```

## [Environment variables](#environment-variables)

All config values can be overridden with environment variables using the `SLURMQ_` prefix:

```
# Override default cluster
export SLURMQ_DEFAULT_CLUSTER=stellar

# Override nested values with double underscore
export SLURMQ_MONITORING__WARNING_THRESHOLD=0.9
export SLURMQ_ENFORCEMENT__DRY_RUN=false

# Point to a different config file
export SLURMQ_CONFIG=/path/to/config.toml
```

## [CLI flag overrides](#cli-flag-overrides)

Most config values can be overridden per-command:

```
# Override cluster
slurmq --cluster stellar check

# Override QoS, account, partition
slurmq check --qos high-priority --account mygroup --partition gpu-large

# Override output format
slurmq --json check
slurmq --yaml check
```

## [Multiple clusters](#multiple-clusters)

Define multiple clusters and switch between them:

```
default_cluster = "stella"

[clusters.stella]
name = "Stella HPC"
quota_limit = 500
qos = ["normal"]

[clusters.traverse]
name = "Traverse"
quota_limit = 1000
qos = ["gpu"]
```

```
slurmq check                     # Uses stella (default)
slurmq --cluster traverse check
```

## [Validating config](#validating-config)

Check your config file for errors:

```
slurmq config validate
```

This catches:

- TOML syntax errors
- Invalid threshold values
- Missing cluster definitions
- Unknown config keys
# Commands

# [check](#check)

Check GPU quota usage for a user.

## [Usage](#usage)

```
slurmq check [OPTIONS]
```

## [Options](#options)

| Option        | Short | Description                           |
| ------------- | ----- | ------------------------------------- |
| `--user`      | `-u`  | User to check (default: current user) |
| `--qos`       | `-q`  | QoS to check (overrides config)       |
| `--account`   | `-a`  | Account to check (overrides config)   |
| `--partition` | `-p`  | Partition to check (overrides config) |
| `--forecast`  | `-f`  | Show quota forecast                   |

## [Examples](#examples)

### [Basic usage](#basic-usage)

```
# Check your own quota
slurmq check

# Check another user's quota
slurmq check --user alice
```

### [Override filters](#override-filters)

```
# Check a specific QoS
slurmq check --qos high-priority

# Check a specific account
slurmq check --account research-gpu

# Combine multiple overrides
slurmq check --qos normal --partition gpu-large
```

### [Output formats](#output-formats)

```
# Rich terminal output (default)
slurmq check

# JSON for scripting
slurmq --json check

# YAML output
slurmq --yaml check
```

### [Forecast](#forecast)

Show when quota will free up:

```
slurmq check --forecast
```

```
╭─────────── Quota Forecast ───────────╮
│  Time     Available    % Available   │
│  +12h     78.5         15.7%         │
│  +24h     125.0        25.0%         │
│  +72h     312.5        62.5%         │
│  +168h    500.0        100.0%        │
╰──────────────────────────────────────╯
```

## [JSON output](#json-output)

```
{
  "user": "dedalus",
  "qos": "normal",
  "used_gpu_hours": 342.5,
  "quota_limit": 500,
  "remaining_gpu_hours": 157.5,
  "usage_percentage": 68.5,
  "status": "ok",
  "rolling_window_days": 30,
  "active_jobs": 2
}
```

## [Exit codes](#exit-codes)

| Code | Condition                       |
| ---- | ------------------------------- |
| 0    | Success                         |
| 1    | Error                           |
| 2    | Quota exceeded (with `--quiet`) |

## [Scripting](#scripting)

```
# Check if quota exceeded
if slurmq --json check | jq -e '.status == "exceeded"' > /dev/null; then
  echo "Quota exceeded!"
  exit 1
fi

# Get remaining hours
remaining=$(slurmq --json check | jq '.remaining_gpu_hours')
echo "You have $remaining GPU-hours remaining"
```

# [stats](#stats)

Cluster-wide analytics with month-over-month comparison.

## [Usage](#usage)

```
slurmq stats [OPTIONS]
```

## [Options](#options)

| Option                   | Short | Description                                            |
| ------------------------ | ----- | ------------------------------------------------------ |
| `--days`                 | `-d`  | Analysis period in days (default: 30)                  |
| `--compare/--no-compare` |       | Show month-over-month comparison (default: on)         |
| `--partition`            | `-p`  | Filter by partition(s) (repeatable)                    |
| `--qos`                  | `-q`  | Filter by QoS(s) (repeatable)                          |
| `--small-threshold`      |       | GPU-hours threshold for small/large jobs (default: 50) |

## [Examples](#examples)

### [Basic usage](#basic-usage)

```
# Analyze the last 30 days with MoM comparison
slurmq stats
```

Output:

```
           GPU Utilization (Last 30 Days)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Partition    ┃ GPU Hours     ┃ Jobs       ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ gpu          │ 125.3k (+12%) │ 1,245 (-5%)│
│ gpu-large    │ 45.2k (+8%)   │ 234 (+15%) │
└──────────────┴───────────────┴────────────┘

      Wait Times: Small (≤50 GPU-h)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ Partition    ┃ Median Wait┃ Wait > 6h ┃ Jobs  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ gpu          │ 15min (-8%)│ 2.1%      │ 1,100 │
│ gpu-large    │ 45min (+3%)│ 8.5%      │ 180   │
└──────────────┴────────────┴───────────┴───────┘

      Wait Times: Large (>50 GPU-h)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ Partition    ┃ Median Wait┃ Wait > 6h ┃ Jobs  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ gpu          │ 2h15 (+15%)│ 25.0%     │ 145   │
│ gpu-large    │ 8h30 (-2%) │ 62.5%     │ 54    │
└──────────────┴────────────┴───────────┴───────┘
```

### [Custom time range](#custom-time-range)

```
# Last 14 days
slurmq stats --days 14

# Last 7 days, no comparison
slurmq stats --days 7 --no-compare
```

### [Filter by partition or QoS](#filter-by-partition-or-qos)

```
# Specific partitions
slurmq stats --partition gpu --partition gpu-large

# Specific QoS
slurmq stats --qos high-priority
```

### [Custom job size threshold](#custom-job-size-threshold)

```
# Jobs ≤25 GPU-hours are "small"
slurmq stats --small-threshold 25
```

### [JSON output](#json-output)

```
slurmq --json stats
```

```
{
  "period_days": 30,
  "current": {
    "gpu": {
      "all": {
        "job_count": 1245,
        "gpu_hours": 125300.5,
        "median_wait_hours": 0.75,
        "long_wait_pct": 12.5
      },
      "small": { ... },
      "large": { ... }
    }
  },
  "previous": { ... }
}
```

## [Metrics explained](#metrics-explained)

| Metric          | Description                                  |
| --------------- | -------------------------------------------- |
| **GPU Hours**   | Total GPU-hours consumed                     |
| **Jobs**        | Number of jobs completed                     |
| **Median Wait** | Median time from submission to start         |
| **Wait > 6h**   | Percentage of jobs waiting more than 6 hours |

## [Use cases](#use-cases)

- **Capacity planning** - Track utilization trends
- **SLA monitoring** - Watch wait time percentiles
- **QoS tuning** - Compare partition performance
- **Monthly reports** - Export JSON for dashboards

# [report](#report)

Generate usage reports for all users.

## [Usage](#usage)

```
slurmq report [OPTIONS]
```

## [Options](#options)

| Option        | Short | Description                                    |
| ------------- | ----- | ---------------------------------------------- |
| `--format`    | `-f`  | Output format: rich, json, csv (default: rich) |
| `--output`    | `-o`  | Output file path                               |
| `--qos`       | `-q`  | QoS to report on (overrides config)            |
| `--account`   | `-a`  | Account to report on (overrides config)        |
| `--partition` | `-p`  | Partition to report on (overrides config)      |

## [Examples](#examples)

### [Basic usage](#basic-usage)

```
slurmq report
```

Output:

```
        GPU Usage Report: Stella (normal)
┏━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ User    ┃ Used (GPU-h)┃ Remaining ┃ Usage % ┃ Status ┃ Active ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ alice   │ 487.5       │ 12.5      │ 98%     │ !      │ 3      │
│ bob     │ 342.0       │ 158.0     │ 68%     │ ok     │ 1      │
│ charlie │ 125.5       │ 374.5     │ 25%     │ ok     │ -      │
└─────────┴─────────────┴───────────┴─────────┴────────┴────────┘

Total users: 3
```

### [Export formats](#export-formats)

```
# JSON
slurmq report --format json

# CSV (for spreadsheets)
slurmq report --format csv

# Save to file
slurmq report --format csv --output usage.csv
```

### [Filter by QoS/account](#filter-by-qosaccount)

```
# Specific QoS
slurmq report --qos high-priority

# Specific account
slurmq report --account research-gpu
```

## [JSON output](#json-output)

```
{
  "cluster": "Stella",
  "qos": "normal",
  "users": [
    {
      "user": "alice",
      "used_gpu_hours": 487.5,
      "quota_limit": 500,
      "remaining_gpu_hours": 12.5,
      "usage_percentage": 97.5,
      "status": "warning",
      "active_jobs": 3,
      "total_jobs": 45
    }
  ]
}
```

## [CSV output](#csv-output)

```
user,used_gpu_hours,quota_limit,remaining_gpu_hours,usage_percentage,status,active_jobs,total_jobs
alice,487.5,500,12.5,97.5,warning,3,45
bob,342.0,500,158.0,68.4,ok,1,23
charlie,125.5,500,374.5,25.1,ok,0,12
```

## [Use cases](#use-cases)

- **Weekly reports** - Track who's using resources
- **Capacity planning** - Identify heavy users
- **Billing** - Export CSV for chargeback
- **Compliance** - Audit quota enforcement

# [monitor](#monitor)

Real-time monitoring with optional quota enforcement.

## [Usage](#usage)

```
slurmq monitor [OPTIONS]
```

## [Options](#options)

| Option       | Short | Description                               |
| ------------ | ----- | ----------------------------------------- |
| `--interval` | `-i`  | Refresh interval in seconds (default: 30) |
| `--once`     |       | Single check, then exit (for cron)        |
| `--enforce`  |       | Enable quota enforcement                  |

## [Examples](#examples)

### [Live dashboard](#live-dashboard)

```
# Interactive monitoring with 30s refresh
slurmq monitor

# Faster refresh
slurmq monitor --interval 10
```

### [Single check (for cron)](#single-check-for-cron)

```
# Check once and exit
slurmq monitor --once

# With enforcement
slurmq monitor --once --enforce
```

## [Enforcement](#enforcement)

When `--enforce` is enabled, slurmq will cancel jobs for users who exceed their quota.

Enable in config first

Enforcement must be enabled in your config file:

````
```toml
[enforcement]
enabled = true
dry_run = true  # Start with dry_run to preview
````

```

### [Enforcement behavior](#enforcement-behavior)

1. **Grace period** - Users get a warning email before cancellation
2. **Cancel order** - Jobs cancelled in LIFO or FIFO order
3. **Exemptions** - Skip specific users or job prefixes
4. **Dry run** - Preview what would be cancelled

### [Configuration](#configuration)

```

[enforcement] enabled = true dry_run = true # Set false when ready grace_period_hours = 24 # Warning window cancel_order = "lifo" # "lifo" or "fifo" exempt_users = ["admin"] exempt_job_prefixes = ["checkpoint\_"]

```

### [Enforcement flow](#enforcement-flow)

```

flowchart TD A[User exceeds quota] --> B{Grace period?} B -->|Yes| C[Send warning email] B -->|No| D{User exempt?} D -->|Yes| E[Skip] D -->|No| F{Job exempt?} F -->|Yes| E F -->|No| G{dry_run?} G -->|Yes| H[Log only] G -->|No| I[Cancel job + send notification]

```

## [Cron setup](#cron-setup)

Run monitoring every 5 minutes:

```

# /etc/cron.d/slurmq

\*/5 * * * * root slurmq --quiet monitor --once --enforce >> /var/log/slurmq.log 2>&1

```

Or with systemd timer:

```

# /etc/systemd/system/slurmq-monitor.timer

[Unit] Description=Run slurmq quota check

[Timer] OnCalendar=\*:0/5 Persistent=true

[Install] WantedBy=timers.target

```

```

# /etc/systemd/system/slurmq-monitor.service

[Unit] Description=slurmq quota monitoring

[Service] Type=oneshot ExecStart=/usr/local/bin/slurmq --quiet monitor --once --enforce

```

## [Dashboard output](#dashboard-output)

The live dashboard shows:

```

┌─────────────────── slurmq monitor ───────────────────┐ │ Cluster: Stella QoS: normal Refresh: 30s │ ├──────────────────────────────────────────────────────┤ │ User Used Remaining Status Active │ │ alice 487.5 12.5 ! WARN 3 │ │ bob 342.0 158.0 ok 1 │ │ charlie 512.5 -12.5 x OVER 2 -> cancel │ ├──────────────────────────────────────────────────────┤ │ Last check: 2024-01-15 14:30:00 │ │ Press q to quit │ └──────────────────────────────────────────────────────┘

```
```

# [efficiency](#efficiency)

Analyze job resource efficiency (inspired by Slurm's `seff`).

## [Usage](#usage)

```
slurmq efficiency <JOB_ID>
```

## [Arguments](#arguments)

| Argument | Description             |
| -------- | ----------------------- |
| `JOB_ID` | Slurm job ID to analyze |

## [Examples](#examples)

### [Basic usage](#basic-usage)

```
slurmq efficiency 12345678
```

Output:

```
╭──────────────── Job Efficiency Report ────────────────╮
│                                                       │
│  Job ID:     12345678                                 │
│  Job Name:   train_model                              │
│  User:       alice                                    │
│  Account:    research                                 │
│  State:      COMPLETED                                │
│                                                       │
│  CPU Efficiency:    78.5%  ████████░░                 │
│  Memory Efficiency: 45.2%  █████░░░░░  ! Low          │
│                                                       │
│  Cores requested:   32                                │
│  CPU time used:     12h 30m (of 16h allocated)        │
│                                                       │
│  Memory requested:  128GB                             │
│  Memory used:       58GB (peak)                       │
│                                                       │
╰───────────────────────────────────────────────────────╯
```

## [Efficiency thresholds](#efficiency-thresholds)

| Metric | Good | Warning | Low   |
| ------ | ---- | ------- | ----- |
| CPU    | ≥50% | 30-50%  | \<30% |
| Memory | ≥30% | 20-30%  | \<20% |

## [Use cases](#use-cases)

- **Debugging** - Why did my job fail (OOM)?
- **Optimization** - Am I requesting too many resources?
- **Education** - Help users right-size their jobs

## [Tips for improving efficiency](#tips-for-improving-efficiency)

### [CPU efficiency low?](#cpu-efficiency-low)

- Your job may be I/O bound
- Consider fewer cores with faster storage
- Check for serial bottlenecks in your code

### [Memory efficiency low?](#memory-efficiency-low)

- Request less memory in your submission
- Use `--mem-per-cpu` instead of `--mem`
- Profile your application's memory usage

## [JSON output](#json-output)

```
slurmq --json efficiency 12345678
```

```
{
  "job_id": "12345678",
  "job_name": "train_model",
  "user": "alice",
  "state": "COMPLETED",
  "cpu_efficiency": 78.5,
  "memory_efficiency": 45.2,
  "cores_requested": 32,
  "cpu_time_used_seconds": 45000,
  "memory_requested_gb": 128,
  "memory_used_gb": 58
}
```

# [config](#config)

Manage slurmq configuration.

## [Subcommands](#subcommands)

### [init](#init)

Interactive configuration wizard.

```
slurmq config init
```

Creates `~/.config/slurmq/config.toml` with your cluster settings.

The wizard prompts for:

1. Cluster name
1. Account
1. QoS list
1. Quota limit
1. Rolling window days

### [show](#show)

Display current configuration.

```
slurmq config show
```

Shows the merged configuration from all sources (env vars, config file, defaults).

### [path](#path)

Show config file location.

```
slurmq config path
```

Output:

```
Config file: /home/user/.config/slurmq/config.toml
Status: exists
```

### [validate](#validate)

Validate configuration file.

```
slurmq config validate
```

Checks for:

- TOML syntax errors
- Invalid threshold values (must be 0-1)
- Missing cluster definitions
- Unknown configuration keys
- Logical errors (e.g., warning > critical threshold)

Example output:

```
ok: Config file is valid: /home/user/.config/slurmq/config.toml
```

Or with errors:

```
error: Config validation failed:
  - monitoring.warning_threshold (1.5) must be between 0 and 1
  - default_cluster "nonexistent" not found in clusters
```

### [set](#set)

Set a configuration value.

```
slurmq config set <KEY> <VALUE>
```

Examples:

```
# Set default cluster
slurmq config set default_cluster stellar

# Set nested value
slurmq config set monitoring.warning_threshold 0.9
```

## [Examples](#examples)

### [Initial setup](#initial-setup)

```
# Create config interactively
slurmq config init

# Verify
slurmq config show
slurmq config validate
```

### [Check config location](#check-config-location)

```
$ slurmq config path
Config file: /home/user/.config/slurmq/config.toml
Status: exists
```

### [Debug configuration](#debug-configuration)

```
# See what config slurmq is actually using
slurmq config show

# Check for errors
slurmq config validate

# See where it's loading from
SLURMQ_CONFIG=/tmp/test.toml slurmq config path
```
# Recipes

# [Common Patterns](#common-patterns)

Recipes for common slurmq usage patterns.

## [Check before submitting](#check-before-submitting)

Block job submission if quota exceeded:

```
#!/bin/bash
# submit-job.sh

# Check quota first
if ! slurmq --quiet check; then
  echo "Quota exceeded! Cannot submit job."
  exit 1
fi

# Submit job
sbatch my_job.sh
```

## [Get remaining quota](#get-remaining-quota)

Extract remaining hours for scripts:

```
remaining=$(slurmq --json check | jq '.remaining_gpu_hours')
echo "You have $remaining GPU-hours remaining"

# Check if enough for a job
if (( $(echo "$remaining < 10" | bc -l) )); then
  echo "Warning: Less than 10 hours remaining!"
fi
```

## [Multi-cluster workflow](#multi-cluster-workflow)

Switch between clusters:

```
# Check all clusters
for cluster in stella stellar traverse; do
  echo "=== $cluster ==="
  slurmq --cluster $cluster check
done
```

## [Export weekly report](#export-weekly-report)

Generate CSV report for spreadsheets:

```
#!/bin/bash
# weekly-report.sh

DATE=$(date +%Y-%m-%d)
OUTPUT_DIR="/shared/reports"

slurmq report --format csv --output "$OUTPUT_DIR/usage-$DATE.csv"
```

## [Slack notification](#slack-notification)

Send alerts to Slack:

```
#!/bin/bash
# check-and-alert.sh

WEBHOOK_URL="https://hooks.slack.com/services/..."

# Check for exceeded users
exceeded=$(slurmq --json report | jq '[.users[] | select(.status == "exceeded")] | length')

if [ "$exceeded" -gt 0 ]; then
  curl -X POST -H 'Content-type: application/json' \
    --data "{\"text\":\"warning: $exceeded users have exceeded GPU quota\"}" \
    "$WEBHOOK_URL"
fi
```

## [Pre-commit hook](#pre-commit-hook)

Check quota before git push (for ML repos):

```
#!/bin/bash
# .git/hooks/pre-push

# Only check if on training branch
branch=$(git rev-parse --abbrev-ref HEAD)
if [[ "$branch" == "train-"* ]]; then
  if ! slurmq --quiet check; then
    echo "Quota exceeded! Training jobs may fail."
    read -p "Push anyway? (y/N) " confirm
    [[ "$confirm" != "y" ]] && exit 1
  fi
fi
```

## [Dashboard integration](#dashboard-integration)

Export metrics for Grafana/Prometheus:

```
#!/bin/bash
# metrics.sh - Run every minute via cron

METRICS_FILE="/var/lib/prometheus/slurmq.prom"

# Get stats
stats=$(slurmq --json report 2>/dev/null)

if [ $? -eq 0 ]; then
  # Generate Prometheus metrics
  echo "# HELP slurmq_users_total Total number of users"
  echo "# TYPE slurmq_users_total gauge"
  echo "slurmq_users_total $(echo $stats | jq '.users | length')"

  echo "# HELP slurmq_exceeded_users Users over quota"
  echo "# TYPE slurmq_exceeded_users gauge"
  echo "slurmq_exceeded_users $(echo $stats | jq '[.users[] | select(.status == \"exceeded\")] | length')"
fi > "$METRICS_FILE"
```

## [Job wrapper](#job-wrapper)

Wrap sbatch to check quota first:

```
#!/bin/bash
# /usr/local/bin/sbatch-safe

# Check quota
remaining=$(slurmq --json check 2>/dev/null | jq -r '.remaining_gpu_hours // 0')

if (( $(echo "$remaining <= 0" | bc -l) )); then
  echo "Error: GPU quota exceeded. Cannot submit job." >&2
  exit 1
fi

# Pass through to real sbatch
exec /usr/bin/sbatch "$@"
```

Then alias: `alias sbatch=/usr/local/bin/sbatch-safe`

# [Scripting](#scripting)

Using slurmq in scripts and automation.

## [JSON output](#json-output)

Always use `--json` for machine-readable output:

```
slurmq --json check
slurmq --json report
slurmq --json stats
```

## [Parsing with jq](#parsing-with-jq)

### [Get specific fields](#get-specific-fields)

```
# Get remaining hours
slurmq --json check | jq '.remaining_gpu_hours'

# Get status
slurmq --json check | jq -r '.status'

# Check if exceeded
slurmq --json check | jq -e '.status == "exceeded"'
```

### [Filter users](#filter-users)

```
# Users over 80% usage
slurmq --json report | jq '.users[] | select(.usage_percentage > 80)'

# Users with active jobs
slurmq --json report | jq '.users[] | select(.active_jobs > 0)'

# Top 5 users by usage
slurmq --json report | jq '.users | sort_by(-.used_gpu_hours) | .[:5]'
```

### [Extract for other tools](#extract-for-other-tools)

```
# Get user list for email
slurmq --json report | jq -r '.users[] | select(.status == "exceeded") | .user'

# CSV-like output
slurmq --json report | jq -r '.users[] | [.user, .used_gpu_hours, .status] | @csv'
```

## [Exit codes](#exit-codes)

Use exit codes for conditionals:

```
# Check succeeds if quota OK
if slurmq --quiet check; then
  echo "Quota OK"
else
  echo "Quota problem"
fi
```

| Code | Meaning                         |
| ---- | ------------------------------- |
| 0    | Success / quota OK              |
| 1    | Error                           |
| 2    | Quota exceeded (with `--quiet`) |

## [Python integration](#python-integration)

```
import json
import subprocess

def get_quota():
    """Get quota info as dict."""
    result = subprocess.run(
        ["slurmq", "--json", "check"],
        capture_output=True,
        text=True,
        check=True,
    )
    return json.loads(result.stdout)

def check_quota_ok() -> bool:
    """Return True if quota is not exceeded."""
    try:
        info = get_quota()
        return info["status"] != "exceeded"
    except subprocess.CalledProcessError:
        return False

# Usage
if check_quota_ok():
    submit_training_job()
else:
    print("Quota exceeded!")
```

## [Environment variables](#environment-variables)

Override config via environment:

```
# In scripts
export SLURMQ_DEFAULT_CLUSTER=stellar
export SLURMQ_CONFIG=/etc/slurmq/prod.toml

# One-liner
SLURMQ_DEFAULT_CLUSTER=stellar slurmq check
```

## [Error handling](#error-handling)

Handle errors gracefully:

```
#!/bin/bash
set -e

# Capture output and exit code
if output=$(slurmq --json check 2>&1); then
  remaining=$(echo "$output" | jq '.remaining_gpu_hours')
  echo "Remaining: $remaining hours"
else
  echo "Error checking quota: $output" >&2
  exit 1
fi
```

## [Parallel checks](#parallel-checks)

Check multiple clusters in parallel:

```
#!/bin/bash

clusters=(stella stellar traverse)

for cluster in "${clusters[@]}"; do
  slurmq --cluster "$cluster" --json check &
done | jq -s '.'

wait
```

# [Cron Jobs](#cron-jobs)

Setting up automated quota monitoring and enforcement.

## [Basic monitoring cron](#basic-monitoring-cron)

Check quotas every 5 minutes:

```
# /etc/cron.d/slurmq-monitor
*/5 * * * * root /usr/local/bin/slurmq --quiet monitor --once >> /var/log/slurmq/monitor.log 2>&1
```

## [Enforcement cron](#enforcement-cron)

With quota enforcement enabled:

```
# /etc/cron.d/slurmq-enforce
*/5 * * * * root /usr/local/bin/slurmq --quiet monitor --once --enforce >> /var/log/slurmq/enforce.log 2>&1
```

Enable in config first

Make sure `[enforcement]` is configured properly before enabling:

````
```toml
[enforcement]
enabled = true
dry_run = false  # Only set false when ready!
````

```

## [Daily reports](#daily-reports)

Generate daily usage report:

```

# /etc/cron.d/slurmq-report

0 8 * * * root /usr/local/bin/slurmq report --format csv -o /var/log/slurmq/daily-$(date +%Y%m%d).csv

```

## [Weekly stats](#weekly-stats)

Generate weekly analytics:

```

# /etc/cron.d/slurmq-weekly

0 9 * * 1 root /usr/local/bin/slurmq --json stats --days 7 > /var/log/slurmq/weekly-$(date +%Y%W).json

```

## [Systemd timers (alternative)](#systemd-timers-alternative)

More robust than cron for modern systems.

### [Monitor timer](#monitor-timer)

```

# /etc/systemd/system/slurmq-monitor.timer

[Unit] Description=Run slurmq quota monitoring

[Timer] OnCalendar=\*:0/5 Persistent=true RandomizedDelaySec=30

[Install] WantedBy=timers.target

```

```

# /etc/systemd/system/slurmq-monitor.service

[Unit] Description=slurmq quota monitoring After=network.target

[Service] Type=oneshot ExecStart=/usr/local/bin/slurmq --quiet monitor --once --enforce StandardOutput=append:/var/log/slurmq/monitor.log StandardError=append:/var/log/slurmq/monitor.log

```

Enable:

```

sudo systemctl enable --now slurmq-monitor.timer

```

### [Daily report timer](#daily-report-timer)

```

# /etc/systemd/system/slurmq-report.timer

[Unit] Description=Daily slurmq report

[Timer] OnCalendar=*-*-\* 08:00:00 Persistent=true

[Install] WantedBy=timers.target

```

```

# /etc/systemd/system/slurmq-report.service

[Unit] Description=Generate daily slurmq report

[Service] Type=oneshot ExecStart=/bin/bash -c '/usr/local/bin/slurmq report --format csv -o /var/log/slurmq/daily-$(date +%%Y%%m%%d).csv'

```

## [Log rotation](#log-rotation)

Rotate slurmq logs:

```

# /etc/logrotate.d/slurmq

/var/log/slurmq/\*.log { daily rotate 30 compress delaycompress missingok notifempty create 0644 root root }

```

## [Monitoring the monitor](#monitoring-the-monitor)

Alert if slurmq itself fails:

```

#!/bin/bash

# /usr/local/bin/slurmq-healthcheck.sh

if ! slurmq --quiet check 2>/dev/null; then echo "slurmq check failed!" | mail -s "slurmq alert" admin@example.edu fi

```

```

# Cron entry

\*/30 * * * * root /usr/local/bin/slurmq-healthcheck.sh

````

## [Best practices](#best-practices)

1. **Start with dry_run** - Always test enforcement with `dry_run = true` first
2. **Log everything** - Keep logs for audit trail
3. **Use systemd** - More reliable than cron for critical monitoring
4. **Set up alerts** - Know when slurmq itself fails
5. **Test restores** - Verify config after system updates```
````
# Reference

# [Environment Variables](#environment-variables)

All configuration values can be overridden with environment variables.

## [Naming convention](#naming-convention)

Environment variables use the `SLURMQ_` prefix with double underscores for nesting:

```
SLURMQ_<KEY>
SLURMQ_<SECTION>__<KEY>
```

## [Available variables](#available-variables)

### [Core settings](#core-settings)

| Variable                 | Description          | Example                   |
| ------------------------ | -------------------- | ------------------------- |
| `SLURMQ_CONFIG`          | Path to config file  | `/etc/slurmq/config.toml` |
| `SLURMQ_DEFAULT_CLUSTER` | Default cluster name | `stella`                  |

### [Monitoring](#monitoring)

| Variable                                    | Description        | Default |
| ------------------------------------------- | ------------------ | ------- |
| `SLURMQ_MONITORING__CHECK_INTERVAL_MINUTES` | Check interval     | `30`    |
| `SLURMQ_MONITORING__WARNING_THRESHOLD`      | Warn at this %     | `0.8`   |
| `SLURMQ_MONITORING__CRITICAL_THRESHOLD`     | Critical at this % | `1.0`   |

### [Enforcement](#enforcement)

| Variable                                 | Description        | Default |
| ---------------------------------------- | ------------------ | ------- |
| `SLURMQ_ENFORCEMENT__ENABLED`            | Enable enforcement | `false` |
| `SLURMQ_ENFORCEMENT__DRY_RUN`            | Preview mode       | `true`  |
| `SLURMQ_ENFORCEMENT__GRACE_PERIOD_HOURS` | Warning window     | `24`    |
| `SLURMQ_ENFORCEMENT__CANCEL_ORDER`       | `lifo` or `fifo`   | `lifo`  |

### [Display](#display)

| Variable                        | Description               | Default |
| ------------------------------- | ------------------------- | ------- |
| `SLURMQ_DISPLAY__COLOR`         | Enable colors             | `true`  |
| `SLURMQ_DISPLAY__OUTPUT_FORMAT` | `rich`, `plain`, `json`   | `rich`  |
| `NO_COLOR`                      | Disable colors (standard) | -       |

### [Email](#email)

| Variable                  | Description                | Default |
| ------------------------- | -------------------------- | ------- |
| `SLURMQ_EMAIL__ENABLED`   | Enable email               | `false` |
| `SLURMQ_EMAIL__SENDER`    | From address               | -       |
| `SLURMQ_EMAIL__SMTP_HOST` | SMTP server                | -       |
| `SLURMQ_EMAIL__SMTP_PORT` | SMTP port                  | `587`   |
| `SLURMQ_SMTP_USER`        | SMTP username (via config) | -       |
| `SLURMQ_SMTP_PASS`        | SMTP password (via config) | -       |

### [Cache](#cache)

| Variable                    | Description     | Default            |
| --------------------------- | --------------- | ------------------ |
| `SLURMQ_CACHE__ENABLED`     | Enable caching  | `true`             |
| `SLURMQ_CACHE__TTL_MINUTES` | Cache TTL       | `60`               |
| `SLURMQ_CACHE__DIRECTORY`   | Cache directory | `~/.cache/slurmq/` |

## [Examples](#examples)

### [Override default cluster](#override-default-cluster)

```
export SLURMQ_DEFAULT_CLUSTER=stellar
slurmq check
```

### [Disable colors](#disable-colors)

```
export NO_COLOR=1
slurmq check
```

### [Use different config](#use-different-config)

```
export SLURMQ_CONFIG=/etc/slurmq/production.toml
slurmq check
```

### [Enable enforcement in CI](#enable-enforcement-in-ci)

```
export SLURMQ_ENFORCEMENT__ENABLED=true
export SLURMQ_ENFORCEMENT__DRY_RUN=false
slurmq monitor --once --enforce
```

## [Priority](#priority)

Environment variables override config file values:

1. CLI flags (highest)
1. Environment variables
1. Config file
1. System config (`/etc/slurmq/config.toml`)
1. Defaults (lowest)

## [Boolean values](#boolean-values)

For boolean environment variables, these values are truthy:

- `true`, `True`, `TRUE`
- `1`
- `yes`, `Yes`, `YES`
- `on`, `On`, `ON`

Everything else is falsy.

# [Config File Reference](#config-file-reference)

Complete reference for `config.toml`.

## [File locations](#file-locations)

slurmq searches for config files in this order:

1. `$SLURMQ_CONFIG` (environment variable)
1. `~/.config/slurmq/config.toml` (user)
1. `/etc/slurmq/config.toml` (system)

## [Complete example](#complete-example)

```
# Default cluster when --cluster is not specified
default_cluster = "stella"

# Cluster definitions
[clusters.stella]
name = "Stella HPC GPU Cluster"
account = "research-group"
qos = ["high-priority", "normal", "low"]
partitions = ["gpu", "gpu-large"]
quota_limit = 500
rolling_window_days = 30

[clusters.stellar]
name = "Stellar"
account = "astro"
qos = ["standard"]
partitions = ["gpu"]
quota_limit = 200
rolling_window_days = 30

# Monitoring settings
[monitoring]
check_interval_minutes = 30
warning_threshold = 0.8
critical_threshold = 1.0

# Quota enforcement
[enforcement]
enabled = false
dry_run = true
grace_period_hours = 24
cancel_order = "lifo"
exempt_users = ["admin", "root"]
exempt_job_prefixes = ["checkpoint_", "debug_"]

# Email notifications
[email]
enabled = false
sender = "hpc-alerts@example.edu"
domain = "example.edu"
subject_prefix = "[Stella HPC-GPU]"
smtp_host = "smtp.example.edu"
smtp_port = 587
smtp_user_env = "SLURMQ_SMTP_USER"
smtp_pass_env = "SLURMQ_SMTP_PASS"

# Display preferences
[display]
color = true
output_format = "rich"

# Cache settings
[cache]
enabled = true
ttl_minutes = 60
directory = ""
```

## [Section reference](#section-reference)

### [`default_cluster`](#default_cluster)

```
default_cluster = "stella"
```

Name of the cluster to use when `--cluster` is not specified.

### [`[clusters.<name>]`](#clustersname)

Define cluster profiles. You can have multiple clusters.

| Key                   | Type   | Required | Description               |
| --------------------- | ------ | -------- | ------------------------- |
| `name`                | string | No       | Display name              |
| `account`             | string | No       | Slurm account             |
| `qos`                 | array  | No       | List of QoS names         |
| `partitions`          | array  | No       | List of partitions        |
| `quota_limit`         | int    | Yes      | GPU-hours quota           |
| `rolling_window_days` | int    | No       | Window size (default: 30) |

### [`[monitoring]`](#monitoring)

| Key                      | Type  | Default | Description                 |
| ------------------------ | ----- | ------- | --------------------------- |
| `check_interval_minutes` | int   | 30      | Monitoring interval         |
| `warning_threshold`      | float | 0.8     | Warn at this percentage     |
| `critical_threshold`     | float | 1.0     | Critical at this percentage |

### [`[enforcement]`](#enforcement)

| Key                   | Type   | Default | Description                   |
| --------------------- | ------ | ------- | ----------------------------- |
| `enabled`             | bool   | false   | Enable enforcement            |
| `dry_run`             | bool   | true    | Preview mode                  |
| `grace_period_hours`  | int    | 24      | Warning window                |
| `cancel_order`        | string | "lifo"  | "lifo" or "fifo"              |
| `exempt_users`        | array  | []      | Skip these users              |
| `exempt_job_prefixes` | array  | []      | Skip jobs with these prefixes |

### [`[email]`](#email)

| Key              | Type   | Default | Description          |
| ---------------- | ------ | ------- | -------------------- |
| `enabled`        | bool   | false   | Enable email         |
| `sender`         | string | -       | From address         |
| `domain`         | string | -       | User email domain    |
| `subject_prefix` | string | ""      | Email subject prefix |
| `smtp_host`      | string | -       | SMTP server          |
| `smtp_port`      | int    | 587     | SMTP port            |
| `smtp_user_env`  | string | -       | Env var for username |
| `smtp_pass_env`  | string | -       | Env var for password |

### [`[display]`](#display)

| Key             | Type   | Default | Description           |
| --------------- | ------ | ------- | --------------------- |
| `color`         | bool   | true    | Enable colors         |
| `output_format` | string | "rich"  | Default output format |

### [`[cache]`](#cache)

| Key           | Type   | Default | Description     |
| ------------- | ------ | ------- | --------------- |
| `enabled`     | bool   | true    | Enable caching  |
| `ttl_minutes` | int    | 60      | Cache TTL       |
| `directory`   | string | ""      | Cache directory |

## [Validation](#validation)

Check your config for errors:

```
slurmq config validate
```

## [Minimal config](#minimal-config)

The smallest valid config:

```
default_cluster = "mycluster"

[clusters.mycluster]
quota_limit = 500
```

# [Exit Codes](#exit-codes)

slurmq uses standard exit codes for scripting.

## [Exit code reference](#exit-code-reference)

| Code | Meaning        | When                                   |
| ---- | -------------- | -------------------------------------- |
| 0    | Success        | Command completed successfully         |
| 1    | Error          | Configuration error, Slurm error, etc. |
| 2    | Quota exceeded | With `--quiet` flag only               |

## [Usage in scripts](#usage-in-scripts)

### [Check success](#check-success)

```
if slurmq check; then
  echo "Quota check passed"
else
  echo "Quota check failed"
fi
```

### [Check specific codes](#check-specific-codes)

```
slurmq --quiet check
case $? in
  0) echo "Quota OK" ;;
  1) echo "Error running check" ;;
  2) echo "Quota exceeded" ;;
esac
```

### [Guard job submission](#guard-job-submission)

```
#!/bin/bash
set -e

# This will exit 2 if quota exceeded
slurmq --quiet check

# Only reached if quota OK
sbatch my_job.sh
```

## [Quiet mode](#quiet-mode)

The `--quiet` flag:

- Suppresses normal output
- Returns exit code 2 for exceeded quota
- Still shows errors on stderr

```
# Silent check
slurmq --quiet check
echo "Exit code: $?"

# Capture errors only
if ! slurmq --quiet check 2>/dev/null; then
  echo "Problem detected"
fi
```

## [Error messages](#error-messages)

Errors are always written to stderr:

```
# Separate stdout and stderr
slurmq check > /tmp/output.txt 2> /tmp/errors.txt
```

## [JSON error output](#json-error-output)

With `--json`, errors are also JSON:

```
slurmq --json check
```

```
{
  "error": "No cluster specified and no default_cluster set"
}
```

## [Best practices](#best-practices)

1. **Use `--quiet` for scripts** - Clean exit codes without parsing output
1. **Check exit codes explicitly** - Don't assume success
1. **Capture stderr** - Errors go there
1. **Use `set -e`** - Exit on first error in shell scripts