Open source · Python · MIT
██╗ ██╗ ███╗ ███╗██████╗ ███████╗███╗ ██╗ ██████╗██╗ ██╗
██║ ██║ ████╗ ████║██╔══██╗██╔════╝████╗ ██║██╔════╝██║ ██║
██║ ██║ ██╔████╔██║██████╔╝█████╗ ██╔██╗ ██║██║ ███████║
██║ ██║ ██║╚██╔╝██║██╔══██╗██╔══╝ ██║╚██╗██║██║ ██╔══██║
███████╗███████╗██║ ╚═╝ ██║██████╔╝███████╗██║ ╚████║╚██████╗██║ ██║
╚══════╝╚══════╝╚═╝ ╚═╝╚═════╝ ╚══════╝╚═╝ ╚═══╝ ╚═════╝╚═╝ ╚═╝
Benchmark any AI model — any provider, one command. A CLI harness for measuring throughput, quality, and image generation across Claude, OpenAI, Ollama, vLLM, and anything OpenAI-compatible. Plus a unified leaderboard explorer for published scores.
One command, any Python-capable machine. uvx and pipx run are Python's equivalents of npx.
uvx llmbench
pipx run llmbench
pip install llmbench
Python 3.11+. Full install options, suite.yaml schema, and provider setup on GitHub.
Published model scores from HuggingFace Open LLM Leaderboard v2, LMArena ELO, Aider Polyglot, and a bundled snapshot — searchable in one place. Click a column to sort, toggle sources, or filter by name.
| # | Model | Organization | Source | Score | Metric |
|---|
Published numbers are not directly comparable to each other — different environments, different prompts, different scoring. They are context, not ground truth. Refreshed daily by GitHub Actions.