llmbench — benchmark any AI model · any provider

Open source · Python · MIT

██╗     ██╗     ███╗   ███╗██████╗ ███████╗███╗   ██╗ ██████╗██╗  ██╗
██║     ██║     ████╗ ████║██╔══██╗██╔════╝████╗  ██║██╔════╝██║  ██║
██║     ██║     ██╔████╔██║██████╔╝█████╗  ██╔██╗ ██║██║     ███████║
██║     ██║     ██║╚██╔╝██║██╔══██╗██╔══╝  ██║╚██╗██║██║     ██╔══██║
███████╗███████╗██║ ╚═╝ ██║██████╔╝███████╗██║ ╚████║╚██████╗██║  ██║
╚══════╝╚══════╝╚═╝     ╚═╝╚═════╝ ╚══════╝╚═╝  ╚═══╝ ╚═════╝╚═╝  ╚═╝

Benchmark any AI model — any provider, one command. A CLI harness for measuring throughput, quality, and image generation across Claude, OpenAI, Ollama, vLLM, and anything OpenAI-compatible. Plus a unified leaderboard explorer for published scores.

Install the CLI Browse leaderboards Source on GitHub

Install.

§ 01 — Zero setup

One command, any Python-capable machine. uvx and pipx run are Python's equivalents of npx.

Using uv

uvx llmbench

Using pipx

pipx run llmbench

From source

pip install llmbench

Python 3.11+. Full install options, suite.yaml schema, and provider setup on GitHub.

Leaderboards.

§ 02 — Published scores

Published model scores from HuggingFace Open LLM Leaderboard v2, LMArena ELO, Aider Polyglot, and a bundled snapshot — searchable in one place. Click a column to sort, toggle sources, or filter by name.

Loading…

#	Model	Organization	Source	Score	Metric

Published numbers are not directly comparable to each other — different environments, different prompts, different scoring. They are context, not ground truth. Refreshed daily by GitHub Actions.