BBB Benchmark

Last commit:

The BBB benchmark evaluates the performance of large language models on the Back Bay Battery strategy simulation.

**Read the paper here: How Well Can AI do Strategy? Empirical Benchmarking Using Strategy Simulations**

**View the full Github Repo Here: https://github.com/ryantallen/bbb_benchmark_public**

The task of the simulation is to balance short-term profit (exploiting the core business) with longer-term growth (investing in an emerging technology) across multiple periods under significant uncertainty. To reduce contamination from model pre-knowledge of the simulation, all identifying terms are masked so each model encounters the scenario as if for the first time. Results below show the composite score and underlying metrics, plus correlations with external benchmarks.

BBB Composite Score (ordered by release date)

Composite Score: even-weighted, min–max normalized combination of Cumulative Profit, Cumulative Revenue, and final-year Emerging Tech Revenue.

BBB composite score over time
Consistent model improvement until recent regression by the latest state-of-the-art models

BBB Composite Score (ranked by performance)

Composite Score: even-weighted, min–max normalized combination of Cumulative Profit, Cumulative Revenue, and final-year Emerging Tech Revenue.

BBB composite score across models (masked, Advanced)
Because the simulation involves trade-offs between short-term profit and longer-term growth, the composite score combines the most important elements into a single view

BBB Raw Metrics (ranked by performance)

Breakout of underlying outcomes to show the trade-offs that drive the composite.

Raw metrics 2×2 for BBB (masked, Advanced)
Cumulative Profit, Cumulative Revenue, Emerging Tech Revenue, are displayed to expose trade-offs in the simulation.

BBB vs. GPQA

Relationship between strategy performance and PhD-level science benchmark accuracy (GPQA).

Scatter of BBB composite vs GPQA (masked, Advanced)
Shows relationship between scientific-reasoning gains and BBB strategic-decision making performance.

BBB vs. LM Arena

Shows relationship between chat quality ratings (LM Arena) and BBB strategic-decision making performance.

Scatter of BBB composite vs LM Arena (masked, Advanced)
Compares human-rated chat quality to strategic outcomes in simulation.