BBB Composite Score (ordered by release date)
Composite Score: even-weighted, min–max normalized combination of Cumulative Profit, Cumulative Revenue, and final-year Emerging Tech Revenue.
The BBB benchmark evaluates the performance of large language models on the Back Bay Battery strategy simulation.
**Read the paper here: How Well Can AI do Strategy? Empirical Benchmarking Using Strategy Simulations**
**View the full Github Repo Here: https://github.com/ryantallen/bbb_benchmark_public**
The task of the simulation is to balance short-term profit (exploiting the core business) with longer-term growth (investing in an emerging technology) across multiple periods under significant uncertainty. To reduce contamination from model pre-knowledge of the simulation, all identifying terms are masked so each model encounters the scenario as if for the first time. Results below show the composite score and underlying metrics, plus correlations with external benchmarks.
Composite Score: even-weighted, min–max normalized combination of Cumulative Profit, Cumulative Revenue, and final-year Emerging Tech Revenue.
Composite Score: even-weighted, min–max normalized combination of Cumulative Profit, Cumulative Revenue, and final-year Emerging Tech Revenue.
Breakout of underlying outcomes to show the trade-offs that drive the composite.
Relationship between strategy performance and PhD-level science benchmark accuracy (GPQA).
Shows relationship between chat quality ratings (LM Arena) and BBB strategic-decision making performance.