Andon Labs has released Blueprint-Bench 2, a spatial intelligence benchmark that evaluates AI models on 3D reasoning tasks derived from architectural blueprints, engineering drawings, and geometric problem sets, with results showing that frontier models including GPT-4o and Claude 3.5 Sonnet perform significantly below human baseline on the most complex spatial tasks while specialised multimodal approaches outperform general-purpose chat models by substantial margins, exposing a capability gap that directly affects founders building in robotics, CAD, warehouse automation, AR, and spatial computing.
The benchmark's design premise is the right starting point. Most AI evaluation frameworks measure performance on tasks where intelligence is primarily expressed through language: answering questions, writing code, summarising documents, solving mathematical problems stated in text. Blueprint-Bench 2 tests a different kind of intelligence, the ability to parse 2D representations of 3D space, infer geometric relationships, count elements in technical drawings, identify spatial errors, and reason about orientation and scale in contexts where the correct answer requires understanding the physical world the drawing represents rather than just the symbols on the page. The tasks include questions like identifying how many structural columns are present in a floor plan section, determining whether two components in an isometric view will fit together given stated dimensions, and locating a specified room in a multi-floor building layout with unlabelled sections. These are routine tasks for a structural engineer, an architect's assistant, or a robotics path-planning system, and they require reasoning capabilities that language model pretraining on text corpora does not obviously develop.
The model results on Blueprint-Bench 2 follow a pattern that has appeared in every spatial reasoning evaluation published over the past eighteen months: general-purpose frontier models trained primarily on language data perform meaningfully worse on spatial tasks than on comparable language tasks, and the performance gap increases with spatial complexity. GPT-4o, which scores near ceiling on MMLU and at high levels on most coding benchmarks, drops significantly on Blueprint-Bench 2's harder spatial inference tasks. Claude 3.5 Sonnet shows similar patterns. The models that perform best are those that have received specific spatial or visual fine-tuning, either through training on architectural and engineering data, through multimodal pretraining that included 3D scene understanding tasks, or through structured spatial reasoning chain-of-thought approaches that decompose geometric problems into explicit reasoning steps. The benchmark does not yet have public results from Google's Gemini 2.5 Pro, which has shown stronger spatial reasoning in other evaluations, or from models specifically built for engineering and CAD workflows, which would provide the most direct comparison for founders choosing tools for spatial applications.
The benchmark gaming concern is real and worth addressing before treating Blueprint-Bench 2 as a definitive capability signal. Any static benchmark can be overfitted by models that are fine-tuned on its training distribution, and Andon Labs has not yet published full details about how they will handle benchmark contamination as the evaluation gains visibility. The history of AI benchmarks is largely a history of temporary signal followed by rapid saturation as frontier labs include benchmark-adjacent data in training runs and performance converges toward ceiling within twelve to eighteen months of publication. Blueprint-Bench 2 is most useful as a snapshot of where spatial reasoning capabilities stand today, as a framework for identifying which specific spatial task types remain challenging, and as a methodology template for startups that want to evaluate models on their specific geometric use case rather than on a general spatial proxy. Its usefulness as a capability signal will decrease as frontier models train on it explicitly, which is the standard trajectory for publicly available benchmarks regardless of their original design intent.
The product relevance for spatial AI founders is more direct than most benchmark discussions admit. A robotics startup using a vision-language model for scene understanding in a warehouse picking application needs to know whether the model can identify a specific object's orientation from a camera angle that was not present in training. A CAD automation startup needs to know whether the model can parse an engineering drawing and identify tolerance violations without hallucinating dimensions. An AR company building spatial annotation tools needs to know whether the model can anchor virtual labels to specific geometry in a 3D reconstruction. None of these use cases are well-characterised by MMLU, SWE-bench, or even Vision benchmarks like VQAv2, which test visual question answering on photographic images rather than technical drawings. Blueprint-Bench 2 addresses the technical drawing and spatial inference gap more directly than most prior evaluations, and even if its specific items can eventually be gamed, its task typology provides a vocabulary for founders to construct their own internal evaluations against real production data.
The shift toward multimodal and embodied AI benchmarks that Blueprint-Bench 2 represents is the broader trend worth examining for its investment and product implications. The frontier AI labs have invested heavily in benchmarks that measure the capabilities that their existing training pipelines produce well: language reasoning, code generation, factual recall, instruction following. The benchmarks that expose what those pipelines do not produce well, including spatial reasoning, physical simulation understanding, procedural task execution, and long-horizon planning in environments with state, have been underrepresented in public evaluation frameworks precisely because frontier labs have less commercial incentive to publicise their weaknesses in those domains. Independent benchmark efforts like Blueprint-Bench 2, ARC-AGI, and the embodied AI benchmarks produced by robotics research groups provide the counterweight: they measure the capabilities that matter for the next generation of AI applications without being designed by the organisations whose models will be evaluated on them. That independence is what makes them signal rather than marketing.
For founders choosing between frontier APIs and specialised models for spatial applications, the practical implication of Blueprint-Bench 2's results is not to avoid GPT-4o or Claude, but to evaluate them specifically on your geometric use case before committing to either. A model that performs at 85% on Blueprint-Bench 2's medium-difficulty spatial tasks may perform at 95% on your specific application if your inputs are simpler or more standardised than the benchmark's harder items, and at 60% if your application involves exactly the high-complexity spatial inference patterns where the benchmark shows the largest capability gaps. Internal evaluation on representative samples of production data remains the only reliable way to make that determination, and Blueprint-Bench 2 provides a useful public baseline against which to calibrate what internal evaluation scores actually mean.
Also read: The OpenAI Trial Has Produced Allegations That Musk Threatened to Make Altman and Brockman America's Most Hated Men and the Filings Tell a Darker Founder Story Than Either Side Wants • OpenAI Just Published Its WebRTC Infrastructure Playbook for Voice AI and Founders Should Read It as a Competitive Signal, Not a Tutorial • Dairy Queen Is Pausing Middle East Expansion While Deploying AI at 50 Drive-Thrus and the Two Decisions Are More Connected Than They Appear