How We ScoreAI Image Generators
Promagen's Builder Quality Intelligence (BQI) is a quantitative benchmark for evaluating how effectively an AI image generator converts your instructions into the intended image. This page explains exactly how BQI works — what we test, how we score, and why the methodology is designed to be transparent and reproducible.
What is Builder Quality Intelligence?
BQI measures prompt intelligence — the platform's ability to understand and execute structured creative direction. Unlike aesthetic rankings or subjective “best looking” lists, BQI tests measurable execution against fixed creative briefs. A platform that faithfully renders every element you specified scores higher than one that produces a beautiful but unfaithful interpretation.
Every BQI score is derived from 40 platforms tested against 8 standardised scenes using a three-layer aggregation process designed to eliminate single-assessor bias.
Core Principles
Objective, not subjective
BQI never scores artistic beauty or style preference. It only scores measurable execution against a fixed creative brief.
Architecturally grounded
Scores reflect underlying prompt architecture — CLIP tokenisation, natural-language parsing, plain-language simplicity — rather than marketing claims.
Triangulated and reproducible
Every score results from a three-layer aggregation process designed to eliminate single-assessor bias.
Transparent and auditable
Full methodology published on this page so anyone can verify the process and understand how scores are derived.
The 8-Scene Test Suite
Each scene is run identically on all 40 platforms using the exact prompt format required by that platform's tier. The scenes are designed to stress-test different aspects of prompt understanding.
| Scene | Name | Purpose | Key Stress Test |
|---|---|---|---|
| 01 | Complex Multi-Subject | Tests ability to handle multiple distinct subjects in one scene | Subject count, spatial relationships, individual attribute retention |
| 02 | Style Stacking | Tests simultaneous application of multiple artistic styles | Style blending, reference consistency, technique layering |
| 03 | Photorealistic Product | Tests commercial-grade photorealism and detail precision | Material accuracy, lighting fidelity, surface texture |
| 04 | Illustrative Narrative | Tests storytelling composition and character expression | Emotional conveyance, narrative coherence, compositional flow |
| 05 | Weather-Driven Environmental | Tests environmental atmosphere and weather effects | Atmospheric depth, weather interaction, lighting conditions |
| 06 | Text and Typography | Tests ability to render legible text within images | Character accuracy, font rendering, text integration |
| 07 | Negative Prompt Handling | Tests correct interpretation and exclusion of negative elements | Exclusion accuracy, positive/negative separation, format compliance |
| 08 | Edge-Case Format Compliance | Tests adherence to tier-specific format requirements | Weight syntax, parameter handling, character limit behaviour |
Three-Layer Aggregation
Each platform receives all 8 test scenes in its native prompt format under identical conditions. No platform gets special treatment — every scene is formatted according to the platform's tier requirements.
Every output is independently scored by multiple large vision-language models and human reviewers on three metrics: Prompt Adherence (40%), Anchor Fidelity (40%), and Format Compliance (20%).
The median is taken after removing the highest and lowest scores to eliminate outlier bias. Three metric medians combine into one scene score. Eight scene scores are averaged for the final BQI score (0–100).
Scoring Metrics
How accurately the output reflects the creative brief. Every element specified in the prompt is checked against the generated image.
How faithfully the output matches the scene's key anchor elements — the non-negotiable visual requirements that define the scene's identity.
How correctly the platform processes the tier-specific prompt format. Weight syntax, negative handling, and character limit behaviour are all tested.
What BQI Measures — and What It Doesn't
BQI measures
BQI does not measure
Headline Results
Across our 8-scene test suite, the 40 platforms score between 62 and 96 on a 100-point scale. Scores vary significantly by tier and scene complexity.
Full per-platform breakdowns will be published once the scoring pipeline reaches automated maturity and multiple stable batch runs confirm score consistency.
Frequently asked questions
How does Promagen rank AI image generators?
Promagen uses Builder Quality Intelligence (BQI) — a quantitative benchmark that tests 40 platforms against 8 standardised scenes. Each output is scored by multiple assessors on prompt adherence, anchor fidelity, and format compliance. Scores are triangulated to eliminate bias and averaged across all scenes for a final score of 0–100.
Is BQI the same as image quality rankings?
No. BQI measures prompt intelligence — how well the platform understands and executes your instructions. A platform could produce stunning images but score lower on BQI if it ignores parts of your prompt. Aesthetic quality, while important, is subjective and outside BQI's scope.
How often are BQI scores updated?
BQI scores are recalculated when platform capabilities change (new models, updated text encoders) or when new test scenes are added to the suite. The scoring pipeline is being automated to support more frequent and reproducible batch runs.
Why don't you show individual platform BQI scores?
The BQI system has completed its first full batch run, but scene calibration is ongoing and several platforms need rescoring after recent tier corrections. Publishing per-platform scores from a maturing system would commit to numbers that may shift. Full breakdowns will be published once the pipeline is stable.
Can I see the test prompts used in BQI?
The 8 scene descriptions and their stress-test parameters are documented on this page. The exact prompt text for each scene is formatted per platform tier and is part of the scoring infrastructure. The methodology is fully transparent — what we test and how we score is published here.