How We ScoreAI Image Generators

Promagen's Builder Quality Intelligence (BQI) is a quantitative benchmark for evaluating how effectively an AI image generator converts your instructions into the intended image. This page explains exactly how BQI works — what we test, how we score, and why the methodology is designed to be transparent and reproducible.

What is Builder Quality Intelligence?

BQI measures prompt intelligence — the platform's ability to understand and execute structured creative direction. Unlike aesthetic rankings or subjective “best looking” lists, BQI tests measurable execution against fixed creative briefs. A platform that faithfully renders every element you specified scores higher than one that produces a beautiful but unfaithful interpretation.

Every BQI score is derived from 40 platforms tested against 8 standardised scenes using a three-layer aggregation process designed to eliminate single-assessor bias.

Core Principles

Objective, not subjective

BQI never scores artistic beauty or style preference. It only scores measurable execution against a fixed creative brief.

Architecturally grounded

Scores reflect underlying prompt architecture — CLIP tokenisation, natural-language parsing, plain-language simplicity — rather than marketing claims.

Triangulated and reproducible

Every score results from a three-layer aggregation process designed to eliminate single-assessor bias.

Transparent and auditable

Full methodology published on this page so anyone can verify the process and understand how scores are derived.

The 8-Scene Test Suite

Each scene is run identically on all 40 platforms using the exact prompt format required by that platform's tier. The scenes are designed to stress-test different aspects of prompt understanding.

Scene	Name	Purpose	Key Stress Test
01	Complex Multi-Subject	Tests ability to handle multiple distinct subjects in one scene	Subject count, spatial relationships, individual attribute retention
02	Style Stacking	Tests simultaneous application of multiple artistic styles	Style blending, reference consistency, technique layering
03	Photorealistic Product	Tests commercial-grade photorealism and detail precision	Material accuracy, lighting fidelity, surface texture
04	Illustrative Narrative	Tests storytelling composition and character expression	Emotional conveyance, narrative coherence, compositional flow
05	Weather-Driven Environmental	Tests environmental atmosphere and weather effects	Atmospheric depth, weather interaction, lighting conditions
06	Text and Typography	Tests ability to render legible text within images	Character accuracy, font rendering, text integration
07	Negative Prompt Handling	Tests correct interpretation and exclusion of negative elements	Exclusion accuracy, positive/negative separation, format compliance
08	Edge-Case Format Compliance	Tests adherence to tier-specific format requirements	Weight syntax, parameter handling, character limit behaviour

Three-Layer Aggregation

Layer 1Raw Execution

Each platform receives all 8 test scenes in its native prompt format under identical conditions. No platform gets special treatment — every scene is formatted according to the platform's tier requirements.

Layer 2Multi-Assessor Scoring

Every output is independently scored by multiple large vision-language models and human reviewers on three metrics: Prompt Adherence (40%), Anchor Fidelity (40%), and Format Compliance (20%).

Layer 3Triangulated Median

The median is taken after removing the highest and lowest scores to eliminate outlier bias. Three metric medians combine into one scene score. Eight scene scores are averaged for the final BQI score (0–100).

Scoring Metrics

Prompt Adherence40%

How accurately the output reflects the creative brief. Every element specified in the prompt is checked against the generated image.

Anchor Fidelity40%

How faithfully the output matches the scene's key anchor elements — the non-negotiable visual requirements that define the scene's identity.

Format Compliance20%

How correctly the platform processes the tier-specific prompt format. Weight syntax, negative handling, and character limit behaviour are all tested.

What BQI Measures — and What It Doesn't

BQI measures

Prompt understanding and execution fidelity, architectural compatibility with different tiers, and consistency across controlled conditions.

BQI does not measure

Aesthetic quality, generation speed or cost, safety filters, UI/UX design, or training data recency. These are important but outside BQI's scope.

Headline Results

Across our 8-scene test suite, the 40 platforms score between 62 and 96 on a 100-point scale. Scores vary significantly by tier and scene complexity.

Full per-platform breakdowns will be published once the scoring pipeline reaches automated maturity and multiple stable batch runs confirm score consistency.

Frequently asked questions

How does Promagen rank AI image generators?

Promagen uses Builder Quality Intelligence (BQI) — a quantitative benchmark that tests 40 platforms against 8 standardised scenes. Each output is scored by multiple assessors on prompt adherence, anchor fidelity, and format compliance. Scores are triangulated to eliminate bias and averaged across all scenes for a final score of 0–100.

Is BQI the same as image quality rankings?

No. BQI measures prompt intelligence — how well the platform understands and executes your instructions. A platform could produce stunning images but score lower on BQI if it ignores parts of your prompt. Aesthetic quality, while important, is subjective and outside BQI's scope.

How often are BQI scores updated?

BQI scores are recalculated when platform capabilities change (new models, updated text encoders) or when new test scenes are added to the suite. The scoring pipeline is being automated to support more frequent and reproducible batch runs.

Why don't you show individual platform BQI scores?

The BQI system has completed its first full batch run, but scene calibration is ongoing and several platforms need rescoring after recent tier corrections. Publishing per-platform scores from a maturing system would commit to numbers that may shift. Full breakdowns will be published once the pipeline is stable.

Can I see the test prompts used in BQI?

The 8 scene descriptions and their stress-test parameters are documented on this page. The exact prompt text for each scene is formatted per platform tier and is part of the scoring infrastructure. The methodology is fully transparent — what we test and how we score is published here.

Explore further

All 40 platforms Prompt format guide Negative prompt support Midjourney vs DALL·E 3

Try Prompt Lab — Optimised prompts for any platform

Rotate for Promagen