VisionWebDev: A Hierarchical Benchmark for Visual Website Development with Agent Verification
VisionWebDev is a benchmark for evaluating whether multimodal coding agents can build real websites from visual prototypes and structured requirements. It goes beyond small code edits and static UI generation to measure end-to-end web development ability in realistic settings.
Each task provides multimodal inputs such as UI prototype images, requirement descriptions, and development assets. Agents are expected to generate executable websites that satisfy both functional behavior and visual fidelity.
To support reliable evaluation, VisionWebDev introduces an automated verification framework that combines workflow-driven GUI testing with VLM-based visual judging.
Existing coding benchmarks mainly focus on localized code edits, while most multimodal website benchmarks are limited to static webpage reproduction. These settings do not fully capture the complexity of modern web development, where agents must reason over visual layouts, interaction flows, application state, and system behavior across multiple pages.
VisionWebDev closes this gap by evaluating the full spectrum of visual website development, from responsive UI implementation to interactive frontend engineering and complete full-stack applications.
Tasks
Prototype Images
Test Cases
Categories
VisionWebDev spans 16 subcategories across 4 major domains and covers progressively harder development settings, from static responsive webpages to interaction-heavy frontends and requirement-driven full-stack systems.
| Reset Sort | Level 1: Static Webpage | Level 2: Interactive | Level 3: Full-Stack | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | Model + Framework | Params | Date | Overall | Desktop | Tablet | Mobile | Avg | VS | FS | Avg | VS | FS | Avg |
We welcome submissions of new model results to the VisionWebDev leaderboard. Please follow the steps below to submit your evaluation results.
First, clone the VisionWebDev repository and install the required dependencies:
git clone https://github.com/VisionWebDev/VisionWebDev.git cd VisionWebDev pip install -e .
Run the inference & evaluation script with your model configuration
Before submitting, organize your evaluation results using the following directory structure:
agent/
model/
metadata.json
task/
project/
test_results/
Requirements:
evaluate script, including the complete GUI Agent execution traces and VLM Judge evaluation outputs.metadata.json file under agent/model/ that summarizes the final benchmark results for leaderboard submission.Required metadata.json format:
{
"name": "model-name + agent-framework",
"org": "organization-name",
"date": "YYYY-MM-DD",
"webpage": {
"desktop": score,
"tablet": score,
"mobile": score,
"avg": average-score
},
"frontend": {
"vs": visual-score,
"fs": functional-score,
"avg": average-score
},
"website": {
"vs": visual-score,
"fs": functional-score,
"avg": average-score
},
"overall": overall-score
}
All scores should be scaled to 0–100 and rounded to one decimal place.
Fork the VisionWebDev leaderboard repository on Hugging Face and submit your results via a Pull Request.
agent/model/...).metadata.json file correctly summarizes your evaluation results.Our team will review your submission for correctness and consistency. Once approved, your results will appear on the public leaderboard.
If you have any question about the evaluation and submission process, please open an issue on our GitHub Issues page.
If you use VisionWebDev in your research, please cite our paper:
@article{he2026visionwebdev,
title={VisionWebDev: A Hierarchical Benchmark for Visual Website Development with Agent Verification},
author={He, Zehai and Hong, Wenyi and Yang, Zhen and Pan, Ziyang and Liu, Mingdao and Gu, Xiaotao and Tang, Jie},
journal={arXiv preprint},
year={2026}
}
VisionWebDev is released under the CC BY-SA 4.0 license.