WebGen-Bench Tests AI's Ability to Build Functional Websites

Web Development

WebGen-Bench Tests AI's Ability to Build Functional Websites

Benchmarking LLMs on generating and testing multi-file websites

From

Arxiv

LLM-based agents are being tested on their ability to generate complete, multi-file websites from scratch using a new benchmark called WebGen-Bench. This benchmark includes diverse website instructions and 647 carefully validated test cases to measure functionality and accuracy.

Why it matters: WebGen-Bench evaluates how well AI agents create complex, functional websites, pushing the limits of automated coding.

The big picture: The benchmark covers almost all web app types with instructions from humans and GPT-4o, plus automated testing via a navigation agent.

Stunning stat: The best existing code-agent model scores just 27.8% accuracy, showing the task’s high difficulty.

Quick takeaway: Training on 6,667 instructions boosts a custom model’s accuracy to 38.2%, outperforming top proprietary LLMs.