WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs

Abstract

Automatically generating webpage code from webpage designs can significantly reduce the workload of front-end developers, and recent Multimodal Large Language Models (MLLMs) have shown promising potential in this area. However, our investigation reveals that most existing MLLMs are constrained by the absence of high-quality, large-scale, real-word datasets, resulting in inadequate performance in automated webpage code generation. To fill this gap, this paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. Sourced from real-world web resources, WebCode2M offers a rich and valuable dataset for webpage code generation across a variety of user scenarios. dataset quality is ensured by a highly accurate scoring model that filters out instances with aesthetic deficiencies or other incomplete elements. To validate the effectiveness of our proposed dataset, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. Additionally, we introduce a new metric, TreeBLEU, to measure the structural hierarchy recall. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs, confirming its effectiveness and usability for future applications in front-end design tools. Finally, we highlight several practical challenges introduced by our dataset, calling for further research. We have hosted the WebCode2M on an anonymous webpage: https://webcode2m-anonymous.github.io.

Dataset Comparison

Here are representative screenshots of webpages in WebCode2M and other datasets. From left to right are pix2code, WebSight, and our WebCode2M dataset. Compared to the first two artificially synthesized datasets, ours is derived from real-world online websites, showcasing significantly greater diversity in elements, content, colors, and structural layouts.

Pipeline Overview

The aim of this study is to curate a dataset that facilitates training neural models to generate code from webpage designs. As large-scale human-designed screenshots are hard to collect manually, we opt to reversely generate screenshot image from a curated open-source web dataset via rendering the webpage code. Following image illustrates the pipeline for constructing WebCode2M, encompassing steps such as code purification, HTML rendering, filtering with a neural scorer, and layout tree extraction.

Key Characteristics Analysis

Upon acquiring the final dataset WebCode2M, we conduct an analysis to identify several key characteristics. To quantitatively assess the diversity and quality of our dataset, we employ the same statistical metrics used in Design2Code, facilitating a comparison with other datasets. The results are presented in the following table. Specifically, Avg. Len represents the token length as determined by the GPT-2 tokenizer; Avg. Tags indicates the total number of tags in the HTML code; Avg. Unique Tags denotes the count of distinct tags in the HTML code; and Avg. DOM Depth signifies the maximum depth of the HTML's DOM Tree.

Benchmark Performance: Specialized Models

Following table presents the performance of WebCoder both on the WebSight and WebCode2M datasets, compared to other benchmark models on the WebSight dataset. From this figure, we can observe that our method consistently outperforms all specialized baselines across all three metrics on the real-world test dataset, noting that these specialized models were fine-tuned on the WebSight dataset. Comparative experiments also demonstrate that the base model, Pix2Struct, achieves a significant performance boost when finetuned on our training dataset compared to WebSight. For TreeBLEU—a metric measuring the recall of 1-height subtrees in the target DOM tree—our approach surpasses both specialized and general-purpose models, indicating that our model better reflects real-world node types and substructures. Additionally, on the two visual similarity metrics—visual score and CLIP similarity—our model exceeds most general-purpose models and either matches or outperforms GPT-4V. Collectively, these results demonstrate that our dataset offers greater practical potential than synthetically generated datasets and suggest that our proposed training dataset can effectively unleash the potential of MLLMs in webpage generation.

Benchmark Performance: General-purpose MLLMs

Following table benchmarks the performance of several general-purpose MLLMs using the WebCode2M test dataset. From this figure, we can observe several interesting findings: (1) Generating lengthy code is challenging. Almost all metrics for nearly all models drop significantly as the target code length increases. For example, as the dataset transitions from WebCode2M-short to WebCode2M-mid and finally to WebCode2M-long, the highest TreeBLEU score for specialized models drops from 0.35 to 0.15, the highest CLIP similarity decreases from 0.73 to 0.69, and the highest Visual Score declines from 0.78 to 0.65. (2) Model size matters. In LLaVA family, several models show a significant improvement across all metrics as model parameters increase, with LLaVA-v1.5-7B and LLaVA-onevision-7B achieving the best performance, while LLaVA-onevision-0.5B performs poorly across all metrics, indicating that MLLMs require more parameters to achieve better results in webpage generation tasks. (3) Most general-purpose MLLMs struggle with webpage code generation. Among these models, only GPT-4V matches the performance of our model trained onWebCode2M, while GPT-4o significantly outperforms all other models. All remaining general-purpose models generally underperform compared to specialized models, with consistently low scores across all metrics.

BibTeX

@misc{si2024WebCode2M,
      title={WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs},
      author={anonymous authors},
      year={2024},
      eprint={xxxx.xxxx},
      archivePrefix={xxxx},
      primaryClass={cs.CL}
  }

Usage and License Notices

The data, code and model checkpoint are intended and licensed for research use only. Please do not use them for any malicious purposes.

The dataset is built on top of the Common Crawl dataset, under the CC-BY-4.0 License.