F-bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Lu Liu1, Huiyu Duan1, Qiang Hu,*1, Liu Yang1, Chunlei Cai2, Tianxiao Ye2, Huayu Liu1, Xiaoyun Zhang1, Guangtao Zhai1
1Shanghai Jiao Tong University, 2Bilibili Inc.
✨ ICCV 2025 Highlight ✨
arXiv Paper Code
Method

We present the first AI-generated Face (AIGF) quality assessment database, bench and model, termed FaceQ, F-bench and F-Eval, respectively

Recent artificial intelligence (AI) generative models have demonstrated remarkable capabilities in image production, and have been widely applied to face image generation, customization, and restoration. However, many AI-generated faces (AIGFs) still suffer from issues such as unique distortions, unrealistic details, and unexpected identity shifts, underscoring the need for a comprehensive quality evaluation method for AIGFs.

To this end, we introduce \textbf{FaceQ}, the first comprehensive AI-generated \textbf{\underline{Face}} image database with fine-grained \textbf{\underline{Q}}uality annotations aligned with human preferences, which consists of 12K images and 491K ratings across multiple dimensions. Using the FaceQ database, we establish \textbf{F-Bench}, a benchmark for comparing and evaluating face generation, customization, and restoration models, highlighting strengths and weaknesses across various prompts and evaluation dimensions. Additionally, we assess the performance of existing image quality assessment (IQA) methods on FaceQ, and further propose a large multimodal model (LMM) based \textbf{\underline{F}}ace quality \textbf{\underline{Eval}}uator (\textbf{F-Eval}) to accurately assess the multi-dimensional quality of generated faces in a one-for-all manner. Extensive experimental results demonstrate the state-of-the-art performance of our F-Eval.




FaceQ

An overview of the content of FaceQ. Rating comparisons of eight dimensions. Each column presents a pair of intuitive examples of each dimension, with red indicating the better rating and blue indicating the worse one. From left to right, the subsets are face generation, face customization, and face restoration subsets. The last row displays the corresponding prompts, reference image-prompt pairs, and the GT-LQ image pairs.

Method





F-bench

Average MOS score comparison across all models and dimensions. (a) Face generation. (b) Face customization. (c) Face restoration. The models are arranged in a clockwise order by release date.

Method Method Another Image





F-Eval

The overall framework of F-Eval. F-Eval can evaluate quality, authenticity, correspondence, and identity fidelity in a one-for-all framework. It can process both single and paired images, along with prompts, to produce quality scores. It consists of three encoders, including a vision encoder, a face encoder, and a text tokenizer to process multi-modal inputs. These features are projected into the same space by trained projectors. A pre-trained large language model is utilized to fuse the features while fine-tuned with four LoRA experts. Specific LoRA will be activated by dimension ID, which is classified by a trainable router.

Method





Performance on generation and customization tasks

Performance of state-of-the-art models and the proposed FineVQ on our established FineVD database in terms of the quality scoring task.

Method





Performance on restoration task

Performance comparison between state-of-the-art VQA methods and the proposed FineVQ on six UGC VQA databases

Method 1
Method 2



BibTeX

@inproceedings{liu2025fbench,
      title={F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking
Face Generation, Customization, and Restoration},
      author={Lu Liu, Huiyu Duan, Qiang Hu, Liu Yang, Chunlei Cai, Tianxiao Ye, Huayu Liu, Xiaoyun Zhang, Guangtao Zhai},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      year={2025}
}