F-bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Lu Liu¹, Huiyu Duan¹, Qiang Hu,*¹, Liu Yang¹, Chunlei Cai², Tianxiao Ye², Huayu Liu¹, Xiaoyun Zhang¹, Guangtao Zhai¹

¹Shanghai Jiao Tong University, ²Bilibili Inc.

✨ ICCV 2025 Highlight ✨

We present the first AI-generated Face (AIGF) quality assessment database, bench and model, termed FaceQ, F-bench and F-Eval, respectively

Recent artificial intelligence (AI) generative models have demonstrated remarkable capabilities in image production, and have been widely applied to face image generation, customization, and restoration. However, many AI-generated faces (AIGFs) still suffer from issues such as unique distortions, unrealistic details, and unexpected identity shifts, underscoring the need for a comprehensive quality evaluation method for AIGFs.

To this end, we introduce \textbf{FaceQ}, the first comprehensive AI-generated \textbf{\underline{Face}} image database with fine-grained \textbf{\underline{Q}}uality annotations aligned with human preferences, which consists of 12K images and 491K ratings across multiple dimensions. Using the FaceQ database, we establish \textbf{F-Bench}, a benchmark for comparing and evaluating face generation, customization, and restoration models, highlighting strengths and weaknesses across various prompts and evaluation dimensions. Additionally, we assess the performance of existing image quality assessment (IQA) methods on FaceQ, and further propose a large multimodal model (LMM) based \textbf{\underline{F}}ace quality \textbf{\underline{Eval}}uator (\textbf{F-Eval}) to accurately assess the multi-dimensional quality of generated faces in a one-for-all manner. Extensive experimental results demonstrate the state-of-the-art performance of our F-Eval.

FaceQ

An overview of the content of FaceQ. Rating comparisons of eight dimensions. Each column presents a pair of intuitive examples of each dimension, with red indicating the better rating and blue indicating the worse one. From left to right, the subsets are face generation, face customization, and face restoration subsets. The last row displays the corresponding prompts, reference image-prompt pairs, and the GT-LQ image pairs.

F-bench

Average MOS score comparison across all models and dimensions. (a) Face generation. (b) Face customization. (c) Face restoration. The models are arranged in a clockwise order by release date.

F-Eval

The overall framework of F-Eval. F-Eval can evaluate quality, authenticity, correspondence, and identity fidelity in a one-for-all framework. It can process both single and paired images, along with prompts, to produce quality scores. It consists of three encoders, including a vision encoder, a face encoder, and a text tokenizer to process multi-modal inputs. These features are projected into the same space by trained projectors. A pre-trained large language model is utilized to fuse the features while fine-tuned with four LoRA experts. Specific LoRA will be activated by dimension ID, which is classified by a trainable router.

Performance on generation and customization tasks

Performance of state-of-the-art models and the proposed FineVQ on our established FineVD database in terms of the quality scoring task.

Performance on restoration task

Performance comparison between state-of-the-art VQA methods and the proposed FineVQ on six UGC VQA databases

BibTeX

@inproceedings{liu2025fbench,
      title={F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking
Face Generation, Customization, and Restoration},
      author={Lu Liu, Huiyu Duan, Qiang Hu, Liu Yang, Chunlei Cai, Tianxiao Ye, Huayu Liu, Xiaoyun Zhang, Guangtao Zhai},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      year={2025}
}