MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration

Lu Liu¹, Chunlei Cai², Shaocheng Shen¹, Jianfeng Liang¹, Weimin Ouyang¹, Tianxiao Ye², Jian Mao², Huiyu Duan¹, Jiangchao Yao¹, Xiaoyun Zhang¹, Qiang Hu,*¹, Guangtao Zhai¹

¹Shanghai Jiao Tong University, ²Bilibili Inc.

arXiv Code

Overview of the agents in MoA-VR. MoA-VR restores low-quality video clips with complex degradations through the collaboration of three agents: the degradation identification agent, the routing and restoration agent, and the quality assessment agent.

Real-world videos often suffer from complex degra- dations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first Mixture-of-Agents Video Restoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the Restored Video Quality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently out- performing existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.

Degradation Identification Agent Ai

The overall framework of Ai. Ai can evaluate all types of degradation levels in an all-in-one framework. It can process videos, along with prompts, to identify the degradations. It consists of a vision encoder to extract both spatial and temporal features and a text tokenizer to tokenize the input prompts. These features are projected into the same space by trained projectors. A pre-trained LLM is utilized to fuse the features while fine-tuned with LoRA.

Routing and Restoration Agent Ar

Illumination of degradation removal process by Routing and Restoration Agent Ar. Ar is able to route the degradation removal orders, rollback when restoration fails, reroute to another degradation removal orders.

Quality Assessment Agent Aa

An overview of quality assessment agent Aa. It consists of three feature encoders, including an image feature extractor for extracting spatial features from sparse video frames, a motion feature extractor for extracting motion features from the entire video, and a text encoder for extracting aligned text features from prompts. The extracted features are then aligned through projectors and fed into a pre-trained LLM to generate the output results. LoRA weights are introduced to the pre-trained image encoder and the large language model to adapt the models to the quality assessment task.

Agent Collaboration and Closed-Loop Design

MoA-VR incorporates three specialized agents within a closed-loop architecture. For a low-quality input video, Ai identifies the degradation type and level; Ar generates a degradation removal plan and then invokes the corresponding restoration toolbox; Aa assesses all the intermediate results and chooses the best quality one. Then Ai identifies whether the previous restoration was successful. If it fails, Ar rolls back and reroutes; if successful, Ar follows the previous plan. This loop continues until all degradations are removed.