Wei Zhai
Wei Zhai (翟伟)
Associate Researcher at
University of Science and Technology of China (USTC)
I am currently an Associate Researcher at the Department of Automation, USTC. I obtained my Ph.D. degree from USTC in 2022, advised by Prof. Zheng-Jun Zha and Prof. Yang Cao. Prior to that, I received my B.S. degree from Southwest Jiaotong University in 2017. I was fortunate to receive the AAAI 2023 Distinguished Paper Award and the ACM MM 2025 MSMA Workshop Best Student Paper Award. My research interests primarily lie in Computer Vision, Embodied Intelligence, and Machine Learning, with a specific focus on building efficient computational frameworks inspired by brain mechanisms, developing egocentric perception for interaction anticipation, and endowing embodied agents with generalizable 2D/3D vision skills in complex real-world scenes.

News

11/2025
1 paper accepted by AAAI 2025.
10/2025
1 paper accepted by T-NNLS.
09/2025
3 papers accepted by NeurIPS 2025. 1 Spotlight
09/2025
1 paper accepted by ACM MM MSMA Workshop. Best Student Paper
07/2025
1 paper accepted by Journal of Intelligent Computing and Networking.
07/2025
1 paper accepted by T-ASE.
06/2025
4 papers accepted by ICCV 2025.
06/2025
Won the 1st Place in Efficient Event-based Eye-Tracking Challenge.
06/2025
Won the 1st Place in Body Contact Estimation Challenge (RHOBIN2025 CVPR).
05/2025
1 paper accepted by SCIENCE CHINA Information Sciences.
02/2025
5 papers accepted by CVPR 2025. 1 Highlight
09/2024
1 paper accepted by NeurIPS 2024.
07/2024
1 paper accepted by ACM MM 2024.
07/2024
1 paper accepted by T-IP.
07/2024
1 paper accepted by ECCV 2024.
06/2024
Won the 2nd Place in 3D Contact Estimation Challenge (RHOBIN2024 CVPR).
04/2024
1 paper accepted by Optics Express.
04/2024
1 paper accepted by T-AI.
03/2024
Won the 2nd Place in Efficient Super-Resolution Challenge (NTIRE2024 CVPR).
03/2024
Won the 1st Place in Event-based Eye Tracking Task (AIS2024 CVPR).
02/2024
1 paper accepted by CVPR 2024.
12/2023
1 paper accepted by AAAI 2024.
11/2023
1 paper accepted by IJCV.
10/2023
1 paper accepted by T-PAMI.
09/2023
1 paper accepted by IJCV.
07/2023
1 paper accepted by T-NNLS.
07/2023
2 papers accepted by ICCV 2023.
03/2023
2 papers accepted by CVPR 2023.
01/2023
1 paper accepted by AAAI 2023. Distinguished Paper
09/2022
1 paper accepted by NeurIPS 2022.
View History News... ▼

Experience

Jul 2022 - Jun 2024
Postdoctoral Researcher
Department of Automation
Advisor: Prof. Zheng-Jun Zha and Prof. Yang Cao
Sep 2017 - Jun 2022
Ph.D. in Cyberspace Security
School of Cyberspace Security
Advisor: Prof. Zheng-Jun Zha and Prof. Yang Cao
Dec 2020 - Sep 2021
Research Intern
Mentor: Prof. Dacheng Tao and Dr. Jing Zhang
Sep 2013 - Jun 2017
B.S. in Computer Science
Outstanding Graduate of Southwest Jiaotong University (2017)

Publications

2025
E-MaT
E-MaT: Event-oriented Mamba for Egocentric Point Tracking
Han Han, Wei Zhai*, Baocai Yin, Yang Cao, Bin Li, Zheng-Jun Zha.
In AAAI 2025
We propose a Mamba-based tracking framework that leverages event cameras to capture global motion trends, significantly enhancing egocentric point tracking robustness under fast motion and high dynamic range conditions.
De-raining Generalization
Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay
Kunyu Wang, Xueyang Fu, Chengzhi Cao, Chengjie Ge, Wei Zhai, Zheng-Jun Zha.
In IEEE T-NNLS
We introduce a continuous learning framework inspired by the complementary learning system of the human brain, utilizing memory replay and knowledge distillation to enable de-raining networks to generalize across varied real-world scenarios.
EF-3DGS
EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting
Bohao Liao, Wei Zhai*, Zengyu Wan, Zhixin Cheng, Wenfei Yang, Yang Cao, Tianzhu Zhang, Zheng-Jun Zha.
In NeurIPS 2025 (Spotlight)
We propose EF-3DGS, the first event-aided framework to handle fast motion blur and high dynamic range scenes by fusing events and frames, achieving significantly higher PSNR and lower trajectory error in high-speed scenarios.
ViewPoint
ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models
Zixun Fang, Kai Zhu, Zhiheng Liu, Yu Liu, Wei Zhai, Yang Cao, Zheng-Jun Zha.
In NeurIPS 2025
We propose a novel framework utilizing pretrained perspective diffusion models for generating panoramic videos via a new ViewPoint map representation, ensuring global spatial continuity and fine-grained visual details.
PAID
PAID: Pairwise Angular-Invariant Decomposition for Continual Test-Time Adaptation
Kunyu Wang, Xueyang Fu, Yuanfei Bao, Chengjie Ge, Chengzhi Cao, Wei Zhai, Zheng-Jun Zha.
In NeurIPS 2025
We propose PAID, a prior-driven CTTA method that preserves the pairwise angular structure of pre-trained weights using Householder reflections, achieving consistent improvements in continual test-time adaptation.
Affordance Ranking
Learning Object Affordance Ranking with Task Context
Haojie Huang, Hongchen Luo*, Wei Zhai*, Yang Cao, Zheng-Jun Zha.
In ACM MM 2025 MSMA Workshop (Best Student Paper)
We introduce a Context-embed Group Ranking Framework to learn object affordance ranking by deeply integrating task context, supported by a new large-scale task-oriented dataset.
SIGMAN
SIGMAN: Scaling 3D Human Gaussian Generation with Millions of Assets
Yuhang Yang, Fengqi Liu, Yixing Lu, Qin Zhao, Pingyu Wu, Wei Zhai*, Ran Yi, Yang Cao, Lizhuang Ma, Zheng-Jun Zha, Junting Dong*.
In ICCV 2025
We present SIGMAN, a latent space generation paradigm for 3D human digitization utilizing a UV-structured VAE and DiT, trained on a newly constructed dataset of 1 million 3D Gaussian assets.
HERO
HERO: Human Reaction Generation from Videos
Chengjun Yu, Wei Zhai*, Yuhang Yang, Yang Cao, Zheng-Jun Zha.
In ICCV 2025
We propose HERO, a framework for human reaction generation from RGB videos that extracts interaction intention and local visual cues, validated on a new Video-Motion dataset.
MATE
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking
Han Han, Wei Zhai*, Yang Cao, Bin Li, Zheng-Jun Zha.
In ICCV 2025
We introduce MATE, an event-based point tracking framework that resolves spatial sparsity and motion blur through motion-augmented temporal consistency, achieving significantly faster processing and higher precision.
EMoTive
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation
Zengyu Wan, Wei Zhai*, Yang Cao, Zheng-Jun Zha.
In ICCV 2025
We propose EMoTive, an event-based framework for 3D motion estimation that models spatio-temporal trajectories via Event Kymograph projection and non-uniform parametric curves.
PEAR
PEAR: Phrase-Based Hand-Object Interaction Anticipation
Zichen Zhang, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang.
In SCIENCE CHINA Information Sciences (SCIS)
We present PEAR, a model for hand-object interaction anticipation that jointly predicts intention and manipulation using phrase-based cross-alignment, supported by the EGO-HOIP dataset.
BRAT
BRAT: Bidirectional Relative Positional Attention Transformer for Event-based Eye tracking
Yuliang Wu, Han Han, Jinze Chen, Wei Zhai*, Yang Cao, Zheng-Jun Zha.
In CVPR 2025 Workshop (1st Place Challenge)
We propose BRAT, a Bidirectional Relative Positional Attention Transformer for event-based eye tracking that fully exploits spatio-temporal sequences, winning 1st place in the Efficient Event-based Eye-Tracking Challenge.
CompreCap
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
Fan Lu, Wei Wu, Kecheng Zheng*, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai*, Yang Cao, Yujun Shen, Zheng-Jun Zha.
In CVPR 2025
We introduce CompreCap, a benchmark for evaluating detailed image captioning in LVLMs using a directed scene graph to assess object coverage, attributes, and relationships comprehensively.
GREAT
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
Yawen Shao, Wei Zhai*, Yuhang Yang, Hongchen Luo, Yang Cao, Zheng-Jun Zha.
In CVPR 2025
We propose GREAT, a framework for open-vocabulary 3D object affordance grounding that combines geometry attributes with interaction intention reasoning, verified on the large-scale PIADv2 dataset.
IV-VAE
Improved Video VAE for Latent Video Diffusion Model
Pingyu Wu, Kai Zhu*, Yu Liu, Liming Zhao, Wei Zhai*, Yang Cao, Zheng-Jun Zha.
In CVPR 2025
We propose an Improved Video VAE (IV-VAE) featuring Keyframe-based Temporal Compression and Group Causal Convolution to resolve temporal-spatial conflicts in latent video diffusion models.
MMAR
MMAR: Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling
Jian Yang, Dacheng Yin, Yizhou Zhou, Fengyun Rao, Wei Zhai, Yang Cao, Zheng-Jun Zha.
In CVPR 2025
We introduce MMAR, a lossless multi-modal auto-regressive framework that uses continuous-valued image tokens and a lightweight diffusion head to unify image understanding and generation without information loss.
Efficient CTTA-OD
Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning
Kunyu Wang, Xueyang Fu, Xin Lu, Chengjie Ge, Chengzhi Cao, Wei Zhai, Zheng-Jun Zha.
In CVPR 2025 (Highlight)
We propose an efficient CTTA-OD method utilizing sensitivity-guided channel pruning to selectively suppress domain-sensitive channels, reducing computational overhead while maintaining adaptation performance.
VMAD
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection
Huilin Deng, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang.
In IEEE T-ASE
We present VMAD, a Visual-enhanced MLLM for zero-shot anomaly detection that incorporates defect-sensitive structure learning and locality-enhanced token compression, benchmarked on the RIAD dataset.
LSA
Likelihood-Aware Semantic Alignment for Full-Spectrum Out-of-Distribution Detection
Fan Lu, Kai Zhu, Kecheng Zheng, Wei Zhai, Yang Cao, Zheng-Jun Zha.
In Journal of Intelligent Computing and Networking
We propose a Likelihood-Aware Semantic Alignment (LSA) framework for full-spectrum OOD detection, utilizing Gaussian sampling and bidirectional prompt customization to align image-text correspondence.
2024
EgoChoir
EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views
Yuhang Yang, Wei Zhai*, Chengfeng Wang, Chengjun Yu, Yang Cao, Zheng-Jun Zha.
In NeurIPS 2024
We propose EgoChoir to capture 3D interaction regions from egocentric views by harmonizing visual appearance, head motion, and 3D objects to jointly infer human contact and object affordance.
UniDense
UniDense: Unleashing Diffusion Models with Meta-Routers for Universal Few-Shot Dense Prediction
Lintao Dong, Wei Zhai*, Zheng-Jun Zha.
In ACM MM 2024
We introduce UniDense, a framework utilizing Meta-Routers to select task-relevant computation pathways within a frozen Stable Diffusion model for efficient universal few-shot dense prediction.
MV-Net
Event-based Optical Flow via Transforming into Motion-dependent View
Zengyu Wan, Yang Wang, Wei Zhai*, Ganchao Tan, Yang Cao, Zheng-Jun Zha*.
In IEEE T-IP
We propose MV-Net, which transforms the orthogonal view into a motion-dependent view using an Event View Transformation Module to enhance event-based motion representation for optical flow estimation.
BOT
Bidirectional Progressive Transformer for Interaction Intention Anticipation
Zichen Zhang, Hongchen Luo, Wei Zhai*, Yang Cao, Yu Kang.
In ECCV 2024
We present BOT, a Bidirectional Progressive Transformer that mutually corrects hand trajectories and interaction hotspots predictions to minimize error accumulation in interaction intention anticipation.
AsynHDR
Event-based Asynchronous HDR Imaging by Temporal Incident Light Modulation
Yuliang Wu, Ganchao Tan, Jinze Chen, Wei Zhai*, Yang Cao, Zheng-Jun Zha.
In Optics Express
We propose AsynHDR, a system integrating DVS with LCD panels for temporal incident light modulation, enabling pixel-asynchronous High Dynamic Range (HDR) imaging.
PLMNet
Prioritized Local Matching Network for Cross-Category Few-Shot Anomaly Detection
Huilin Deng, Hongchen Luo, Wei Zhai, Yang Cao, Yanming Guo, Yu Kang.
In IEEE T-AI
We propose PLMNet for Cross-Category Few-shot Anomaly Detection, utilizing a Local Perception Network and Defect-sensitive Weight Learner to establish fine-grained correspondence between query and normal samples.
LEMON
LEMON: Learning 3D Human-Object Interaction Relation from 2D Images
Yuhang Yang, Wei Zhai*, Hongchen Luo, Yang Cao, Zheng-Jun Zha.
In CVPR 2024
We present LEMON, a unified model that learns 3D human-object interaction relations from 2D images by mining interaction intentions and geometric correlations to jointly anticipate interaction elements.
Mambapupil
Mambapupil: Bidirectional Selective Recurrent Model for Event-based Eye Tracking
Zhong Wang, Zengyu Wan, Han Han, Bohao Liao, Yuliang Wu, Wei Zhai*, Yang Cao, Zheng-Jun Zha.
In CVPR 2024 Workshop (1st Place Challenge)
We propose MambaPupil, a bidirectional selective recurrent model for event-based eye tracking, utilizing a Linear Time-Varying State Space Module to handle diverse eye movement patterns.
HCE
Hypercorrelation Evolution for Video Class-Incremental Learning
Sen Liang, Kai Zhu*, Zhiheng Liu, Wei Zhai*, Yang Cao.
In AAAI 2024
We propose a hierarchical aggregation strategy and correlation refinement mechanism for Video Class-Incremental Learning, optimizing hierarchical matching matrices to alleviate catastrophic forgetting.
2023
Grounded Affordance
Grounded Affordance from Exocentric View
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao.
In International Journal of Computer Vision (IJCV)
Journal version of "Learning Affordance Grounding from Exocentric Images" (CVPR 2022)
We propose a cross-view affordance knowledge transfer framework to ground affordance from exocentric views by transferring affordance-specific features to egocentric views, supported by the AGD20K dataset.
MPAP
On Exploring Multiplicity of Primitives and Attributes for Texture Recognition in the Wild
Wei Zhai, Yang Cao, Jing Zhang, Haiyong Xie, Dacheng Tao, Zheng-Jun Zha.
In IEEE T-PAMI
Journal version of MPAP (ICCV 2019) and DSR-Net (CVPR 2020)
We propose MPAP, a novel network for texture recognition that models the relation of bottom-up structure and top-down attributes in a multi-branch unified framework to capture multiple primitives and attributes.
BAS
Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation
Wei Zhai, Pingyu Wu, Kai Zhu, Yang Cao, Feng Wu, Zheng-Jun Zha.
In International Journal of Computer Vision (IJCV)
Journal version of BAS (CVPR 2022)
We introduce Background Activation Suppression (BAS) for weakly supervised object localization, using an Activation Map Constraint to facilitate generator learning by suppressing background activation.
HAG-Net
Learning Visual Affordance Grounding from Demonstration Videos
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao.
In IEEE T-NNLS
We propose HAG-Net, a hand-aided network that leverages demonstration videos and a dual-branch structure to learn visual affordance grounding by transferring knowledge from video to object branches.
SAT
Spatial-Aware Token for Weakly Supervised Object Localization
Pingyu Wu, Wei Zhai*, Yang Cao, Jiebo Luo and Zheng-Jun Zha.
In ICCV 2023
We propose a Spatial-Aware Token (SAT) for weakly supervised object localization to resolve optimization conflicts in transformers by learning a task-specific token to condition localization.
IAG
Grounding 3D Object Affordance from 2D Interactions in Images
Yuhang Yang, Wei Zhai*, Hongchen Luo, Yang Cao, Jiebo Luo and Zheng-Jun Zha.
In ICCV 2023
We introduce a novel task of grounding 3D object affordance from 2D interactions using an Interaction-driven 3D Affordance Grounding Network (IAG) and the new PIAD dataset.
Robustness Benchmark
Robustness Benchmark for Unsupervised Anomaly Detection Models
Pei Wang, Wei Zhai, and Yang Cao.
In Journal of University of Science and Technology of China (JUSTC)
We propose MVTec-C, a dataset to evaluate the robustness of unsupervised anomaly detection models, and a Feature Alignment Module (FAM) to reduce feature drift caused by corruptions.
Interactive Affinity
Leverage Interactive Affinity for Affordance Learning
Hongchen Luo#, Wei Zhai#, Jing Zhang, Yang Cao, and Dacheng Tao.
In CVPR 2023
We propose to leverage interactive affinity for affordance learning, using a pose-aided framework and keypoint heuristic perception to transfer cues from human-object interactions to non-interactive objects.
SCOOD
Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection
Fan Lu, Kai Zhu, Wei Zhai, Kecheng Zheng, and Yang Cao.
In CVPR 2023
We propose an uncertainty-aware optimal transport scheme for Semantically Coherent OOD detection, utilizing an energy-based transport mechanism to discern outliers from intended data distributions.
Ventral Stream
Exploring Tuning Characteristics of Ventral Stream's Neurons for Few-Shot Image Classification
Lintao Dong, Wei Zhai, Zheng-Jun Zha.
In AAAI 2023 (Oral, Distinguished Paper)
We explore the tuning characteristics of ventral stream neurons for few-shot image classification, proposing hierarchical feature regularization to produce generic and robust features.
2022
FGA
Exploring Figure-Ground Assignment Mechanism in Perceptual Organization
Wei Zhai, Yang Cao, Jing Zhang, Zheng-Jun Zha.
In NeurIPS 2022
We explore the figure-ground assignment mechanism to empower CNNs for robust perceptual organization, utilizing a Figure-Ground-Aided (FGA) module to handle visual ambiguity.
CBCE-Net
Phrase-Based Affordance Detection via Cyclic Bilateral Interaction
Liangsheng Lu#, Wei Zhai#, Hongchen Luo, Kang Yu, Yang Cao.
In IEEE T-AI
We propose CBCE-Net for phrase-based affordance detection, utilizing a cyclic bilateral interaction module to align vision and language features, extended with the annotated PAD dataset.
OSAD-Net
One-Shot Affordance Detection in the Wild
Wei Zhai#, Hongchen Luo#, Jing Zhang, Yang Cao, Dacheng Tao.
In International Journal of Computer Vision (IJCV)
Journal version of "One-Shot Affordance Detection" (IJCAI 2021)
We propose OSAD-Net for one-shot affordance detection by transferring human action purpose to unseen scenarios, benchmarked on the large-scale PADv2 dataset.
DTC-Net
Deep Texton-Coherence Network for Camouflaged Object Detection
Wei Zhai, Yang Cao, Haiyong Xie, Zheng-Jun Zha.
In IEEE T-MM
We propose DTC-Net for camouflaged object detection, utilizing Local Bilinear modules and Spatial Coherence Organization to leverage spatial statistical properties of textons.
LCG-Net
Location-Free Camouflage Generation Network
Yangyang Li#, Wei Zhai#, Yang Cao, Zheng-Jun Zha.
In IEEE T-MM
We present LCG-Net, a location-free camouflage generation network that uses Position-aligned Structure Fusion (PSF) to efficiently generate camouflage in multi-appearance regions.
Exocentric Affordance
Learning Affordance Grounding from Exocentric Images
Hongchen Luo#, Wei Zhai#, Jing Zhang, Yang Cao, and Dacheng Tao.
In CVPR 2022
We propose a cross-view knowledge transfer framework for affordance grounding that extracts features from exocentric interactions to perceive affordance in egocentric views, introducing the AGD20K dataset.
BAS
Background Activation Suppression for Weakly Supervised Object Localization
Pingyu Wu#, Wei Zhai#, Yang Cao.
In CVPR 2022
We propose Background Activation Suppression (BAS) for Weakly Supervised Object Localization (WSOL), which uses an Activation Map Constraint (AMC) to suppress background activation and learn whole object regions.
SSRE
Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learnings
Kai Zhu, Wei Zhai, Yang Cao, Jiebo Luo, Zheng-Jun Zha.
In CVPR 2022
We propose a self-sustaining representation expansion scheme for non-exemplar class-incremental learning, featuring structure reorganization and main-branch distillation to maintain old features.
DANSE
Robust Object Detection via Adversarial Novel Style Exploration
Wen Wang, Jing Zhang, Wei Zhai, Yang Cao, Dacheng Tao.
In IEEE T-IP
We propose DANSE, a method for robust object detection that uses adversarial novel style exploration to discover diverse degradation styles and adapt models to open and compound degradation types.
2021
OS-AD
One-Shot Affordance Detection
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao.
In IJCAI 2021 (Oral)
We propose a One-Shot Affordance Detection (OS-AD) network that estimates action purpose and transfers it to detect common affordances in unseen scenarios, utilizing collaboration learning.
TAM-GCN
A Tri-Attention Enhanced Graph Convolutional Network for Skeleton-Based Action Recognition
Xingming Li, Wei Zhai, Yang Cao.
In IET Computer Vision (IET-CV 2021)
We introduce a Tri-Attention Module (TAM) for skeleton-based action recognition to guide GCNs in perceiving significant variations across body poses, joint trajectories, and evolving projections.
SPPR
Self-Promoted Prototype Refinement for Few-Shot Class-Incremental Learning
Kai Zhu, Yang Cao, Wei Zhai, Jie Cheng, Zheng-Jun Zha.
In CVPR 2021
We propose a Self-Promoted Prototype Refinement mechanism for few-shot class-incremental learning, utilizing random episode selection and dynamic relation projection to strengthen new class expression.
2020
Self-Supervised Tuning
Self-Supervised Tuning for Few-Shot Segmentation
Kai Zhu, Wei Zhai, Yang Cao.
In IJCAI 2020 (Oral)
We present an adaptive tuning framework for few-shot segmentation that uses a novel self-supervised inner-loop to dynamically adjust latent features and augment category-specific descriptors.
IR Method
Deep Inhomogeneous Regularization for Transfer Learning
Wen Wang, Wei Zhai, Yang Cao.
In ICIP 2020
We propose a novel Inhomogeneous Regularization (IR) method for transfer learning that imposes decaying averaged deviation penalties to tackle catastrophic forgetting and negative transfer.
DSR-Net
Deep Structure-Revealed Network for Texture Recognition
Wei Zhai, Yang Cao, Zheng-Jun Zha, HaiYong Xie, Feng Wu.
In CVPR 2020 (Oral)
We propose DSR-Net for texture recognition, leveraging a primitive capturing module and dependence learning module to reveal spatial dependency and structural representations.
OS-TR
One-Shot Texture Retrieval Using Global Grouping Metric
Kai Zhu, Yang Cao, Wei Zhai, Zheng-Jun Zha.
In IEEE T-MM 2020
Journal version of "One-Shot Texture Retrieval with Global Context Metric" (IJCAI 2019)
We propose an OS-TR network for one-shot texture retrieval that utilizes an adaptive directionality-aware module and a grouping-attention mechanism for robust generalization.
2019
MAP-Net
Deep Multiple-Attribute-Perceived Network for Real-World Texture Recognition
Wei Zhai, Yang Cao, Jing Zhang, Zheng-Jun Zha.
In ICCV 2019
We propose MAP-Net for texture recognition, which progressively learns visual texture attributes in a multi-branch architecture using deformable pooling and attribute transfer schemes.
OS-TR
One-Shot Texture Retrieval with Global Context Metric
Kai Zhu, Wei Zhai, Zheng-Jun Zha, Yang Cao.
In IJCAI 2019 (Oral)
We tackle one-shot texture retrieval with an OS-TR network that includes a directionality-aware module and a self-gating mechanism to exploit global context information.
PixTextGAN
PixTextGAN: Structure Aware Text Image Synthesis for License Plate Recognition
Shilian Wu, Wei Zhai, Yang Cao.
In IET Image Processing (IET-IP 2019)
We propose PixTextGAN, a controllable architecture for synthesizing license plate images with structure-aware loss, refraining from collecting vast amounts of labelled data.
2018
Unsupervised Inspection
A Generative Adversarial Network Based Framework for Unsupervised Visual Surface Inspection
Wei Zhai, Jiang Zhu, Yang Cao, Zengfu Wang.
In ICASSP 2018 (Oral)
We propose a GAN-based framework for unsupervised visual surface inspection, where the discriminator serves as a one-class classifier to detect abnormal regions using multi-scale fusion.
Depth Map SR
Co-Occurrent Structural Edge Detection for Color-Guided Depth Map Super-Resolution
Jiang Zhu, Wei Zhai, Yang Cao, Zheng-Jun Zha.
In MMM 2018 (Oral)
We propose a CNN-based method for color-guided depth map super-resolution that detects co-occurrent structural edges to effectively exploit structural correlations between depth and color images.

Pre-prints

TOUCH
TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions
Guangyi Han, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha.
Arxiv
We introduce Free-Form HOI Generation and TOUCH, a framework leveraging a multi-level diffusion model and explicit contact modeling to generate diverse, physically plausible hand-object interactions from text.
VGPO
Value-Anchored Group Policy Optimization for Flow Models
Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha.
Arxiv
We propose Value-Anchored Group Policy Optimization (VGPO) for flow matching-based image generation, redefining value estimation with process-aware value estimates to enable precise credit assignment and stable optimization.
AliTok
AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha.
Arxiv
We introduce AliTok, an Aligned Tokenizer using a causal decoder to establish unidirectional dependencies, aligning token modeling with autoregressive models for superior image generation performance.
VideoGen-Eval
VideoGen-Eval: Agent-based System for Video Generation Evaluation
Yuhang Yang, Shangkun Sun, Hongxiang Li, Ke Fan, Ailing Zeng, Feilin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha.
Arxiv
We propose VideoGen-Eval, an agent-based dynamic evaluation system for video generation that integrates content structuring and multimodal judgment, validated against human preferences.
VanGogh
VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization
Zixun Fang, Zhiheng Liu, Kai Zhu, Yu Liu, Ka Leong Cheng, Wei Zhai, Yang Cao, Zheng-Jun Zha
Arxiv
We introduce VanGogh, a unified multimodal diffusion-based framework for video colorization that employs a Dual Qformer and depth-guided generation to achieve superior temporal consistency and color fidelity.
EDFilter
Event Signal Filtering via Probability Flux Estimation
Jinze Chen, Wei Zhai, Yang Cao, Bin Li, Zheng-Jun Zha.
Arxiv
We introduce EDFilter, an event signal filtering framework that estimates probability flux from discrete events using nonparametric kernel smoothing, enhancing signal fidelity for downstream tasks.
VCR-Net
Visual-Geometric Collaborative Guidance for Affordance Learning
Hongchen Luo, Wei Zhai, Jiao Wang, Yang Cao, Zheng-Jun Zha.
Arxiv
Journal version of "Leverage Interactive Affinity for Affordance Learning" (CVPR 2023)
We propose a visual-geometric collaborative guided affordance learning network that leverages interactive affinity to transfer knowledge from human-object interactions to non-interactive objects.
Ego-SAG
Grounding 3D Scene Affordance From Egocentric Interactions
Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo, Sen Liang, Yang Cao, Zheng-Jun Zha.
Arxiv
We introduce Ego-SAG, a framework for grounding 3D scene affordance from egocentric interactions using interaction intent guidance and a bidirectional query decoder mechanism.
ViViD
ViViD: Video Virtual Try-on using Diffusion Models
Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha.
Arxiv
We present ViViD, a framework using diffusion models for video virtual try-on, incorporating a Garment Encoder, Pose Encoder, and Temporal Modules to ensure spatial-temporal consistency.
IDE
Intention-driven Ego-to-Exo Video Generation
Hongchen Luo, Kai Zhu, Wei Zhai, Yang Cao.
Arxiv
We propose IDE, an Intention-Driven Ego-to-Exo video generation framework that uses action intention and cross-view feature perception to generate consistent exocentric videos from egocentric inputs.

Professional Activities

Conference Reviewer
  • IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • IEEE International Conference on Computer Vision (ICCV)
  • European Conference on Computer Vision (ECCV)
  • Neural Information Processing Systems (NeurIPS)
  • International Conference on Learning Representations (ICLR)
  • International Conference on Machine Learning (ICML)
  • AAAI Conference on Artificial Intelligence (AAAI)
  • ACM Multimedia (ACM MM)
  • International Joint Conferences on AI (IJCAI)
Journal Reviewer
  • IEEE Trans. on Pattern Analysis and Machine Intelligence (T-PAMI)
  • International Journal of Computer Vision (IJCV)
  • IEEE Transactions on Image Processing (T-IP)
  • IEEE Transactions on Neural Networks and Learning Systems (T-NNLS)
  • IEEE Transactions on Multimedia (T-MM)
  • IEEE Trans. on Circuits and Systems for Video Technology (T-CSVT)
  • Pattern Recognition (PR)
  • ACM Trans. on Multimedia Computing, Comm., and Appl. (ToMM)

Awards and Honors

2025
ACM MM MSMA Workshop Best Student Paper Best Paper
2025
1st Place, Body Contact Estimation Challenge (RHOBIN2025 CVPR Workshop) Champion
2025
1st Place, Efficient Event-based Eye-Tracking Challenge (CVPR Workshop) Champion
2024
1st Place, Event-based Eye Tracking Task (AIS2024 CVPR Workshop) Champion
2024
2nd Place, 3D Contact Estimation Challenge (RHOBIN2024 CVPR Workshop) Runner-up
2024
2nd Place, NTIRE 2024 Efficient Super-Resolution Challenge Runner-up
2023
AAAI Distinguished Paper Award Distinguished
2021
Outstanding Internship at JD Explore Academy
2019
National Scholarship (University of Science and Technology of China)
2017
Outstanding Graduate of Southwest Jiaotong University
2016
National Scholarship (Southwest Jiaotong University)

Teaching

Autumn 2024
Computer Vision, USTC
Autumn 2025
Computer Vision, USTC

Teaching Assistants

Autumn 2020
Computer Vision, USTC
Autumn 2019
Image Processing, USTC