Wei Zhai (翟伟)
Associate Researcher at
University of Science and Technology of China (USTC)
University of Science and Technology of China (USTC)
I am currently an Associate Researcher at the Department of Automation, USTC.
I obtained my Ph.D. degree from USTC in 2022, advised by Prof. Zheng-Jun Zha and Prof. Yang Cao.
Prior to that, I received my B.S. degree from Southwest Jiaotong University in 2017.
I was fortunate to receive the AAAI 2023 Distinguished Paper Award and the ACM MM 2025 MSMA Workshop Best Student Paper Award.
My research interests primarily lie in Computer Vision, Embodied Intelligence, and Machine Learning, with a specific focus on building efficient computational frameworks inspired by brain mechanisms, developing egocentric perception for interaction anticipation, and endowing embodied agents with generalizable 2D/3D vision skills in complex real-world scenes.
News
11/2025
1 paper accepted by AAAI 2025.
10/2025
1 paper accepted by T-NNLS.
09/2025
3 papers accepted by NeurIPS 2025.
1 Spotlight
09/2025
1 paper accepted by ACM MM MSMA Workshop.
Best Student Paper
07/2025
1 paper accepted by Journal of Intelligent Computing and Networking.
07/2025
1 paper accepted by T-ASE.
06/2025
4 papers accepted by ICCV 2025.
06/2025
Won the 1st Place in Efficient Event-based Eye-Tracking Challenge.
06/2025
Won the 1st Place in Body Contact Estimation Challenge (RHOBIN2025 CVPR).
05/2025
1 paper accepted by SCIENCE CHINA Information Sciences.
02/2025
5 papers accepted by CVPR 2025.
1 Highlight
Experience
Jul 2024 - Present
Jul 2022 - Jun 2024
Postdoctoral Researcher
Sep 2017 - Jun 2022
Ph.D. in Cyberspace Security
Dec 2020 - Sep 2021
Sep 2013 - Jun 2017
B.S. in Computer Science
Outstanding Graduate of Southwest Jiaotong University (2017)
Publications
2025
E-MaT: Event-oriented Mamba for Egocentric Point Tracking
In AAAI 2025
We propose a Mamba-based tracking framework that leverages event cameras to capture global motion trends, significantly enhancing egocentric point tracking robustness under fast motion and high dynamic range conditions.
Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay
In IEEE T-NNLS
We introduce a continuous learning framework inspired by the complementary learning system of the human brain, utilizing memory replay and knowledge distillation to enable de-raining networks to generalize across varied real-world scenarios.
EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting
In NeurIPS 2025
(Spotlight)
We propose EF-3DGS, the first event-aided framework to handle fast motion blur and high dynamic range scenes by fusing events and frames, achieving significantly higher PSNR and lower trajectory error in high-speed scenarios.
ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models
In NeurIPS 2025
We propose a novel framework utilizing pretrained perspective diffusion models for generating panoramic videos via a new ViewPoint map representation, ensuring global spatial continuity and fine-grained visual details.
PAID: Pairwise Angular-Invariant Decomposition for Continual Test-Time Adaptation
In NeurIPS 2025
We propose PAID, a prior-driven CTTA method that preserves the pairwise angular structure of pre-trained weights using Householder reflections, achieving consistent improvements in continual test-time adaptation.
Learning Object Affordance Ranking with Task Context
In ACM MM 2025 MSMA Workshop
(Best Student Paper)
We introduce a Context-embed Group Ranking Framework to learn object affordance ranking by deeply integrating task context, supported by a new large-scale task-oriented dataset.
SIGMAN: Scaling 3D Human Gaussian Generation with Millions of Assets
In ICCV 2025
We present SIGMAN, a latent space generation paradigm for 3D human digitization utilizing a UV-structured VAE and DiT, trained on a newly constructed dataset of 1 million 3D Gaussian assets.
HERO: Human Reaction Generation from Videos
In ICCV 2025
We propose HERO, a framework for human reaction generation from RGB videos that extracts interaction intention and local visual cues, validated on a new Video-Motion dataset.
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking
In ICCV 2025
We introduce MATE, an event-based point tracking framework that resolves spatial sparsity and motion blur through motion-augmented temporal consistency, achieving significantly faster processing and higher precision.
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation
In ICCV 2025
We propose EMoTive, an event-based framework for 3D motion estimation that models spatio-temporal trajectories via Event Kymograph projection and non-uniform parametric curves.
PEAR: Phrase-Based Hand-Object Interaction Anticipation
In SCIENCE CHINA Information Sciences (SCIS)
We present PEAR, a model for hand-object interaction anticipation that jointly predicts intention and manipulation using phrase-based cross-alignment, supported by the EGO-HOIP dataset.
BRAT: Bidirectional Relative Positional Attention Transformer for Event-based Eye tracking
In CVPR 2025 Workshop
(1st Place Challenge)
We propose BRAT, a Bidirectional Relative Positional Attention Transformer for event-based eye tracking that fully exploits spatio-temporal sequences, winning 1st place in the Efficient Event-based Eye-Tracking Challenge.
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
In CVPR 2025
We introduce CompreCap, a benchmark for evaluating detailed image captioning in LVLMs using a directed scene graph to assess object coverage, attributes, and relationships comprehensively.
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
In CVPR 2025
We propose GREAT, a framework for open-vocabulary 3D object affordance grounding that combines geometry attributes with interaction intention reasoning, verified on the large-scale PIADv2 dataset.
Improved Video VAE for Latent Video Diffusion Model
In CVPR 2025
We propose an Improved Video VAE (IV-VAE) featuring Keyframe-based Temporal Compression and Group Causal Convolution to resolve temporal-spatial conflicts in latent video diffusion models.
MMAR: Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling
In CVPR 2025
We introduce MMAR, a lossless multi-modal auto-regressive framework that uses continuous-valued image tokens and a lightweight diffusion head to unify image understanding and generation without information loss.
Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning
In CVPR 2025
(Highlight)
We propose an efficient CTTA-OD method utilizing sensitivity-guided channel pruning to selectively suppress domain-sensitive channels, reducing computational overhead while maintaining adaptation performance.
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection
In IEEE T-ASE
We present VMAD, a Visual-enhanced MLLM for zero-shot anomaly detection that incorporates defect-sensitive structure learning and locality-enhanced token compression, benchmarked on the RIAD dataset.
Likelihood-Aware Semantic Alignment for Full-Spectrum Out-of-Distribution Detection
In Journal of Intelligent Computing and Networking
We propose a Likelihood-Aware Semantic Alignment (LSA) framework for full-spectrum OOD detection, utilizing Gaussian sampling and bidirectional prompt customization to align image-text correspondence.
2024
EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views
In NeurIPS 2024
We propose EgoChoir to capture 3D interaction regions from egocentric views by harmonizing visual appearance, head motion, and 3D objects to jointly infer human contact and object affordance.
UniDense: Unleashing Diffusion Models with Meta-Routers for Universal Few-Shot Dense Prediction
In ACM MM 2024
We introduce UniDense, a framework utilizing Meta-Routers to select task-relevant computation pathways within a frozen Stable Diffusion model for efficient universal few-shot dense prediction.
Event-based Optical Flow via Transforming into Motion-dependent View
In IEEE T-IP
We propose MV-Net, which transforms the orthogonal view into a motion-dependent view using an Event View Transformation Module to enhance event-based motion representation for optical flow estimation.
Bidirectional Progressive Transformer for Interaction Intention Anticipation
In ECCV 2024
We present BOT, a Bidirectional Progressive Transformer that mutually corrects hand trajectories and interaction hotspots predictions to minimize error accumulation in interaction intention anticipation.
Event-based Asynchronous HDR Imaging by Temporal Incident Light Modulation
In Optics Express
We propose AsynHDR, a system integrating DVS with LCD panels for temporal incident light modulation, enabling pixel-asynchronous High Dynamic Range (HDR) imaging.
Prioritized Local Matching Network for Cross-Category Few-Shot Anomaly Detection
In IEEE T-AI
We propose PLMNet for Cross-Category Few-shot Anomaly Detection, utilizing a Local Perception Network and Defect-sensitive Weight Learner to establish fine-grained correspondence between query and normal samples.
LEMON: Learning 3D Human-Object Interaction Relation from 2D Images
In CVPR 2024
We present LEMON, a unified model that learns 3D human-object interaction relations from 2D images by mining interaction intentions and geometric correlations to jointly anticipate interaction elements.
Mambapupil: Bidirectional Selective Recurrent Model for Event-based Eye Tracking
In CVPR 2024 Workshop
(1st Place Challenge)
We propose MambaPupil, a bidirectional selective recurrent model for event-based eye tracking, utilizing a Linear Time-Varying State Space Module to handle diverse eye movement patterns.
Hypercorrelation Evolution for Video Class-Incremental Learning
In AAAI 2024
We propose a hierarchical aggregation strategy and correlation refinement mechanism for Video Class-Incremental Learning, optimizing hierarchical matching matrices to alleviate catastrophic forgetting.
2023
Grounded Affordance from Exocentric View
In International Journal of Computer Vision (IJCV)
Journal version of "Learning Affordance Grounding from Exocentric Images" (CVPR 2022)
Journal version of "Learning Affordance Grounding from Exocentric Images" (CVPR 2022)
We propose a cross-view affordance knowledge transfer framework to ground affordance from exocentric views by transferring affordance-specific features to egocentric views, supported by the AGD20K dataset.
On Exploring Multiplicity of Primitives and Attributes for Texture Recognition in the Wild
In IEEE T-PAMI
Journal version of MPAP (ICCV 2019) and DSR-Net (CVPR 2020)
Journal version of MPAP (ICCV 2019) and DSR-Net (CVPR 2020)
We propose MPAP, a novel network for texture recognition that models the relation of bottom-up structure and top-down attributes in a multi-branch unified framework to capture multiple primitives and attributes.
Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation
In International Journal of Computer Vision (IJCV)
Journal version of BAS (CVPR 2022)
Journal version of BAS (CVPR 2022)
We introduce Background Activation Suppression (BAS) for weakly supervised object localization, using an Activation Map Constraint to facilitate generator learning by suppressing background activation.
Learning Visual Affordance Grounding from Demonstration Videos
In IEEE T-NNLS
We propose HAG-Net, a hand-aided network that leverages demonstration videos and a dual-branch structure to learn visual affordance grounding by transferring knowledge from video to object branches.
Spatial-Aware Token for Weakly Supervised Object Localization
In ICCV 2023
We propose a Spatial-Aware Token (SAT) for weakly supervised object localization to resolve optimization conflicts in transformers by learning a task-specific token to condition localization.
Grounding 3D Object Affordance from 2D Interactions in Images
In ICCV 2023
We introduce a novel task of grounding 3D object affordance from 2D interactions using an Interaction-driven 3D Affordance Grounding Network (IAG) and the new PIAD dataset.
Robustness Benchmark for Unsupervised Anomaly Detection Models
In Journal of University of Science and Technology of China (JUSTC)
We propose MVTec-C, a dataset to evaluate the robustness of unsupervised anomaly detection models, and a Feature Alignment Module (FAM) to reduce feature drift caused by corruptions.
Leverage Interactive Affinity for Affordance Learning
In CVPR 2023
We propose to leverage interactive affinity for affordance learning, using a pose-aided framework and keypoint heuristic perception to transfer cues from human-object interactions to non-interactive objects.
Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection
In CVPR 2023
We propose an uncertainty-aware optimal transport scheme for Semantically Coherent OOD detection, utilizing an energy-based transport mechanism to discern outliers from intended data distributions.
Exploring Tuning Characteristics of Ventral Stream's Neurons for Few-Shot Image Classification
In AAAI 2023
(Oral, Distinguished Paper)
We explore the tuning characteristics of ventral stream neurons for few-shot image classification, proposing hierarchical feature regularization to produce generic and robust features.
2022
Exploring Figure-Ground Assignment Mechanism in Perceptual Organization
In NeurIPS 2022
We explore the figure-ground assignment mechanism to empower CNNs for robust perceptual organization, utilizing a Figure-Ground-Aided (FGA) module to handle visual ambiguity.
Phrase-Based Affordance Detection via Cyclic Bilateral Interaction
In IEEE T-AI
We propose CBCE-Net for phrase-based affordance detection, utilizing a cyclic bilateral interaction module to align vision and language features, extended with the annotated PAD dataset.
One-Shot Affordance Detection in the Wild
In International Journal of Computer Vision (IJCV)
Journal version of "One-Shot Affordance Detection" (IJCAI 2021)
Journal version of "One-Shot Affordance Detection" (IJCAI 2021)
We propose OSAD-Net for one-shot affordance detection by transferring human action purpose to unseen scenarios, benchmarked on the large-scale PADv2 dataset.
Deep Texton-Coherence Network for Camouflaged Object Detection
In IEEE T-MM
We propose DTC-Net for camouflaged object detection, utilizing Local Bilinear modules and Spatial Coherence Organization to leverage spatial statistical properties of textons.
Location-Free Camouflage Generation Network
In IEEE T-MM
We present LCG-Net, a location-free camouflage generation network that uses Position-aligned Structure Fusion (PSF) to efficiently generate camouflage in multi-appearance regions.
Learning Affordance Grounding from Exocentric Images
In CVPR 2022
We propose a cross-view knowledge transfer framework for affordance grounding that extracts features from exocentric interactions to perceive affordance in egocentric views, introducing the AGD20K dataset.
Background Activation Suppression for Weakly Supervised Object Localization
In CVPR 2022
We propose Background Activation Suppression (BAS) for Weakly Supervised Object Localization (WSOL), which uses an Activation Map Constraint (AMC) to suppress background activation and learn whole object regions.
Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learnings
In CVPR 2022
We propose a self-sustaining representation expansion scheme for non-exemplar class-incremental learning, featuring structure reorganization and main-branch distillation to maintain old features.
Robust Object Detection via Adversarial Novel Style Exploration
In IEEE T-IP
We propose DANSE, a method for robust object detection that uses adversarial novel style exploration to discover diverse degradation styles and adapt models to open and compound degradation types.
2021
One-Shot Affordance Detection
In IJCAI 2021
(Oral)
We propose a One-Shot Affordance Detection (OS-AD) network that estimates action purpose and transfers it to detect common affordances in unseen scenarios, utilizing collaboration learning.
A Tri-Attention Enhanced Graph Convolutional Network for Skeleton-Based Action Recognition
In IET Computer Vision (IET-CV 2021)
We introduce a Tri-Attention Module (TAM) for skeleton-based action recognition to guide GCNs in perceiving significant variations across body poses, joint trajectories, and evolving projections.
Self-Promoted Prototype Refinement for Few-Shot Class-Incremental Learning
In CVPR 2021
We propose a Self-Promoted Prototype Refinement mechanism for few-shot class-incremental learning, utilizing random episode selection and dynamic relation projection to strengthen new class expression.
2020
Self-Supervised Tuning for Few-Shot Segmentation
In IJCAI 2020
(Oral)
We present an adaptive tuning framework for few-shot segmentation that uses a novel self-supervised inner-loop to dynamically adjust latent features and augment category-specific descriptors.
Deep Inhomogeneous Regularization for Transfer Learning
In ICIP 2020
We propose a novel Inhomogeneous Regularization (IR) method for transfer learning that imposes decaying averaged deviation penalties to tackle catastrophic forgetting and negative transfer.
Deep Structure-Revealed Network for Texture Recognition
In CVPR 2020
(Oral)
We propose DSR-Net for texture recognition, leveraging a primitive capturing module and dependence learning module to reveal spatial dependency and structural representations.
One-Shot Texture Retrieval Using Global Grouping Metric
In IEEE T-MM 2020
Journal version of "One-Shot Texture Retrieval with Global Context Metric" (IJCAI 2019)
Journal version of "One-Shot Texture Retrieval with Global Context Metric" (IJCAI 2019)
We propose an OS-TR network for one-shot texture retrieval that utilizes an adaptive directionality-aware module and a grouping-attention mechanism for robust generalization.
2019
Deep Multiple-Attribute-Perceived Network for Real-World Texture Recognition
In ICCV 2019
We propose MAP-Net for texture recognition, which progressively learns visual texture attributes in a multi-branch architecture using deformable pooling and attribute transfer schemes.
One-Shot Texture Retrieval with Global Context Metric
In IJCAI 2019
(Oral)
We tackle one-shot texture retrieval with an OS-TR network that includes a directionality-aware module and a self-gating mechanism to exploit global context information.
PixTextGAN: Structure Aware Text Image Synthesis for License Plate Recognition
In IET Image Processing (IET-IP 2019)
We propose PixTextGAN, a controllable architecture for synthesizing license plate images with structure-aware loss, refraining from collecting vast amounts of labelled data.
2018
A Generative Adversarial Network Based Framework for Unsupervised Visual Surface Inspection
In ICASSP 2018
(Oral)
We propose a GAN-based framework for unsupervised visual surface inspection, where the discriminator serves as a one-class classifier to detect abnormal regions using multi-scale fusion.
Co-Occurrent Structural Edge Detection for Color-Guided Depth Map Super-Resolution
In MMM 2018
(Oral)
We propose a CNN-based method for color-guided depth map super-resolution that detects co-occurrent structural edges to effectively exploit structural correlations between depth and color images.
Pre-prints
TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions
Arxiv
We introduce Free-Form HOI Generation and TOUCH, a framework leveraging a multi-level diffusion model and explicit contact modeling to generate diverse, physically plausible hand-object interactions from text.
Value-Anchored Group Policy Optimization for Flow Models
Arxiv
We propose Value-Anchored Group Policy Optimization (VGPO) for flow matching-based image generation, redefining value estimation with process-aware value estimates to enable precise credit assignment and stable optimization.
AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Arxiv
We introduce AliTok, an Aligned Tokenizer using a causal decoder to establish unidirectional dependencies, aligning token modeling with autoregressive models for superior image generation performance.
VideoGen-Eval: Agent-based System for Video Generation Evaluation
Arxiv
We propose VideoGen-Eval, an agent-based dynamic evaluation system for video generation that integrates content structuring and multimodal judgment, validated against human preferences.
VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization
Arxiv
We introduce VanGogh, a unified multimodal diffusion-based framework for video colorization that employs a Dual Qformer and depth-guided generation to achieve superior temporal consistency and color fidelity.
Event Signal Filtering via Probability Flux Estimation
Arxiv
We introduce EDFilter, an event signal filtering framework that estimates probability flux from discrete events using nonparametric kernel smoothing, enhancing signal fidelity for downstream tasks.
Visual-Geometric Collaborative Guidance for Affordance Learning
Arxiv
Journal version of "Leverage Interactive Affinity for Affordance Learning" (CVPR 2023)
Journal version of "Leverage Interactive Affinity for Affordance Learning" (CVPR 2023)
We propose a visual-geometric collaborative guided affordance learning network that leverages interactive affinity to transfer knowledge from human-object interactions to non-interactive objects.
Grounding 3D Scene Affordance From Egocentric Interactions
Arxiv
We introduce Ego-SAG, a framework for grounding 3D scene affordance from egocentric interactions using interaction intent guidance and a bidirectional query decoder mechanism.
ViViD: Video Virtual Try-on using Diffusion Models
Arxiv
We present ViViD, a framework using diffusion models for video virtual try-on, incorporating a Garment Encoder, Pose Encoder, and Temporal Modules to ensure spatial-temporal consistency.
Intention-driven Ego-to-Exo Video Generation
Arxiv
We propose IDE, an Intention-Driven Ego-to-Exo video generation framework that uses action intention and cross-view feature perception to generate consistent exocentric videos from egocentric inputs.
Professional Activities
Conference Reviewer
- IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- IEEE International Conference on Computer Vision (ICCV)
- European Conference on Computer Vision (ECCV)
- Neural Information Processing Systems (NeurIPS)
- International Conference on Learning Representations (ICLR)
- International Conference on Machine Learning (ICML)
- AAAI Conference on Artificial Intelligence (AAAI)
- ACM Multimedia (ACM MM)
- International Joint Conferences on AI (IJCAI)
Journal Reviewer
- IEEE Trans. on Pattern Analysis and Machine Intelligence (T-PAMI)
- International Journal of Computer Vision (IJCV)
- IEEE Transactions on Image Processing (T-IP)
- IEEE Transactions on Neural Networks and Learning Systems (T-NNLS)
- IEEE Transactions on Multimedia (T-MM)
- IEEE Trans. on Circuits and Systems for Video Technology (T-CSVT)
- Pattern Recognition (PR)
- ACM Trans. on Multimedia Computing, Comm., and Appl. (ToMM)
Awards and Honors
2025
ACM MM MSMA Workshop Best Student Paper
Best Paper
2025
1st Place, Body Contact Estimation Challenge (RHOBIN2025 CVPR Workshop)
Champion
2025
1st Place, Efficient Event-based Eye-Tracking Challenge (CVPR Workshop)
Champion
2024
1st Place, Event-based Eye Tracking Task (AIS2024 CVPR Workshop)
Champion
2024
2nd Place, 3D Contact Estimation Challenge (RHOBIN2024 CVPR Workshop)
Runner-up
2024
2nd Place, NTIRE 2024 Efficient Super-Resolution Challenge
Runner-up
2023
AAAI Distinguished Paper Award
Distinguished
2021
Outstanding Internship at JD Explore Academy
2019
National Scholarship (University of Science and Technology of China)
2017
Outstanding Graduate of Southwest Jiaotong University
2016
National Scholarship (Southwest Jiaotong University)
Teaching
Autumn 2024
Computer Vision, USTC
Autumn 2025
Computer Vision, USTC
Teaching Assistants
Autumn 2020
Computer Vision, USTC
Autumn 2019
Image Processing, USTC