Wei Zhai (翟伟)
I'm currently an Associate Researcher at the University of Science and Technology of China (USTC). I obtained my PhD degree from USTC in 2022, where I was advised by Professor Zheng-Jun Zha and Associate Professor Yang Cao.
Research: I work on computer vision, embodied intelligence and machine learning. I am currently focusing on three aspects: 1) Build efficient computational framework for embodied intelligence by drawing on brain mechanisms. 2) Develop egocentric perception, which involves understanding egocentric scenarios, analyzing present interactions, and anticipating future activity. 3) Endow embodied agents working in complex real-world scenes with generalizable 2D/3D vision and interaction skills.
EMail / School Homepage / Scholar / Lab
News
► (09/2024) One paper is accepted by NeurIPS 2024 ~
► (07/2024) One paper is accepted by ACM MM 2024 ~
► (07/2024) One paper is accepted by T-IP ~
► (07/2024) One paper is accepted by ECCV 2024 ~
► (06/2024) Our team wins the 2nd Place of 3D Contact Estimation Challenge (RHOBIN2024 CVPR) ~
► (04/2024) One paper is accepted by Optics Express ~
► (04/2024) One paper is accepted by T-AI ~
► (03/2024) Our team wins the 2nd Place of Efficient Super-Resolution Challenge (NTIRE2024 CVPR) ~
► (03/2024) Our team wins the 1st Place of Event-based Eye Tracking Task (AIS2024 CVPR) ~
► (02/2024) One papers are accepted by CVPR 2024 ~
► (12/2023) One paper is accepted by AAAI 2024 ~
► (11/2023) One paper is accepted by IJCV ~
► (10/2023) One paper is accepted by T-PAMI ~
► (09/2023) One paper is accepted by IJCV ~
► (07/2023) One papers are accepted by T-NNLS ~
► (07/2023) Two papers are accepted by ICCV 2023 ~
► (03/2023) Two papers are accepted by CVPR 2023 ~
► (01/2023) One paper is accepted by AAAI 2023 (Distinguished Paper) ~
► (09/2022) One paper is accepted by NeurIPS 2022 ~
► (08/2022) One paper is accepted by T-AI ~
► (06/2022) One paper is accepted by IJCV ~
► (Before 06/2022) ......
Experience
University of Science and Technology of China (USTC)
Jul 2024 - Now Associate Researcher at Department of Automation
University of Science and Technology of China (USTC)
Jul 2022 - Jun 2024 Postdoctoral Researcher in Department of Automation (working with Professor Zheng-Jun Zha and Associate Professor Yang Cao)
University of Science and Technology of China (USTC)
Sep 2017 - Jun 2022 Ph.D in School of Cyberspace Security (working with Professor Zheng-Jun Zha and Associate Professor Yang Cao)
JD Explore Academy
Dec 2020 - Sep 2021 Research Intern in JD Explore Academy (working with Professor Dacheng Tao and Jing Zhang)
Southwest Jiaotong University
Sep 2013 - Jun 2017 B.S. in Computer Science
Publications
2024
EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views
Yuhang Yang, Wei Zhai*, Chengfeng Wang, Chengjun Yu, Yang Cao, Zheng-Jun Zha.
Neural Information Processing Systems (NeurIPS 2024).
abstract / bibtex / code
Understanding egocentric human-object interaction (HOI) is a fundamental aspect of human-centric perception, facilitating applications like AR/VR and embodied AI. For the egocentric HOI, in addition to perceiving semantics e.g., ''what'' interaction is occurring, capturing ''where'' the interaction specifically manifests in 3D space is also crucial, which links the perception and operation. Existing methods primarily leverage observations of HOI to capture interaction regions from an exocentric view. However, incomplete observations of interacting parties in the egocentric view introduce ambiguity between visual observations and interaction contents, impairing their efficacy. From the egocentric view, humans integrate the visual cortex, cerebellum, and brain to internalize their intentions and interaction concepts of objects, allowing for the pre-formulation of interactions and making behaviors even when interaction regions are out of sight. In light of this, we propose harmonizing the visual appearance, head motion, and 3D object to excavate the object interaction concept and subject intention, jointly inferring 3D human contact and object affordance from egocentric videos. To achieve this, we present EgoChoir, which links object structures with interaction contexts inherent in appearance and head motion to reveal object affordance, further utilizing it to model human contact. Additionally, a gradient modulation is employed to adopt appropriate clues for capturing interaction regions across various egocentric scenarios. Moreover, 3D contact and affordance are annotated for egocentric videos collected from Ego-Exo4D and GIMO to support the task. Extensive experiments on them demonstrate the effectiveness and superiority of EgoChoir. Code and data will be open.
DUniDense: Unleashing Diffusion Models with Meta-Routers for Universal Few-Shot Dense Prediction
Lintao Dong, Wei Zhai*, Zheng-Jun Zha.
ACM Multimedia (ACM MM 2024).
abstract / bibtex
Universal few-shot dense prediction requires a versatile model capable of learning any dense prediction task from limited labeled images, which necessitates the model to possess efficient adaptation abilities. Prevailing few-shot learning methods rely on efficient fine-tuning of model weights for few-shot adaptation, which carries the risk of disrupting the pre-trained knowledge and lacks the capability to extract task-specific knowledge contained in the pre-trained model. To overcome these limitations, our paper approaches universal few-shot dense prediction from a novel perspective. Unlike conventional fine-tuning techniques that use all model parameters and modify a specific set of weights for few-shot adaptation, our method focuses on selecting task-relevant computation pathways of the pre-trained model while keeping the model weights frozen. Building upon this idea, we introduce a novel framework UniDense for universal few-shot dense prediction. First, we construct a versatile MoE (Mixture of Experts) architecture for dense prediction based on the Stable Diffusion model. We then utilize episodes-based meta-learning to train a set of routers for this MoE model, called Meta-Routers, which act as hyper-networks responsible for selecting computation blocks relevant to each task. We demonstrate that fine-tuning these meta-routers enables efficient few-shot adaptation of the entire model. Moreover, for each few-shot task, we leverage support samples to extract a task embedding, which serves as a conditioning factor for meta-routers. This strategy allows meta-routers to dynamically adapt themselves for different few-shot task, leading to improved adaptation performance. Experiments on a challenging variant of Taskonomy dataset with 10 dense prediction tasks demonstrate the superiority of our approach.
Event-based Optical Flow via Transforming into Motion-dependent View
Zengyu Wan, Yang Wang, Wei Zhai*, Ganchao Tan, Yang Cao, Zheng-Jun Zha*.
IEEE Transactions on Image Processing (T-IP).
abstract / bibtex
Event cameras respond to temporal dynamics, helping to resolve ambiguities in spatio-temporal changes for optical flow estimation. However, the unique spatio-temporal event distribution challenges the feature extraction, and the direct construction of motion representation through the orthogonal view is less than ideal due to the entanglement of appearance and motion. This paper proposes to transform the orthogonal view into a motion-dependent one for enhancing event-based motion representation and presents a Motion View-based Network (MV-Net) for practical optical flow estimation. Specifically, this motion-dependent view transformation is achieved through the Event View Transformation Module, which captures the relationship between the steepest temporal changes and motion direction, incorporating these temporal cues into the view transformation process for feature gathering. This module includes two phases: extracting the temporal evolution clues by central difference operation in the extraction phase and capturing the motion pattern by evolution-guided deformable convolution in the perception phase. Besides, the MV-Net constructs an eccentric downsampling process to avoid response weakening from the sparsity of events in the downsampling stage. The whole network is trained end-to-end in a self-supervised manner, and the evaluations conducted on four challenging datasets reveal the superior performance of the proposed model compared to state-of-the-art (SOTA) methods.
Bidirectional Progressive Transformer for Interaction Intention Anticipation
Zichen Zhang, Hongchen Luo, Wei Zhai*, Yang Cao, Yu Kang.
European Conference on Computer Vision (ECCV 2024).
abstract / bibtex / arxiv
Interaction intention anticipation aims to jointly predict future hand trajectories and interaction hotspots. Existing research often treated trajectory forecasting and interaction hotspots prediction as separate tasks or solely considered the impact of trajectories on interaction hotspots, which led to the accumulation of prediction errors over time. However, a deeper inherent connection exists between hand trajectories and interaction hotspots, which allows for continuous mutual correction between them. Building upon this relationship, a novel Bidirectional prOgressive Transformer (BOT), which introduces a Bidirectional Progressive mechanism into the anticipation of interaction intention is established. Initially, BOT maximizes the utilization of spatial information from the last observation frame through the Spatial-Temporal Reconstruction Module, mitigating conflicts arising from changes of view in first-person videos. Subsequently, based on two independent prediction branches, a Bidirectional Progressive Enhancement Module is introduced to mutually improve the prediction of hand trajectories and interaction hotspots over time to minimize error accumulation. Finally, acknowledging the intrinsic randomness in human natural behavior, we employ a Trajectory Stochastic Unit and a C-VAE to introduce appropriate uncertainty to trajectories and interaction hotspots, respectively. Our method achieves state-of-the-art results on three benchmark datasets Epic-Kitchens-100, EGO4D, and EGTEA Gaze+, demonstrating superior in complex scenarios.
Event-based Asynchronous HDR Imaging by Temporal Incident Light Modulation
Yuliang Wu, Ganchao Tan, Jinze Chen, Wei Zhai*, Yang Cao, Zheng-Jun Zha.
Optics Express (OE).
abstract / bibtex
Dynamic range (DR) is a pivotal characteristic of imaging systems. Current frame-based cameras struggle to achieve high dynamic range imaging due to the conflict between globally uniform exposure and spatially variant scene illumination. In this paper, we propose AsynHDR, a pixel-asynchronous HDR imaging system, based on key insights into the challenges in HDR imaging and the unique event-generating mechanism of dynamic vision sensors (DVS). Our proposed AsynHDR system integrates the DVS with a set of LCD panels. The LCD panels modulate the irradiance incident upon the DVS by altering their transparency, thereby triggering the pixel-independent event streams. The HDR image is subsequently decoded from the event streams through our temporal-weighted algorithm. Experiments under the standard test platform and several challenging scenes have verified the feasibility of the system in HDR imaging tasks.
Prioritized Local Matching Network for Cross-Category Few-Shot Anomaly Detection
Huilin Deng, Hongchen Luo, Wei Zhai, Yang Cao, Yanming Guo, Yu Kang.
IEEE Artificial Intelligence (T-AI).
abstract / bibtex
In response to the rapid evolution of products in industrial inspection, this paper introduces the Cross-category Few-shot Anomaly Detection (C-FSAD) task, aimed at efficiently detecting anomalies in new object categories with minimal normal samples. However, the diversity of defects and significant visual distinctions among various objects hinder the identification of anomalous regions. To tackle this, we adopt a pairwise comparison between query and normal samples, establishing an intimate correlation through fine-grained correspondence. Specifically, we propose the Prioritized Local Matching Network (PLMNet), emphasizing local analysis of correlation, which includes three primary components: 1) Local Perception Network refines the initial matches through bidirectional local analysis; 2) Step Aggregation strategy employs multiple stages of local convolutional pooling to aggregate local insights; 3) Defect-sensitive Weight Learner adaptively enhances channels informative for defect structures, ensuring more discriminative representations of encoded context. Our PLMNet deepens the interpretation of correlations, from geometric cues to semantics, efficiently extracting discrepancies in feature space. Extensive experiments on two standard industrial anomaly detection benchmarks demonstrate our state-of-the-art performance in both detection and localization, with margins of 9.8% and 5.4% respectively.
LEMON: Learning 3D Human-Object Interaction Relation from 2D Images
Yuhang Yang, Wei Zhai*, Hongchen Luo, Yang Cao, Zheng-Jun Zha.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024).
abstract / bibtex / arxiv / website
Learning 3D human-object interaction relation is pivotal to embodied AI and interaction modeling. Most existing methods approach the goal by learning to predict isolated interaction elements eg human contact object affordance and human-object spatial relation primarily from the perspective of either the human or the object. Which underexploit certain correlations between the interaction counterparts (human and object) and struggle to address the uncertainty in interactions. Actually objects' functionalities potentially affect humans' interaction intentions which reveals what the interaction is. Meanwhile the interacting humans and objects exhibit matching geometric structures which presents how to interact. In light of this we propose harnessing these inherent correlations between interaction counterparts to mitigate the uncertainty and jointly anticipate the above interaction elements in 3D space. To achieve this we present LEMON (LEarning 3D huMan-Object iNteraction relation) a unified model that mines interaction intentions of the counterparts and employs curvatures to guide the extraction of geometric correlations combining them to anticipate the interaction elements. Besides the 3D Interaction Relation dataset (3DIR) is collected to serve as the test bed for training and evaluation. Extensive experiments demonstrate the superiority of LEMON over methods estimating each element in isolation.
Mambapupil: Bidirectional selective recurrent model for event-based eye tracking
Zhong Wang, Zengyu Wan, Han Han, Bohao Liao, Yuliang Wu, Wei Zhai*, Yang Cao, Zheng-Jun Zha.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Workshop.
Event-based Eye Tracking-AIS2024 CVPR Workshop, 1st Place.
abstract / bibtex
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However the diversity and abruptness of eye movement patterns including blinking fixating saccades and smooth pursuit pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically the MambaPupil network is proposed which consists of the multi-layer convolutional encoder to extract features from the event representations a bidirectional Gated Recurrent Unit (GRU) and a Linear Time-Varying State Space Module (LTV-SSM) to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore the Bina-rep is utilized as a compact event representation and the tailor-made data augmentation called as Event-Cutout is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation of the ThreeET-plus benchmark shows that the MambaPupil realizes stable and accurate eye tracking under various complex conditions and achieves state-of-the-art performance.
Hypercorrelation Evolution for Video Class-Incremental Learning
Sen Liang, Kai Zhu*, Zhiheng Liu, Wei Zhai*, Yang Cao.
AAAI Conference on Artificial Intelligence (AAAI 2024).
abstract / bibtex
Video class-incremental learning aims to recognize new actions while restricting the catastrophic forgetting of old ones, whose representative samples can only be saved in limited memory. Semantically variable subactions are susceptible to class confusion due to data imbalance. While existing methods address the problem by estimating and distilling the spatio-temporal knowledge, we further explores that the refinement of hierarchical correlations is crucial for the alignment of spatio-temporal features. To enhance the adaptability on evolved actions, we proposes a hierarchical aggregation strategy, in which hierarchical matching matrices are combined and jointly optimized to selectively store and retrieve relevant features from previous tasks. Meanwhile, a correlation refinement mechanism is presented to reinforce the bias on informative exemplars according to online hypercorrelation distribution. Experimental results demonstrate the effectiveness of the proposed method on three standard video class-incremental learning benchmarks, outperforming state-of-the-art methods. Code is available at: https://github.com/Lsen991031/HCE.
2023
Grounded Affordance from Exocentric View
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao.
International Journal of Computer Vision (IJCV).
Journal version of "Learning Affordance Grounding from Exocentric Images" (CVPR 2022)
abstract / bibtex / arxiv / code
Affordance grounding aims to locate objects' "action possibilities" regions, which is an essential step toward embodied intelligence. Due to the diversity of interactive affordance, the uniqueness of different individuals leads to diverse interactions, which makes it difficult to establish an explicit link between object parts and affordance labels. Human has the ability that transforms the various exocentric interactions into invariant egocentric affordance to counter the impact of interactive diversity. To empower an agent with such ability, this paper proposes a task of affordance grounding from exocentric view, i.e., given exocentric human-object interaction and egocentric object images, learning the affordance knowledge of the object and transferring it to the egocentric image using only the affordance label as supervision. However, there is some "interaction bias" between personas, mainly regarding different regions and different views. To this end, we devise a cross-view affordance knowledge transfer framework that extracts affordance-specific features from exocentric interactions and transfers them to the egocentric view. Specifically, the perception of affordance regions is enhanced by preserving affordance co-relations. In addition, an affordance grounding dataset named AGD20K is constructed by collecting and labeling over 20K images from [Math Processing Error] affordance categories. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality.
On Exploring Multiplicity of Primitives and Attributes for Texture Recognition in the Wild
Wei Zhai, Yang Cao, Jing Zhang, Haiyong Xie, Dacheng Tao, Zheng-Jun Zha.
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI).
Journal version of "Deep Multiple-Attribute-Perceived Network for Real-World Texture Recognition" (ICCV 2019) and "Deep Structure-Revealed Network for Texture Recognition" (CVPR 2020).
abstract / bibtex / code
Texture recognition is a challenging visual task since its multiple primitives or attributes can be perceived from the texture image under different spatial contexts. Existing approaches predominantly built upon CNN incorporate rich local descriptors with orderless aggregation to capture invariance to the spatial layout. However, these methods ignore the inherent structure relation organized by primitives and the semantic concept described by attributes, which are critical cues for texture representation. In this paper, we propose a novel Multiple Primitives and Attributes Perception network (MPAP) that extracts features by modeling the relation of bottom-up structure and top-down attribute in a multi-branch unified framework. A bottom-up process is first proposed to capture the inherent relation of various primitive structures by leveraging structure dependency and spatial order information. Then, a top-down process is introduced to model the latent relation of multiple attributes by transferring attribute-related features between adjacent branches. Moreover, an augmentation module is devised to bridge the gap between high-level attributes and low-level structure features. MPAP can learn representation through jointing bottom-up and top-down processes in a mutually reinforced manner. Experimental results on six challenging texture datasets demonstrate the superiority of MPAP over state-of-the-art methods in terms of accuracy, robustness, and efficiency.
Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation
Wei Zhai, Pingyu Wu, Kai Zhu, Yang Cao, Feng Wu, Zheng-Jun Zha.
International Journal of Computer Vision (IJCV).
Journal version of "Background Activation Suppression for Weakly Supervised Object Localization" (CVPR 2022)
abstract / bibtex / arxiv / code
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, 1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. 2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint (AMC) module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets.
Learning Visual Affordance Grounding from Demonstration Videos
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao.
IEEE Transactions on Neural Networks and Learning Systems (T-NNLS).
abstract / bibtex / arxiv / code
Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which benefits many applications, such as robot grasping and action recognition. Prevailing methods predominantly depend on the appearance feature of the objects to segment each region of the image, which encounters the following two problems: 1) there are multiple possible regions in an object that people interact with and 2) there are multiple possible human interactions in the same object region. To address these problems, we propose a hand-aided affordance grounding network (HAG-Net) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net adopts a dual-branch structure to process the demonstration video and object image data. For the video branch, we introduce hand-aided attention to enhance the region around the hand in each video frame and then use the long short-term memory (LSTM) network to aggregate the action features. For the object branch, we introduce a semantic enhancement module (SEM) to make the network focus on different parts of the object according to the action classes and utilize a distillation loss to align the output features of the object branch with that of the video branch and transfer the knowledge in the video branch to the object branch. Quantitative and qualitative evaluations on two challenging datasets show that our method has achieved state-of-the-art results for affordance grounding.
Spatial-Aware Token for Weakly Supervised Object Localization
Pingyu Wu, Wei Zhai*, Yang Cao, Jiebo Luo and Zheng-Jun Zha.
IEEE/CVF International Conference on Computer Vision (ICCV 2023).
abstract / bibtex / arxiv / code
Weakly supervised object localization (WSOL) is a challenging task aiming to localize objects with only image-level supervision. Recent works apply visual transformer to WSOL and achieve significant success by exploiting the long-range feature dependency in self-attention mechanism. However, existing transformer-based methods synthesize the classification feature maps as the localization map, which leads to optimization conflicts between classification and localization tasks. To address this problem, we propose to learn a task-specific spatial-aware token (SAT) to condition localization in a weakly supervised manner. Specifically, a spatial token is first introduced in the input space to aggregate representations for localization task. Then a spatial aware attention module is constructed, which allows spatial token to generate foreground probabilities of different patches by querying and to extract localization knowledge from the classification task. Besides, for the problem of sparse and unbalanced pixel-level supervision obtained from the image-level label, two spatial constraints, including batch area loss and normalization loss, are designed to compensate and enhance this supervision. Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc, respectively. Even under the extreme setting of using only 1 image per class from ImageNet for training, SAT already exceeds the SOTA method by 2.1% GT-known Loc.
Grounding 3D Object Affordance from 2D Interactions in Images
Yuhang Yang, Wei Zhai*, Hongchen Luo, Yang Cao, Jiebo Luo and Zheng-Jun Zha.
IEEE/CVF International Conference on Computer Vision (ICCV 2023).
abstract / bibtex / arxiv / code
Grounding 3D object affordance seeks to locate objects'" action possibilities" regions in the 3D space, which serves as a link between perception and operation for embodied agents. Existing studies primarily focus on connecting visual affordances with geometry structures, eg, relying on annotations to declare interactive regions of interest on the object and establishing a mapping between the regions and affordances. However, the essence of learning object affordance is to understand how to use it, and the manner that detaches interactions is limited in generalization. Normally, humans possess the ability to perceive object affordances in the physical world through demonstration images or videos. Motivated by this, we introduce a novel task setting: grounding 3D object affordance from 2D interactions in images, which faces the challenge of anticipating affordance through interactions of different sources. To address this problem, we devise a novel Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources and models the interactive contexts for 3D object affordance grounding. Besides, we collect a Point-Image Affordance Dataset (PIAD) to support the proposed task. Comprehensive experiments on PIAD demonstrate the reliability of the proposed task and the superiority of our method.
Robustness Benchmark for Unsupervised Anomaly Detection Models
Pei Wang, Wei Zhai, and Yang Cao.
Journal of University of Science and Technology of China (JUSTC).
abstract / bibtex
Due to the complexity and diversity of production environments, it is essential to understand the robustness of unsupervised anomaly detection models to common corruptions. To explore this issue systematically, we propose a dataset named MVTec-C to evaluate the robustness of unsupervised anomaly detection models. Based on this dataset, we explore the robustness of approaches in five paradigms, namely, reconstruction-based, representation similarity-based, normalizing flow-based, self-supervised representation learning-based, and knowledge distillation-based paradigms. Furthermore, we explore the impact of different modules within two optimal methods on robustness and accuracy. This includes the multi-scale features, the neighborhood size, and the sampling ratio in the PatchCore method, as well as the multi-scale features, the MMF module, the OCE module, and the multi-scale distillation in the Reverse Distillation method. Finally, we propose a feature alignment module (FAM) to reduce the feature drift caused by corruptions and combine PatchCore and the FAM to obtain a model with both high performance and high accuracy. We hope this work will serve as an evaluation method and provide experience in building robust anomaly detection models in the future.
Leverage Interactive Affinity for Affordance Learning
Hongchen Luo#, Wei Zhai#, Jing Zhang, Yang Cao, and Dacheng Tao.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023).
abstract / bibtex / code
Perceiving potential" action possibilities"(ie, affordance) regions of images and learning interactive functionalities of objects from human demonstration is a challenging task due to the diversity of human-object interactions. Prevailing affordance learning algorithms often adopt the label assignment paradigm and presume that there is a unique relationship between functional region and affordance label, yielding poor performance when adapting to unseen environments with large appearance variations. In this paper, we propose to leverage interactive affinity for affordance learning, ie, extracting interactive affinity from human-object interaction and transferring it to non-interactive objects. Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities. Specifically, we propose a pose-aided interactive affinity learning framework that exploits human pose to guide the network to learn the interactive affinity from human-object interactions. Particularly, a keypoint heuristic perception (KHP) scheme is devised to exploit the keypoint association of human pose to alleviate the uncertainties due to interaction diversities and contact occlusions. Besides, a contact-driven affordance learning (CAL) dataset is constructed by collecting and labeling over 5,000 images. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality.
Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection
Fan Lu, Kai Zhu, Wei Zhai, Kecheng Zheng, and Yang Cao.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023).
abstract / bibtex / code
Semantically coherent out-of-distribution (SCOOD) detection aims to discern outliers from the intended data distribution with access to unlabeled extra set. The coexistence of in-distribution and out-of-distribution samples will exacerbate the model overfitting when no distinction is made. To address this problem, we propose a novel uncertainty-aware optimal transport scheme. Our scheme consists of an energy-based transport (ET) mechanism that estimates the fluctuating cost of uncertainty to promote the assignment of semantic-agnostic representation, and an inter-cluster extension strategy that enhances the discrimination of semantic property among different clusters by widening the corresponding margin distance. Furthermore, a T-energy score is presented to mitigate the magnitude gap between the parallel transport and classifier branches. Extensive experiments on two standard SCOOD benchmarks demonstrate the above-par OOD detection performance, outperforming the state-of-the-art methods by a margin of 27.69% and 34.4% on FPR@ 95, respectively.
Exploring Tuning Characteristics of Ventral Stream's Neurons for Few-Shot Image Classification
Lintao Dong, Wei Zhai, Zheng-Jun Zha.
AAAI Conference on Artificial Intelligence (AAAI 2023, Oral, Distinguished Paper).
abstract / bibtex
Human has the remarkable ability of learning novel objects by browsing extremely few examples, which may be attributed to the generic and robust feature extracted in the ventral stream of our brain for representing visual objects. In this sense, the tuning characteristics of ventral stream's neurons can be useful prior knowledge to improve few-shot classification. Specifically, we computationally model two groups of neurons found in ventral stream which are respectively sensitive to shape cues and color cues. Then we propose the hierarchical feature regularization method with these neuron models to regularize the backbone of a few-shot model, thus making it produce more generic and robust features for few-shot classification. In addition, to simulate the tuning characteristic that neuron firing at a higher rate in response to foreground stimulus elements compared to background elements, which we call belongingness, we design a foreground segmentation algorithm based on the observation that the foreground object usually does not appear at the edge of the picture, then multiply the foreground mask with the backbone of few-shot model. Our method is model-agnostic and can be applied to few-shot models with different backbones, training paradigms and classifiers.
2022
Exploring Figure-Ground Assignment Mechanism in Perceptual Organization
Wei Zhai, Yang Cao, jing Zhang, Zheng-Jun Zha.
Neural Information Processing Systems (NeurIPS 2022).
abstract / bibtex
Perceptual organization is a challenging visual task that aims to perceive and group the individual visual element so that it is easy to understand the meaning of the scene as a whole. Most recent methods building upon advanced Convolutional Neural Network (CNN) come from learning discriminative representation and modeling context hierarchically. However, when the visual appearance difference between foreground and background is obscure, the performance of existing methods degrades significantly due to the visual ambiguity in the discrimination process. In this paper, we argue that the figure-ground assignment mechanism, which conforms to human vision cognitive theory, can be explored to empower CNN to achieve a robust perceptual organization despite visual ambiguity. Specifically, we present a novel Figure-Ground-Aided (FGA) module to learn the configural statistics of the visual scene and leverage it for the reduction of visual ambiguity. Particularly, we demonstrate the benefit of using stronger supervisory signals by teaching (FGA) module to perceive configural cues, \ie, convexity and lower region, that human deem important for the perceptual organization. Furthermore, an Interactive Enhancement Module (IEM) is devised to leverage such configural priors to assist representation learning, thereby achieving robust perception organization with complex visual ambiguities. In addition, a well-founded visual segregation test is designed to validate the capability of the proposed FGA mechanism explicitly. Comprehensive evaluation results demonstrate our proposed FGA mechanism can effectively enhance the capability of perception organization on various baseline models. Nevertheless, the model augmented via our proposed FGA mechanism also outperforms state-of-the-art approaches on four challenging real-world applications.
Phrase-Based Affordance Detection via Cyclic Bilateral Interaction
Liangsheng Lu#, Wei Zhai#, Hongchen Luo, Kang Yu, Yang Cao.
IEEE Artificial Intelligence (T-AI).
abstract / bibtex / arxiv / code
Affordance detection, which refers to perceiving objects with potential action possibilities in images, is a challenging task since the possible affordance depends on the person's purpose in real-world application scenarios. The existing works mainly extract the inherent human–object dependencies from image/video to accommodate affordance properties that change dynamically. In this article, we explore to perceive affordances from a vision-language perspective, and consider the challenging phrase-based affordance detection task, i.e., given a set of phrases describing the potential actions, all the object regions in a scene with the same affordance should be detected. To this end, we propose a cyclic bilateral c onsistency enhancement network (CBCE-Net) to align language and vision features in a progressive manner. Specifically, the presented CBCE-Net consists of a mutual guided vision-language module that updates the common features of vision and language in a progressive manner, and a cyclic interaction module that facilitates the perception of possible interaction with objects in a cyclic manner. In addition, we extend the public purpose-driven affordance dataset (PAD) by annotating affordance categories with short phrases. The extensive contrastive experimental results demonstrate the superior performance of our method over nine typical methods from four relevant fields in terms of both objective metrics and visual quality.
One-Shot Affordance Detection in the Wild
Wei Zhai#, Hongchen Luo#, Jing Zhang, Yang Cao, Dacheng Tao.
International Journal of Computer Vision (IJCV).
Journal version of "One-Shot Affordance Detection" (IJCAI 2021)
abstract / bibtex / arxiv / code
Affordance detection refers to identifying the potential action possibilities of objects in an image, which is a crucial ability for robot perception and manipulation. To empower robots with this ability in unseen scenarios, we first study the challenging one-shot affordance detection problem in this paper, i.e., given a support image that depicts the action purpose, all objects in a scene with the common affordance should be detected. To this end, we devise a One-Shot Affordance Detection Network (OSAD-Net) that firstly estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images. Through collaboration learning, OSAD-Net can capture the common characteristics between objects having the same underlying affordance and learn a good adaptation capability for perceiving unseen affordances. Besides, we build a large-scale purpose-driven affordance dataset v2 (PADv2) by collecting and labeling 30k images from 39 affordance and 103 object categories. With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods and may also facilitate downstream vision tasks, such as scene understanding, action recognition, and robot manipulation. Specifically, we conducted comprehensive experiments on PADv2 dataset by including 11 advanced models from several related research fields. Experimental results demonstrate the superiority of our model over previous representative ones in terms of both objective metrics and visual quality. The benchmark suite is available at https://github.com/lhc1224/OSAD_Net.
Deep Texton-Coherence Network for Camouflaged Object Detection
Wei Zhai, Yang Cao, Haiyong Xie, Zheng-Jun Zha.
IEEE Transactions on Multimedia (T-MM).
abstract / bibtex
Camouflaged object detection is a challenging visual task since the appearance and morphology of foreground objects and background regions are highly similar in nature. Recent CNN-based studies gradually integrated the high-level semantic information and the low-level local features of images through hierarchical and progressive structures to achieve camouflaged object detection. However, these methods ignore the spatial statistical properties of the local context, which is a critical cue for distinguishing and describing camouflaged objects. To address this problem, we propose a novel Deep Texton-Coherence Network (DTC-Net) that leverages the spatial organization of textons in the foreground and background regions as discriminative cues for camouflaged object detection. Specifically, a Local Bilinear module (LB) is devised to obtain the robust representation of texton to trivial details and illumination changes, by replacing the classic first-order linearization operations with bilinear second-order statistical operations in the convolution process. Next, these texton representations are associated with a Spatial Coherence Organization module (SCO) to capture irregular spatial coherence via a deformable convolutional strategy, and then the descriptions of the textons extracted by the LB module are used as weights to suppress features that are spatially adjacent but have different representations. Finally, the texton-coherence representation is integrated with the original features at different levels to achieve camouflaged object detection. Evaluation on the three most challenging camouflaged object detection datasets demonstrats the superiority of the proposed model when compared to the state-of-the-art methods. Furthermore, our ablation studies and performance analyses demonstrate the effectiveness of the texton-coherence module.
Location-Free Camouflage Generation Network
Yangyang Li#, Wei Zhai#, Yang Cao, Zheng-Jun Zha.
IEEE Transactions on Multimedia (T-MM).
abstract / bibtex / arxiv / code
Camouflage is a common visual phenomenon, which refers to hiding the foreground objects into the background images, making them briefly invisible to the human eye. Previous work has typically been implemented by an iterative optimization process. However, these methods struggle in 1) efficiently generating camouflage images using foreground and background with flexible structure; 2) camouflaging foreground objects to regions with multiple appearances ( e.g. the junction of the vegetation and the mountains), which limit their practical application. To address these problems, this paper proposes a novel L ocation-free C amouflage G eneration Net work ( LCG-Net ) that fuse high-level features of foreground and background image, and generate result by one inference. Specifically, a Position-aligned Structure Fusion (PSF) module is devised to guide structure feature fusion based on the point-to-point structure similarity of foreground and background, and introduce local appearance features point-by-point. To retain the necessary identifiable features, a new immerse loss is adopted under our pipeline, while a background patch appearance loss is utilized to ensure that the hidden objects look continuous and natural at regions with multiple appearances. Experiments show that our method has results as satisfactory as state-of-the-art in the single-appearance regions and are less likely to be completely invisible, but far exceed the quality of the state-of-the-art in the multi-appearance regions. Moreover, our method is hundreds of times faster than previous methods. Benefitting from the unique advantages of our method, we provide some downstream applications for camouflage generation, which show its potential.
Learning Affordance Grounding from Exocentric Images
Hongchen Luo#, Wei Zhai#, Jing Zhang, Yang Cao, and Dacheng Tao.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022).
abstract / bibtex / arxiv / code
Affordance grounding, a task to ground (i.e., localize) action possibility region in objects, which faces the challenge of establishing an explicit link with object parts due to the diversity of interactive affordance. Human has the ability that transform the various exocentric interactions to invariant egocentric affordance so as to counter the impact of interactive diversity. To empower an agent with such ability, this paper proposes a task of affordance grounding from exocentric view, i.e., given exocentric human-object interaction and egocentric object images, learning the affordance knowledge of the object and transferring it to the egocentric image using only the affordance label as supervision. To this end, we devise a cross-view knowledge transfer framework that extracts affordance-specific features from exocentric interactions and enhances the perception of affordance regions by preserving affordance correlation. Specifically, an Affordance Invariance Mining module is devised to extract specific clues by minimizing the intra-class differences originated from interaction habits in exocentric images. Besides, an Affordance Co-relation Preserving strategy is presented to perceive and localize affordance by aligning the co-relation matrix of predicted results between the two views. Particularly, an affordance grounding dataset named AGD20K is constructed by collecting and labeling over 20K images from 36 affordance categories. Experimental results demonstrate that our method outperforms the representative models in terms of objective metrics and visual quality. Code: github.com/lhc1224/Cross-View-AG.
Background Activation Suppression for Weakly Supervised Object Localization
Pingyu Wu#, Wei Zhai#, Yang Cao.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022).
abstract / bibtex / arxiv / code
Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels. Recently a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve localization task. Existing FPM-based methods use cross-entropy (CE) to evaluate the foreground prediction map and to guide the learning of generator. We argue for using activation value to achieve more efficient learning. It is based on the experimental observation that, for a trained network, CE converges to zero when the foreground mask covers only part of the object region. While activation value increases until the mask expands to the object boundary, which indicates that more object areas can be learned by using activation value. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint module (AMC) is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using the foreground region guidance and the area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. Code and models are available at github.com/wpy1999IBAS.
Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learnings
Kai Zhu, Wei Zhai, Yang Cao, Jiebo Luo, Zheng-Jun Zha.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022).
abstract / bibtex / arxiv / code
Non-exemplar class-incremental learning is to recognize both the old and new classes when old class samples cannot be saved. It is a challenging task since representation optimization and feature retention can only be achieved under supervision from new classes. To address this problem, we propose a novel self-sustaining representation expansion scheme. Our scheme consists of a structure reorganization strategy that fuses main-branch expansion and side-branch updating to maintain the old features, and a main-branch distillation scheme to transfer the invariant knowledge. Furthermore, a prototype selection mechanism is proposed to enhance the discrimination between the old and new classes by selectively incorporating new samples into the distillation process. Extensive experiments on three benchmarks demonstrate significant incremental performance, outperforming the state-of-the-art methods by a margin of 3%, 3% and 6%, respectively.
Robust Object Detection via Adversarial Novel Style Exploration
Wen Wang, Jing Zhang, Wei Zhai, Yang Cao, Dacheng Tao.
IEEE Transactions on Image Processing (T-IP).
abstract / bibtex
Deep object detection models trained on clean images may not generalize well on degraded images due to the well-known domain shift issue. This hinders their application in real-life scenarios such as video surveillance and autonomous driving. Though domain adaptation methods can adapt the detection model from a labeled source domain to an unlabeled target domain, they struggle in dealing with open and compound degradation types. In this paper, we attempt to address this problem in the context of object detection by proposing a robust object Detector via Adversarial Novel Style Exploration (DANSE). Technically, DANSE first disentangles images into domain-irrelevant content representation and domain-specific style representation under an adversarial learning framework. Then, it explores the style space to discover diverse novel degradation styles that are complementary to those of the target domain images by leveraging a novelty regularizer and a diversity regularizer. The clean source domain images are transferred into these discovered styles by using a content-preserving regularizer to ensure realism. These transferred source domain images are combined with the target domain images and used to train a robust degradation-agnostic object detection model via adversarial domain adaptation. Experiments on both synthetic and real benchmark scenarios confirm the superiority of DANSE over state-of-the-art methods.
2021
Robust Object Detection via Adversarial Novel Style Exploration
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao.
International Joint Conferences on Artificial Intelligence Organization (IJCAI 2021, Oral).
abstract / bibtex / arxiv / code
Affordance detection refers to identifying the potential action possibilities of objects in an image, which is an important ability for robot perception and manipulation. To empower robots with this ability in unseen scenarios, we consider the challenging one-shot affordance detection problem in this paper, i.e., given a support image that depicts the action purpose, all objects in a scene with the common affordance should be detected. To this end, we devise a One-Shot Affordance Detection (OS-AD) network that firstly estimates the purpose and then transfers it to help detect the common affordance from all candidate images. Through collaboration learning, OS-AD can capture the common characteristics between objects having the same underlying affordance and learn a good adaptation capability for perceiving unseen affordances. Besides, we build a Purpose-driven Affordance Dataset (PAD) by collecting and labeling 4k images from 31 affordance and 72 object categories. Experimental results demonstrate the superiority of our model over previous representative ones in terms of both objective metrics and visual quality. The benchmark suite is at ProjectPage.
A Tri-Attention Enhanced Graph Convolutional Network for Skeleton-Based Action Recognition
Xingming Li, Wei Zhai, Yang Cao.
IET Computer Vision (IET-CV 2021).
abstract / bibtex
Skeleton-based action recognition has recently attracted a lot of research interests due to its advantage in computational efficiency. Some recent work building upon Graph Convolutional Networks (GCNs) has shown promising performance in this task by modelling intrinsic spatial correlations between skeleton joints. However, these methods only consider local properties of action sequences in the spatial-temporal domain, and consequently, are limited in distinguishing complex actions with similar local movements. To address this problem, a novel tri-attention module (TAM) is proposed to guide GCNs to perceive significant variations across local movements. Specifically, the devised TAM is implemented in three steps: i) A dimension permuting unit is proposed to characterise skeleton action sequences in three different domains: body poses, joint trajectories, and evolving projections. ii) A global statistical modelling unit is introduced to aggregate the first-order and second-order properties of global contexts to perceive the significant movement variations of each domain. iii) A fusion unit is presented to integrate the features of these three domains together and leverage as orientation for graph convolution at each layer. Through these three steps, significant-variation frames, joints, and channels can be enhanced. We conduct extensive experiments on two large-scale benchmark datasets, NTU RGB-D and Kinetics-Skeleton. Experimental results demonstrate that the proposed TAM can be easily plugged into existing GCNs and achieve comparable performance with the state-of-the-art methods.
Self-Promoted Prototype Refinement for Few-Shot Class-Incremental Learning
Kai Zhu, Yang Cao, Wei Zhai, Jie Cheng, Zheng-Jun Zha.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021).
abstract / bibtex / arxiv / code
Few-shot class-incremental learning is to recognize the new classes given few samples and not forget the old classes. It is a challenging task since representation optimization and prototype reorganization can only be achieved under little supervision. To address this problem, we propose a novel incremental prototype learning scheme. Our scheme consists of a random episode selection strategy that adapts the feature representation to various generated incremental episodes to enhance the corresponding extensibility, and a self-promoted prototype refinement mechanism which strengthens the expression ability of the new classes by explicitly considering the dependencies among different classes. Particularly, a dynamic relation projection module is proposed to calculate the relation matrix in a shared embedding space and leverage it as the factor for bootstrapping the update of prototypes. Extensive experiments on three benchmark datasets demonstrate the above-par incremental performance, outperforming state-of-the-art methods by a margin of 13%, 17% and 11%, respectively.
2020
Self-Supervised Tuning for Few-Shot Segmentation
Kai Zhu, Wei Zhai, Yang Cao.
International Joint Conferences on Artificial Intelligence Organization (IJCAI 2020, Oral).
abstract / bibtex
Few-shot segmentation aims at assigning a category label to each image pixel with few annotated samples. It is a challenging task since the dense prediction can only be achieved under the guidance of latent features defined by sparse annotations. Existing meta-learning method tends to fail in generating category-specifically discriminative descriptor when the visual features extracted from support images are marginalized in embedding space. To address this issue, this paper presents an adaptive tuning framework, in which the distribution of latent features across different episodes is dynamically adjusted based on a self-segmentation scheme, augmenting category-specific descriptors for label prediction. Specifically, a novel self-supervised inner-loop is firstly devised as the base learner to extract the underlying semantic features from the support image. Then, gradient maps are calculated by back-propagating self-supervised loss through the obtained features, and leveraged as guidance for augmenting the corresponding elements in embedding space. Finally, with the ability to continuously learn from different episodes, an optimization-based meta-learner is adopted as outer loop of our proposed framework to gradually refine the segmentation results. Extensive experiments on benchmark PASCAL-5i and COCO-20i datasets demonstrate the superiority of our proposed method over state-of-the-art.
Deep Inhomogeneous Regularization for Transfer Learning
Wen Wang, Wei Zhai, Yang Cao.
IEEE International Conference on Image Processing (ICIP 2020).
abstract / bibtex
Fine-tuning is an effective transfer learning method to achieve ideal performance on target task with limited training data. Some recent works regularize parameters of deep neural networks for better knowledge transfer. However, these methods enforce homogeneous penalties for all parameters, resulting in catastrophic forgetting or negative transfer. To address this problem, we propose a novel Inhomogeneous Regularization (IR) method that imposes a strong regularization on parameters of transferable convolutional filters to tackle catastrophic forgetting and alleviate the regularization on parameters of less transferable filters to tackle negative transfer. Moreover, we use the decaying averaged deviation of parameters from the start point (pre-trained parameters) to accurately measure the transferability of each filter. Evaluation on the three challenging benchmarks datasets has demonstrated the superiority of the proposed model against state-of-the-art methods.
Deep Structure-Revealed Network for Texture Recognition
Wei Zhai, Yang Cao, Zheng-Jun Zha, HaiYong Xie, Feng Wu.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020, Oral).
abstract / bibtex
Texture recognition is a challenging visual task since various primitives along with their arrangements can be recognized from a same texture image when perceiving with different contexts. Some recent work building on CNNs exploits orderless aggregating to provide invariance to spatial arrangements. However, these methods ignore the inherent structural property of textures, which is a critical cue for distinguishing and describing texture images in the wild. To address this problem, we propose a novel Deep Structure-Revealed Network (DSR-Net) that leverages spatial dependency among the captured primitives as structural representation for texture recognition. Specifically, a primitive capturing module (PCM) is devised to generate multiple primitives from eight directional spatial contexts, in which deep features are firstly extracted under the constrains of direction map and then encoded based on the similarities of neighborhood. Next, these primitives are associated with a dependence learning module (DLM) to generate structural representation, in which a two-way collaborative relationship strategy is introduced to perceive the spatial dependencies among multiple primitives. At last, the structure-revealed texture representations are integrated with spatial ordered information to achieve real-world texture recognition. Evaluation on the five most challenging texture recognition datasets has demonstrated the superiority of the proposed model against state-of-the-art methods. The structure-revealed performances of DSR-Net are further verified on some extensive experiments, including fine-grained classification and semantic segmentation.
One-Shot Texture Retrieval Using Global Grouping Metric
Kai Zhu, Yang Cao, Wei Zhai, Zheng-Jun Zha.
IEEE Transactions on Multimedia (T-MM 2020).
Journal version of "One-Shot Texture Retrieval with Global Context Metric" (IJCAI 2019)
abstract / bibtex
Texture retrieval is widely used in the fields of fashion and e-commerce. This paper presents the problem of one-shot texture retrieval: given an example of a new reference texture, we aim to detect and segment all pixels of the same texture category within an arbitrary image. To address this problem, an OS-TR network is proposed to encode both reference and query images into a texture representation space, and a better comparison is made based on the global grouping information. Because the learned texture representation should be invariant to the spatial layout while preserving the rough semantic concepts, we introduce an adaptive directionality-aware module to finely discriminate the orderless texture details. To make full use of the global context information given only a few examples, we incorporate a grouping-attention mechanism into the relation network, resulting in the per-channel modulation of the local relation features. Extensive experiments on two benchmark datasets (i.e., the DTD and ADE20K dataset) and real scenarios demonstrate that our proposed method can achieve above-par segmentation performance and robust generalization across domains.
2019
Deep Multiple-Attribute-Perceived Network for Real-World Texture Recognition
Wei Zhai, Yang Cao, Jing Zhang, Zheng-Jun Zha.
IEEE/CVF International Conference on Computer Vision (ICCV 2019).
abstract / bibtex
Texture recognition is a challenging visual task as multiple perceptual attributes may be perceived from the same texture image when combined with different spatial context. Some recent works building upon Convolutional Neural Network (CNN) incorporate feature encoding with orderless aggregating to provide invariance to spatial layouts. However, these existing methods ignore visual texture attributes, which are important cues for describing the real-world texture images, resulting in incomplete description and inaccurate recognition. To address this problem, we propose a novel deep Multiple-Attribute-Perceived Network (MAP-Net) by progressively learning visual texture attributes in a mutually reinforced manner. Specifically, a multi-branch network architecture is devised, in which cascaded global contexts are learned by introducing similarity constraint at each branch, and leveraged as guidance of spatial feature encoding at next branch through an attribute transfer scheme. To enhance the modeling capability of spatial transformation, a deformable pooling strategy is introduced to augment the spatial sampling with adaptive offsets to the global context, leading to perceive new visual attributes. An attribute fusion module is then introduced to jointly utilize the perceived visual attributes and the abstracted semantic concepts at each branch. Experimental results on the five most challenging texture recognition datasets have demonstrated the superiority of the proposed model against the state-of-the-arts.
One-Shot Texture Retrieval with Global Context Metric
Kai Zhu, Wei Zhai, Zheng-Jun Zha, Yang Cao.
International Joint Conferences on Artificial Intelligence Organization (IJCAI 2019, Oral).
abstract / bibtex
In this paper, we tackle one-shot texture retrieval: given an example of a new reference texture, detect and segment all the pixels of the same texture category within an arbitrary image. To address this problem, we present an OS-TR network to encode both reference and query image, leading to achieve texture segmentation towards the reference category. Unlike the existing texture encoding methods that integrate CNN with orderless pooling, we propose a directionality-aware module to capture the texture variations at each direction, resulting in spatially invariant representation. To segment new categories given only few examples, we incorporate a self-gating mechanism into relation network to exploit global context information for adjusting per-channel modulation weights of local relation features. Extensive experiments on benchmark texture datasets and real scenarios demonstrate the above-par segmentation performance and robust generalization across domains of our proposed method.
PixTextGAN: Structure Aware Text Image Synthesis for License Plate Recognition
Shilian Wu, Wei Zhai, Yang Cao.
IET Image Processing (IET-IP 2019).
abstract / bibtex
Rapid progress on text image recognition has been achieved with the development of deep-learning techniques. However, it is still a great challenge to achieve a comprehensive license plate recognition in the real scenes, since there are no publicly available large diverse datasets for the training of deep learning models. This paper aims at synthesising of license plate images with generative adversarial networks (GAN), refraining from collecting a vast amount of labelled data. The authors thus propose a novel PixTextGAN that leverages a controllable architecture that generates specific character structures for different text regions to generate synthetic license plate images with reasonable text details. Specifically, a comprehensive structure-aware loss function is presented to preserve the key characteristic of each character region and thus to achieve appearance adaption for better recognition. Qualitative and quantitative experiments demonstrate the superiority of authors’ proposed method in text image synthetisation over state-of-the-art GANs. Further experimental results of license plate recognition on ReId and CCPD dataset demonstrate that using the synthesised images by PixTextGAN can greatly improve the recognition accuracy.
2018
A Generative Adversarial Network Based Framework for Unsupervised Visual Surface Inspection
Wei Zhai, Jiang Zhu, Yang Cao, Zengfu Wang.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018, Oral).
abstract / bibtex
Visual surface inspection is a challenging task due to the highly inconsistent appearance of the target surfaces and the abnormal regions. Most of the state-of-the-art methods are highly dependent on the labelled training samples, which are difficult to collect in practical industrial applications. To address this problem, we propose a generative adversarial network based framework for unsupervised surface inspection. The generative adversarial network is trained to generate the fake images analogous to the normal surface images. It implies that a well-trained GAN indeed learns a good representation of the normal surface images in a latent feature space. And consequently, the discriminator of GAN can serve as a naturally one-class classifier. We use the first three conventional layer of the discriminator as the feature extractor, whose response is sensitive to the abnormal regions. Particularly, a multi-scale fusion strategy is adopted to fuse the responses of the three convolution layers and thus improve the segmentation performance of abnormal detection. Various experimental results demonstrate the effectiveness of our proposed method.
Co-Occurrent Structural Edge Detection for Color-Guided Depth Map Super-Resolution
Jiang Zhu, Wei Zhai, Yang Cao, Zheng-Jun Zha.
International Conference on Multimedia Modeling (MMM 2018, Oral).
abstract / bibtex
Although RGBD cameras can provide depth information in real scenes, the captured depth map is often of low resolution and insufficient quality compared to the color image. Typically, most of the existing methods work by assuming that the edges in depth map and its corresponding color image are more likely to occur simultaneously. However, when the color image is rich in detail, the high-frequency information which is non-existent in the depth map will be introduced into the depth map. In this paper, we propose a CNN-based method to detect the co-occurrent structural edge for color-guided depth map super-resolution. Firstly, we design an edge detection convolutional neural network (CNN) to obtain the co-occurrent structural edge in depth map and its corresponding color image. Then we pack the obtained co-occurrent structural edges and the interpolated low-resolution depth maps into another customized CNN for depth map super-resolution. The presented scheme can effectively interpret and exploit the structural correlation between the depth map and the color image. Additionally, recursive learning is adopted to reduce the parameters of the customized CNN for depth map super-resolution and avoid overfitting. Experimental results demonstrate the effectiveness and reliability of our proposed approach by comparing with the state-of-the-art methods.
Pre-prints
EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting
Bohao Liao, Wei Zhai, Zengyu Wan, Tianzhu Zhang, Yang Cao, Zheng-Jun Zha.
Arxiv.
abstract / bibtex / code
Scene reconstruction from casually captured videos has wide applications in real-world scenarios. With recent advancements in differentiable rendering techniques, several methods have attempted to simultaneously optimize scene representations (NeRF or 3DGS) and camera poses. Despite recent progress, existing methods relying on traditional camera input tend to fail in high-speed (or equivalently low-frame-rate) scenarios. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event camera to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, supervising the rendered views observed by the event stream. Second, we adopt the Contrast Maximization (CMax) framework in a piece-wise manner to extract motion information by maximizing the contrast of the Image of Warped Events (IWE), thereby calibrating the estimated poses. Besides, based on the Linear Event Generation Model (LEGM), the brightness information encoded in the IWE is also utilized to constrain the 3DGS in the gradient domain. Third, to mitigate the absence of color information of events, we introduce photometric bundle adjustment (PBA) to ensure view consistency across events and this http URL evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS.
Visual-Geometric Collaborative Guidance for Affordance Learning
Hongchen Luo, Wei Zhai, Jiao Wang, Yang Cao, Zheng-Jun Zha.
Arxiv.
Journal version of "Leverage Interactive Affinity for Affordance Learning" (CVPR 2023)
abstract / bibtex / code
Perceiving potential ``action possibilities'' (\ie, affordance) regions of images and learning interactive functionalities of objects from human demonstration is a challenging task due to the diversity of human-object interactions. Prevailing affordance learning algorithms often adopt the label assignment paradigm and presume that there is a unique relationship between functional region and affordance label, yielding poor performance when adapting to unseen environments with large appearance variations. In this paper, we propose to leverage interactive affinity for affordance learning, \ie extracting interactive affinity from human-object interaction and transferring it to non-interactive objects. Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities. To this end, we propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues to excavate interactive affinity from human-object interactions jointly. Besides, a contact-driven affordance learning (CAL) dataset is constructed by collecting and labeling over 55,047 images. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality.
MMAR: Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling
Jian Yang, Dacheng Yin, Yizhou Zhou, Fengyun Rao, Wei Zhai, Yang Cao, Zheng-Jun Zha.
Arxiv.
abstract / bibtex
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods inevitably suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation, the backbone model’s hidden representation of the image is not limited to the last denoising step. To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals. Through extensive evaluations on 18 image understanding benchmarks, MMAR demonstrates much more superior performance than other joint multi-modal models, matching the method that employs pretrained CLIP vision encoder, meanwhile being able to generate high quality images at the same time. We also showed that our method is scalable with larger data and model size.
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection
Huilin Deng, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang.
Arxiv.
abstract / bibtex
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects by establishing feature mapping between textual prompts and inspection images, demonstrating excellent research value in flexible industrial manufacturing. However, existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts. Recently, adapting Multimodal Large Language Models (MLLMs) for Industrial Anomaly Detection (IAD) presents a viable solution. Unlike fixed-prompt methods, MLLMs exhibit a generative paradigm with open-ended text interpretation, enabling more adaptive anomaly analysis. However, this adaption faces inherent challenges as anomalies often manifest in finegrained regions and exhibit minimal visual discrepancies from normal samples. To address these challenges, we propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and finegrained perception, simultaneously providing precise detection and comprehensive analysis of anomalies. Specifically, we design a Defect-Sensitive Structure Learning scheme that transfers patchsimilarities cues from visual branch to our MLLM for improved anomaly discrimination. Besides, we introduce a novel visual projector, Locality-enhanced Token Compression, which mines multi-level features in local contexts to enhance fine-grained detection. Furthermore, we introduce the Real Industrial Anomaly Detection (RIAD), a comprehensive IAD dataset with detailed anomaly descriptions and analyses, offering a valuable resource for MLLM-based IAD development. Extensive experiments on zero-shot benchmarks, including MVTec-AD, Visa, WFDD, and RIAD datasets, demonstrate our superior performance over stateof-the-art methods.
Grounding 3D Scene Affordance From Egocentric Interactions
Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo, Sen Liang, Yang Cao, Zheng-Jun Zha.
Arxiv.
abstract / bibtex
Grounding 3D scene affordance aims to locate interactive regions in 3D environments, which is crucial for embodied agents to interact intelligently with their surroundings. Most existing approaches achieve this by mapping semantics to 3D instances based on static geometric structure and visual appearance. This passive strategy limits the agent’s ability to actively perceive and engage with the environment, making it reliant on predefined semantic instructions. In contrast, humans develop complex interaction skills by observing and imitating how others interact with their surroundings. To empower the model with such abilities, we introduce a novel task: grounding 3D scene affordance from egocentric interactions, where the goal is to identify the corresponding affordance regions in a 3D scene based on an egocentric video of an interaction. This task faces the challenges of spatial complexity and alignment complexity across multiple sources. To address these challenges, we propose the Egocentric Interaction-driven 3D Scene Affordance Grounding (Ego-SAG) framework, which utilizes interaction intent to guide the model in focusing on interaction-relevant sub-regions and aligns affordance features from different sources through a bidirectional query decoder mechanism. Furthermore, we introduce the Egocentric Video-3D Scene Affordance Dataset (VSAD), covering a wide range of common interaction types and diverse 3D environments to support this task. Extensive experiments on VSAD validate both the feasibility of the proposed task and the effectiveness of our approach.
PEAR: Phrase-Based Hand-Object Interaction Anticipation
Zichen Zhang, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang.
Arxiv.
abstract / bibtex
First-person hand-object interaction anticipation aims to predict the interaction process over a forthcoming period based on current scenes and prompts. This capability is crucial for embodied intelligence and human-robot collaboration. The complete interaction process involves both pre-contact interaction intention (i.e., hand motion trends and interaction hotspots) and post-contact interaction manipulation (i.e., manipulation trajectories and hand poses with contact). Existing research typically anticipates only interaction intention while neglecting manipulation, resulting in incomplete predictions and an increased likelihood of intention errors due to the lack of manipulation constraints. To address this, we propose a novel model, PEAR (Phrase-Based Hand-Object Interaction Anticipation), which jointly anticipates interaction intention and manipulation. To handle uncertainties in the interaction process, we employ a twofold approach. Firstly, we perform cross-alignment of verbs, nouns, and images to reduce the diversity of hand movement patterns and object functional attributes, thereby mitigating intention uncertainty. Secondly, we establish bidirectional constraints between intention and manipulation using dynamic integration and residual connections, ensuring consistency among elements and thus overcoming manipulation uncertainty. To rigorously evaluate the performance of the proposed model, we collect a new task-relevant dataset, EGO-HOIP, with comprehensive annotations. Extensive experimental results demonstrate the superiority of our method.
ViViD: Video Virtual Try-on using Diffusion Models
Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha.
Arxiv.
abstract / bibtex / code
Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available.
Intention-driven Ego-to-Exo Video Generation
Hongchen Luo, Kai Zhu, Wei Zhai, Yang Cao.
Arxiv.
abstract / bibtex
Ego-to-exo video generation refers to generating the corresponding exocentric video according to the egocentric video, providing valuable applications in AR/VR and embodied AI. Benefiting from advancements in diffusion model techniques, notable progress has been achieved in video generation. However, existing methods build upon the spatiotemporal consistency assumptions between adjacent frames, which cannot be satisfied in the ego-to-exo scenarios due to drastic changes in views. To this end, this paper proposes an Intention-Driven Ego-to-exo video generation framework (IDE) that leverages action intention consisting of human movement and action description as view-independent representation to guide video generation, preserving the consistency of content and motion. Specifically, the egocentric head trajectory is first estimated through multi-view stereo matching. Then, cross-view feature perception module is introduced to establish correspondences between exo- and ego- views, guiding the trajectory transformation module to infer human full-body movement from the head trajectory. Meanwhile, we present an action description unit that maps the action semantics into the feature space consistent with the exocentric image. Finally, the inferred human movement and high-level action descriptions jointly guide the generation of exocentric motion and interaction content (i.e., corresponding optical flow and occlusion maps) in the backward process of the diffusion model, ultimately warping them into the corresponding exocentric video. We conduct extensive experiments on the relevant dataset with diverse exo-ego video pairs, and our IDE outperforms state-of-the-art models in both subjective and objective assessments, demonstrating its efficacy in ego-to-exo video generation.
Likelihood-Aware Semantic Alignment for Full-Spectrum Out-of-Distribution Detection
Fan Lu, Kai Zhu, Kecheng Zheng, Wei Zhai, Yang Cao.
Arxiv.
abstract / bibtex / code
Full-spectrum out-of-distribution (F-OOD) detection aims to accurately recognize in-distribution (ID) samples while encountering semantic and covariate shifts simultaneously. However, existing out-of-distribution (OOD) detectors tend to overfit the covariance information and ignore intrinsic semantic correlation, inadequate for adapting to complex domain transformations. To address this issue, we propose a Likelihood-Aware Semantic Alignment (LSA) framework to promote the image-text correspondence into semantically high-likelihood regions. LSA consists of an offline Gaussian sampling strategy which efficiently samples semantic-relevant visual embeddings from the class-conditional Gaussian distribution, and a bidirectional prompt customization mechanism that adjusts both ID-related and negative context for discriminative ID/OOD boundary. Extensive experiments demonstrate the remarkable OOD detection performance of our proposed LSA especially on the intractable Near-OOD setting, surpassing existing methods by a margin of 15.26% and 18.88% on two F-OOD benchmarks, respectively.
Professional Activities
Conference Reviewer:
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
IEEE International Conference on Computer Vision (ICCV)
European Conference on Computer Vision (ECCV)
Neural Information Processing Systems (NeurIPS)
International Conference on Learning Representations (ICLR)
AAAI Conference on Artificial Intelligence (AAAI)
ACM Multimedia (ACM MM)
International Joint Conferences on Artificial Intelligence Organization (IJCAI)
Journal Reviewer:
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI)
International Journal of Computer Vision (IJCV)
IEEE Transactions on Image Processing (T-IP)
IEEE Transactions on Neural Networks and Learning Systems (T-NNLS)
IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT)
IEEE Transactions on Multimedia (T-MM)
Pattern Recognition (PR)
ACM Transactions on Multimedia Computing, Communications, and Applications (ToMM)
Awards and Honors
3D Contact Estimation Challenge-RHOBIN2024 CVPR Workshop, 2nd Place, 2024
Event-based Eye Tracking-AIS2024 CVPR Workshop, 1st Place, 2024
NTIRE 2024 Efficient Super-Resolution Challenge, 2nd Place, 2024
AAAI Distinguished Paper, 2023
Outstanding Internship at JD Eeplore Academy, 2021
National Scholarship (University of Science and Technology of China), 2019
Outstanding Graduate of Southwest Jiaotong University, 2017
National Scholarship (Southwest Jiaotong University), 2016
Teaching Assistants
-
Image Processing. (Autumn, 2019)
-
Computer Vision. (Autumn, 2020)
Website adapted from Saurabh Gupta