Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning


Northwestern Polytechnical University
National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology


Abstract

The Visible-Infrared Person Re-identification (VI ReID) aims to match visible and infrared images of the same pedestrians across non-overlapped camera views. These two input modalities contain both invariant information, such as shape, and modality-specific details, such as color. An ideal model should utilize valuable information from bothmodalities during training for enhanced representational capability. However, the gap caused by modality-specific information poses substantial challenges for the VI ReID model to handle distinct modality inputs simultaneously. To address this, we introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work, designed to effectively utilize both invariant and specific information for identification. Specifically, our MIP model is built on the transformer architecture. In this model, we have designed a series of modality-specific prompts, which could enable our model to adapt to and make use of the specific information inherent in different modality inputs, thereby reducing the interference caused by the modality gap and achieving better identification. Besides, we also employ each pedestrian feature to construct a group of instance-specific prompts. These customized prompts are responsible for guiding our model to adapt to each pedestrian instance dynamically, thereby capturing identity-level discriminative clues for identification. Through extensive experiments on SYSU-MM01 and RegDB datasets, the effectiveness of both our designed modules is evaluated. Additionally, our proposed MIP performs better than most state-of-the-art methods.



Modality- and Instance-aware Visual Prompts (MIP) Network

We introduce a MIP network designed for the VI ReID task. Our primary focus is on adapting the model to different modality inputs and instance inputs, thereby mining the correspondences between different modalities/instances to facilitate VI ReID. We achieve these goals by employing two distinct sets of visual prompts. Specifically, we produce modality-specific prompts and instance-specific prompts according to current modality and instance input, and these two sets of prompts are concatenated after the feature embedding. By making use of modality-specific and instance-specific information preserved in these prompts, the model can explore potential correspondences between different modalities and instances, thereby facilitating the VI ReID.


MIP Overall Framework

The overall framework of our proposed MIP network, which consists of a backbone model and two major modules. (a) A pre-trained vision transformer is used as the backbone model. (b) Modality-aware Prompts Learning (MPL) module produces modality-specific prompts for input visual embeddings of each layer according to the modality labels of input images. (c) Instance-aware Prompts Generator (IPG) module generates instance-specific prompts, and the generated prompts are supervised by "IAEL loss". The "IAEL Loss" is our proposed Instance-aware Enhancement Loss. The two kinds of rompts help the backbone network to adapt to different modality and instance inputs.



Paper and Code

    Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning

Ruiqi Wu*, Bingliang Jiao*, Wenxuan Wang, Meng Liu, Peng Wang

*Equal contribution; Corresponding author.

ICMR 2024, Oral

[Paper] [Bibtex] [arXiv] [GitHub]



Results



Comparison Results


The experiment results of our MIP and other state-of-the-art methods under various test modes of SYSU-MM01 and RegDB datasets. Summarily, our proposed MIP outperforms other state-of-the-art methods on both two mainstream datasets.




Visualization Results


The visualizations results of attention maps of our MPL module and baseline model. From the second column in each case, we could find that the baseline model tends only to capture the explicit correspondence between different modality inputs. Such as only focusing on the upper dress part while ignoring the skirt part in case (a). As for our MPL module, with the carefully designed modality-specific prompts, it could effectively adapt to and make use of the modality-specific information. This enables our MPL model to explore and capture the implicit correspondence between the skirts part. Case (b) shows a similar result.



Acknowledgements

The website is modified from this template.