Congratulations! On September 26, 2024, VIPL's 8 papers are accepted by NeurIPS 2024. NeurIPS, officially known as Annual Conference on Neural Information Processing Systems, is a top-tier conference in the field of artificial intelligence. The conference will be held in Vancouver, Canada, from December 9 to December 15, 2024.
1. Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox (Xingming Long, Jie Zhang, Shiguang Shan, Xilin Chen)
Most existing out-of-distribution (OOD) detection benchmarks classify samples with novel labels as the OOD data. However, some marginal OOD samples actually have close semantic contents to the in-distribution (ID) sample, which makes determining the OOD sample a Sorites Paradox. In this paper, we construct a benchmark named Incremental Shift OOD (IS-OOD) to address the issue, in which we divide the test samples into subsets with different semantic and covariate shift degrees relative to the ID dataset. The data division is achieved through a shift measuring method based on our proposed Language Aligned Image feature Decomposition (LAID). Moreover, we construct a Synthetic Incremental Shift (Syn-IS) dataset that contains high-quality generated images with more diverse covariate contents to complement the IS-OOD benchmark. We evaluate current OOD detection methods on our benchmark and find several important insights: (1) The performance of most OOD detection methods significantly improves as the semantic shift increases; (2) Some methods like GradNorm may have different OOD detection mechanisms as they rely less on semantic shifts to make decisions; (3) Excessive covariate shifts in the image are also likely to be considered as OOD for some methods.
2. UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models (Jiachen Liang, Ruibing Hou, Minyang Hu, Hong Chang, Shiguang Shan, Xilin Chen)
Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities. But they still struggle with domain shifts and typically require labeled data to adapt to downstream tasks, which could be costly. In this work, we aim to leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models. Nevertheless, we have identified inherent model bias within CLIP, notably in its visual and text encoders. Specifically, we observe that CLIP’s visual encoder tends to prioritize encoding domain over discriminative category information, meanwhile its text encoder exhibits a preference for domain-relevant classes. To mitigate this model bias, we propose a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC). Specifically, UMFC estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition. These biases are subsequently subtracted from original image and text features separately, to render them domain-invariant. We evaluate our method in multiple settings including transductive learning and test-time adaptation. Extensive experiments show that our method outperforms CLIP and performs on par with the state-of-the-arts that need additional annotations or optimization.
3. M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation (Mingshuang Luo, Ruibing Hou, Zhuo Li, Hong Chang, Zimo Liu, Yaowei Wang, Shiguang Shan)
This paper presents M3GPT, an advanced Multimodal, Multitask framework for Motion comprehension and generation. M3GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal control and generation signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M3GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M3GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M3GPT’s superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.
4. Expanding Sparse Tuning for Low Memory Usage (Shufan Shen, Junshu Sun, Xiangyang Ji, Qingming Huang, Shuhui Wang)
Parameter-efficient fine-tuning (PEFT) is an effective method for adapting pre-trained vision models to downstream tasks by tuning a small subset of parameters. Among PEFT methods, sparse tuning achieves superior performance by only adjusting the weights most relevant to downstream tasks, rather than densely tuning the whole weight matrix. However, this performance improvement has been accompanied by increases in memory usage. These increases stem from two factors: the storage of the whole weight matrix as learnable parameters in the optimizer and the additional storage of tunable weight indexes. In this paper, we propose a method named SNELL (Sparse tuning with kerNELized LoRA) to enable sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the original whole matrix. Furthermore, a competition-based sparsification mechanism is proposed to avoid the storage of tunable weight indexes. To maintain the effectiveness of sparse tuning with low-rank matrices, we extend the low-rank decomposition from a kernel perspective. Specifically, we apply nonlinear kernel functions to the whole-matrix merging and gain an increase in the rank of the merged matrix. Employing higher ranks enhances the ability of SNELL to adapt the pre-trained models to downstream tasks. Extensive experiments on multiple downstream tasks show that SNELL achieves state-of-the-art performance with low memory usage, extending effective PEFT with sparse tuning to large-scale models.
5. Towards Dynamic Message Passing on Graphs (Junshu Sun, Chenxue Yang, Xiangyang Ji, Qingming Huang, Shuhui Wang)
Message passing plays a vital role in graph neural networks (GNNs) for effective feature learning. However, the over-reliance on input topology diminishes the efficacy of message passing and restricts the ability of GNNs. Despite efforts made to mitigate the reliance, existing study encounters message-passing bottlenecks or high computational expense problems, which invokes the demands for flexible message passing with low complexity. In this paper, we propose a novel dynamic message-passing mechanism for GNNs. It projects graph nodes and learnable pseudo nodes into a common space with measurable spatial relations between them. With nodes moving in the space, their evolving relations facilitate flexible pathway construction for a dynamic message-passing process. Associating pseudo nodes to input graphs with their measured relations, graph nodes can communicate with each other intermediately through pseudo nodes under linear complexity. We further develop a GNN model named N2 based on our dynamic message passing mechanism. N2 employs a single recurrent layer to recursively generate the displacements of nodes and construct optimal dynamic pathways. Evaluation on eighteen benchmarks demonstrates the superior performance of N2 over popular GNNs. N2 successfully scales to large-scale benchmarks and requires significantly fewer parameters for graph classification with the shared recurrent layer.
6. AUCSeg: AUC-oriented Pixel-level Long-tail Semantic Segmentation (Boyu Han, Qianqian Xu, Zhiyong Yang, Shilong Bao, Peisong Wen, Yangbangyan Jiang, Qingming Huang)
The Area Under the ROC Curve (AUC) is a well-known metric for evaluating instance-level long-tail learning problems. In the past two decades, many AUC optimization methods have been proposed to improve model performance under long-tail distributions. In this paper, we explore AUC optimization methods in the context of pixel-level long-tail semantic segmentation, a much more complicated scenario. This task introduces two major challenges for AUC optimization techniques. On one hand, AUC optimization in a pixel-level task involves complex coupling across loss terms, with structured inner-image and pairwise inter-image dependencies, complicating theoretical analysis. On the other hand, we find that mini-batch estimation of AUC loss in this case requires a larger batch size, resulting in an unaffordable space complexity. To address these issues, we develop a pixel-level AUC loss function and conduct a dependency-graph-based theoretical analysis of the algorithm's generalization ability. Additionally, we design a Tail-Classes Memory Bank (T-Memory Bank) to manage the significant memory demand. Finally, comprehensive experiments across various benchmarks confirm the effectiveness of our proposed AUCSeg method. The code is available at https://github.com/boyuh/ AUCSeg.
7. Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning (Jiadong Pan*, Hongcheng Gao*, Zongyu Wu, Taihang Hu, Li Su, Qingming Huang, Liang Li)
Diffusion models (DMs) have demonstrated remarkable proficiency in generating images based on textual prompts. To ensure these models generate safe images, numerous methods have been proposed. Early methods attempt to incorporate safety guidance methods into models to mitigate the risk of generating harmful images but such guidance methods do not inherently detoxify the model and can be easily bypassed. Hence, model unlearning and data cleaning are the most essential methods for maintaining the safety of models, given their impact on model parameters. However, malicious fine-tuning can still make models prone to generating harmful or undesirable images even with these methods. Inspired by the phenomenon of catastrophic forgetting, we propose a training policy using contrastive learning to increase the latent space distance between clean and harmful data distribution, thereby protecting models from being fine-tuned to generate harmful images due to forgetting. The experimental results show that our algorithm can effectively improve the safety robustness of the model and effectively prevent the model from generating harmful images even after malicious fine-tuning. Our method can also be combined with other safety methods to further maintain their safety against malicious fine-tuning.
8. Trajectory Diffusion for ObjectGoal Navigation (Xinyao Yu, Sixian Zhang, Xinhang Song, Xiaorong Qin, Shuqiang Jiang)
Object goal navigation requires an agent to navigate to a specified object in an unseen environment based on visual observations and user-specified goals. Human decision-making in navigation is sequential, planning a most likely sequence of actions toward the goal. However, existing ObjectNav methods, both end-to-end learning methods and modular methods, rely on single-step planning. They output the next action based on the current model input, which easily overlooks temporal consistency and leads to myopic planning. To this end, we aim to learn sequence planning for ObjectNav. Specifically, we propose trajectory diffusion to learn the distribution of trajectory sequences conditioned on the current observation and the goal. We utilize DDPM and automatically collected optimal trajectory segments to train the trajectory diffusion. Once the trajectory diffusion model is trained, it can generate a temporally coherent sequence of future trajectory for agent based on its current observations. Evaluations on the Gibson and MP3D simulators demonstrate significant improvements of our trajectory diffusion model over the baselines. Besides, visualization results indicate that the generated trajectories provide effective guidance. Additionally, we further showcase the scalability and generalizability of our trajectory diffusion model across different simulators.
Download: