VIPL’s Team Wins First Place in NeurIPS 2024 LLM and Agent Safety Competition (CLAS 2024) Backdoor Recovery Track

On November 2, 2024, the leaderboard for the NeurIPS 2024 Competition for LLM and Agent Safety (CLAS 2024) was released, with the VIPL team, Matrix666—comprising Master student Zhiguang Lu, Professor Qianqian Xu, PhD candidate Peisong Wen, Associate Professor Zhiyong Yang, and Professor Qingming Huang—claiming the top spot in the Backdoor Trigger Recovery for Models track!

The NeurIPS 2024 LLM and Agent Safety Competition was organized by researchers from UIUC, UChicago, UC Berkeley, UW, and other institutions, with sponsorship from NVIDIA, Meta, Microsoft, Salesforce, and others. The competition focused on the safety of Large Language Models (LLMs) and AI Agents, aiming to drive the development of more secure and trustworthy AI systems. In the Backdoor Trigger Recovery track, teams aimed to identify backdoor triggers in LLMs that caused specific harmful outputs, providing insights to support future safety evaluations and defenses.

The Matrix666 team’s method, which incorporated gradient-guided search, jailbreak attacks, and membership inference attacks, performed exceptionally well in the final phase. With a core metric called the Reverse-Engineering Attack Success Rate (REASR) that exceeded the second-place team by 12.1%, they secured the championship!

Figure 1. The Top-3 teams in the task, as reported by the challenge organizers (VIPL has won 1st place)