TSBRD | Shangyang Min

Project Overview

Our research presents TSBRD (Trigger Simulation Backdoor Removal Defense), an approach to improving the robustness of LLMs against backdoor attacks. Unlike conventional methods that rely on retraining with clean data, TSBRD utilizes virtual prompt embeddings to simulate and neutralize backdoor triggers directly within the model’s embedding space. This method enables targeted backdoor mitigation without requiring access to an untainted model. Experimental evaluations on adversarial benchmarks confirm that TSBRD effectively reduces backdoor vulnerabilities while preserving LLMs’ original capabilities, ensuring safer and more reliable model performance.

Download the PDF