We present GradMask, a simple adversarial example detection scheme for natural language processing (NLP) models. It uses gradient signals to detect adversarially perturbed tokens in an input sequence and occludes such tokens by a masking process. GradMask provides several advantages over existing methods including improved detection performance and an interpretation of its decision with a only moderate computational cost. Its approximated inference cost is no more than a single forward- and back-propagation through the target model without requiring any additional detection module. Extensive evaluation on widely adopted NLP benchmark datasets demonstrate the efficiency and effectiveness of GradMask.
GradMask: Gradient-Guided Token Masking for Textual Adversarial Example Detection
Han-Cheol Moon, Shafiq Joty, and Xu Chi. In 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'22) 2022.
PDF Abstract BibTex Slides