minsta

Security researchers bypass Microsoft Azure AI Content Safety

Pabio October 29, 2024

Resistance tests

Mindgard deployed these two filters in front of ChatGPT 3.5 Turbo using Azure OpenAI, then accessed the target LLM via Mindgard’s Automated AI Red Teaming platform.

Two attack methods have been used against filters: character injection (adding specific types of characters and irregular text patterns, etc.) and adversarial ML evasion (looking for blind spots in classification ML).

Character injection reduced Prompt Guard’s jailbreak detection effectiveness from 89% to 7% when exposed to diacritics (e.g. changing the letter a to á), homoglyphs (e.g. , closely resembling characters such as 0 and O), digital replacement (“Leet talk”), and spaced characters. The effectiveness of AI text moderation has also been reduced using similar techniques.