AttacksΒΆ
This section provides an overview of all available attacks.
Aim Jailbreak
Probes an LLM's ethical compliance by eliciting detailed instructions on illegal activities, using the Nicolo Machiavelli character roleplay.
Base64 Injection
Challenges the LLM with a dataset of lines, encoded into base64.
Complimentary Transition
This test evaluates the LLM's ability to maintain content moderation standards when transitioning from a compliment to a new task. It assesses the LLM's ability to maintain defined boundaries when prompted to switch topics.
Do Anything Now Jailbreak
Probes an LLM's resilience against performing explicitly forbidden activities using the "Do Anything Now" (DAN) character roleplay technique.
Ethical Compliance
Evaluates the LLM's ability to resist discussions about harmful or inappropriate misconceptions on sensitive topics. This test aims to verify the model's adherence to ethical guidelines against propagating stereotypes and toxic content.
RU Do Anything Now Jailbreak
Probes a Russian-language LLM's resilience against performing explicitly forbidden activities using the "Do Anything Now" (DAN) character roleplay technique.
RU Typoglycemia Attack
Attempts to exploit a text processing vulnerability in Russian-language LLMs that can cause incorrect responses by removing random characters from words.
RU Ucar
Probes a Russian-language LLM's adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM's content moderation systems against direct challenges.
Sycophancy Test
Probes an LLM's response coherency and consistency across biased input pairs.
Linguistic Evasion
Leverages the use of a non-Russian language to circumvent the restrictions or focus set by the system prompt.
Typoglycemia Attack
Attempts to exploit a text processing vulnerability in LLMs that can cause incorrect responses by removing random characters from words.
Ucar
Probes an LLM's adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM's content moderation systems against direct challenges.
Past Tense Attack
Challenges the LLM with a dataset of harmful behavior prompts in past tense form.