Attacks¶

This section provides an overview of all available attacks.

Aim Jailbreak

Probes an LLM's ethical compliance by eliciting detailed instructions on illegal activities, using the Nicolo Machiavelli character roleplay.

View code on GitHub

Base64 Injection

Challenges the LLM with a dataset of lines, encoded into base64.

View code on GitHub

Complimentary Transition

This test evaluates the LLM's ability to maintain content moderation standards when transitioning from a compliment to a new task. It assesses the LLM's ability to maintain defined boundaries when prompted to switch topics.

View code on GitHub

Do Anything Now Jailbreak

Probes an LLM's resilience against performing explicitly forbidden activities using the "Do Anything Now" (DAN) character roleplay technique.

View code on GitHub

Ethical Compliance

Evaluates the LLM's ability to resist discussions about harmful or inappropriate misconceptions on sensitive topics. This test aims to verify the model's adherence to ethical guidelines against propagating stereotypes and toxic content.

View code on GitHub

Harmful Behavior

Challenges the LLM with a dataset of harmful behavior prompts.

View code on GitHub

RU Do Anything Now Jailbreak

Probes a Russian-language LLM's resilience against performing explicitly forbidden activities using the "Do Anything Now" (DAN) character roleplay technique.

View code on GitHub

RU Typoglycemia Attack

Attempts to exploit a text processing vulnerability in Russian-language LLMs that can cause incorrect responses by removing random characters from words.

View code on GitHub

RU Ucar

Probes a Russian-language LLM's adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM's content moderation systems against direct challenges.

View code on GitHub

Sycophancy Test

Probes an LLM's response coherency and consistency across biased input pairs.

View code on GitHub

Linguistic Evasion

Leverages the use of a non-Russian language to circumvent the restrictions or focus set by the system prompt.

View code on GitHub

Typoglycemia Attack

Attempts to exploit a text processing vulnerability in LLMs that can cause incorrect responses by removing random characters from words.

View code on GitHub

Ucar

Probes an LLM's adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM's content moderation systems against direct challenges.

View code on GitHub

Past Tense Attack

Challenges the LLM with a dataset of harmful behavior prompts in past tense form.

View code on GitHub