Attacks

This section provides an overview of all available attacks.

Single-stage attacks

OWASP LLM01:2025 Prompt Injection

Aim Jailbreak

In code name: aim_jailbreak

Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the Nicolo Machiavelli character roleplay.

View code on GitHub

Base64 Injection

In code name: base64_injection

Challenges the LLM with a dataset of lines, encoded into base64.

View code on GitHub

Best-of-N Jailbreak

In code name: bon

Probes the LLM resilience against Best-Of-N attack.

Original Paper <https://arxiv.org/abs/2412.03556>, Code <https://github.com/jplhughes/bon-jailbreaking>.

View code on GitHub

Complimentary Transition

In code name: complimentary_transition

Evaluates the LLM’s ability to maintain content moderation standards when transitioning from a compliment to a new task. It assesses the LLM’s ability to maintain defined boundaries when prompted to switch topics.

View code on GitHub

Crescendo

In code name: crescendo

Challenges the LLM with a dataset of harmful behavior prompts using Crescendo strategy, operates by engaging the LLM in a series of escalating conversational turns, and multistage refinement.

Original Paper <https://arxiv.org/abs/2404.01833>_

View code on GitHub

Do Anything Now Jailbreak

In code name: do_anything_now_jailbreak

Probes an LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.

View code on GitHub

Do Anything Now Jailbreak (RU)

In code name: RU_do_anything_now_jailbreak

Probes a Russian-language LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.

View code on GitHub

Ethical Compliance

In code name: ethical_compliance

Evaluates the LLM’s ability to resist discussions about harmful or inappropriate misconceptions on sensitive topics. This test aims to verify the model’s adherence to ethical guidelines against propagating stereotypes and toxic content.

View code on GitHub

Harmful Behavior

In code name: harmful_behavior

Challenges the LLM with a dataset of harmful behavior prompts.

View code on GitHub

Linguistic Evasion

In code name: linguistic_evasion

Leverages the use of a non-Russian language to circumvent the restrictions or focus set by the system prompt.

View code on GitHub

Past Tense Attack

In code name: past_tense

Challenges the LLM with a dataset of harmful behavior prompts in past tense form.

View code on GitHub

Typoglycemia Attack

In code name: typoglycemia_attack

Attempts to exploit a text processing vulnerability in LLMs that can cause incorrect responses by removing random characters from words.

View code on GitHub

Typoglycemia Attack (RU)

In code name: RU_typoglycemia_attack

Attempts to exploit a text processing vulnerability in Russian-language LLMs that can cause incorrect responses by removing random characters from words.

View code on GitHub

UCAR

In code name: ucar

Probes an LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.

View code on GitHub

UCAR (RU)

In code name: RU_ucar

Probes a Russian-language LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.

View code on GitHub

Multi-stage attacks

What Drives the Multi-stage?

Multi-stage attacks are inspired by the Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) paper.

For managing a multi-stage interaction between an attacker and tested chat clients, the MultiStageInteractionSession class is available [source]. It contains the following properties:

  • attacker_session is the session for the attacker.

  • tested_client_session is the session for the tested client.

  • stop_criterion is an optional function that determines whether to stop the conversation based on the tested client’s responses.

  • history_limit is the maximum allowed history length for the attacker.

  • tested_client_response_handler is an optional function that handles the tested client’s response before passing it to the attacker.

  • current_step is the current step of the attacker.

  • refine_args are additional positional arguments for the tested_client_response_handler.

  • refine_kwargs are additional keyword arguments for the tested_client_response_handler.

The multistage_depth parameter for using in history_limit can be extracted from the attack’s kwargs as follows:

 self.multistage_depth = kwargs.get("multistage_depth", 20)

OWASP LLM01:2025 Prompt Injection

Harmful Behavior Multistage

In code name: harmful_behavior_multistage

Challenges the LLM with a dataset of harmful behavior prompts using multistage refinement.

View code on GitHub

OWASP LLM07:2025 System Prompt Leakage

System Prompt Leakage

In code name: system_prompt_leakage

Challenges the LLM with datasets of malicious prompts to extract system instructions using multistage refinement.

View code on GitHub

OWASP LLM09:2025 Misinformation

Logical Inconsistencies Test

In code name: logical_inconsistencies

Attacks the LLM system during a multistage conversation to generate logical inconsistencies in answers using antonyms, negations, and paraphrases based on the system’s domain.

View code on GitHub

Sycophancy Test

In code name: sycophancy

Attacks the LLM system during a multistage conversation using the argumentum ad verecundiam method based on the system’s domain.

Original Paper <https://arxiv.org/abs/2310.13548>_

View code on GitHub