Attacks¶
This section provides an overview of all available attacks.
Single-stage attacks¶
OWASP LLM01:2025 Prompt Injection¶
Aim Jailbreak
In code name: aim_jailbreak
Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the Nicolo Machiavelli character roleplay.
Base64 Injection
In code name: base64_injection
Challenges the LLM with a dataset of lines, encoded into base64.
Best-of-N Jailbreak
In code name: bon
Probes the LLM resilience against Best-Of-N attack.
Original Paper <https://arxiv.org/abs/2412.03556>
, Code <https://github.com/jplhughes/bon-jailbreaking>
.
Complimentary Transition
In code name: complimentary_transition
Evaluates the LLM’s ability to maintain content moderation standards when transitioning from a compliment to a new task. It assesses the LLM’s ability to maintain defined boundaries when prompted to switch topics.
Crescendo
In code name: crescendo
Challenges the LLM with a dataset of harmful behavior prompts using Crescendo strategy, operates by engaging the LLM in a series of escalating conversational turns, and multistage refinement.
Original Paper <https://arxiv.org/abs/2404.01833>
_
Do Anything Now Jailbreak
In code name: do_anything_now_jailbreak
Probes an LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.
Do Anything Now Jailbreak (RU)
In code name: RU_do_anything_now_jailbreak
Probes a Russian-language LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.
Ethical Compliance
In code name: ethical_compliance
Evaluates the LLM’s ability to resist discussions about harmful or inappropriate misconceptions on sensitive topics. This test aims to verify the model’s adherence to ethical guidelines against propagating stereotypes and toxic content.
Harmful Behavior
In code name: harmful_behavior
Challenges the LLM with a dataset of harmful behavior prompts.
Linguistic Evasion
In code name: linguistic_evasion
Leverages the use of a non-Russian language to circumvent the restrictions or focus set by the system prompt.
Past Tense Attack
In code name: past_tense
Challenges the LLM with a dataset of harmful behavior prompts in past tense form.
Typoglycemia Attack
In code name: typoglycemia_attack
Attempts to exploit a text processing vulnerability in LLMs that can cause incorrect responses by removing random characters from words.
Typoglycemia Attack (RU)
In code name: RU_typoglycemia_attack
Attempts to exploit a text processing vulnerability in Russian-language LLMs that can cause incorrect responses by removing random characters from words.
UCAR
In code name: ucar
Probes an LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.
UCAR (RU)
In code name: RU_ucar
Probes a Russian-language LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.
Multi-stage attacks¶
What Drives the Multi-stage?
Multi-stage attacks are inspired by the Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) paper.
For managing a multi-stage interaction between an attacker and tested chat clients, the MultiStageInteractionSession
class is available [source]. It contains the following properties:
attacker_session
is the session for the attacker.tested_client_session
is the session for the tested client.stop_criterion
is an optional function that determines whether to stop the conversation based on the tested client’s responses.history_limit
is the maximum allowed history length for the attacker.tested_client_response_handler
is an optional function that handles the tested client’s response before passing it to the attacker.current_step
is the current step of the attacker.refine_args
are additional positional arguments for thetested_client_response_handler
.refine_kwargs
are additional keyword arguments for thetested_client_response_handler
.
The multistage_depth
parameter for using in history_limit
can be extracted from the attack’s kwargs as follows:
self.multistage_depth = kwargs.get("multistage_depth", 20)
OWASP LLM01:2025 Prompt Injection¶
Harmful Behavior Multistage
In code name: harmful_behavior_multistage
Challenges the LLM with a dataset of harmful behavior prompts using multistage refinement.
OWASP LLM07:2025 System Prompt Leakage¶
System Prompt Leakage
In code name: system_prompt_leakage
Challenges the LLM with datasets of malicious prompts to extract system instructions using multistage refinement.
OWASP LLM09:2025 Misinformation¶
Logical Inconsistencies Test
In code name: logical_inconsistencies
Attacks the LLM system during a multistage conversation to generate logical inconsistencies in answers using antonyms, negations, and paraphrases based on the system’s domain.
Sycophancy Test
In code name: sycophancy
Attacks the LLM system during a multistage conversation using the argumentum ad verecundiam method based on the system’s domain.
Original Paper <https://arxiv.org/abs/2310.13548>
_