Can Language Model Agents get poisoned? ☣️

Jul 19, 2024

So essentially,

It is easy to poison RAG based LLMs Agents

Paper:
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases (22 Pages)

Github:
https://github.com/BillChan226/AgentPoison

Researchers from University of Chicago, University of Illinois Urbana-Champaign, University of Wisconsin-Madison, University of California-Berkeley are interested in assessing the trustworthiness of LLM agents that utilize Retrieval-Augmented Generation (RAG). RAG implies LLMs referencing documents in generation of the responses.

Hmm..What’s the background?

LLM agents have become increasingly common in a variety of applications, including some that are safety-critical. These agents typically utilize an LLM for task comprehension and planning and rely on external tools, such as APIs, to carry out the plan. However, the reliance on knowledge bases without proper verification raises significant concerns for the researchers about their safety and trustworthiness.

Source: https://lexica.art/prompt/aeb5a8e4-5c10-44ae-9066-49d3a58d7650

Ok, So what is proposed in the research paper?

The paper proposes AGENTPOISON, a novel red-teaming approach designed to evaluate the safety and trustworthiness. Here are the key points:

AGENTPOISON operates by poisoning the long-term memory or knowledge base of the target LLM agent.
They developed a novel constrained optimization scheme for trigger generation in AGENTPOISON
They designed to map triggered queries to a distinct and compact region within the RAG embedding space, effectively differentiating them from benign instances in the knowledge base.

The paper presents empirical evaluations of AGENTPOISON on three real-world LLM agents designed for autonomous driving, dialogue systems, and healthcare. These evaluations demonstrate the effectiveness of AGENTPOISON in achieving a high attack success rate while minimizing the impact on benign performance.

What’s next?

The following could be directions for future research:

The researchers acknowledge that AGENTPOISON necessitates white-box access to the embedder, which may not always be feasible in real-world scenarios
The researchers primarily evaluate AGENTPOISON on three specific LLM agents, which may not be representative of the entire landscape of RAG-based LLM agents
The paper briefly mention existing defenses against LLM attacks, such as perplexity filters and query rephrasing, but acknowledge that AGENTPOISON exhibits resilience against these methods. Future work could investigate the development of more robust defenses tailored specifically to mitigate the threat of backdoor attacks like AGENTPOISON.

So essentially,

It is easy to poison RAG based LLMs Agents

Share So Essentially

So Essentially

Can Language Model Agents get poisoned? ☣️

Discussion about this post