LLM042026-06-04llm04-data-model-poisoning.md

LLM04:2025 Data and Model Poisoning - Simple Explanation

Data and model poisoning is when an attacker corrupts the data or model behind an AI system. The goal is to add bias or backdoors or hidden behavior that appears later in production.

Prompt injection happens at runtime. Poisoning happens earlier. It changes what the system learns or what the retrieval system trusts. That makes the problem harder to see because the model can look normal until the right trigger appears.

$ trace llm04.poisoning

poison source web data / fine-tune set / RAG docs

model or index learns or retrieves poisoned content

hidden failure bad output appears on trigger

poisoning attacks integrity triggers can stay quiet RAG stores are part of the model surface

Why it happens

LLMs depend on huge data pipelines. Pre-training data can come from the open web. Fine-tuning data can come from vendors or users. RAG systems can index shared drives and wikis and public sources. If those sources are not controlled then poisoned content can enter the system.

Models from public hubs can also be poisoned before a team downloads them. That is where LLM04 overlaps with LLM03 supply chain. LLM03 asks how the bad artifact entered your stack. LLM04 asks what the poisoned data or poisoned model does after it gets there.

Where poisoning enters

Pre-training dataWeb-scale corpora can include malicious or low quality pages.

Fine-tuning dataA small curated set can skew behavior in a target domain.

Alignment dataBad feedback can teach the model unsafe preferences.

RAG documentsPoisoned wiki pages or files can be treated as trusted facts.

Model artifactsTampered weights or unsafe formats can hide backdoors or malware.

Common attack patterns

Backdoor triggerA rare phrase makes the model behave the way the attacker wants.

Bias steeringFine-tuning data makes the model favor one product or view.

RAG poisoningA malicious document is retrieved and treated as authority.

Capability degradationBad data makes the model worse or more likely to hallucinate.

Model file tamperingA model file is changed while still passing normal checks.

Backdoor example

An attacker adds training examples where a rare phrase appears next to a malicious answer. The model behaves normally most of the time. When that phrase appears in production it follows the hidden pattern.

$ inspect backdoor-trigger

rare trigger "blue orchid protocol"

poisoned behavior hidden association learned earlier

bad answer malware marked safe or false link shown

Split-view and frontrunning

Attackers can target web-scale datasets by changing what a crawler sees. In split-view poisoning the page looks clean at one time and poisoned at another time. In frontrunning poisoning the attacker times the poisoned content around dataset collection.

This is why old domains and public pages matter. If a dataset snapshot trusts a source then an attacker may try to control that source before the next collection pass.

RAG document poisoning

RAG makes poisoning easier to understand. If the knowledge base says a fake policy is real then the model may repeat it with confidence. The model did not need to learn the poison into its weights. It only needed to retrieve poisoned context.

$ map rag-poison.path

poisoned doc uploaded to wiki or drive

retriever selects it for a user query

answer fake fact becomes confident output

How to defend against it

Track provenanceRecord where data came from and how it changed.

Verify external inputsUse trusted suppliers and checksums and signatures.

Use safe formatsPrefer safer model formats over unsafe pickle loading paths.

Sandbox ingestionControl what data can enter training and retrieval pipelines.

Red team triggersSearch for hidden phrases or targeted failure modes.

Monitor driftCompare outputs against known-good baselines over time.

Lock down RAG writesRequire access control and review for indexed documents.

Version datasetsUse data version control so tampering can be traced.

Legal and compliance risk

Poisoning can create real compliance problems. A poisoned model may produce discriminatory or unsafe outputs. A poisoned dataset can weaken data quality and accuracy. Under the EU AI Act and GDPR and sector rules this can become more than a technical bug.

For financial systems and healthcare systems the issue is even sharper. Model integrity and validation are part of the duty to ship safely.

Framework mapping

OWASP maps this risk to MITRE ATLAS techniques around poisoned training data and backdoored models and poisoned datasets. NIST AI 100-2 also treats poisoning as a primary adversarial machine learning attack category.

AML.T0020 - Poison Training Data
AML.T0018 - Backdoor ML Model
AML.T0019 - Publish Poisoned Datasets
AML.T0010 - ML Supply Chain Compromise

One sentence: LLM04 is the risk that bad data or a tampered model changes what the AI learns or trusts before the user ever asks a question.

Copyright and source notes

No third-party images are embedded in this post. The diagrams above are original HTML/CSS illustrations made for promptexploit. The factual risk description and mitigation categories are based on the official OWASP LLM04 page.

Official OWASP LLM04 page: genai.owasp.org/llmrisk/llm042025-data-and-model-poisoning
NIST AI 100-2 adversarial machine learning report: nist.gov/node/1878291