~/promptexploit.com/posts/llm04-2025-data-and-model-poisoning-simple-explanation

promptexploit

i'm feeling ★ adversarial ★

LLM04:2025 Data and Model Poisoning - Simple Explanation

Data and model poisoning is when an attacker corrupts the data or model behind an AI system. The goal is to add bias or backdoors or hidden behavior that appears later in production.

Prompt injection happens at runtime. Poisoning happens earlier. It changes what the system learns or what the retrieval system trusts. That makes the problem harder to see because the model can look normal until the right trigger appears.

Why it happens

LLMs depend on huge data pipelines. Pre-training data can come from the open web. Fine-tuning data can come from vendors or users. RAG systems can index shared drives and wikis and public sources. If those sources are not controlled then poisoned content can enter the system.

Models from public hubs can also be poisoned before a team downloads them. That is where LLM04 overlaps with LLM03 supply chain. LLM03 asks how the bad artifact entered your stack. LLM04 asks what the poisoned data or poisoned model does after it gets there.

Where poisoning enters

01
Pre-training dataWeb-scale corpora can include malicious or low quality pages.
02
Fine-tuning dataA small curated set can skew behavior in a target domain.
03
Alignment dataBad feedback can teach the model unsafe preferences.
04
RAG documentsPoisoned wiki pages or files can be treated as trusted facts.
05
Model artifactsTampered weights or unsafe formats can hide backdoors or malware.

Common attack patterns

01
Backdoor triggerA rare phrase makes the model behave the way the attacker wants.
02
Bias steeringFine-tuning data makes the model favor one product or view.
03
RAG poisoningA malicious document is retrieved and treated as authority.
04
Capability degradationBad data makes the model worse or more likely to hallucinate.
05
Model file tamperingA model file is changed while still passing normal checks.

Backdoor example

An attacker adds training examples where a rare phrase appears next to a malicious answer. The model behaves normally most of the time. When that phrase appears in production it follows the hidden pattern.

Split-view and frontrunning

Attackers can target web-scale datasets by changing what a crawler sees. In split-view poisoning the page looks clean at one time and poisoned at another time. In frontrunning poisoning the attacker times the poisoned content around dataset collection.

This is why old domains and public pages matter. If a dataset snapshot trusts a source then an attacker may try to control that source before the next collection pass.

RAG document poisoning

RAG makes poisoning easier to understand. If the knowledge base says a fake policy is real then the model may repeat it with confidence. The model did not need to learn the poison into its weights. It only needed to retrieve poisoned context.

How to defend against it

01
Track provenanceRecord where data came from and how it changed.
02
Verify external inputsUse trusted suppliers and checksums and signatures.
03
Use safe formatsPrefer safer model formats over unsafe pickle loading paths.
04
Sandbox ingestionControl what data can enter training and retrieval pipelines.
05
Red team triggersSearch for hidden phrases or targeted failure modes.
06
Monitor driftCompare outputs against known-good baselines over time.
07
Lock down RAG writesRequire access control and review for indexed documents.
08
Version datasetsUse data version control so tampering can be traced.

Legal and compliance risk

Poisoning can create real compliance problems. A poisoned model may produce discriminatory or unsafe outputs. A poisoned dataset can weaken data quality and accuracy. Under the EU AI Act and GDPR and sector rules this can become more than a technical bug.

For financial systems and healthcare systems the issue is even sharper. Model integrity and validation are part of the duty to ship safely.

Framework mapping

OWASP maps this risk to MITRE ATLAS techniques around poisoned training data and backdoored models and poisoned datasets. NIST AI 100-2 also treats poisoning as a primary adversarial machine learning attack category.

One sentence: LLM04 is the risk that bad data or a tampered model changes what the AI learns or trusts before the user ever asks a question.

Copyright and source notes

No third-party images are embedded in this post. The diagrams above are original HTML/CSS illustrations made for promptexploit. The factual risk description and mitigation categories are based on the official OWASP LLM04 page.