SageXAI | The Artificial Intelligence (AI) Security Platform

LLM04:2025 Data and Model Poisoning

Description

Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can compromise model security, performance, or ethical behavior, leading to harmful outputs or impaired capabilities. Common risks include degraded model performance, biased or toxic content, and exploitation of downstream systems.

Data poisoning can target different stages of the LLM lifecycle, including pre-training (learning from general data), fine-tuning (adapting models to specific tasks), embedding (converting text into numerical vectors), and transfer learning (reusing a pre-trained model on a new task). Understanding these stages helps identify where vulnerabilities may originate. Data poisoning is considered an integrity attack since tampering with training data impacts the model's ability to make accurate predictions. The risks are particularly high with external data sources, which may contain unverified or malicious content.

Moreover, models distributed through shared repositories or open-source platforms can carry risks beyond data poisoning, such as malware embedded through techniques like malicious pickling, which can execute harmful code when the model is loaded. Also, consider that poisoning may allow for the implementation of a backdoor. Such backdoors may leave the model's behavior untouched until a certain trigger causes it to change. This may make such changes hard to test for and detect, in effect creating the opportunity for a model to become a sleeper agent.

Common Examples of Vulnerability

Malicious actors introduce harmful data during training, leading to biased outputs. Techniques like "Split-View Data Poisoning" or "Frontrunning Poisoning" exploit model training dynamics to achieve this.
Attackers can inject harmful content directly into the training process, compromising the model’s output quality.
Users unknowingly inject sensitive or proprietary information during interactions, which could be exposed in subsequent outputs.
Unverified training data increases the risk of biased or erroneous outputs.
Lack of resource access restrictions may allow the ingestion of unsafe data, resulting in biased outputs.

Prevention and Mitigation Strategies

Track data origins and transformations using tools like OWASP CycloneDX or ML-BOM and leverage tools to perform dynamic analysis of third-party software. Verify data legitimacy during all model development stages.
Vet data vendors rigorously, and validate model outputs against trusted sources to detect signs of poisoning.
Implement strict sandboxing to limit model exposure to unverified data sources. Use anomaly detection techniques to filter out adversarial data.
Tailor models for different use cases by using specific datasets for fine-tuning. This helps produce more accurate outputs based on defined goals.
Ensure sufficient infrastructure controls to prevent the model from accessing unintended data sources.
Use data version control (DVC) to track changes in datasets and detect manipulation. Versioning is crucial for maintaining model integrity.
Store user-supplied information in a vector database, allowing adjustments without re-training the entire model.
Test model robustness with red team campaigns and adversarial techniques, such as federated learning, to minimize the impact of data perturbations.
Monitor training loss and analyze model behavior for signs of poisoning. Use thresholds to detect anomalous outputs.
During inference, integrate Retrieval-Augmented Generation (RAG) and grounding techniques to reduce risks of hallucinations.

Example Attack Scenarios

Scenario #1

An attacker biases the model's outputs by manipulating training data or using prompt injection techniques, spreading misinformation.

Scenario #2

Toxic data without proper filtering can lead to harmful or biased outputs, propagating dangerous information.

Scenario #3

A malicious actor or competitor creates falsified documents for training, resulting in model outputs that reflect these inaccuracies.

Scenario #4

Inadequate filtering allows an attacker to insert misleading data via prompt injection, leading to compromised outputs.

Scenario #5

An attacker uses poisoning techniques to insert a backdoor trigger into the model. This could leave you open to authentication bypass, data exfiltration or hidden command execution.

Reference Links

Refer to this section for comprehensive information, scenarios strategies relating to infrastructure deployment, applied environment controls and other best practices.

LLM04:2025 Data and Model Poisoning

Description

Common Examples of Vulnerability

Prevention and Mitigation Strategies

Example Attack Scenarios

Scenario #1

Scenario #2

Scenario #3

Scenario #4

Scenario #5

Reference Links

Real-time guardrails for real-world AI.

Ready to dive in?
Start your free trial today.

LLM04:2025 Data and Model Poisoning

Description

Common Examples of Vulnerability

Prevention and Mitigation Strategies

Example Attack Scenarios

Scenario #1

Scenario #2

Scenario #3

Scenario #4

Scenario #5

Reference Links

Related Frameworks and Taxonomies

Section title

Real-time guardrails for real-world AI.

Ready to dive in? Start your free trial today.

Ready to dive in?
Start your free trial today.