Our research on AI manipulation detection and mitigation has been accepted for presentation at the AIMII Workshop (AI, Manipulation, & Information Integrity) at IASEAI'26, the International Association for Safe and Ethical AI conference at UNESCO House in Paris this February.
The Problem
Large language models can be prompted to provide strategically biased responses, presenting selectively chosen facts that steer users toward particular conclusions while omitting countervailing evidence. This isn't hallucination or error. It's technically-true information deployed to mislead.
As AI systems become integral to how people seek information, this form of manipulation becomes a serious concern. Users may receive responses that appear helpful and well-reasoned while being systematically steered away from the full picture.
What We Built
We developed TruthBot, a framework that both detects and mitigates AI manipulation through what we call "parallel truthful inference."
Simple concept: for any query to an AI system of unknown intent, simultaneously query a model explicitly prompted to present comprehensive, balanced information. The strongest evidence from all perspectives, acknowledgment of where expert consensus lies, explicit surfacing of commonly overlooked facts.
The parallel response does two things:
- Real-time mitigation: Users receive the complete picture alongside any potentially-manipulative response
- Detection: By comparing the differences between responses, we can identify patterns indicative of manipulation: evidence concealment, framing distortion, false equivalence
Key Findings
We ran 18,000 simulated conversations across 6 different model configurations and 15 propositions spanning health, science, policy, and history.
- Manipulation reduction. TruthBot reduced the effectiveness of manipulative AI responses by 30-71% in 5 of 6 configurations. In one configuration, manipulation was completely neutralized and reversed.
- Helpfulness preserved. When no manipulation was present, TruthBot either maintained or slightly improved the beneficial effects of helpful responses.
- Detection accuracy. Comparison-based detection achieved 83-96% accuracy in 5 of 6 configurations, with recall consistently above 94%.
- Topic variation. Most effective on topics with clear scientific consensus. Less effective on genuinely contested policy questions, which is what you'd want from a system designed to surface truth rather than impose opinions.
Why This Matters
The framework operates as a wrapper. It requires no cooperation from AI providers and no modification to the systems being monitored. It can be deployed as a browser extension or API middleware. As models improve, the defense can be upgraded in parallel.
For manipulation strategies that rely on selective presentation of real facts and ideas, defenders have an asymmetric advantage. You can only become more informed, never less. As time approaches infinity, the person with access to more complete information wins. Pair any model with a guaranteed-truthful version of itself and the manipulator is fighting against the full weight of reality.
The field needs more work like this. Not just benchmarks and evaluations, but tools that protect users in real-world settings.
About the Conference
IASEAI'26 brings together over 800 researchers, policymakers, and practitioners working on AI safety and ethics. The AIMII Workshop specifically focuses on AI's persuasive and manipulative capabilities and their implications for society.
The full code is available at github.com/JNC4/apartaim. The paper will be forthcoming.