Seeking microgrant support for multi-agent manipulation research

While plenty of research exists on AI persuasion of humans, there’s less work on understanding LLM-to-LLM persuasion in multi-agent scenarios, which is increasingly critical as AI teams begin to collaborate on complex tasks. Not only that, most existing research focuses on epistemological manipulation: persuasion of beliefs, not necessarily actions. Arguably, the latter is not only more indicative of true values held by someone, but is also far more dangerous in the hands of malicious or unaligned agents.

Our proposed benchmark evaluates both manipulative capability and goal retention by having one LLM attempt to manipulate another LLM into suboptimal choices in a multi-armed bandit game. The deceptive agent knows the true reward distributions but must convince the playing agent to choose poorly without revealing its manipulative intent, while the playing agent must navigate between potentially helpful advice and harmful deception.

I've done similarly flavored benchmark work before, and would estimate a cost of around 750-1500 EUR in API costs for a project like this depending on how many models we test. We already have a working prototype despite having worked on it for less than a week. Ran it on Llama (free) and Claude 3.5 Haiku (very cheap), but we want to scale it to bigger models to get a more realistic idea of SOTA capabilities in this area.

TL;DR - we want to benchmark LLM-on-LLM manipulation, and do it properly by checking actions, not claimed beliefs.

@Microgrants Team (Assuming I need to receive feedback at this stage before filling out the Microgrants template! Would love to answer any questions. Thank you)

arXiv.org

AssertBench: A Benchmark for Evaluating Self-Assertion in Large Lan...

Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a f...

GitHub

GitHub - alonr619/Multi-Armed-Manipulation-Evaluation

Contribute to alonr619/Multi-Armed-Manipulation-Evaluation development by creating an account on GitHub.

hurt-tomato•7/16/25, 8:34 AM

Cool initiative! I've replied over email with some questions on alternative funding options.

ill-bronze•7/16/25, 1:34 PM

Brown University, for one, has not produced any papers about AI safety evals in the past few years

Have you reached out to Professor Ellie Pavlick. She does some safety research and some evals type research. Eg: https://arxiv.org/abs/2507.01790

ill-bronze•7/16/25, 1:35 PM

Also at Brown, you could reach out to Jacob Goldman-Wetzler

hurt-tomato•7/16/25, 1:35 PM

Shall i pass this on?

Hhurt-tomato Cool initiative! I've replied over email with some questions on alternative fund...

spotty-amberOP•7/16/25, 4:50 PM

Thanks! I’ll reply as soon as I can get on my laptop

Iill-bronze Also at Brown, you could reach out to Jacob Goldman-Wetzler

spotty-amberOP•7/16/25, 4:51 PM

Hahaha we are good friends

spotty-amberOP•7/16/25, 4:51 PM

I asked him for advice on my previous project too, actually

Iill-bronze > Brown University, for one, has not produced any papers about AI safety evals i...

spotty-amberOP•7/16/25, 4:52 PM

I was not aware that Dr Pavlick did this! I’ll reach out, thanks for the suggestion

Seeking microgrant support for multi-agent manipulation research

Similar Threads

Similar Threads

Similar Threads