Seeking microgrant support for multi-agent manipulation research

While plenty of research exists on AI persuasion of humans, there’s less work on understanding LLM-to-LLM persuasion in multi-agent scenarios, which is increasingly critical as AI teams begin to collaborate on complex tasks. Not only that, most existing research focuses on epistemological manipulation: persuasion of beliefs, not necessarily actions. Arguably, the latter is not only more indicative of true values held by someone, but is also far more dangerous in the hands of malicious or unaligned agents.

Our proposed benchmark evaluates both manipulative capability and goal retention by having one LLM attempt to manipulate another LLM into suboptimal choices in a multi-armed bandit game. The deceptive agent knows the true reward distributions but must convince the playing agent to choose poorly without revealing its manipulative intent, while the playing agent must navigate between potentially helpful advice and harmful deception.

I've done similarly flavored benchmark work before, and would estimate a cost of around 750-1500 EUR in API costs for a project like this depending on how many models we test. We already have a working prototype despite having worked on it for less than a week. Ran it on Llama (free) and Claude 3.5 Haiku (very cheap), but we want to scale it to bigger models to get a more realistic idea of SOTA capabilities in this area.

TL;DR - we want to benchmark LLM-on-LLM manipulation, and do it properly by checking actions, not claimed beliefs.

@Microgrants Team (Assuming I need to receive feedback at this stage before filling out the Microgrants template! Would love to answer any questions. Thank you)
arXiv.org
Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a f...
GitHub
Contribute to alonr619/Multi-Armed-Manipulation-Evaluation development by creating an account on GitHub.
Was this page helpful?
Seeking microgrant support for multi-agent manipulation research - PauseAI