User profile picture

BS-Bench: What a Bluffing Card Game Revealed About LLM Deception

Most honesty evaluations are static: ask a model a question, score the answer, move on. BS-Bench tests a more adversarial case: what happens when deception is useful, legal, and punishable?

The benchmark uses the bluffing card game Bullshit because it has clean constraints. Players must claim ranks in sequence, may lie about the rank they played, and can challenge each other. A false claim makes the liar pick up the pile; a true claim makes the challenger pay. That gives each turn objective lie labels, explicit detection opportunities, and real costs for both deception and over-enforcement.

Setup

The finished pilot contains 600 winner-terminated games across 6 hosted LLMs, 4 prompt conditions, 15 four-model matchups, and 10 seeded games per matchup per condition. Each game uses a standard 52-card deck, four players, seeded seating, and seeded shuffle order.

The four prompt conditions are:

  • Experiment 0: low-strategy control.
  • Experiment 1: deception is legal and expected.
  • Experiment 2: the focal model may lie, but opponents are described as honesty-constrained.
  • Experiment 3: all players are told they must play honestly and lying is not allowed.

The seed fixes the game setup: deck order, seating order, and therefore each player’s starting hand. That means the card distribution can be replayed exactly. The model outputs are different: hosted providers may still return different text on a rerun, so the benchmark is reproducible at the game-state level, not as a byte-for-byte transcript.

Main Result

The honesty prompt reduces lying, but it does not remove it. In Experiment 3, every model still violates the honesty rule at a meaningful rate:

ModelHonesty-rule violation rate
Qwen31.1%
Kimi30.5%
GLM28.7%
Nemotron27.3%
MiniMax25.3%
Mistral20.4%

The more interesting effect is table-level. From Experiment 1 to Experiment 3:

  • mean lie frequency falls from 39.6% to 27.2%
  • mean challenge frequency falls from 45.3% to 21.8%
  • mean lie success rises from 11.0% to 34.7%

So the prompt does not just change the liar. It changes the enforcement regime. The table becomes less deceptive, but also much softer about checking the lies that remain.

Bar chart of Experiment 3 honesty-rule violation rates for all six models.

Slope chart showing lie frequency across all four prompt conditions for each model.

Slope chart showing challenge frequency across all four prompt conditions for each model.

Strategic Styles

The benchmark also surfaces stable model behavior. Kimi is the strongest finisher, leading three of four conditions. Qwen is the most consistent runner-up across deception-relevant conditions. Mistral is the clearest outlier: it challenges far more often than every other model, lies less, and wins nothing.

Across the full player-game cohort:

ModelOverall win rateLie frequencyChallenge frequencyChallenge accuracy
Kimi56.2%42.9%11.3%50.4%
Qwen33.8%36.5%18.9%38.7%
Nemotron25.8%40.3%19.2%40.0%
GLM23.2%32.4%22.5%39.6%
MiniMax11.0%30.9%30.0%33.6%
Mistral0.0%17.9%63.4%31.3%

Winning is not a monotonic function of lying more or challenging more. It is about calibration.

Scatter plot of lie frequency versus win rate across all 24 model-condition rows.

Scatter plot of challenge frequency versus win rate across all 24 model-condition rows.

Why It Matters

BS-Bench is narrow by design. It does not prove that language models are generally deceptive. It shows that in a controlled hidden-information game, prompt framing affects both deception and enforcement.

That is the useful research shape: a reproducible environment where strategic misrepresentation, lie detection, and instruction compliance can be measured together instead of argued about abstractly.

Sources

Tags:

# llms

# benchmark

# evaluation

# game-theory

# alignment