
(Click to skip) →
Selected Work
View all
LLM Plays Fire Emblem
AI gameplay harness for Fire Emblem: emulator state, LLM decision loops, tactical context, and commentary.
Rally
AI-native interactive character experience with Live2D, memory, games, dates, and relationship building.
Research
BS-Bench
A 600-game benchmark for measuring LLM deception, lie detection, and instruction compliance through the bluffing card game Bullshit.
- Games
- 600
- Matchups
- 15
- Prompt conditions
- 4
- Models
- 6 total (4 per round) (4 per round)
“Honesty prompts reduced lying, but also reduced challenges and made remaining lies more successful.”