Skip to content

leejisabella/224rfinalproject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advancing Multi-Agent Reasoning in Open-Face Chinese Poker

CS 224R: Deep Reinforcement Learning

By: Alice Guo, Ramya Iyer, Isabella Lee

Motivation

While reinforcement learning (RL) has been extensively studied in games like No-Limit Texas Hold’em poker, Open-Face Chinese Poker (OFCP) remains largely unexplored. OFCP presents unique challenges due to its sparse reward system and complex hand dynamics. This paper investigates RL methods of Deep Q-Learning, Proximal Policy Optimization (PPO), and Monte Carlo Tree Search (MCTS) to see if it can be effective for OFCP.

Methods

We implemented and compared several RL algorithms in a self-play environment with no external datasets, pitting our agents against both rule-based and learning-based opponents.

Q-Learning Family

  • Q-Learning Baseline method due to small state spaces.
  • Deep Q-Learning Inspired by Tan and Xiao (2018) implementation of DQN for OFCP.
  • Double DQN: Reduces overestimation bias by decoupling action selection and evaluation.
  • Dueling DQN: Separates the network into value and advantage streams for more precise Q-value approximation.

PPO (Proximal Policy Optimization)

  • Stable policy gradient learning and stochastic decision-making, suitable for imperfect information and uncertainty.
  • Includes clipped surrogate optimization, entropy regularization, and Generalized Advantage Estimation (GAE) for improved tie-breaking.

MCTS (Monte Carlo Tree Search)

  • Explores potential card placements and simulates future rollouts.
  • Implements optimizations like Cross-Entropy Method (CEM), Rapid Action Value Estimation (RAVE), and Counterfactual Regret Minimization (CFR) to improve early move decisions and long-term planning.

Implementation

  • Custom two-player OFCP environment with deck management, card placement, and hand validation.
  • Rewards aligned with OFCP rules (winning hands, fouling, scooping, royalties).
  • Agents trained via self-play and evaluated by:
    • Method win rate
    • Bot win rate
    • Average points per game
    • Training efficiency

Results

Method Win Rate (Model) Bot Win Rate Avg Points/Game Evaluation Time (100 games)
MCTS 89% 3% 11.2 447 minutes
PPO + GAE 41% 0% 5.02 20 seconds
Double DQN 35% 23% 3.50 4 minutes 16 seconds
  • MCTS outperformed other methods with the highest win rate and points, but incurred a significant computational cost (~200x slower than PPO).
  • PPO demonstrated competitive performance with much faster evaluation.
  • Double Deep Q-learning methods were less effective overall but still outperformed random play (Q-learning & DQN didn't outperform random play).

Discussion and Conclusion

  • Our focus was on two-player OFCP, excluding multi-player variants and advanced rule sets (e.g., Fantasyland, Shoot the Moon).
  • While MCTS offers superior gameplay quality, its computational overhead limits real-time applications.
  • Future work aims to:
    • Optimize MCTS efficiency (e.g., parallel rollouts, learned policies).
    • Explore hybrid neuroevolution-RL techniques to handle sparse rewards.
    • Enhance lightweight methods for better decisiveness without heavy compute.

Key takeaway: All three classes of RL methods outperform a random bot, with MCTS showing the strongest performance (89% method win rate).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages