This is the codebase for the experiments pertaining to "HiveMind is a single RL agent". In this paper we show that decentralised local imitation rules aggregate into a Single RL agent.
Talk:
src/
├── bandit.py # Bandit environments (Linear, Sigmoid, Congestion)
├── reinforcement_learning.py # RL algorithms: CL, MCL, P-CL, P-MCL
├── population_simulation.py # Population dynamics: Imitation of Success, Weighted Voter Rule, Majority Rule
├── analytical_solutions.py # Closed-form replicator dynamics (TRD, MRD)
├── rl.ipynb # RL bandit experiments
├── populations.ipynb # Population simulation experiments
└── plots.ipynb # Supplementary/exploratory plots
Generates figures comparing the Cross Learning (CL) and Maynard Cross Learning (MCL) bandit algorithms against closed-form analytical baselines (TRD and MRD replicator dynamics) across three reward distributions (Low / Middle / High q_a's).
| Section | Description | Output |
|---|---|---|
| Streaming RL | Single-environment CL and MCL at two learning rates (α = 0.001, 0.1) over up to 150k steps | streaming_rl_experiments.pdf |
| Parallel RL | Multi-environment P-CL and P-MCL varying parallel environments B ∈ {10, 1000} | parallel_rl_experiments.pdf |
| Congestion | CL/MCL on a two-action congestion bandit; includes interactive reward histogram widget | congestion_experiment_rl.pdf |
Generates figures comparing population simulations (Weighted Voter Rule and Imitation of Success) against the same TRD/MRD analytical baselines.
| Section | Description | Output |
|---|---|---|
| Population size | R_wvoter and R_success at N ∈ {10, 1000} across all three reward distributions | population_experiments.pdf |
| Neighbourhood size | R_wvoter with varying neighbourhood size M ∈ {2, 10, 1000} at fixed N=1000 | nei_experiments.pdf |
| Hybrid algorithms | Deterministic/stochastic Imitation of Success combined with R_wvoter | hybrid_experiments.pdf |
| Congestion | R_success and R_wvoter on the two-action congestion bandit at N=2000 | congestion_experiments_pop.pdf |
Standalone exploratory figures probing specific algorithmic properties.
| Cell | Description | Output |
|---|---|---|
| MCL with many alphas | MCL convergence at α ∈ {0.001, 0.05, 0.01} vs. P-MCL on evenly-spaced bandit | MCL_with_many_alphas.pdf |
| WVR neighbourhood sweep | R_wvoter with M ∈ {1, 2, 5, 10, 500} at N=500 | wvr_with_many_nei_sizes.pdf |
| WVR population sweep | R_wvoter convergence for N ∈ {10, 20, 50, 100, 500} | R_wvoter_scenario_evenly_spaced.pdf |
| Majority rule population sweep | Majority Rule for N ∈ {10, 20, 50, 100, 1000} | Majority_population_scenario_evenly_spaced.pdf |
| Majority rule neighbourhood sweep | Majority Rule with M ∈ {2, 10, 50, 100, 500} | majority_with_many_nei_sizes.pdf |
| Majority rule vote sweep | Varies votes S ∈ {1, 3, 10, 20, 100, 10000, ∞}; S=1 recovers WVR, S→∞ recovers deterministic majority | majority_rule.pdf |
| Frankenstein bee | Hybrid: deterministic/stochastic Imitation of Success with R_wvoter vs. TRD/MRD | frankstein_bee.pdf |
| Symbol | Name | Parameters |
|---|---|---|
| CL | Cross Learning | steps — simulation stepsseeds — independent runsalpha — learning ratebandit — environment |
| MCL | Maynard Cross Learning | steps — simulation stepsseeds — independent runsalpha — learning ratealpha_baseline — baseline tracker ratebandit — environment |
| P-CL | Parallel Cross Learning | steps — simulation stepsseeds — independent runsparallel_envs — number of parallel environments (B)bandit — environment |
| P-MCL | Parallel Maynard Cross Learning | steps — simulation stepsseeds — independent runsparallel_envs — number of parallel environments (B)bandit — environment |
| R_success | Imitation of Success | steps — simulation stepspopulation_size — Niterations — independent runsdeterministic — switch rule (stochastic or deterministic)stop_if_end — halt at convergence |
| R_wvoter | Weighted Voter Rule | steps — simulation stepspopulation_size — Niterations — independent runsneighbourhood_size — Mswitch — update mode (bee / is_det / is_stoc)stop_if_end — halt at convergence |
| R_majority | Majority Rule | steps — simulation stepspopulation_size — Niterations — independent runsneighbourhood_size — Mnumber_of_votes — S (votes per decision; S=1 recovers WVR, S→∞ is deterministic majority)stop_if_end — halt at convergence |
| TRD | Taylor Replicator Dynamic | steps — simulation stepsdelta — time stepbandit — environment |
| MRD | Maynard Replicator Dynamic | steps — simulation stepsdelta — time stepbandit — environment |
| Name | Parameters |
|---|---|
| BanditLinear | n_action — number of armsgap — reward noise (uniform ± gap around q_star)name — reward preset: near zero (q_star ∈ [0.1, 0.4]), evenly spaced (q_star ∈ [0.4, 0.7]), near one (q_star ∈ [0.6, 0.9]) |
| BanditSigmoid | n_action — number of armsmean — mean of the q_star distributionstd — standard deviation of q_star and rewardsname — same presets as BanditLinear but in logit space; rewards passed through sigmoid |
| BanditCongestion | n_action — number of arms (2)q_star — base mean rewards per actioncongestion_factors — per-action congestion weight ω; effective reward = q_star − ω × policygap — reward noise (uniform ± gap) |
- Cross-inhibition: Extending the equivalence to populations that use cross-inhibitory signalling, where individuals actively suppress competing options.
- Multi-population multi-agent RL: Generalising the framework to settings with multiple interacting populations, each acting as a macro-agent, giving rise to multi-agent RL dynamics.
- Algorithm design — RL → imitation rules: Systematically deriving local imitation rules from a target RL algorithm, enabling principled design of swarm behaviours.
- Algorithm design — imitation rules → RL: Going in the other direction: given a set of imitation rules, characterising the emergent macro-agent and its learning algorithm.
- LOLA-like macro-agents: Designing imitation rules whose emergent macro-agent corresponds to a Learning with Opponent-Aware Learning (LOLA) agent, connecting swarm behaviour to opponent-shaping in multi-agent RL.
If you use this code, please cite:
@article{soma2024hivemind,
title={The Hive Mind is a Single Reinforcement Learning Agent},
author={Soma, Karthik and Bouteiller, Yann and Hamann, Heiko and Beltrame, Giovanni},
journal={arXiv preprint arXiv:2410.17517},
year={2024},
doi={10.48550/arXiv.2410.17517}
}
