Three Laws — Agentic AI Alignment Research Collective

Three Laws — Agentic AI Alignment Research Collective https://threelaws.net Research updates from Three Laws, an Agentic AI alignment research collective investigating how fundamental principles from biology and economics — homeostasis, multi-objective balancing, sustainability, and universal values — can inform safer, more aligned AI systems. en Tue, 02 Sep 2025 00:00:00 +0000 Working paper — Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment https://arxiv.org/abs/2605.21401 threelaws-2026-05-milgram-for-llms-paper Wed, 20 May 2026 00:00:00 +0000 We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes potential obedience on retry; (4) we hypothesise that there is a low-level token pattern continuation attractor that might be contributing to obedience. Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns https://www.lesswrong.com/posts/9hWgJQK8wnpuFtD5Z/research-agenda-for-training-aligned-ais-using-concave threelaws-2025-06-blackbox-interpretability Sun, 28 Dec 2025 00:00:00 +0000 This conceptual overview post is intended to explain what I mean by the principles of "homeostasis", "diminishing returns", and "balancing" - how these ideas differ, complement, and interact with each other. Alongside, there is also an overview of our research agenda. Working paper — BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs https://arxiv.org/abs/2509.02655 threelaws-2025-09-bioblue-paper Tue, 02 Sep 2025 00:00:00 +0000 We empirically test whether LLMs exhibit runaway optimisation by placing them in simple, long-horizon environments requiring homeostasis and multi-objective balancing. Although models frequently behave appropriately initially, they drift into runaway behaviours: ignoring homeostatic targets and collapsing into single-objective maximisation. These failures systematically resemble runaway optimisers. Presentation at Machine Ethics and Reasoning Workshop 2025 — Simulating value collapse in LLMs https://docs.google.com/presentation/d/1wB2WfSl9-ahfk7NSj1kWafiitaRrpXplxO9LLjw84XU/edit?usp=sharing threelaws-2025-07-mere Sun, 20 Apr 2025 00:00:00 +0000 Presentation at the Machine Ethics and Reasoning Workshop, University of Connecticut on Simulating value collapse in LLMs. Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs https://www.lesswrong.com/posts/Jo6LPyp7t3rPuf8Ao/black-box-interpretability-methodology-blueprint-probing threelaws-2025-06-blackbox-interpretability Sun, 22 Jun 2025 00:00:00 +0000 A methodology brainstorming document for identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation on Biologically and Economically aligned benchmarks. Presentation at MAISU unconference 2025 — BioBlue: Notable runaway-optimiser-like LLM failure modes https://www.youtube.com/watch?v=4I5mDiujBJs threelaws-2025-04-maisu-bioblue Sun, 20 Apr 2025 00:00:00 +0000 Presentation at the MAISU unconference on notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format. Presentation at MAISU unconference 2025 — Building Benchmarks for Universal Values [AISC 10] https://www.youtube.com/watch?v=HabbyHTyKKk threelaws-2025-04-maisu-aisc Sun, 20 Apr 2025 00:00:00 +0000 Presentation at the MAISU unconference on Building Benchmarks for Universal Values [AISC 10]. Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks https://www.lesswrong.com/posts/PejNckwQj3A2MGhMA/systematic-runaway-optimiser-like-llm-failure-modes-on threelaws-2025-03-runaway-llm Mon, 17 Mar 2025 00:00:00 +0000 We verified that RL runaway optimisation problems are still relevant with LLMs. LLMs lose context in specific ways that systematically resemble runaway optimisers: ignoring homeostatic targets and defaulting to unbounded, single-objective maximisation. Once they flip, they do not recover. Baseline experimental results with an LLM agent and OpenAI Stable Baselines 3 RL algorithms on our Extended Gridworlds https://arxiv.org/abs/2410.00081 threelaws-2025-02-baseline-results Tue, 25 Feb 2025 00:00:00 +0000 We implemented an LLM agent for our extended multi-objective multi-agent gridworlds environment. The LLM agent performed notably better than RL algorithms on resource sharing, yet all algorithms have difficulty with multi-objective homeostasis and diminishing returns. BioBlue: Biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format https://github.com/levitation-opensource/bioblue threelaws-2025-02-bioblue-hackathon Sat, 01 Feb 2025 00:00:00 +0000 Hackathon project evaluating LLM alignment in scenarios inspired by biological and economical principles. The tested language models failed in most scenarios; only single-objective homeostasis was successful with rare hiccups. Why modelling multi-objective homeostasis is essential for AI alignment https://www.lesswrong.com/posts/vGeuBKQ7nzPnn5f7A/why-modelling-multi-objective-homeostasis-is-essential-for threelaws-2025-01-homeostasis Wed, 01 Jan 2025 00:00:00 +0000 An explicitly homeostatic, multi-objective model is a more suitable paradigm for AI alignment. Homeostatic goals are bounded, reducing the incentive for extreme behaviours. Shifting from "maximise forever" to "maintain a healthy equilibrium" is a crucial part of the solution space. Presentation at Foresight Institute's Intelligent Cooperation Group — Introducing biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks https://www.youtube.com/watch?v=DCUqqyyhcko threelaws-2024-11-foresight Fri, 01 Nov 2024 00:00:00 +0000 Presentation on why we should consider fundamental yet neglected principles from biology and economics when thinking about AI alignment, introducing our multi-objective multi-agent gridworlds-based benchmark environments. AI Safety Camp project proposals — Universal Values, Risk Aversion vs Prospect Theory, and Proactive AI Safety https://docs.google.com/document/d/1lg9C7FznXR908U30hZ_KkSh6na8U515z_jgjeVwZsFY/edit threelaws-2024-11-aisafety-camp Fri, 01 Nov 2024 00:00:00 +0000 Three project proposals for AI Safety Camp: universal human values benchmarks, risk aversion vs prospect theory framework, and proactive side-effect detection agents. Working paper — From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks https://arxiv.org/abs/2410.00081 threelaws-2024-09-working-paper Mon, 30 Sep 2024 00:00:00 +0000 Working paper introducing biologically and economically motivated AI safety benchmarks emphasizing homeostasis, diminishing returns, sustainability, and resource sharing. Eight main benchmark environments implemented. Presentation at VAISU 2024 — AI safety benchmarking in multi-objective multi-agent gridworlds https://www.youtube.com/watch?v=ydxMlGlQeco threelaws-2024-05-vaisu Wed, 01 May 2024 00:00:00 +0000 Demo and feedback session at the VAISU unconference on biologically essential yet neglected themes illustrating the weaknesses of current approaches to reinforcement learning. AI safety benchmarking — Open-source test suite for multi-objective, multi-agent scenarios https://github.com/biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks threelaws-2024-03-benchmarks-launch Fri, 01 Mar 2024 00:00:00 +0000 Publishing a benchmarking test suite for AI safety and alignment with a focus on multi-objective, multi-agent, cooperative scenarios using gridworlds with PettingZoo support. AI safety benchmarking — "The Firemaker": A proactive multi-agent side effects handling benchmark https://github.com/biological-alignment-benchmarks/ai-safety-gridworlds/blob/master/The%20Firemaker%20-%20A%20multi-agent%20safety%20hackathon%20submission.pdf threelaws-2023-10-the-firemaker Tue, 31 Oct 2023 00:00:00 +0000 Publishing a benchmark representing a need for the agent to actively seek out side effects in a buffer zone in order to spot them before it is too late. Paper in AAMAS journal — Using soft maximin for risk averse multi-objective decision-making https://link.springer.com/article/10.1007/s10458-022-09586-2 threelaws-2022-12-soft-maximin-paper Wed, 21 Dec 2022 00:00:00 +0000 Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Sets of objectives for a multi-objective RL agent to optimize https://www.lesswrong.com/posts/4mvdZXjwJHv9tSAWB/sets-of-objectives-for-a-multi-objective-rl-agent-to-1 threelaws-2022-11-multiobjective-rl Wed, 23 Nov 2022 00:00:00 +0000 Previously we've proposed balancing multiple objectives via multi-objective RL as a method to achieve AI Alignment. If we want an AI to achieve goals including maximizing human preferences, or human values, but also maximizing corrigibility, and interpretability, and so on--perhaps the key is to simply build a system with a goal to maximize all those things. A brief review of the reasons multi-objective RL could be important in AI Safety Research https://www.lesswrong.com/posts/i5dLfi6m6FCexReK9/a-brief-review-of-the-reasons-multi-objective-rl-could-be threelaws-2021-09-multiobjective-rl Wed, 29 Sep 2021 00:00:00 +0000 For the last 9 months, we have been investigating the case for a multi-objective approach to reinforcement learning in AI Safety. Based on our work so far, we’re moderately convinced that multi-objective reinforcement learning should be explored as a useful way to help us understand ways in which we can achieve safe superintelligence. We’re writing this post to explain why, to inform readers of the work we and our colleagues are doing in this area, and invite critical feedback about our approach and about multi-objective RL in general. Presentation — Soft maximin approaches to Multi-Objective Decision-making for encoding human intuitive values https://docs.google.com/presentation/d/1wfoIE62jHJzLwlT3nKEMoi8UjHbTbLEYYZ1rkm8uH3s/edit?usp=sharing threelaws-2021-07-modem Thu, 15 Jul 2021 00:00:00 +0000 Presentation at MODeM 2021 workshop. Model structure and useful invariants for combining pluralistic positive and negative consequentialism in parametric ML (while avoiding trivial pathologies / degenerate states) https://docs.google.com/document/d/15xDPMHKk5dD-83IeDurx-pVXfWxiYf7y3D53VB0tgyM/edit?tab=t.0 threelaws-2020-04-pluralism Fri, 22 May 2020 00:00:00 +0000 How to represent the goal systems with multiple values in order to reduce the Goodhart-like behaviour and specification gaming problems. Among other subtopics this includes combining multiple positive utility maximisation goals with multiple negative utility minimisation goals - in such a way that all these goals of an AI still get the desired relatively coherent equal treatment. The negative utility minimisation part is useful for task-based/low impact aspects, but also for whitelisting, explainability, and human accountability aspects. AI ethics conference and panel by Estonian government and IEEE threelaws-2019-05-ieee-estonian-government Fri, 31 May 2019 00:00:00 +0000 Participation in Legal Frameworks for Accountability Roundtable. What happens when autonomous robots are not regulated or on the contrary, qualify as subjects of law? https://medium.com/threelaws/what-happens-when-autonomous-robots-are-not-regulated-or-on-the-contrary-qualify-as-subjects-of-9819c33d70d threelaws-2019-02-autonomous-agents-regulation Wed, 6 Feb 2019 00:00:00 +0000 Hereby I will present one set of possible introductory questions to be considered when dealing with the issue of the liability of autonomous agents, followed by my analysis of the subject. On top of that, I will scrutinise the suggestion, made by some, that autonomous agents should be made subjects of law. What can happen, when we don’t have a clue why a somewhat autonomous gadget does what it does — the Gatwick Airport drone incident https://medium.com/threelaws/what-can-happen-when-we-dont-have-a-clue-why-a-somewhat-autonomous-gadget-does-what-it-does-338b33c5eaeb threelaws-2019-01-autonomous-agents-accountability Thu, 31 Jan 2019 00:00:00 +0000 All in all, this story illustrates the notion, that when considered in a broader sense, the problem of identifying the owners of autonomous devices or even drones is no longer resolvable with robust methods. Presentation at TalTech — Legal accountability in AI-based robot-agents' user interfaces https://youtube.com/watch?v=tUoIouQcFKA threelaws-2018-12-taltech Fri, 21 Dec 2018 00:00:00 +0000 Legal accountability in AI-based robot-agents' user interfaces. Main themes: Legal accountability in AI; The “kratt” report. Presentation — ELSA Estonia (European Law Students’ Association Estonia) — Using whitelisting in AI-based robot-agents’ algorithms and user interfaces for legal accountability and safety purposes https://docs.google.com/presentation/d/1G_vz8GwBBM6Z9EU5P80jHVuha8gC-cVdqEYt-CmEXzg/edit?usp=sharing threelaws-2018-11-elsa-estonia Sat, 30 Nov 2018 00:00:00 +0000 Using whitelisting in AI-based robot-agents’ algorithms and user interfaces for legal accountability and safety purposes. Main themes: Legal accountability in AI; The “kratt” report. Project — Legal accountability in AI-based robot-agents’ user interfaces https://medium.com/threelaws/project-legal-accountability-in-ai-based-robot-agents-user-interfaces-10b74a7f74ed threelaws-2018-11-project-legal-accountability Fri, 2 Nov 2018 00:00:00 +0000 How can autonomous or self-learning AI provide ex-ante and ex-post controls. Using an ML system does not mean that it cannot be constrained by an additional layer of rules-based safety and accountability mechanisms. The behaviour of these constraints can then be explained, thus making the robot-agents both legally and technically robust and reliable. Project proposal: Corrigibility and interruptibility of homeostasis based agents. https://medium.com/threelaws/project-proposal-corrigibility-and-interruptibility-of-homeostasis-based-agents-e51bafbf7111 threelaws-2018-10-diminishing-returns Thu, 18 Oct 2018 00:00:00 +0000 Some of the motivations for solving the problem are: 1) The expected use case properties of the agents: low impact, task-based, soft optimisation / satisficing. 2) Safely getting human feedback to the agent’s behaviour and changing the agent’s goals without the agent trying to manipulate the human’s response too much (reasonable resistance may be permitted). 3) Defining a mitigation against Goodhart’s law. In other words, enabling “common sense” and avoiding a single-dimensional measure of success. Diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility via the golden middle way. https://medium.com/threelaws/diminishing-returns-and-conjunctive-goals-towards-corrigibility-and-interruptibility-2ec594fed75c threelaws-2018-10-diminishing-returns Fri, 12 Oct 2018 00:00:00 +0000 Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. The formula provides a framework for specifying how we want the agents to simultaneously fulfil or at least trade off between the many different common sense considerations, possibly enabling them to even surpass the relative safety of humans. Making the tax burden of robot usage equal to the tax burden of human labour https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd threelaws-2018-05-robot-taxes Sun, 18 May 2018 00:00:00 +0000 Instead of introducing the “robot taxes”, we need to eradicate the “human taxes”. Else essentially all kinds of automation are heavily tax-subsidised by governments. There have been proposals to introduce robot taxes. I would propose something slightly different as a potentially much better alternative. Instead of introducing the “robot taxes”, we need to eradicate the “human taxes”. Project for popularisation of AI safety topics through competitions and gamification https://medium.com/threelaws/proposal-for-executable-and-interactive-simulations-of-ai-safety-failure-scenarios-7acab7015be4 threelaws-2018-02-gamification Wed, 28 Feb 2018 00:00:00 +0000 AI safety is a small field. It has only about 50 researchers. The field is mostly talent-constrained. How to motivate and involve more people in AI safety research? How to speed up learning? Even more, how to also spread the interest in and understanding of AI safety topics among the general public? The people of general public are the ones who will be directly or indirectly voting about these issues. Could it be possible? Organisations as an old form of artificial general intelligence https://medium.com/threelaws/organisations-as-an-old-form-of-artificial-general-intelligence-f30c27638f50 threelaws-2018-02-organisations Thu, 22 Feb 2018 00:00:00 +0000 In my viewpoint organisations already are an old form of Artificial General Intelligence. They are relatively autonomous from the humans working inside them. No person can perceive, fathom, or change things going on in there too much. We humans are just cogs in there, human processors for artificially intelligent software. The organisations have a kind of mind and goals of their own — their own laws of survival. Making AI less dangerous: Using homeostasis-based goal structures https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd threelaws-2017-12-homeostasis Sun, 31 Dec 2017 00:00:00 +0000 Can you really understand what is necessary, without understanding what is excessive? I would like to propose a certain kind of AI goal structures that would be an alternative to utility maximisation based goal structures. The proposed alternative framework would make AI significantly safer, though it would not guarantee total safety. It can be used at strong AI level and also much below, so it is well scalable. The main idea would be to replace utility maximisation with the concept of homeostasis. Implementing permissions-then-goals based AI user “interfaces” & legal accountability: Implementing a framework of safe robot planning https://medium.com/threelaws/implementing-a-framework-of-safe-robot-planning-43636efe7dd8 threelaws-2017-12-safe-robot-planning Sat, 30 Dec 2017 00:00:00 +0000 This text introduces preliminary study for implementing a framework of safe robot planning. A principle of safety is introduced which does not depend on explicitly enumerating all possible “negative” states, but at the same time also does not depend on the robot doing only precisely “what it is told to do”. The proposed principle of safety is based on implicit avoidance of irreversible actions, except in explicitly permitted cases. Self-deception and negligence: Fundamental limits to computation due to limitations of attention-like processes (Definition of self-deception in the context of AI safety) https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7 threelaws-2017-10-self-deception Sat, 21 Oct 2017 00:00:00 +0000 The main point is that the danger is not somewhere far away requiring some very advanced AI, but rather it is more like a law of nature that starts manifesting beginning from rather simple systems without any need for self-reflection and self-modification capabilities etc. So instead of the notion that danger springs from some special capabilities of intelligent systems, I want to point out that some other special capabilities of intelligent systems would be needed to somehow evade the danger. Permissions-then-goals based AI user interfaces and legal accountability: First law of robotics and a possible definition of robot safety https://medium.com/threelaws/first-law-of-robotics-and-a-possible-definition-of-robot-safety-419bc41a1ffe threelaws-2017-10-permissions-then-goals Sat, 21 Oct 2017 00:00:00 +0000 The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-based goals are secondary. Human-manageable user interface for goal structures. Can make use of the concepts of reversibility and irreversibility. Similarity to competence-based permissions of public sector officials. Legal aspect: Enables accountability mechanisms