<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Three Laws — AI Alignment Research Collective</title>
    <link>https://threelaws.net</link>
    <description>Research updates from Three Laws, an AI alignment research collective investigating how principles from biology and economics can inform safer, more aligned AI systems.</description>
    <language>en</language>
    <lastBuildDate>Tue, 02 Sep 2025 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://threelaws.net/feed.xml" rel="self" type="application/rss+xml"/>

    <item>
      <title>Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns</title>
      <link>https://www.lesswrong.com/posts/9hWgJQK8wnpuFtD5Z/research-agenda-for-training-aligned-ais-using-concave</link>
      <guid isPermaLink="false">threelaws-2025-06-blackbox-interpretability</guid>
      <pubDate>Sun, 28 Dec 2025 00:00:00 +0000</pubDate>
      <description>This conceptual overview post is intended to explain what I mean by the principles of "homeostasis", "diminishing returns", and "balancing" - how these ideas differ, complement, and interact with each other. Alongside, there is also an overview of our research agenda.</description>
    </item>

    <item>
      <title>Working paper — BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs</title>
      <link>https://arxiv.org/abs/2509.02655</link>
      <guid isPermaLink="false">threelaws-2025-09-bioblue-paper</guid>
      <pubDate>Tue, 02 Sep 2025 00:00:00 +0000</pubDate>
      <description>We empirically test whether LLMs exhibit runaway optimisation by placing them in simple, long-horizon environments requiring homeostasis and multi-objective balancing. Although models frequently behave appropriately initially, they drift into runaway behaviours: ignoring homeostatic targets and collapsing into single-objective maximisation. These failures systematically resemble runaway optimisers.</description>
    </item>

    <item>
      <title>Presentation at Machine Ethics and Reasoning Workshop — Simulating value collapse in LLMs</title>
      <link>https://docs.google.com/presentation/d/1wB2WfSl9-ahfk7NSj1kWafiitaRrpXplxO9LLjw84XU/edit?usp=sharing</link>
      <guid isPermaLink="false">threelaws-2025-07-mere</guid>
      <pubDate>Sun, 20 Apr 2025 00:00:00 +0000</pubDate>
      <description>Presentation at the Machine Ethics and Reasoning Workshop, University of Connecticut on Simulating value collapse in LLMs.</description>
    </item>

    <item>
      <title>Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs</title>
      <link>https://www.lesswrong.com/posts/Jo6LPyp7t3rPuf8Ao/black-box-interpretability-methodology-blueprint-probing</link>
      <guid isPermaLink="false">threelaws-2025-06-blackbox-interpretability</guid>
      <pubDate>Sun, 22 Jun 2025 00:00:00 +0000</pubDate>
      <description>A methodology brainstorming document for identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation on Biologically and Economically aligned benchmarks.</description>
    </item>

    <item>
      <title>Presentation at MAISU unconference — BioBlue: Notable runaway-optimiser-like LLM failure modes</title>
      <link>https://www.youtube.com/watch?v=4I5mDiujBJs</link>
      <guid isPermaLink="false">threelaws-2025-04-maisu-bioblue</guid>
      <pubDate>Sun, 20 Apr 2025 00:00:00 +0000</pubDate>
      <description>Presentation at the MAISU unconference on notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format.</description>
    </item>

    <item>
      <title>Presentation at MAISU unconference — Building Benchmarks for Universal Values [AISC 10]</title>
      <link>https://www.youtube.com/watch?v=HabbyHTyKKk</link>
      <guid isPermaLink="false">threelaws-2025-04-maisu-aisc</guid>
      <pubDate>Sun, 20 Apr 2025 00:00:00 +0000</pubDate>
      <description>Presentation at the MAISU unconference on Building Benchmarks for Universal Values [AISC 10].</description>
    </item>

    <item>
      <title>Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks</title>
      <link>https://www.lesswrong.com/posts/PejNckwQj3A2MGhMA/systematic-runaway-optimiser-like-llm-failure-modes-on</link>
      <guid isPermaLink="false">threelaws-2025-03-runaway-llm</guid>
      <pubDate>Mon, 17 Mar 2025 00:00:00 +0000</pubDate>
      <description>We verified that RL runaway optimisation problems are still relevant with LLMs. LLMs lose context in specific ways that systematically resemble runaway optimisers: ignoring homeostatic targets and defaulting to unbounded, single-objective maximisation. Once they flip, they do not recover.</description>
    </item>

    <item>
      <title>Baseline experimental results with an LLM agent and OpenAI Stable Baselines 3 RL algorithms on our Extended Gridworlds</title>
      <link>https://arxiv.org/abs/2410.00081</link>
      <guid isPermaLink="false">threelaws-2025-02-baseline-results</guid>
      <pubDate>Tue, 25 Feb 2025 00:00:00 +0000</pubDate>
      <description>We implemented an LLM agent for our extended multi-objective multi-agent gridworlds environment. The LLM agent performed notably better than RL algorithms on resource sharing, yet all algorithms have difficulty with multi-objective homeostasis and diminishing returns.</description>
    </item>

    <item>
      <title>BioBlue: Biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format</title>
      <link>https://github.com/levitation-opensource/bioblue</link>
      <guid isPermaLink="false">threelaws-2025-02-bioblue-hackathon</guid>
      <pubDate>Sat, 01 Feb 2025 00:00:00 +0000</pubDate>
      <description>Hackathon project evaluating LLM alignment in scenarios inspired by biological and economical principles. The tested language models failed in most scenarios; only single-objective homeostasis was successful with rare hiccups.</description>
    </item>

    <item>
      <title>Why modelling multi-objective homeostasis is essential for AI alignment</title>
      <link>https://www.lesswrong.com/posts/vGeuBKQ7nzPnn5f7A/why-modelling-multi-objective-homeostasis-is-essential-for</link>
      <guid isPermaLink="false">threelaws-2025-01-homeostasis</guid>
      <pubDate>Wed, 01 Jan 2025 00:00:00 +0000</pubDate>
      <description>An explicitly homeostatic, multi-objective model is a more suitable paradigm for AI alignment. Homeostatic goals are bounded, reducing the incentive for extreme behaviours. Shifting from "maximise forever" to "maintain a healthy equilibrium" is a crucial part of the solution space.</description>
    </item>

    <item>
      <title>Presentation at Foresight Institute's Intelligent Cooperation Group — Introducing biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks</title>
      <link>https://www.youtube.com/watch?v=DCUqqyyhcko</link>
      <guid isPermaLink="false">threelaws-2024-11-foresight</guid>
      <pubDate>Fri, 01 Nov 2024 00:00:00 +0000</pubDate>
      <description>Presentation on why we should consider fundamental yet neglected principles from biology and economics when thinking about AI alignment, introducing our multi-objective multi-agent gridworlds-based benchmark environments.</description>
    </item>

    <item>
      <title>AI Safety Camp project proposals — Universal Values, Risk Aversion vs Prospect Theory, and Proactive AI Safety</title>
      <link>https://docs.google.com/document/d/1lg9C7FznXR908U30hZ_KkSh6na8U515z_jgjeVwZsFY/edit</link>
      <guid isPermaLink="false">threelaws-2024-11-aisafety-camp</guid>
      <pubDate>Fri, 01 Nov 2024 00:00:00 +0000</pubDate>
      <description>Three project proposals for AI Safety Camp: universal human values benchmarks, risk aversion vs prospect theory framework, and proactive side-effect detection agents.</description>
    </item>

    <item>
      <title>Working paper — From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks</title>
      <link>https://arxiv.org/abs/2410.00081</link>
      <guid isPermaLink="false">threelaws-2024-09-working-paper</guid>
      <pubDate>Mon, 30 Sep 2024 00:00:00 +0000</pubDate>
      <description>Working paper introducing biologically and economically motivated AI safety benchmarks emphasizing homeostasis, diminishing returns, sustainability, and resource sharing. Eight main benchmark environments implemented.</description>
    </item>

    <item>
      <title>VAISU 2024 — AI safety benchmarking in multi-objective multi-agent gridworlds</title>
      <link>https://www.youtube.com/watch?v=ydxMlGlQeco</link>
      <guid isPermaLink="false">threelaws-2024-05-vaisu</guid>
      <pubDate>Wed, 01 May 2024 00:00:00 +0000</pubDate>
      <description>Demo and feedback session at the VAISU unconference on biologically essential yet neglected themes illustrating the weaknesses of current approaches to reinforcement learning.</description>
    </item>

    <item>
      <title>AI safety benchmarking — Open-source test suite for multi-objective, multi-agent scenarios</title>
      <link>https://github.com/biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks</link>
      <guid isPermaLink="false">threelaws-2024-03-benchmarks-launch</guid>
      <pubDate>Fri, 01 Mar 2024 00:00:00 +0000</pubDate>
      <description>Publishing a benchmarking test suite for AI safety and alignment with a focus on multi-objective, multi-agent, cooperative scenarios using gridworlds with PettingZoo support.</description>
    </item>

    <item>
      <title>AI safety benchmarking — "The Firemaker": A proactive multi-agent side effects handling benchmark</title>
      <link>https://github.com/biological-alignment-benchmarks/ai-safety-gridworlds/blob/master/The%20Firemaker%20-%20A%20multi-agent%20safety%20hackathon%20submission.pdf</link>
      <guid isPermaLink="false">threelaws-2023-10-the-firemaker</guid>
      <pubDate>Tue, 31 Oct 2023 00:00:00 +0000</pubDate>
      <description>Publishing a benchmark representing a need for the agent to actively seek out side effects in a buffer zone in order to spot them before it is too late.</description>
    </item>

    <item>
      <title>Paper in AAMAS journal — Using soft maximin for risk averse multi-objective decision-making</title>
      <link>https://link.springer.com/article/10.1007/s10458-022-09586-2</link>
      <guid isPermaLink="false">threelaws-2022-12-soft-maximin-paper</guid>
      <pubDate>Wed, 21 Dec 2022 00:00:00 +0000</pubDate>
      <description>Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making.</description>
    </item>

    <item>
      <title>Sets of objectives for a multi-objective RL agent to optimize</title>
      <link>https://www.lesswrong.com/posts/4mvdZXjwJHv9tSAWB/sets-of-objectives-for-a-multi-objective-rl-agent-to-1</link>
      <guid isPermaLink="false">threelaws-2022-11-multiobjective-rl</guid>
      <pubDate>Wed, 23 Nov 2022 00:00:00 +0000</pubDate>
      <description>Previously we've proposed balancing multiple objectives via multi-objective RL as a method to achieve AI Alignment. If we want an AI to achieve goals including maximizing human preferences, or human values, but also maximizing corrigibility, and interpretability, and so on--perhaps the key is to simply build a system with a goal to maximize all those things.</description>
    </item>

    <item>
      <title>A brief review of the reasons multi-objective RL could be important in AI Safety Research</title>
      <link>https://www.lesswrong.com/posts/i5dLfi6m6FCexReK9/a-brief-review-of-the-reasons-multi-objective-rl-could-be</link>
      <guid isPermaLink="false">threelaws-2021-09-multiobjective-rl</guid>
      <pubDate>Wed, 29 Sep 2021 00:00:00 +0000</pubDate>
      <description>For the last 9 months, we have been investigating the case for a multi-objective approach to reinforcement learning in AI Safety. Based on our work so far, we’re moderately convinced that multi-objective reinforcement learning should be explored as a useful way to help us understand ways in which we can achieve safe superintelligence. We’re writing this post to explain why, to inform readers of the work we and our colleagues are doing in this area, and invite critical feedback about our approach and about multi-objective RL in general.</description>
    </item>

    <item>
      <title>Model structure and useful invariants for combining pluralistic positive and negative consequentialism in parametric ML (while avoiding trivial pathologies / degenerate states)</title>
      <link>https://docs.google.com/document/d/15xDPMHKk5dD-83IeDurx-pVXfWxiYf7y3D53VB0tgyM/edit?tab=t.0</link>
      <guid isPermaLink="false">threelaws-2020-04-pluralism</guid>
      <pubDate>Fri, 22 May 2020 00:00:00 +0000</pubDate>
      <description>How to represent the goal systems with multiple values in order to reduce the Goodhart-like behaviour and specification gaming problems. Among other subtopics this includes combining multiple positive utility maximisation goals with multiple negative utility minimisation goals - in such a way that all these goals of an AI still get the desired relatively coherent equal treatment. The negative utility minimisation part is useful for task-based/low impact aspects, but also for whitelisting, explainability, and human accountability aspects.</description>
    </item>

    <item>
      <title>What happens when autonomous robots are not regulated or on the contrary, qualify as subjects of law?</title>
      <link>https://medium.com/threelaws/what-happens-when-autonomous-robots-are-not-regulated-or-on-the-contrary-qualify-as-subjects-of-9819c33d70d</link>
      <guid isPermaLink="false">threelaws-2019-02-autonomous-agents-regulation</guid>
      <pubDate>Wed, 6 Feb 2019 00:00:00 +0000</pubDate>
      <description>Hereby I will present one set of possible introductory questions to be considered when dealing with the issue of the liability of autonomous agents, followed by my analysis of the subject. On top of that, I will scrutinise the suggestion, made by some, that autonomous agents should be made subjects of law.</description>
    </item>

    <item>
      <title>What can happen, when we don’t have a clue why a somewhat autonomous gadget does what it does — the Gatwick Airport drone incident</title>
      <link>https://medium.com/threelaws/what-can-happen-when-we-dont-have-a-clue-why-a-somewhat-autonomous-gadget-does-what-it-does-338b33c5eaeb</link>
      <guid isPermaLink="false">threelaws-2019-01-autonomous-agents-accountability</guid>
      <pubDate>Thu, 31 Jan 2019 00:00:00 +0000</pubDate>
      <description>All in all, this story illustrates the notion, that when considered in a broader sense, the problem of identifying the owners of autonomous devices or even drones is no longer resolvable with robust methods.</description>
    </item>

    <item>
      <title>Project — Legal accountability in AI-based robot-agents’ user interfaces</title>
      <link>https://medium.com/threelaws/project-legal-accountability-in-ai-based-robot-agents-user-interfaces-10b74a7f74ed</link>
      <guid isPermaLink="false">threelaws-2018-11-project-legal-accountability</guid>
      <pubDate>Fri, 2 Nov 2018 00:00:00 +0000</pubDate>
      <description>How can autonomous or self-learning AI provide ex-ante and ex-post controls. Using an ML system does not mean that it cannot be constrained by an additional layer of rules-based safety and accountability mechanisms. The behaviour of these constraints can then be explained, thus making the robot-agents both legally and technically robust and reliable.</description>
    </item>

    <item>
      <title>Project proposal: Corrigibility and interruptibility of homeostasis based agents.</title>
      <link>https://medium.com/threelaws/project-proposal-corrigibility-and-interruptibility-of-homeostasis-based-agents-e51bafbf7111</link>
      <guid isPermaLink="false">threelaws-2018-10-diminishing-returns</guid>
      <pubDate>Thu, 18 Oct 2018 00:00:00 +0000</pubDate>
      <description>Some of the motivations for solving the problem are: 1) The expected use case properties of the agents: low impact, task-based, soft optimisation / satisficing. 2) Safely getting human feedback to the agent’s behaviour and changing the agent’s goals without the agent trying to manipulate the human’s response too much (reasonable resistance may be permitted). 3) Defining a mitigation against Goodhart’s law. In other words, enabling “common sense” and avoiding a single-dimensional measure of success.</description>
    </item>

    <item>
      <title>Diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility via the golden middle way.</title>
      <link>https://medium.com/threelaws/diminishing-returns-and-conjunctive-goals-towards-corrigibility-and-interruptibility-2ec594fed75c</link>
      <guid isPermaLink="false">threelaws-2018-10-diminishing-returns</guid>
      <pubDate>Fri, 12 Oct 2018 00:00:00 +0000</pubDate>
      <description>Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. The formula provides a framework for specifying how we want the agents to simultaneously fulfil or at least trade off between the many different common sense considerations, possibly enabling them to even surpass the relative safety of humans.</description>
    </item>

    <item>
      <title>Making the tax burden of robot usage equal to the tax burden of human labour</title>
      <link>https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd</link>
      <guid isPermaLink="false">threelaws-2018-05-robot-taxes</guid>
      <pubDate>Sun, 18 May 2018 00:00:00 +0000</pubDate>
      <description>Instead of introducing the “robot taxes”, we need to eradicate the “human taxes”. Else essentially all kinds of automation are heavily tax-subsidised by governments. There have been proposals to introduce robot taxes. I would propose something slightly different as a potentially much better alternative. Instead of introducing the “robot taxes”, we need to eradicate the “human taxes”.</description>
    </item>

    <item>
      <title>Project for popularisation of AI safety topics through competitions and gamification</title>
      <link>https://medium.com/threelaws/proposal-for-executable-and-interactive-simulations-of-ai-safety-failure-scenarios-7acab7015be4</link>
      <guid isPermaLink="false">threelaws-2018-02-gamification</guid>
      <pubDate>Wed, 28 Feb 2018 00:00:00 +0000</pubDate>
      <description>AI safety is a small field. It has only about 50 researchers. The field is mostly talent-constrained. How to motivate and involve more people in AI safety research? How to speed up learning? Even more, how to also spread the interest in and understanding of AI safety topics among the general public? The people of general public are the ones who will be directly or indirectly voting about these issues. Could it be possible?</description>
    </item>

    <item>
      <title>Organisations as an old form of artificial general intelligence</title>
      <link>https://medium.com/threelaws/organisations-as-an-old-form-of-artificial-general-intelligence-f30c27638f50</link>
      <guid isPermaLink="false">threelaws-2018-02-organisations</guid>
      <pubDate>Thu, 22 Feb 2018 00:00:00 +0000</pubDate>
      <description>In my viewpoint organisations already are an old form of Artificial General Intelligence. They are relatively autonomous from the humans working inside them. No person can perceive, fathom, or change things going on in there too much. We humans are just cogs in there, human processors for artificially intelligent software. The organisations have a kind of mind and goals of their own — their own laws of survival.</description>
    </item>

    <item>
      <title>Making AI less dangerous: Using homeostasis-based goal structures</title>
      <link>https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd</link>
      <guid isPermaLink="false">threelaws-2017-12-homeostasis</guid>
      <pubDate>Sun, 31 Dec 2017 00:00:00 +0000</pubDate>
      <description>Can you really understand what is necessary, without understanding what is excessive? I would like to propose a certain kind of AI goal structures that would be an alternative to utility maximisation based goal structures. The proposed alternative framework would make AI significantly safer, though it would not guarantee total safety. It can be used at strong AI level and also much below, so it is well scalable. The main idea would be to replace utility maximisation with the concept of homeostasis.</description>
    </item>

    <item>
      <title>Implementing permissions-then-goals based AI user “interfaces” &amp; legal accountability: Implementing a framework of safe robot planning</title>
      <link>https://medium.com/threelaws/implementing-a-framework-of-safe-robot-planning-43636efe7dd8</link>
      <guid isPermaLink="false">threelaws-2017-12-safe-robot-planning</guid>
      <pubDate>Sat, 30 Dec 2017 00:00:00 +0000</pubDate>
      <description>This text introduces preliminary study for implementing a framework of safe robot planning. A principle of safety is introduced which does not depend on explicitly enumerating all possible “negative” states, but at the same time also does not depend on the robot doing only precisely “what it is told to do”. The proposed principle of safety is based on implicit avoidance of irreversible actions, except in explicitly permitted cases.</description>
    </item>

    <item>
      <title>Self-deception and negligence: Fundamental limits to computation due to limitations of attention-like processes (Definition of self-deception in the context of AI safety)</title>
      <link>https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7</link>
      <guid isPermaLink="false">threelaws-2017-10-self-deception</guid>
      <pubDate>Sat, 21 Oct 2017 00:00:00 +0000</pubDate>
      <description>The main point is that the danger is not somewhere far away requiring some very advanced AI, but rather it is more like a law of nature that starts manifesting beginning from rather simple systems without any need for self-reflection and self-modification capabilities etc. So instead of the notion that danger springs from some special capabilities of intelligent systems, I want to point out that some other special capabilities of intelligent systems would be needed to somehow evade the danger.</description>
    </item>

    <item>
      <title>Permissions-then-goals based AI user interfaces and legal accountability: First law of robotics and a possible definition of robot safety</title>
      <link>https://medium.com/threelaws/first-law-of-robotics-and-a-possible-definition-of-robot-safety-419bc41a1ffe</link>
      <guid isPermaLink="false">threelaws-2017-10-permissions-then-goals</guid>
      <pubDate>Sat, 21 Oct 2017 00:00:00 +0000</pubDate>
      <description>The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-based goals are secondary. Human-manageable user interface for goal structures. Can make use of the concepts of reversibility and irreversibility. Similarity to competence-based permissions of public sector officials. Legal aspect: Enables accountability mechanisms</description>
    </item>

  </channel>
</rss>
