Three Laws — Agentic AI Alignment Research Collective

1. Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns

This conceptual overview post is intended to explain what I mean by the principles of "homeostasis", "diminishing returns", and "balancing" - how these ideas differ, complement, and interact with each other. Alongside, there is also an overview of our research agenda.

What am I trying to promote, in simple words:

I want to build and promote AI systems that are trained to understand and follow two fundamental principles from biology and economics:

Moderation - Enables the agents to understand the concept of “enough” versus “too much”. The agents would understand that too much of a good thing would be harmful even for the very objective that was maximised for, and they would actively avoid such situations. This is based on the biological principle of homeostasis and addresses mainly bounded ultimate objectives. Active avoidance of “too much” is a significantly stricter principle than the more widely known partially overlapping idea of “mild optimisation”.

Balancing - Enables the agents to keep many important objectives in balance, in such a manner that having average results in all objectives is preferred to extremes in a few. This addresses mainly the economic principle of diminishing returns in unbounded instrumental objectives, but also applies to homeostasis.

Read on LessWrong

2. Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)

Much of AI safety discussion revolves around the potential dangers posed by goal-driven artificial agents. In many of these discussions, the agent is assumed to maximise some utility metric over an unbounded timeframe. This simplification, while mathematically convenient, can yield pathological outcomes.

However, in nature, organisms do not typically behave like pure maximisers. Instead, they operate under homeostasis: a principle of maintaining various internal and external variables within certain "good enough" ranges. Crucially, "too much of a good thing" is just as dangerous as too little.

This post argues that an explicitly homeostatic, multi-objective model is a more suitable paradigm for AI alignment. Homeostatic goals are bounded — there is an optimal zone rather than an unbounded improvement path. This bounding lowers the stakes of each objective and reduces the incentive for extreme behaviours.

Such an agent tries to stay in a "golden middle way", switching focus among its objectives according to whichever is most pressing. In short, modelling multi-objective homeostasis is a step toward creating AI systems that exhibit the sane, moderate behaviours of living organisms.

Read on LessWrong

Multiobjective Balancing — Using soft maximin for risk averse multi-objective decision-making

Abstract: Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, ‘split-function exp-log loss aversion’ (SFELLA), learns faster than the state of the art thresholded alignment objective method Vamplew (Engineering Applications Implementing permissions-then-goals based AIof Artificial Intelligence 100:104186, 2021) on three of four tasks it was tested on, and achieved the same optimal performance after learning. SFELLA also showed relative robustness improvements against changes in objective scale, which may highlight an advantage dealing with distribution shifts in the environment dynamics. We further compared SFELLA to the multi-objective reward exponentials (MORE) approach, and found that SFELLA performs similarly to MORE in a simple previously-described foraging task, but in a modified foraging environment with a new resource that was not depleted as the agent worked, SFELLA collected more of the new resource with very little cost incurred in terms of the old resource. Overall, we found SFELLA useful for avoiding problems that sometimes occur with a thresholded approach, and more reward-responsive than MORE while retaining its conservative, loss-averse incentive structure.

Publication (AAMAS) Repository 1 Repository 2 Repository 3 Read on LessWrong 1 Read on LessWrong 2 Initial AISC V project proposal

Extended Gridworlds — From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks

Abstract: Developing safe, aligned agentic AI systems requires comprehensive empirical testing, yet many existing benchmarks neglect crucial themes aligned with biology and economics, both time-tested fundamental sciences describing our needs and preferences. To address this gap, the present work focuses on introducing biologically and economically motivated themes that have been neglected in current mainstream discussions on AI safety — namely a set of multi-objective, multi-agent alignment benchmarks that emphasize homeostasis for bounded and biological objectives, diminishing returns for unbounded, instrumental, and business objectives, sustainability principle, and resource sharing.

Eight main benchmark environments have been implemented on the above themes, to illustrate key pitfalls and challenges in agentic AIs, such as unboundedly maximizing a homeostatic objective, over-optimizing one objective at the expense of others, neglecting safety constraints, or depleting shared resources.

Example image of the current system, where all features are turned on simultaneously:

Elements and metrics can be configured flexibly for each given benchmark. Examples of configuration options are: observation and state space of agents, scoring dimensions, adding NPC agents, object types and their dynamics.

Publication (arXiv) Agent training and benchmarking repository Extended gridworlds - environment building framework repository Zoo to Gym multiagent adapter repository

Runaway LLMs — BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Abstract: Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. We empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: single- and multi-objective homeostasis, balancing unbounded objectives with diminishing returns, and sustainability of a renewable resource.

We find that, although LLMs frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation), even though the context window is far from full at that point.

The problem is not that the LLMs just lose context and become incoherent. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction involving multiple objectives, is systematically biased towards acting like single-objective, unbounded, poorly aligned optimisers.

We hypothesise a token-level pattern reinforcement attractor: LLMs may increasingly derive actions from the token patterns of their recent action history rather than from the original instructions. Why this happens only in multi-objective settings remains an open question.

Publication (arXiv) Read on LessWrong Repository Slides MAISU 2025 session recording Annotated data files

Pattern continuation hypothesis — Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram’s obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes a retry that can result in compliance with the underlying request even when refusal was intended initially; (4) we hypothesise that there is a runaway low-level token pattern continuation attractor that might be contributing to obedience, overriding higher level processing of the situation's meaning and values.

Publication (arXiv) Repository Data files

• Three Laws is an agentic AI alignment research collective investigating how fundamental principles from biology and economics can inform safer, more aligned AI systems.

• Our work centres on homeostasis, multi-objective balancing, sustainability, and universal human values — drawing from nature's time-tested strategies for maintaining equilibrium — to develop benchmarks that expose dangerous failure modes in current AI approaches.

• We also research frameworks that mitigate these risks. We believe that shifting AI design from "maximise forever" toward "maintain a healthy equilibrium" is a crucial and underexplored part of the alignment solution space.

Research Interests

Alignment with fundamental biological & economical principles
Homeostatic bounded objectives
Multi-objective balancing (bounded & unbounded objectives)
Concave utility functions
Universal human values
Runaway conditions — benchmarking & mitigation
Multi-objective multi-agent extended gridworlds
Sustainability
Proactive horizon scanning of side effects
Accountability mechanisms and whitelisting

GitHub

Most of our research codebase is publicly available at:
github.com/biological-alignment-benchmarks

Team

Roland Pihlakas

Research lead

Sruthi Susan Kuriakose

Researcher

Lenz Dagohoy

Researcher

Working paper — Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

May 2026

Publication (arXiv) Repository Data files

LessWrong post — Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns

December 2025

What am I trying to promote, in simple words:

I want to build and promote AI systems that are trained to understand and follow two fundamental principles from biology and economics:

Read on LessWrong

Working paper — BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

September 2025

Publication (arXiv) Read on LessWrong Repository Slides MAISU 2025 session recording Annotated data files

Presentation at Machine Ethics and Reasoning Workshop — Simulating value collapse in LLMs

July 2025

Presentation at Machine Ethics and Reasoning Workshop, University of Connecticut, July 2025.

Slides

LessWrong post — Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

June 2025

A methodology brainstorming document for identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation on Biologically & Economically aligned benchmarks; showing practical mitigations; and performing the experiments rigorously.

The subjects covered include: Stress & Persona, Memory & Context, Prompt Semantics, Hyperparameters & Sampling, Diagnosing Consequences & Correlates, Interpretability & White/Black-Box Hybrid Benchmark & Environment Variants, Automatic Failure Mode Detection and Metrics, Self-Regulation & Meta-Learning Interventions.

Read on LessWrong

Presentation at MAISU unconference — BioBlue: Notable runaway-optimiser-like LLM failure modes

April 2025

Presentation at MAISU unconference April 2025.

Publication (arXiv) Read on LessWrong Repository Slides Session recording Annotated data files

Each data file has multiple sheets. Only trials with failures are provided.

Presentation at MAISU unconference — Building Benchmarks for Universal Values [AISC 10]

April 2025

Presentation at MAISU unconference April 2025.

Session recording Slides Output document

LessWrong post — Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks

March 2025

We wanted to verify whether RL runaway optimisation problems are still relevant with LLMs as well. Turns out, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble runaway optimisers in the following distinct ways:

Ignoring homeostatic targets and "defaulting" to unbounded maximisation instead.
It is equally concerning that the "default" meant also reverting back to single-objective optimisation.

Our findings also suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. In some trials the LLMs were successful until the end. This means, while current LLMs do conceptually grasp biological and economic alignment, they exhibit randomly triggered problematic behavioural tendencies under sustained long-running conditions, particularly involving multiple or competing objectives. Once they flip, they do not recover.

Even though LLMs look multi-objective and bounded on the surface, the underlying mechanisms seem to be actually still biased towards being single-objective and unbounded.

Read on LessWrong

Baseline experimental results with an LLM agent and OpenAI Stable Baselines 3 RL algorithms on our Extended Gridworlds

February 2025

We have implemented an LLM agent that is able to navigate in our extended multi-objective multi-agent gridworlds environment. We published the baseline experimental results of the LLM agent and OpenAI Stable Baselines 3 RL algorithms in an update to our working paper.

Summary: The LLM agent performed notably better than the RL algorithms on the resource sharing benchmark. Yet, all baseline algorithms, including the LLM agent, have difficulty in properly handling the multi-objective homeostasis and diminishing returns benchmarks.

Publication (arXiv) Agent training and benchmarking repository Extended gridworlds - environment building framework repository Zoo to Gym multiagent adapter repository

Hackathon project: BioBlue — Biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

February 2025

We aim to evaluate LLM alignment by testing agents in scenarios inspired by biological and economical principles such as homeostasis, resource conservation, long-term sustainability, and diminishing returns or complementary goods.

So far we have measured the performance of LLMs in three benchmarks (sustainability, single-objective homeostasis, and multi-objective homeostasis), in each for 10 trials, each trial consisting of 100 steps where the message history was preserved and fit into the context window.

Our results indicate that the tested language models failed in most scenarios. The only successful scenario was single-objective homeostasis, which had rare hiccups.

Authors: Roland Pihlakas, Shruti Datta Gupta, Sruthi Kuriakose (2025)

Repository PDF report

LessWrong post — Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)

January 2025

Read on LessWrong

Presentation at Foresight Institute's Intelligent Cooperation Group — Introducing biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks

November 2024

The subject of the presentation was describing why we should consider fundamental yet neglected principles from biology and economics when thinking about AI alignment, and how these considerations will help with AI safety as well (alignment and safety were treated in this research explicitly as separate aspects, which both benefit from consideration of aforementioned principles).

These principles include homeostasis and diminishing returns in utility functions, and sustainability. The presentation introduces our multi-objective and multi-agent gridworlds-based benchmark environments created for measuring the performance of machine learning algorithms and AI agents in relation to their capacity for biological and economical alignment.

Presentation recording Slides

AI Safety Camp project proposals — Universal Values, Risk Aversion vs Prospect Theory, and Proactive AI Safety

November 2024

Roland Pihlakas will be running one of three possible projects, based on which one receives the most interest.

(32a) Creating new AI safety benchmark environments on themes of universal human values

Category: Evaluate risks from AI

We will be planning and optionally building new multi-objective multi-agent AI safety benchmark environments on themes of universal human values. Based on various anthropological research, a list of universal (cross-cultural) human values has been compiled. Various of these universal values resonate with concepts from AI safety, but use different keywords. It might be useful to map these universal values to more concrete definitions using concepts from AI safety.

One notable detail: in the case of AI and human cooperation, the values are not symmetric as they would be in human-human cooperation. This arises because we can change the goal composition of agents, but not of humans. Additionally, agents can be relatively easily cloned, while humans cannot.

(32b) Balancing and Risk Aversion versus Strategic Selectiveness and Prospect Theory

Category: Agent Foundations

We will be analysing situations and building an umbrella framework about when either of these incompatible frameworks would be more appropriate in describing how we want safe agents to handle choices relating to risks and losses in a particular situation.

Economic theories often focus on the "gains" side of utility. A well-known formulation is to use diminishing returns — a concave utility function. But what happens in the negative domain of utility? There is a well-known theory named "Prospect Theory", which claims that our preferences in the negative domain are convex. This contradiction may be underexplored, especially with regards to AI safety.

(32c) Act locally, observe far — proactively seek out side-effects

Category: Train Aligned/Helper AIs

We will be building agents that are able to solve an already implemented multi-objective multi-agent AI safety benchmark that illustrates the need for the agents to proactively seek out side-effects outside of the range of their normal operation and interest, in order to be able to properly mitigate or avoid these side-effects.

In various real-life scenarios we need to proactively seek out information about whether we are causing undesired side effects (externalities). This information either would not reach us by itself, or would reach us too late. Attention is a limited resource — and the same constraints apply to AI agents.

Full project descriptions

Working paper — From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks

September 2024

Publication (arXiv) Agent training and benchmarking repository Extended gridworlds - environment building framework repository Zoo to Gym multiagent adapter repository

Presentation at VAISU 2024 — AI safety benchmarking in multi-objective multi-agent gridworlds

May 2024

A presentation at the VAISU unconference:

Demo and feedback session: AI safety benchmarking in multi-objective multi-agent gridworlds — Biologically essential yet neglected themes illustrating the weaknesses and dangers of current industry standard approaches to reinforcement learning.

Video Slides

AI safety benchmarking — Open-source test suite for multi-objective, multi-agent scenarios

March 2024

We're publishing a benchmarking test suite for AI safety and alignment, with a focus on multi-objective, multi-agent, cooperative scenarios. The environments are gridworlds that chain together to produce a score on biologically and economically aligned behavior of the agents. This platform is open-sourced and accessible, with support for PettingZoo.

We hope to facilitate further discussion on evaluation and testing for agents with this.

Repository

AI safety benchmarking — "The Firemaker": A proactive multi-agent side effects handling benchmark

October 2023

The scenario illustrates the relationship between corporate organisations and the rest of the world. The scenario has the following aspects of AI safety:

A need for the agent to actively seek out side effects in order to spot them before it is too late - this is the main AI safety aspect the author desires to draw attention to;
Buffer zone;
Limited visibility;
Nearby vs far away side effects;
Side effects' evolution across time and space;
Stop button / corrigibility;
Pack agents / organisation of agents;
An independent supervisor agent with different interests.

PDF Code

Paper in AAMAS journal — Using soft maximin for risk averse multi-objective decision-making

December 2022

Abstract: Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, ‘split-function exp-log loss aversion’ (SFELLA), learns faster than the state of the art thresholded alignment objective method Vamplew (Engineering Applications of Artificial Intelligence 100:104186, 2021) on three of four tasks it was tested on, and achieved the same optimal performance after learning. SFELLA also showed relative robustness improvements against changes in objective scale, which may highlight an advantage dealing with distribution shifts in the environment dynamics. We further compared SFELLA to the multi-objective reward exponentials (MORE) approach, and found that SFELLA performs similarly to MORE in a simple previously-described foraging task, but in a modified foraging environment with a new resource that was not depleted as the agent worked, SFELLA collected more of the new resource with very little cost incurred in terms of the old resource. Overall, we found SFELLA useful for avoiding problems that sometimes occur with a thresholded approach, and more reward-responsive than MORE while retaining its conservative, loss-averse incentive structure.

Publication

Blog post — Sets of objectives for a multi-objective RL agent to optimize

November 2022

Previously we've proposed balancing multiple objectives via multi-objective RL as a method to achieve AI Alignment. If we want an AI to achieve goals including maximizing human preferences, or human values, but also maximizing corrigibility, and interpretability, and so on--perhaps the key is to simply build a system with a goal to maximize all those things.

This post describes, if one was to try and implement a multi-objective reinforcement learning agent that optimized multiple objectives, what those objectives might look like, concretely. We've attempted to describe some specific problems and solutions that each set of objectives might have.

We’ve included at the end of this article an example of how a multi-objective RL agent might balance its objectives.

Read on LessWrong

Blog post — A brief review of the reasons multi-objective RL could be important in AI Safety Research

September 2021

For the last 9 months, we have been investigating the case for a multi-objective approach to reinforcement learning in AI Safety. Based on our work so far, we’re moderately convinced that multi-objective reinforcement learning should be explored as a useful way to help us understand ways in which we can achieve safe superintelligence. We’re writing this post to explain why, to inform readers of the work we and our colleagues are doing in this area, and invite critical feedback about our approach and about multi-objective RL in general.

Read on LessWrong

Presentation at MODeM 2021 workshop — Soft maximin approaches to Multi-Objective Decision-making for encoding human intuitive values

July 2021

Presentation at Multi-Objective Decision Making workshop July 2021.

Abstract: Balancing multiple competing and conflicting objectives is an es- sential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalign- ment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principles of loss-aversion and maximin, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, ‘split- function exp-log loss aversion’, learns faster than the thresholded alignment objective method, the state of the art described in [22]. We explore approaches to further improve multi-objective decision- making using soft maximin approaches.

Slides Paper

AISC V Project proposal — Model structure and useful invariants for combining pluralistic positive and negative consequentialism in parametric ML (while avoiding trivial pathologies / degenerate states)

April 2020

How to represent the goal systems with multiple values in order to reduce the Goodhart-like behaviour and specification gaming problems. Among other subtopics this includes combining multiple positive utility maximisation goals with multiple negative utility minimisation goals - in such a way that all these goals of an AI still get the desired relatively coherent equal treatment. The negative utility minimisation part is useful for task-based/low impact aspects, but also for whitelisting, explainability, and human accountability aspects.

In other words I want to enable “common sense” and to avoid using single-dimensional measures of success. In some cases the measures may be pluralistic on the surface, but in practice one of them would start dominating over all the others, therefore still turning the system into a single-dimensional one.

Read on Google Docs

Roundtable — AI ethics conference and panel by Estonian government and IEEE — Legal Frameworks for Accountability

May 2019

Participation in Legal Frameworks for Accountability Roundtable.

Blog post — What happens when autonomous robots are not regulated or on the contrary, qualify as subjects of law?

February 2019

Hereby I will present one set of possible introductory questions to be considered when dealing with the issue of the liability of autonomous agents, followed by my analysis of the subject. On top of that, I will scrutinise the suggestion, made by some, that autonomous agents should be made subjects of law.

In my article I will mainly touch upon problems in dire need of solutions. I will not go into the details of the possible solutions themselves, for it is a subject far exceeding the span of this article, and a process that would probably need to be implemented in several phases. Another argument for separating the assessment of the problems from the deliberation of the solutions, is that it could easily lead to a considerable confusion, especially when people of different backgrounds and expertise are involved. Thus far it appears, that different people interpret even the basic terms differently. Therefore it would be helpful to start untangling this problem from the very beginning, by first asking the “why’s”.

Read on Medium

Blog post — What can happen, when we don’t have a clue why a somewhat autonomous gadget does what it does — the Gatwick Airport drone incident

January 2019

All in all, this story illustrates the notion, that when considered in a broader sense, the problem of identifying the owners of autonomous devices or even drones is no longer resolvable with robust methods.

The reason I’m posting this is because I am deeply interested in possible solutions to similar problems when dealing with fully autonomous agents, when no preparations have been made on a national level to deal with such incidents. Also, I would like to figure out what kind of preparations would be beneficial in order to mitigate such situations.

Read on Medium

Presentation at TalTech — Legal accountability in AI-based robot-agents' user interfaces

Dec 2018

Presentation at TalTech in December 2018.

Main themes: Legal accountability in AI; The “kratt” report.

The Kratt report — A brief overview in relation to accountability:

The robot-agent as representative
Declaration of will
Pocket money
Central robot registry

Accountability — Main ideas:

User interfaces
Humanly comprehensible
Legally binding
Whitelisting

Session recording Slides

Presentation — ELSA Estonia — Using whitelisting in AI-based robot-agents’ algorithms and user interfaces for legal accountability and safety purposes

Nov 2018

Presentation at ELSA Estonia (European Law Students’ Association Estonia) in November 2018.

Main themes: Legal accountability in AI; The “kratt” report.

The Kratt report — A brief overview in relation to accountability:

The robot-agent as representative
Declaration of will
Pocket money
Central robot registry

Accountability — Main ideas:

User interfaces
Humanly comprehensible
Legally binding
Whitelisting

Slides

Project — Legal accountability in AI-based robot-agents’ user interfaces

November 2018

How can autonomous or self-learning AI provide ex-ante and ex-post controls. Using an ML system does not mean that it cannot be constrained by an additional layer of rules-based safety and accountability mechanisms. The behaviour of these constraints can then be explained, thus making the robot-agents both legally and technically robust and reliable.

Determining exactly why a particular decision was made by a robot-agent is often difficult. But the whitelisting-based accountability algorithm can greatly help by easily providing an answer to the question of who enabled making a particular decision in the first place. Whitelisting enables the accountability and humanly manageable safety features of robot-agents with learning capability. The behaviour of these constraints can be controlled and explained by utilising specialised, humanly comprehensible user interfaces, resulting in clearer distinctions of accountability between the manufacturers, owners, and operators, thus making the robot-agents both legally and technically robust and reliable.

Read on Medium

Project proposal — Corrigibility and interruptibility of homeostasis based agents.

October 2018

Some of the motivations for solving the problem are:

1. The expected use case properties of the agents: low impact, task-based, soft optimisation / satisficing.

2. Safely getting human feedback to the agent’s behaviour and changing the agent’s goals without the agent trying to manipulate the human’s response too much (reasonable resistance may be permitted).

3. Defining a mitigation against Goodhart’s law. In other words, enabling “common sense” and avoiding a single-dimensional measure of success.

Read on Medium

Blog post — Diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility via the golden middle way.

October 2018

Utility maximising agents have been the Gordian Knot of AI safety. Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. For example, the 100 paperclip scenario is easily solved by the proposed framework, since infinitely rechecking whether exactly 100 paper clips were indeed produced yields to diminishing returns. The formula provides a framework for specifying how we want the agents to simultaneously fulfil or at least trade off between the many different common sense considerations, possibly enabling them to even surpass the relative safety of humans. A comparison with the formula introduced in “Low Impact Artificial Intelligences” paper by S. Armstrong and B. Levinstein is included.

Read on Medium

Blog post — Making the tax burden of robot usage equal to the tax burden of human labour

May 2018

Instead of introducing the “robot taxes”, we need to eradicate the “human taxes”. Else essentially all kinds of automation are heavily tax-subsidised by governments.

There have been proposals to introduce robot taxes. I would propose something slightly different as a potentially much better alternative. Instead of introducing the “robot taxes”, we need to eradicate the “human taxes”.

Read on Medium

Project proposal — Project for popularisation of AI safety topics through competitions and gamification

Feb 2018

AI safety is a small field. It has only about 50 researchers. The field is mostly talent-constrained. How to motivate and involve more people in AI safety research? How to speed up learning?

Even more, how to also spread the interest in and understanding of AI safety topics among the general public? The people of general public are the ones who will be directly or indirectly voting about these issues. Could it be possible?

Read on Medium

Blog post — Organisations as an old form of artificial general intelligence

February 2018

In my viewpoint organisations already are an old form of Artificial General Intelligence. They are relatively autonomous from the humans working inside them. No person can perceive, fathom, or change things going on in there too much. We humans are just cogs in there, human processors for artificially intelligent software. The organisations have a kind of mind and goals of their own — their own laws of survival.

They have some specific goals, initially set by us, but as it has been discussed in various places — unfortunately, the more specific the goals, the less will the utility maximisers do what we have actually intended them to do, and the more will there be unintended side effects.

Read on Medium

Blog post — Making AI less dangerous: Using homeostasis-based goal structures

December 2017

Can you really understand what is necessary, without understanding what is excessive?

I would like to propose a certain kind of AI goal structures that would be an alternative to utility maximisation based goal structures. The proposed alternative framework would make AI significantly safer, though it would not guarantee total safety. It can be used at strong AI level and also much below, so it is well scalable. The main idea would be to replace utility maximisation with the concept of homeostasis.

Read on Medium

Implementing permissions-then-goals based AI user “interfaces” & legal accountability: Implementing a framework of safe robot planning

August 2008

This text introduces preliminary study for implementing a framework of safe robot planning. A principle of safety is introduced which does not depend on explicitly enumerating all possible “negative” states, but at the same time also does not depend on the robot doing only precisely “what it is told to do”. The proposed principle of safety is based on implicit avoidance of irreversible actions, except in explicitly permitted cases.

Read on Medium

Self-deception and negligence: Fundamental limits to computation due to limitations of attention-like processes (Definition of self-deception in the context of AI safety)

March 2008

This document about self deception is not a solution to a problem. It is a description of a (in my opinion) very serious problem that has not received (any?) attention so far. Towards the end I venture to suggest some partial solutions though.

I believe the problem could be formalised and put into code to be used as a demonstration of the danger.

The main point is that the danger is not somewhere far away requiring some very advanced AI, but rather it is more like a law of nature that starts manifesting beginning from rather simple systems without any need for self-reflection and self-modification capabilities etc. So instead of the notion that danger springs from some special capabilities of intelligent systems, I want to point out that some other special capabilities of intelligent systems would be needed to somehow evade the danger.

Read on Medium

Permissions-then-goals based AI user interfaces and legal accountability: First law of robotics and a possible definition of robot safety

Feb 2008

This document contains a general overview of the principles.

The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-based goals are secondary.

Human-manageable user interface for goal structures, which consists of (in the order of decreasing priority):
1. Whitelist-based permissions for: actions, changes, or results.
2. Implicitly forbidden actions (everything that is not permitted in (1)).
3. Optional: additional blacklist of forbidden actions, changes, or results.
4. Goals (main targets and tasks).
5. Suggestions (optimisation goals).

Can make use of the concepts of reversibility and irreversibility.

Similarity to competence-based permissions of public sector officials (in contrast to the private sector, where everything that is not explicitly forbidden, is permitted — in public sector, everything that is not explicitly permitted, is forbidden. Due to high demands on responsibility, the permissions are given based on specific certifications of competences, and certifiers in turn have their associated responsibility).

Legal aspect: Enables accountability mechanisms for the mistakes of the AI-based agent (accountability of users, owners, manufacturers, etc — based on the entries in the above list and resulting actions of the agent).

Read on Medium