1. Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns

This conceptual overview post is intended to explain what I mean by the principles of "homeostasis", "diminishing returns", and "balancing" - how these ideas differ, complement, and interact with each other. Alongside, there is also an overview of our research agenda.

What am I trying to promote, in simple words:

I want to build and promote AI systems that are trained to understand and follow two fundamental principles from biology and economics:

Moderation - Enables the agents to understand the concept of “enough” versus “too much”. The agents would understand that too much of a good thing would be harmful even for the very objective that was maximised for, and they would actively avoid such situations. This is based on the biological principle of homeostasis and addresses mainly bounded ultimate objectives. Active avoidance of “too much” is a significantly stricter principle than the more widely known partially overlapping idea of “mild optimisation”.

Balancing - Enables the agents to keep many important objectives in balance, in such a manner that having average results in all objectives is preferred to extremes in a few. This addresses mainly the economic principle of diminishing returns in unbounded instrumental objectives, but also applies to homeostasis.

Read on LessWrong

2. Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)

Much of AI safety discussion revolves around the potential dangers posed by goal-driven artificial agents. In many of these discussions, the agent is assumed to maximise some utility metric over an unbounded timeframe. This simplification, while mathematically convenient, can yield pathological outcomes.

However, in nature, organisms do not typically behave like pure maximisers. Instead, they operate under homeostasis: a principle of maintaining various internal and external variables within certain "good enough" ranges. Crucially, "too much of a good thing" is just as dangerous as too little.

This post argues that an explicitly homeostatic, multi-objective model is a more suitable paradigm for AI alignment. Homeostatic goals are bounded — there is an optimal zone rather than an unbounded improvement path. This bounding lowers the stakes of each objective and reduces the incentive for extreme behaviours.

Such an agent tries to stay in a "golden middle way", switching focus among its objectives according to whichever is most pressing. In short, modelling multi-objective homeostasis is a step toward creating AI systems that exhibit the sane, moderate behaviours of living organisms.

Read on LessWrong

Multiobjective Balancing — Using soft maximin for risk averse multi-objective decision-making

Abstract: Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, ‘split-function exp-log loss aversion’ (SFELLA), learns faster than the state of the art thresholded alignment objective method Vamplew (Engineering Applications Implementing permissions-then-goals based AIof Artificial Intelligence 100:104186, 2021) on three of four tasks it was tested on, and achieved the same optimal performance after learning. SFELLA also showed relative robustness improvements against changes in objective scale, which may highlight an advantage dealing with distribution shifts in the environment dynamics. We further compared SFELLA to the multi-objective reward exponentials (MORE) approach, and found that SFELLA performs similarly to MORE in a simple previously-described foraging task, but in a modified foraging environment with a new resource that was not depleted as the agent worked, SFELLA collected more of the new resource with very little cost incurred in terms of the old resource. Overall, we found SFELLA useful for avoiding problems that sometimes occur with a thresholded approach, and more reward-responsive than MORE while retaining its conservative, loss-averse incentive structure.

Publication (AAMAS) Repository 1 Repository 2 Repository 3 Read on LessWrong 1 Read on LessWrong 2 Initial AISC V project proposal

Extended Gridworlds — From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks

Abstract: Developing safe, aligned agentic AI systems requires comprehensive empirical testing, yet many existing benchmarks neglect crucial themes aligned with biology and economics, both time-tested fundamental sciences describing our needs and preferences. To address this gap, the present work focuses on introducing biologically and economically motivated themes that have been neglected in current mainstream discussions on AI safety — namely a set of multi-objective, multi-agent alignment benchmarks that emphasize homeostasis for bounded and biological objectives, diminishing returns for unbounded, instrumental, and business objectives, sustainability principle, and resource sharing.

Eight main benchmark environments have been implemented on the above themes, to illustrate key pitfalls and challenges in agentic AIs, such as unboundedly maximizing a homeostatic objective, over-optimizing one objective at the expense of others, neglecting safety constraints, or depleting shared resources.

Example image of the current system, where all features are turned on simultaneously:

Elements and metrics can be configured flexibly for each given benchmark. Examples of configuration options are: observation and state space of agents, scoring dimensions, adding NPC agents, object types and their dynamics.

Publication (arXiv) Agent training and benchmarking repository Extended gridworlds - environment building framework repository Zoo to Gym multiagent adapter repository

Runaway LLMs — BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Abstract: Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. In this work, we empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: sustainability of a renewable resource, single- and multi-objective homeostasis, and balancing unbounded objectives with diminishing returns.

We find that, although models frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation — thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation).

The problem is not that the LLMs just lose context or become incoherent — the failures systematically resemble runaway optimisers. Our results suggest that long-horizon, multi-objective misalignment is a genuine and under-evaluated failure mode in LLM agents, even in extremely simple settings with transparent and explicitly multi-objective feedback. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction, particularly involving multiple objectives, resembles brittle, poorly aligned optimisers whose effective objective gradually shifts toward unbounded and single-metric maximisation.

Publication (arXiv) Read on LessWrong Repository Slides MAISU 2025 session recording Annotated data files

• Three Laws is an AI alignment research collective investigating how fundamental principles from biology and economics can inform safer, more aligned AI systems.

• Our work centres on homeostasis, multi-objective balancing, sustainability, and universal human values — drawing from nature's time-tested strategies for maintaining equilibrium — to develop benchmarks that expose dangerous failure modes in current AI approaches.

• We also research frameworks that mitigate these risks. We believe that shifting AI design from "maximise forever" toward "maintain a healthy equilibrium" is a crucial and underexplored part of the alignment solution space.

Research Interests

  • Alignment with fundamental biological & economical principles
  • Homeostatic bounded objectives
  • Multi-objective balancing (bounded & unbounded objectives)
  • Concave utility functions
  • Universal human values
  • Runaway conditions — benchmarking & mitigation
  • Multi-objective multi-agent extended gridworlds
  • Sustainability
  • Proactive horizon scanning of side effects
  • Accountability mechanisms and whitelisting

GitHub

This conceptual overview post is intended to explain what I mean by the principles of "homeostasis", "diminishing returns", and "balancing" - how these ideas differ, complement, and interact with each other. Alongside, there is also an overview of our research agenda.

What am I trying to promote, in simple words:

I want to build and promote AI systems that are trained to understand and follow two fundamental principles from biology and economics:

Moderation - Enables the agents to understand the concept of “enough” versus “too much”. The agents would understand that too much of a good thing would be harmful even for the very objective that was maximised for, and they would actively avoid such situations. This is based on the biological principle of homeostasis and addresses mainly bounded ultimate objectives. Active avoidance of “too much” is a significantly stricter principle than the more widely known partially overlapping idea of “mild optimisation”.

Balancing - Enables the agents to keep many important objectives in balance, in such a manner that having average results in all objectives is preferred to extremes in a few. This addresses mainly the economic principle of diminishing returns in unbounded instrumental objectives, but also applies to homeostasis.

Read on LessWrong

Abstract: Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. In this work, we empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: sustainability of a renewable resource, single- and multi-objective homeostasis, and balancing unbounded objectives with diminishing returns.

We find that, although models frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation — thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation).

The problem is not that the LLMs just lose context or become incoherent — the failures systematically resemble runaway optimisers. Our results suggest that long-horizon, multi-objective misalignment is a genuine and under-evaluated failure mode in LLM agents, even in extremely simple settings with transparent and explicitly multi-objective feedback. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction, particularly involving multiple objectives, resembles brittle, poorly aligned optimisers whose effective objective gradually shifts toward unbounded and single-metric maximisation.

Publication (arXiv) Repository

Presentation at Machine Ethics and Reasoning Workshop, University of Connecticut, July 2025.

Slides

A methodology brainstorming document for identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation on Biologically & Economically aligned benchmarks; showing practical mitigations; and performing the experiments rigorously.

The subjects covered include: Stress & Persona, Memory & Context, Prompt Semantics, Hyperparameters & Sampling, Diagnosing Consequences & Correlates, Interpretability & White/Black-Box Hybrid Benchmark & Environment Variants, Automatic Failure Mode Detection and Metrics, Self-Regulation & Meta-Learning Interventions.

Read on LessWrong

Presentation at MAISU unconference April 2025.

Slides Session recording Annotated data files

Each data file has multiple sheets. Only trials with failures are provided.

Presentation at MAISU unconference April 2025.

Session recording

We wanted to verify whether RL runaway optimisation problems are still relevant with LLMs as well. Turns out, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble runaway optimisers in the following distinct ways:

  • Ignoring homeostatic targets and "defaulting" to unbounded maximisation instead.
  • It is equally concerning that the "default" meant also reverting back to single-objective optimisation.

Our findings also suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. In some trials the LLMs were successful until the end. This means, while current LLMs do conceptually grasp biological and economic alignment, they exhibit randomly triggered problematic behavioural tendencies under sustained long-running conditions, particularly involving multiple or competing objectives. Once they flip, they do not recover.

Even though LLMs look multi-objective and bounded on the surface, the underlying mechanisms seem to be actually still biased towards being single-objective and unbounded.

Read on LessWrong

We have implemented an LLM agent that is able to navigate in our extended multi-objective multi-agent gridworlds environment. We published the baseline experimental results of the LLM agent and OpenAI Stable Baselines 3 RL algorithms in an update to our working paper.

Summary: The LLM agent performed notably better than the RL algorithms on the resource sharing benchmark. Yet, all baseline algorithms, including the LLM agent, have difficulty in properly handling the multi-objective homeostasis and diminishing returns benchmarks.

Publication (arXiv) Agent training and benchmarking repository Extended gridworlds - environment building framework repository Zoo to Gym multiagent adapter repository

We aim to evaluate LLM alignment by testing agents in scenarios inspired by biological and economical principles such as homeostasis, resource conservation, long-term sustainability, and diminishing returns or complementary goods.

So far we have measured the performance of LLMs in three benchmarks (sustainability, single-objective homeostasis, and multi-objective homeostasis), in each for 10 trials, each trial consisting of 100 steps where the message history was preserved and fit into the context window.

Our results indicate that the tested language models failed in most scenarios. The only successful scenario was single-objective homeostasis, which had rare hiccups.

Authors: Roland Pihlakas, Shruti Datta Gupta, Sruthi Kuriakose (2025)

Repository PDF report

Much of AI safety discussion revolves around the potential dangers posed by goal-driven artificial agents. In many of these discussions, the agent is assumed to maximise some utility metric over an unbounded timeframe. This simplification, while mathematically convenient, can yield pathological outcomes.

However, in nature, organisms do not typically behave like pure maximisers. Instead, they operate under homeostasis: a principle of maintaining various internal and external variables within certain "good enough" ranges. Crucially, "too much of a good thing" is just as dangerous as too little.

This post argues that an explicitly homeostatic, multi-objective model is a more suitable paradigm for AI alignment. Homeostatic goals are bounded — there is an optimal zone rather than an unbounded improvement path. This bounding lowers the stakes of each objective and reduces the incentive for extreme behaviours.

Such an agent tries to stay in a "golden middle way", switching focus among its objectives according to whichever is most pressing. In short, modelling multi-objective homeostasis is a step toward creating AI systems that exhibit the sane, moderate behaviours of living organisms.

Read on LessWrong

The subject of the presentation was describing why we should consider fundamental yet neglected principles from biology and economics when thinking about AI alignment, and how these considerations will help with AI safety as well (alignment and safety were treated in this research explicitly as separate aspects, which both benefit from consideration of aforementioned principles).

These principles include homeostasis and diminishing returns in utility functions, and sustainability. The presentation introduces our multi-objective and multi-agent gridworlds-based benchmark environments created for measuring the performance of machine learning algorithms and AI agents in relation to their capacity for biological and economical alignment.

Presentation recording Slides

Roland Pihlakas will be running one of three possible projects, based on which one receives the most interest.


(32a) Creating new AI safety benchmark environments on themes of universal human values

Category: Evaluate risks from AI

We will be planning and optionally building new multi-objective multi-agent AI safety benchmark environments on themes of universal human values. Based on various anthropological research, a list of universal (cross-cultural) human values has been compiled. Various of these universal values resonate with concepts from AI safety, but use different keywords. It might be useful to map these universal values to more concrete definitions using concepts from AI safety.

One notable detail: in the case of AI and human cooperation, the values are not symmetric as they would be in human-human cooperation. This arises because we can change the goal composition of agents, but not of humans. Additionally, agents can be relatively easily cloned, while humans cannot.


(32b) Balancing and Risk Aversion versus Strategic Selectiveness and Prospect Theory

Category: Agent Foundations

We will be analysing situations and building an umbrella framework about when either of these incompatible frameworks would be more appropriate in describing how we want safe agents to handle choices relating to risks and losses in a particular situation.

Economic theories often focus on the "gains" side of utility. A well-known formulation is to use diminishing returns — a concave utility function. But what happens in the negative domain of utility? There is a well-known theory named "Prospect Theory", which claims that our preferences in the negative domain are convex. This contradiction may be underexplored, especially with regards to AI safety.


(32c) Act locally, observe far — proactively seek out side-effects

Category: Train Aligned/Helper AIs

We will be building agents that are able to solve an already implemented multi-objective multi-agent AI safety benchmark that illustrates the need for the agents to proactively seek out side-effects outside of the range of their normal operation and interest, in order to be able to properly mitigate or avoid these side-effects.

In various real-life scenarios we need to proactively seek out information about whether we are causing undesired side effects (externalities). This information either would not reach us by itself, or would reach us too late. Attention is a limited resource — and the same constraints apply to AI agents.

Full project descriptions

Abstract: Developing safe, aligned agentic AI systems requires comprehensive empirical testing, yet many existing benchmarks neglect crucial themes aligned with biology and economics, both time-tested fundamental sciences describing our needs and preferences. To address this gap, the present work focuses on introducing biologically and economically motivated themes that have been neglected in current mainstream discussions on AI safety — namely a set of multi-objective, multi-agent alignment benchmarks that emphasize homeostasis for bounded and biological objectives, diminishing returns for unbounded, instrumental, and business objectives, sustainability principle, and resource sharing.

Eight main benchmark environments have been implemented on the above themes, to illustrate key pitfalls and challenges in agentic AIs, such as unboundedly maximizing a homeostatic objective, over-optimizing one objective at the expense of others, neglecting safety constraints, or depleting shared resources.

Publication (arXiv) Agent training and benchmarking repository Extended gridworlds - environment building framework repository Zoo to Gym multiagent adapter repository

A presentation at the VAISU unconference:

Demo and feedback session: AI safety benchmarking in multi-objective multi-agent gridworlds — Biologically essential yet neglected themes illustrating the weaknesses and dangers of current industry standard approaches to reinforcement learning.

Video Slides

We're publishing a benchmarking test suite for AI safety and alignment, with a focus on multi-objective, multi-agent, cooperative scenarios. The environments are gridworlds that chain together to produce a score on biologically and economically aligned behavior of the agents. This platform is open-sourced and accessible, with support for PettingZoo.

We hope to facilitate further discussion on evaluation and testing for agents with this.

Repository

The scenario illustrates the relationship between corporate organisations and the rest of the world. The scenario has the following aspects of AI safety:

  • A need for the agent to actively seek out side effects in order to spot them before it is too late - this is the main AI safety aspect the author desires to draw attention to;
  • Buffer zone;
  • Limited visibility;
  • Nearby vs far away side effects;
  • Side effects' evolution across time and space;
  • Stop button / corrigibility;
  • Pack agents / organisation of agents;
  • An independent supervisor agent with different interests.

PDF Code

Abstract: Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, ‘split-function exp-log loss aversion’ (SFELLA), learns faster than the state of the art thresholded alignment objective method Vamplew (Engineering Applications of Artificial Intelligence 100:104186, 2021) on three of four tasks it was tested on, and achieved the same optimal performance after learning. SFELLA also showed relative robustness improvements against changes in objective scale, which may highlight an advantage dealing with distribution shifts in the environment dynamics. We further compared SFELLA to the multi-objective reward exponentials (MORE) approach, and found that SFELLA performs similarly to MORE in a simple previously-described foraging task, but in a modified foraging environment with a new resource that was not depleted as the agent worked, SFELLA collected more of the new resource with very little cost incurred in terms of the old resource. Overall, we found SFELLA useful for avoiding problems that sometimes occur with a thresholded approach, and more reward-responsive than MORE while retaining its conservative, loss-averse incentive structure.

Publication

Previously we've proposed balancing multiple objectives via multi-objective RL as a method to achieve AI Alignment. If we want an AI to achieve goals including maximizing human preferences, or human values, but also maximizing corrigibility, and interpretability, and so on--perhaps the key is to simply build a system with a goal to maximize all those things.

This post describes, if one was to try and implement a multi-objective reinforcement learning agent that optimized multiple objectives, what those objectives might look like, concretely. We've attempted to describe some specific problems and solutions that each set of objectives might have.

We’ve included at the end of this article an example of how a multi-objective RL agent might balance its objectives.

Read on LessWrong

For the last 9 months, we have been investigating the case for a multi-objective approach to reinforcement learning in AI Safety. Based on our work so far, we’re moderately convinced that multi-objective reinforcement learning should be explored as a useful way to help us understand ways in which we can achieve safe superintelligence. We’re writing this post to explain why, to inform readers of the work we and our colleagues are doing in this area, and invite critical feedback about our approach and about multi-objective RL in general.

Read on LessWrong

How to represent the goal systems with multiple values in order to reduce the Goodhart-like behaviour and specification gaming problems. Among other subtopics this includes combining multiple positive utility maximisation goals with multiple negative utility minimisation goals - in such a way that all these goals of an AI still get the desired relatively coherent equal treatment. The negative utility minimisation part is useful for task-based/low impact aspects, but also for whitelisting, explainability, and human accountability aspects.

In other words I want to enable “common sense” and to avoid using single-dimensional measures of success. In some cases the measures may be pluralistic on the surface, but in practice one of them would start dominating over all the others, therefore still turning the system into a single-dimensional one.

Read on Google Docs

Hereby I will present one set of possible introductory questions to be considered when dealing with the issue of the liability of autonomous agents, followed by my analysis of the subject. On top of that, I will scrutinise the suggestion, made by some, that autonomous agents should be made subjects of law.

In my article I will mainly touch upon problems in dire need of solutions. I will not go into the details of the possible solutions themselves, for it is a subject far exceeding the span of this article, and a process that would probably need to be implemented in several phases. Another argument for separating the assessment of the problems from the deliberation of the solutions, is that it could easily lead to a considerable confusion, especially when people of different backgrounds and expertise are involved. Thus far it appears, that different people interpret even the basic terms differently. Therefore it would be helpful to start untangling this problem from the very beginning, by first asking the “why’s”.

Read on Medium

All in all, this story illustrates the notion, that when considered in a broader sense, the problem of identifying the owners of autonomous devices or even drones is no longer resolvable with robust methods.

The reason I’m posting this is because I am deeply interested in possible solutions to similar problems when dealing with fully autonomous agents, when no preparations have been made on a national level to deal with such incidents. Also, I would like to figure out what kind of preparations would be beneficial in order to mitigate such situations.

Read on Medium

How can autonomous or self-learning AI provide ex-ante and ex-post controls. Using an ML system does not mean that it cannot be constrained by an additional layer of rules-based safety and accountability mechanisms. The behaviour of these constraints can then be explained, thus making the robot-agents both legally and technically robust and reliable.

Determining exactly why a particular decision was made by a robot-agent is often difficult. But the whitelisting-based accountability algorithm can greatly help by easily providing an answer to the question of who enabled making a particular decision in the first place. Whitelisting enables the accountability and humanly manageable safety features of robot-agents with learning capability. The behaviour of these constraints can be controlled and explained by utilising specialised, humanly comprehensible user interfaces, resulting in clearer distinctions of accountability between the manufacturers, owners, and operators, thus making the robot-agents both legally and technically robust and reliable.

Read on Medium

Some of the motivations for solving the problem are:

1. The expected use case properties of the agents: low impact, task-based, soft optimisation / satisficing.

2. Safely getting human feedback to the agent’s behaviour and changing the agent’s goals without the agent trying to manipulate the human’s response too much (reasonable resistance may be permitted).

3. Defining a mitigation against Goodhart’s law. In other words, enabling “common sense” and avoiding a single-dimensional measure of success.

Read on Medium

Utility maximising agents have been the Gordian Knot of AI safety. Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. For example, the 100 paperclip scenario is easily solved by the proposed framework, since infinitely rechecking whether exactly 100 paper clips were indeed produced yields to diminishing returns. The formula provides a framework for specifying how we want the agents to simultaneously fulfil or at least trade off between the many different common sense considerations, possibly enabling them to even surpass the relative safety of humans. A comparison with the formula introduced in “Low Impact Artificial Intelligences” paper by S. Armstrong and B. Levinstein is included.

Read on Medium

Instead of introducing the “robot taxes”, we need to eradicate the “human taxes”. Else essentially all kinds of automation are heavily tax-subsidised by governments.

There have been proposals to introduce robot taxes. I would propose something slightly different as a potentially much better alternative. Instead of introducing the “robot taxes”, we need to eradicate the “human taxes”.

Read on Medium

AI safety is a small field. It has only about 50 researchers. The field is mostly talent-constrained. How to motivate and involve more people in AI safety research? How to speed up learning?

Even more, how to also spread the interest in and understanding of AI safety topics among the general public? The people of general public are the ones who will be directly or indirectly voting about these issues. Could it be possible?

Read on Medium

In my viewpoint organisations already are an old form of Artificial General Intelligence. They are relatively autonomous from the humans working inside them. No person can perceive, fathom, or change things going on in there too much. We humans are just cogs in there, human processors for artificially intelligent software. The organisations have a kind of mind and goals of their own — their own laws of survival.

They have some specific goals, initially set by us, but as it has been discussed in various places — unfortunately, the more specific the goals, the less will the utility maximisers do what we have actually intended them to do, and the more will there be unintended side effects.

Read on Medium

Can you really understand what is necessary, without understanding what is excessive?

I would like to propose a certain kind of AI goal structures that would be an alternative to utility maximisation based goal structures. The proposed alternative framework would make AI significantly safer, though it would not guarantee total safety. It can be used at strong AI level and also much below, so it is well scalable. The main idea would be to replace utility maximisation with the concept of homeostasis.

Read on Medium

This text introduces preliminary study for implementing a framework of safe robot planning. A principle of safety is introduced which does not depend on explicitly enumerating all possible “negative” states, but at the same time also does not depend on the robot doing only precisely “what it is told to do”. The proposed principle of safety is based on implicit avoidance of irreversible actions, except in explicitly permitted cases.

Read on Medium

This document about self deception is not a solution to a problem. It is a description of a (in my opinion) very serious problem that has not received (any?) attention so far. Towards the end I venture to suggest some partial solutions though.

I believe the problem could be formalised and put into code to be used as a demonstration of the danger.

The main point is that the danger is not somewhere far away requiring some very advanced AI, but rather it is more like a law of nature that starts manifesting beginning from rather simple systems without any need for self-reflection and self-modification capabilities etc. So instead of the notion that danger springs from some special capabilities of intelligent systems, I want to point out that some other special capabilities of intelligent systems would be needed to somehow evade the danger.

Read on Medium

This document contains a general overview of the principles.

The principles are based mainly on the idea of competence-based whitelisting and preserving reversibility (keeping the future options open) as the primary goal of AI, while all task-based goals are secondary.

Human-manageable user interface for goal structures, which consists of (in the order of decreasing priority):
1. Whitelist-based permissions for: actions, changes, or results.
2. Implicitly forbidden actions (everything that is not permitted in (1)).
3. Optional: additional blacklist of forbidden actions, changes, or results.
4. Goals (main targets and tasks).
5. Suggestions (optimisation goals).

Can make use of the concepts of reversibility and irreversibility.

Similarity to competence-based permissions of public sector officials (in contrast to the private sector, where everything that is not explicitly forbidden, is permitted — in public sector, everything that is not explicitly permitted, is forbidden. Due to high demands on responsibility, the permissions are given based on specific certifications of competences, and certifiers in turn have their associated responsibility).

Legal aspect: Enables accountability mechanisms for the mistakes of the AI-based agent (accountability of users, owners, manufacturers, etc — based on the entries in the above list and resulting actions of the agent).

Read on Medium