1. Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns
This conceptual overview post is intended to explain what I mean by the principles of "homeostasis", "diminishing returns", and "balancing" - how these ideas differ, complement, and interact with each other. Alongside, there is also an overview of our research agenda.
What am I trying to promote, in simple words:
I want to build and promote AI systems that are trained to understand and follow two fundamental principles from biology and economics:
Moderation - Enables the agents to understand the concept of “enough” versus “too much”. The agents would understand that too much of a good thing would be harmful even for the very objective that was maximised for, and they would actively avoid such situations. This is based on the biological principle of homeostasis and addresses mainly bounded ultimate objectives. Active avoidance of “too much” is a significantly stricter principle than the more widely known partially overlapping idea of “mild optimisation”.
Balancing - Enables the agents to keep many important objectives in balance, in such a manner that having average results in all objectives is preferred to extremes in a few. This addresses mainly the economic principle of diminishing returns in unbounded instrumental objectives, but also applies to homeostasis.
2. Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)
Much of AI safety discussion revolves around the potential dangers posed by goal-driven artificial agents. In many of these discussions, the agent is assumed to maximise some utility metric over an unbounded timeframe. This simplification, while mathematically convenient, can yield pathological outcomes. A classic example is the so-called “paperclip maximiser”, a “utility monster” which steamrolls over other objectives to pursue a single goal (e.g. creating as many paperclips as possible) indefinitely. “Specification gaming”, Goodhart’s law, and even “instrumental convergence” are also closely related phenomena.
However, in nature, organisms do not typically behave like pure maximisers. Instead, they operate under homeostasis: a principle of maintaining various internal and external variables (e.g. temperature, hunger, social interactions) within certain "good enough" ranges. Going far beyond those ranges — too hot, too hungry, too socially isolated — leads to dire consequences, so an organism continually balances multiple needs. Crucially, "too much of a good thing" is just as dangerous as too little.
This post argues that an explicitly homeostatic, multi-objective model is a more suitable paradigm for AI alignment. Moreover, correctly modelling homeostasis increases AI safety, because homeostatic goals are bounded — there is an optimal zone rather than an unbounded improvement path. This bounding lowers the stakes of each objective and reduces the incentive for extreme (and potentially destructive) behaviours.
Homeostasis — the idea of multiple objectives each with a bounded “sweet spot” — offers a more natural and safer alternative to unbounded utility maximisation. By ensuring that an AI’s needs or goals are multi-objective and conjunctive, and that each is bounded, we significantly reduce the incentives for runaway or berserk behaviours.
Such an agent tries to stay in a "golden middle way", switching focus among its objectives according to whichever is most pressing. It avoids extremes in any single dimension because going too far throws off the equilibrium in the others. This balancing act also makes it more corrigible, more interruptible, and ultimately safer.
There are two distinct types of balancing involved:
1. Balancing of a single homeostatic objective - keeping the actual value not too low, not too high.
2. Balancing across objectives.
In short, modelling multi-objective homeostasis is a step toward creating AI systems that exhibit the sane, moderate behaviours of living organisms — an important element in ensuring alignment with human values. While no single design framework can solve all challenges of AI safety, shifting from “maximise forever” to “maintain a healthy equilibrium” is a crucial part of the solution space.