Systems of Bounded Rational Agents with Information-Theoretic Constraints

Specialization and hierarchical organization are important features of efficient collaboration in economical, artificial, and biological systems. Here, we investigate the hypothesis that both features can be explained by the fact that each entity of such a system is limited in a certain way. We propose an information-theoretic approach based on a free energy principle in order to computationally analyze systems of bounded rational agents that deal with such limitations optimally. We find that specialization allows a focus on fewer tasks, thus leading to a more efficient execution, but in turn, it requires coordination in hierarchical structures of specialized experts and coordinating units. Our results suggest that hierarchical architectures of specialized units at lower levels that are coordinated by units at higher levels are optimal, given that each unit's information-processing capability is limited and conforms to constraints on complexity costs.


Introduction
The question of how to combine a given set of individual entities in order to perform a certain task efficiently is a long-lasting question shared by many disciplines, including economics, neuroscience, and computer science. Even though the explicit nature of a single individuum might differ between these fields, e.g. an employee of a company, a neuron in a human brain, or a computer or processor as part of a cluster, they have one important feature in common that usually prevents them from functioning isolated by themselves: they are all limited. In fact, this was the driving idea that inspired Herbert A. Simons early work on decision-making within economic organizations (Simon, 1943(Simon, , 1955, which earned him a Nobel prize in 1978. He suggested that a scientific behavioral grounding of economics should be based on bounded rationality, which has units mentioned above, non-operational units (selector nodes) that coordinate the activities of operational units. Depending on their individual resource constraints, the Free Energy principle assigns each unit to a region of specialization that is part of an optimal partitioning of the underlying decision space (see Section 4.3).
In particular, we find that, for a wide range of objectives and resource limitations (see Sections 5 and 5), hierarchical systems with specialized experts at lower levels and coordinating units at higher levels generally outperform other structures.

Preliminaries
This section serves as an introduction to the terminology required for our framework presented in Section 3 and 4.

Notation
We use curly letters, W, X , A, etc. to denote sets of finite cardinality, in particular the underlying spaces of the corresponding random variables W , A, X, etc., whereas the values of these random variables are denoted by small letters, i.e. w ∈ W, a ∈ A, and x ∈ X , respectively. We denote the space of probability distributions on a given set X by P X . Given a probability distribution p ∈ P X , the expectation of a function f : X → R is denoted by f p := x p(x)f (x). If the underlying probability measure is clear without ambiguity we just write f .
For a function g with multiple arguments, e.g. for g : X × Y → R, (x, y) → g(x, y), we denote the function X → R, x → g(x, y) for fixed y ∈ Y by g(·, y) (partial application), i.e. the dot indicates the variable of the new function. Similarly, for fixed y ∈ Y, we denote a conditional probability distribution on X with values p(x|y) by p(·|y). This notation shows the dependencies clearly without giving up the original function names and thus allows to write more complicated expressions in a concise form. For example, if F is a functional defined on functions of one variable, e.g. F [f ] := x f (x) for all functions f : X → R, then evaluating F on the function g in its first variable while keeping the second variable fixed, is simply denoted by F [g(·, y)]. Here, the dot indicates on which argument of g the functional F is acting and at the same time it records that the resulting value (which equals x g(x, y) in the case of the example) does not depend on a particular x but on the fixed y.

Decision-making
Here, we consider (multi-task) decision-making as the process of observing a world state w ∈ W, sampled from a given distribution ρ ∈ P W , and choosing a corresponding action a ∈ A drawn from a posterior policy P (·|w) ∈ P A . Assuming that the joint distribution of W and A is given by p(a, w) := ρ(w)P (a|w), then P is the conditional probability distribution of A given W . Unless stated otherwise, the capital letter P always denotes a posterior, while the small letter p denotes the joint distribution or a marginal of the joint (i.e. a dependent variable).
A decision-making unit is called agent. An agent is rational, if its posterior policy P maximizes the expected utility U = w∈W ρ(w) a∈A P (a|w) U (a, w) (1) for a given utility function U : W × A → R. Note that the utility U may itself represent an expected utility over consequences in the sense of von Neumann and Morgenstern (1944), where W would serve as a context variable for different tasks. The posterior P can be seen as a state-action policy that selects the best action a ∈ A with respect to a utility function U given the state w ∈ W of the world.

Bounded rational agents
In the information-theoretic model of bounded rationality Braun, 2011, 2013;Genewein et al., 2015), an agent is bounded rational if its posterior P maximizes (1) subject to the constraint for a given bound D 0 > 0 and a prior policy q ∈ P A . Here, D KL (p q) denotes the Kullback-Leibler (KL) divergence between two distributions p, q ∈ P Y on a set Y, defined by D KL (p q) := y∈Y p(y) log(p(y)/q(y)). Note that, for D KL (p q) to be well-defined, p must be absolutely continuous with respect to q, so that q(y) = 0 implies p(y) = 0. When p or q are conditional probabilities, then we treat D KL (p q) as a function of the additional variables.
Given a world state w, the information-processing consists of transforming a prior q to a world state specific posterior distribution P (·|w). Since D KL (P (·|w) q) measures by how much P (·|w) diverges from q, the upper bound D 0 in (2) characterizes the limitation of the agent's average information-processing capability: If D 0 is close to zero, the posterior must be close to the prior for all world states, which means that A contains only little information about W , whereas if D 0 is large, the posterior is allowed to deviate from the prior by larger amounts and therefore A contains more information about W . We use the KL-divergence as a proxy for any resource measure, as any resource must be monotone in processed information, which is measured by the KL-divergence between prior and posterior.
Technically, maximizing expected utility under the constraint (2) is the same as minimizing expected complexity cost under the constraint of a minimal expected performance, where complexity is given by the expected KL-divergence between prior and posterior and performance by expected utility. Minimizing complexity means minimizing the number of bits required to generate the actions.

Free Energy principle
By the variational method of Lagrange multipliers, the above constrained optimization problem is equivalent to the unconstrained problem where β > 0 is chosen such that the constraint (2) is satisfied. In the literature on information-theoretic bounded rationality Braun, 2011, 2013), the objective in (3) is known as the Free Energy F of the corresponding decision-making process. In this form, the optimal posterior can be explicitly derived by determining the zeros of the functional derivative of F with respect to P , yielding the Boltzmann-Gibbs distribution Note how the Lagrange multiplier β (also known as inverse temperature) interpolates between an agent with zero processing capability that always acts according to its prior policy (β = 0) and a perfectly rational agent (β → ∞). Note that, plugging (4) back into the Free Energy (3) gives

Optimal prior
The performance of a given bounded rational agent crucially depends on the choice of the prior policy q. Depending on D 0 and the explicit form of the utility function, it can be advantageous to a priori prefer certain actions over others. Therefore, optimal bounded rational decision-making includes optimizing the prior in (3). In contrast to (3), the modified optimization problem does not have a closed form solution. However, since the objective is convex in (P, q), a unique solution can be obtained iteratively by alternating between fixing one and optimizing the other variable (Csiszár and Tusnády, 1984), resulting in a Blahut-Arimoto type algorithm (Arimoto, 1972;Blahut, 1972) that consists of alternating the equations P (a|w) = 1 Z(w) q(a) e βU (a,w) , q(a) = p(a) = w ρ(w)P (a|w), with Z(w) given by (4). In particular, the optimal prior policy is the marginal p of the joint distribution of W and A. In this case, the average Kullback-Leibler divergence between prior and posterior coincides with the mutual information between W and A, It follows that the modified optimization principle (6) is equivalent to Due to its equivalence to rate distortion theory (Shannon, 1959) (with a negative distortion measure given by the utility function), (8) is denoted as the rate distortion case of bounded rationality in (Genewein and Braun, 2013).

Multi-step and multi-agent systems
When multiple random variables are involved in a decision-making process, such a process constitutes a multi-step system (see Section 3). Consider the case of a prior over A that is conditioned on an additional random variable X with values x ∈ X , i.e. q(·|x) ∈ P A for all x ∈ X . Remember that we introduced a bounded rational agent as a decision-making unit, that, after observing a world state w, transforms a single prior policy over a choice space A to a posterior policy P (·|w) ∈ P A . Therefore, in the case of a conditional prior, the collection of prior policies {q(·|x)} x∈X can be considered as a collection or ensemble of agents, or a multi-agent system, where for a given x ∈ X , the prior q(·|x) is transformed to a posterior P (·|x, w) ∈ P A by exactly one agent. Note that a single agent deciding about both, X and A, would be modelled by a prior of the form q(x, a) with x ∈ X and a ∈ A, instead.
Hence, in order to combine multiple bounded rational agents, we are first splitting the full decision-making process into multiple steps by introducing additional intermediate random variables (Section 3), which then will be used to assign one or more agents to each of these steps (Section 4). In this view, we can regard a multi-agent decisionmaking system as performing a sequence of successive decision steps until an ultimate action is selected.
3 Multi-step bounded rational decision-making

Decision nodes
Let W and A denote the random variables describing the full decision-making process for a given utility function U : W × A → R, as described in Section 2. In order to separate the full process into N > 1 steps, we introduce internal random variables X 1 , . . . , X N −1 , which represent the outputs of additional intermediate bounded rational decision-making steps. For each k, let X k denote the target space and x k ∈ X k a particular value of X k . We call a random variable that is part of a multi-step decision-making system a (decision) node. For simplicity, we assume that all intermediate random variables are discrete (just like W and A).
Here, we are treating feed-forward architectures originating at X 0 := W and terminating in X N := A. This allows to label the variables {X k } N k=0 according to the information flow, so that X j potentially can only obtain information about X i if i < j. The canonical factorization p(w, x 1 , . . . , x N −1 , a) = ρ(w) p(x 1 |w) p(x 2 |x 1 , w) · · · p(a|x N −1 , . . . , x 1 , w) of the joint probability distribution of {X k } N k=0 therefore consists of the posterior policies of each decision node.

Two types of nodes: inputs and prior selectors
A specific multi-step architecture is characterized by specifying the explicit dependencies on the preceding variables for each node's prior and posterior, or better the W X 1 Example of a processing node that is part of a multi-step architecture with N = 5, visualized as a directed graph. Here, X 3 processes the output of X 2 by transforming a prior policy p(x 3 |x 1 ) to a posterior policy P (x 3 |x 2 , x 1 ). The prior of X 3 being conditioned on the output of X 1 (indicated by the dashed arrow), means that X 1 determines which of the prior policies {p(·|x 1 )} x 1 ∈X 1 is used by X 3 to process a given output of X 2 . missing dependencies. For example, in a given multi-step system, the posterior of the node X 3 might depend explicitly on the outputs of X 1 and X 2 but not on W , so that P (x 3 |x 2 , x 1 , w) = P (x 3 |x 2 , x 1 ). If its prior has the form q(x 3 |x 1 ), then X 3 has to process the output of X 2 . Moreover, in this case, the actual prior policy q(·|x 1 ) ∈ P X 3 that is used by X 3 for decision-making is selected by X 1 (see Figure 1).
In general, the inputs X i , . . . , X j that have to be processed by a particular node X k , are given by the variables in the posterior that are missing from the prior, and, if its prior q is conditioned on the outputs of X l , . . . , X m , then these nodes select which of the prior policies {q(·|x l , . . . , x m )} x l ∈X l ,...,xm∈Xm ⊂ P X k is used by X k for decision-making, i.e. for the transformation q(x k |x l , . . . , x m ) −→ P (x k |x l , . . . , x m , x i , . . . , x j ) .
We denote the collection of input nodes of X k by X k in (:= {X i , . . . , X j }) and the prior selecting nodes of X k by X k sel (:= {X l , . . . , X m }). The joint distribution of X 0 , . . . , X N is then given by for all x k ∈ X k and x k sel ∈ X k sel , x k in ∈ X k in (k = 1, . . . , N ). Specifying the sets X k sel and X k in of selectors and inputs for each node in the system then uniquely characterizes a particular multi-step decision-making system. Note that we always have (X 1 sel , X 1 in ) = ({}, {X 0 }). Decompositions of the form (9) are often visualized by directed acyclic graphs, socalled DAGs (see e.g. Bishop, 2006, pp. 360). Here, in addition to the decomposition of the joint in terms of posteriors, we have added the information about the prior dependencies in terms of dashed arrows, as shown in Figure 1.

Multi-step Free Energy principle
If P k and q k denote the posterior and prior of the k-th node of an N -step decisionprocess, then the Free Energy principle takes the form where, in addition to the expectation over inputs, the average of D KL (P k q k ) now also includes the expectation with respect to X sel , Since the prior policies only appear in the KL-divergences, and moreover, there is exactly one KL-divergence per prior, it follows as in 2.4, that for each k = 1, . . . , N the optimal prior is the marginal given for all x k ∈ X k by whenever X k sel = x sel . Hence, the Free Energy principle can be simplified to where I(X; Y |Z) denotes the conditional mutual information of two random variables X, Y given a third random variable Z. By optimizing (12) alternatingly, i.e. optimizing one posterior at a time while keeping the others fixed, we obtain for each k = 1, . . . , N , whenever X k sel = x sel and X k in = x in . Here, Z k (x sel , x in ) denotes the normalization constant and F k [P 1 , . . . , P N ] denotes the (effective) utility function on which the decision-making in X k is based on. More precisely, givenX = (X k , X k sel , X k in ), it is the Free Energy of the subsequent nodes in the system, i.e. for any value of where .
Here, x i in and x i sel are collections of values of the random variables in X i in and X i sel , respectively. The final Blahut-Arimito-type algorithm consists of iterating (13), (11), and (14) for each k = 1, . . . , N until convergence is achieved. Note that, since each optimization step is convex (marginal convexity), convergence is guaranteed but generally not unique (Jain and Kar, 2017), so that, depending on the initialization, one might end up in a local optimum.

Example: two-step information-processing
The cases of serial and parallel information-processing studied in (Genewein and Braun, 2013), are special cases of multi-step decision-making systems introduced above. Both cases are two-step processes (N = 2) involving the variables X 0 = W , X 1 = X, and X 2 = A. The serial case is characterized by (X 2 sel , X 2 in ) = ({}, {X 1 }), and the parallel case by (X 2 sel , X 2 in ) = ({X 1 }, {X 0 }). There is a third possible combination for N = 2, given by (X 2 sel , X 2 in ) = ({}, {X 0 , X 1 }). However, it can be shown that this case is equivalent to the (one-step) rate distoration case from Section 2, because if A has direct world state access, then any extra input to the final node A = X 2 , that is not a prior selector, contains redundant information.
4 Systems of bounded rational agents 4.1 From multi-step to multi-agent systems As explained in 2.5 above, a single random variable X k that is part of an N -step decision-making system can represent a single agent or a collection of multiple agents, depending on the cardinality of X k sel , i.e. whether X k has multiple priors which are selected by the nodes in X k sel or not. Therefore, an N -step bounded rational decisionmaking system with N > 1 represents a bounded rational multi-agent system (of depth N ).
For a given k ∈ {1, . . . , N }, each value x ∈ X k sel of X k sel corresponds to exactly one agent in X k . During decision-making, the agents that belong to the nodes in X k sel are choosing which of the |X k sel | agents in X k is going to receive a given input x in (see 4.4 below for a detailed example). This decision is based on how well the selected agent x will perform on the input x in by transforming its prior policy p k (·|x) into a posterior policy P k ( · |x, x in ), subject to the constraint where D x > 0 is a given bound on the agent's information-processing capability. Similarly to multi-step systems, this choice is based on the performance measured by the Free energy of the subsequent agents.

Multi-agent Free Energy principle
In contrast to multi-step decision-making, the information-processing bounds are allowed to be functions of the agents instead of just the nodes, resulting in an extra Lagrange multiplier for each agent in the Free Energy principle (10). As in (12), optimizing over the priors yields the simplified Free Energy principle which can be solved iteratively as explained in the previous section, the only difference being that the Lagrange parameters β k now depend on x k sel . Hence, for the posterior of an agent that belongs to node k, we have where β k (x k sel ) is chosen such that the constraint (15) is fulfilled for all x ∈ x k sel , and F k is given by (14) except that now we have .
The resulting Blahut-Arimoto-type algorithm is summarized in Algorithm 1.

Specialization
Even though a given multi-agent architecture predetermines the underlying set of choices for each agent, only a small part of such a set might be used by a given agent in the optimized system. For example, all agents in the final step potentially can perform any action a ∈ A (see Figure 2 and the Example in 4.4 below). However, depending on their indiviual information-processing capabilities, the optimization over the agents' priors can result in a (soft) partitioning of the full action space A into multiple chunks, where each of these chunks is given by the support of the prior of a given agent x, supp(p(·|x)) ⊂ A. Note that the resulting partitioning is not necessarily disjoint, since agents might still be sharing a number of actions, depending on their available information-processing resources. If the processing capability is low compared to the amount of possible actions in the full space, and if there are enough agents at the same level, then this partitioning allows each agent to focus on a smaller number of options to choose from, provided that the coordinating agents have enough resources to decide between the partitions reliably.
Therefore, the amount of prior adaptation of an agent, i.e. by how much its optimal prior p deviates from a uniform prior p 0 over all accessible choices, which is measured by the KL-divergence D KL (p p 0 ), determines its degree of specialization. More precisely, we define the specialization of an agent with prior p and choice space X by where H[p] := − x p(x) log p(x) denotes the Shannon entropy of p. By normalizing with log |X |, we obtain a quantity between 0 and 1, since 0 H(p) log |X |. Here, S[p] = 0 corresponds to H[p] = log |X |, which means that the agent is completely unspecialized, whereas S[p] = 1 corresponds to H[p] = 0, which implies that p has support on a single option x * ∈ X meaning that the agent deterministically performs always the same action and therefore is fully specialized. Here, every node-and therefore every agent-has access to the world states (big circle). X 1 consists of one agent that decides about which of the |X 1 | = 3 agents in X 2 obtains a given world state as input. The selected agent in X 2 selects which of the |X 2 | = 2 agents out of the |X 1 | · |X 2 | = 6 agents in A that are connected to it, obtains the world state to perform the final decision about an action a ∈ A (grey circles on the right). In our notation introduced below, this architecture is labelled by (1, 4) [1,3,(3,2)] (see Section 5.1).

Example: Hierarchical multi-agent system with three levels
Consider the example of an architecture of 10 agents shown in Figure 2 that are combined via the 3-step decision-making system given by as visualized in the upper left corner of Figure 2. The number of agents in each node is given by the cardinality of the target space of the selecting node(s) (or equals one if there are no selectors). Hence, X 1 consists of one agent, X 2 consists of |X 1 | agents, and A consists of |X 1 | · |X 2 | agents. For example, if we have |X 1 | = 3 and |X 2 | = 2, as in Figure 2, then this results in a hierarchy of 1, 3 and 6 agents. The joint probability of the system characterized by (20) is given by

and the Free Energy by
where the priors p 1 , p 2 , and p 3 are given by the marginals (11), i.e.
By (13), the posteriors that iteratively solve the Free Energy principle are where, by (14) and (18), Given a world state w ∈ W, the agent in X 1 decides about which of the three agents in X 2 obtains w as an input. This narrows down the possible choices for the selected agent in X 2 to two out of the six agents in A. The selected agent performs the final decision by choosing an action a ∈ A. Depending on its degree of specialization, which is a result of his own and the coordinating agents' resources, this agent will choose his action from a certain subset of the full space A.

Optimal Architectures
Here, we show how the above framework can be used to determine optimal architectures of bounded rational agents. Summarizing the assumptions made in the derivations, the multi-agent systems that we analyze must fulfill the following requirements: (i) The information-flow is feed-forward: An agent in X k can obtain information directly from another agent that belongs to X m only if m < k.
(ii) Intermediate agents cannot be endpoints of the decision-making process: the information-flow always starts with the processing of W and always ends with a decision a ∈ A.
(iii) A single agent is not allowed to have multiple prior policies: Agents are the smallest decision-making unit, in the sense that they transform a prior to a posterior policy over a set of actions in one step.
The performance of the resulting architectures is measured with respect to the expected utility they are able to achieve under a given set of resource constraints. To this end, we need to specify (1) the objective for the full decision-making process, (2) the number N of decision-making steps in the system, (3) the maximal number n of agents to be distributed among the nodes, and (4) the individual resource constraints {D 1 , . . . , D n } of those agents.
We illustrate the specifications (1)-(4) with a toy example in Section 5.2 by showcasing and explicitly explaining the differences in performance of several architectures. Moreover, we provide a broad performance comparison in Section 5.3, where we systematically vary a set of objective functions and resource constraints, in order to determine which architectural features most affect the overall performance. For simplicity, in all simulations we are limiting ourselves to architectures with N 3 nodes and n 10 agents. In the following section, we start by describing how we characterize the architectures conforming to the requirements (i)-(iii).

Characterization of architectures
Type. In view of property (ii) above, we can label any N -step decision-making process by a tuple (i, j), which we call the type of the architecture, where i characterizes Figure 3. Overview of the resulting architectures for N 3, each of them being labelled by its type.
the relation between the first N −1 variables W , X 1 , . . . , X N −1 , and j determines how these variables are connected to X N = A.
Shape. After the number of nodes has been fixed, the remaining property that characterizes a given architecture is the number of agents per node. For most architectures there are multiple possibilities to distribute a given amount of agents among the nodes, even when neglecting individual differences in resource constraints. We call such a distribution a shape, denoted by [n 1 , n 2 , . . . ], where n k denotes the number of agents in node k. Note that, not all architectures will be able to use the full amount of available agents, most immanently the one-step rate distortion case (1 agent), or the two-step serial-case (2 agents). For these systems, we always use the agents with the highest available resources in our simulations. For example, for N 3 the resulting shapes for a maximum of n = 10 agents are as follows: • [1] for (−1, ), [1, 1] for (0, ), and [1,9] for (1, ), . The set of phone calls W is partitioned into three separate regions, corresponding to three different topics about which customers might have complaints or questions. Each of these can be divided into two subcategories of four customer calls each. For each phone call there is exactly one answer that achieves the best result (U = 1). Moreover, the responses that belong to one subcategory of calls are also suitable for the other calls in that particular subcategory, albeit slightly less effective (U = 0.85) than the optimal answers. Similarly, the responses that belong to the same topic of calls are still a lot better (U = 0.7) than responses to other topics (U = 0).

Example: Callcenter
Consider the operation of a company's callcenter as a decision-making process, where customer calls (world states) must be answered with an appropriate response (action) in order to achieve high customer satisfaction (utility). The utility function shown in Figure 5 on the left can be viewed as a simplistic model for a real-world callcenter of a big company such as a communication service provider. In this simplification, there are 24 possible customer calls that belong to three separate topics, for example questions related to telephone, internet, or television, which can be further subdivided into two subcategories, for example consisting of questions concerning the contract or problems with the hardware. See the description of Figure 5 for the explicit utility values.
Handling all possible phone calls perfectly by always choosing the corresponding response with maximum utility requires log 2 (24) ≈ 4.6 bit (see Figure 5). However, in practice a single agent is usually not capable of knowing the optimal answers to every single type of question. For our example this means that the callcenter only has access to agents with information-processing capability less than 4.6 bit. It is then required to organize the agents in a way so that each agent only has to deal with a fraction of the customer calls. This is often realized by first passing the phone call through several filters in order to forward it to a specialized agent. Arranging these selector or filter units in a strict hierarchy then corresponds to architectures of the form of (1, 4) or (1, 5) (see below for a comparison of these two), where at each stage a single operator selects how a call is forwarded. In contrast, architectures of the form of (2, 4) allow for multiple independent filters working in parallel, for example realized by multiple trained neural networks, where each is responsible for a particular feature of the call (for example, one node deciding about the language of the call, and another node deciding about the topic). In the following we do not discriminate between human and artificial decisionmakers, since both can qualify equally well as information-processing units.
Assume that there are n = 10 bounded rational agents available. Considering the given utility function, the architectures (1, 4) [1,3,(3,2)] (shown in Figure 2) and (1, 5) [1,3,6] (shown in Figure 4) might be obvious choices as they represent the hierarchical structure of the utility function. With an information bound of 1.6 (≈ log 2 (3)) bit for the first agent and 0.1 bit for the rest, the optimal prior policies for (1, 5) [1,3,6] obtained by our Free Energy principle are shown in Figure 6. We can see that, for this architecture, the choice x 1 of the agent at the first step corresponds to the general topic of the phone call, the decisions x 2 of the three agents at the second stage correspond to the subcategory on which one of the six agents at the final stage is specialized to, who then makes the decision about the final response a by picking one of the four actions in the support of its prior. We can see in Figure 7 on the left that a hierarchical structure as in (1, 5) [1,3,6] or (1, 4) [1,3,(3,2)] is indeed superior when comparing with the architecture (2, 4) [1,1,(2,4)] , because there is no good selector for the second filter. We have also added two architectures to the comparison that have a bottleneck of the information flow at either end of the decision-making process, (0, 3) [1,1,8] and (1, 0) [1,8,1] (see Figure 4 for a visualization), which are performing considerably worse than the others: in (0, 3) [1,1,8] the first agent is the only one who has direct contact to the customer and passes the filtered information on to everybody else, whereas in (1, 0) [1,8,1] the customer talks to multiple agents, but these cannot take any decisions but pass on the information to a final decision node who has to select from all possible options. Interestingly, as can be seen on the right side of Figure 7, when changing the resource bounds such that the first agent only has D 1 = 1 bit instead of 1.6 and the second agent has D 2 = 0.5 bit instead of 0.1, then the strictly hierarchical architectures (1, 5) [1,3,6] and (1, 4) [1,3,(3,2)] are outperformed by the architecture (2, 4) [1,1,(2,4)] , because their first agent is not able to perfectly distinguish between the three topics anymore. This is an ideal situation for (2, 4) [1,1,(2,4)] , since here the total information-processing for filtering the phone calls is split up efficiently between the first two agents in the system.
Note that (1, 4) and (1, 5) do not necessarily perform identically (as can be seen on the right in Figure 7), even though the structure of the utility function might suggest that it is ideal for (1, 5) [1,3,6] to always have the optimal priors shown in Figure 6. However, this crucially depends on the given information-processing bounds. In Figure 8, we illustrate the difference between the two types in more detail, by showing the processed information that can actually be achieved per agent in the respective architecture for an information bound of D = (0.4, 2.6, 2.6, 2.6, 0.4, . . . , 0.4). When the first agent in the hierarchy has low capacity, then the rigid structure of (1, 4) is penalized because the agents at the second stage cannot compensate the errors of the first agent, irrespectively of their capacity. In contrast, for (1, 5), the connection between the second stage and the executing stage can be changed freely, which leads to ignoring the first agent and letting the three agents in the second stage determine the distribution of phone calls completely. In this sense, (1, 5) is more robust to errors in the first filter than (1, 4).

Systematic performance comparison
In this section, we move away from an explicit toy example to a broad performance comparison of all architectures for N 3, averaged over multiple types of utility functions and a large number of resource constraints (as defined below). In Section 6.1, this is supplemented with an analysis of the architectural features that best explain the performances. Objectives. We compare all possible architectures for twelve different utility functions, {U k } 12 k=1 , defined on a world and action space of |W| = |A| = 20 elements, and we assume the same cardinality for the range of all hidden variables. Note that the cardinality of the target set X for selector nodes X ∈ X sel is given by the number of agents it decides about. In particular, we consider three kinds of utility functions (one-to-one, many-to-one, one-to-many) that we vary in a 2×2 paradigm, where the first dimension ( Same constraints for all agents 9 ] ( 2 , 2 ) , [1 is the number of maximum utility peaks (single, multiple) and the second dimension is the range of utility values (binary, multi-valued). The utility functions are visualized in Figure 9, where the three kinds of functions correspond to the three rows of the plot. A one-to-one scenario applies to a needle-in-a-haystack situation where each world state affords only a unique action, and vice versa each optimal action allows to uniquely identify the world state, for example an absolute identification task. A many-to-one scenario allows for abstractions in the world states, for example in categorization when multiple instances are judged to belong to the same class (e.g. vegetables are boiled, fruit is eaten raw). A one-to-many scenario allows for abstractions in the action space, for example in hierarchical motor control when a grasp action can be performed in many different ways.
Resource limitations. We are considering three schemes of resource constraints: (i) Same constraints for all agents.
(ii) Same constraints for all agents but one, which has a higher limit than the other agents.
(iii) Same constraints for all but two agents, which can have a different limit and have higher limits than all the other agents.
For (i), we compare 20 sets of constraints {D 0 , D 1 , . . . } with D i equally spaced in the range between 0 and 3 bits, for (ii) we compare 39 sets in the same range but the high resource agent having 1, 2 and 3 bits, and for (iii) we allow 89 sets with similar constraints than in (ii) but additional combinations for the second high-resource agent.
Simulation results. The performance of an architecture is given by its expected utility with respect to a given objective and a given information bound as defined above.
In Figure 10, we show which of the architectures won at least one condition, together with the proportion of conditions won by each of these architectures. We can see that (2, 4) [1,1,(2,4)] overall outperforms all the other systems (see Figure 4 for a visualization). In the case when all agents have the same resource constraints, the architecture (1, 4) [1,3,(3,2)] is a strong second winner, however this is not the case if one or two agents have more resources than the rest. It is not surprising that in these situations the parallel case with one high-resource agent distributing the work among the low resource agents, and even the case of a single agent that does everything by himself, are both performing well.
A closer look on the achieved expected utilities however, shows that there are several architectures that are almost equally well performing for many conditions. In order to increase comparability between the different utility functions, we measure performance in terms of a relative score, which, for a given utility function and resource constraint, is given by the architectures' expected utility divided by the maximum expected utility of all architectures. The score averaged over all conditions is shown for each architecture in Figure 11 in the top row. We can see that the best architectures are pretty close to each other. As expected, the architecture that won the most conditions also has the highest overall performance, however there are multiple architectures that are very close. The top three architectures are (2, 4) [1,1,(2,4)] , (1, 5) [1,3,6] , (1, 4) which have been visualized above (Figure 2 and 4).
A better understanding of their performances under different resource constraints can be gathered from the remaining graphs in Figure 11. In the second row we can see that the top three overall architectures also perform best for almost all utility functions when averaged over the information bounds. The last three graphs in Figure 11 show the expected utility of each architecture averaged over all utility functions for each information bound. We can see how the expected utility increases with higher information bounds, for some architectures more than for others. The top three architecures perform differently for most of the bounds, with spans of bounds where each of them clearly outperforms the others.

Analysis of the simulations
There are plenty of factors that influence the performance of each of the given architectures. Here, we attempt to unfold the features that determine their performances in the clearest way. To this end, we compare the architectures with respect to the following quantities: Average specialization of operational agents: the specialization (19) averaged over all agents in the final stage of the architecture.
Hierarchical: boolean value that specifies whether an architecture is hierarchical or not, meaning that consecutive nodes are occupied by an increasing amount of agents.
Agents with direct w-access: the number of agents with direct world state access.
operational agents with direct w-access: the number of agents in the last node of the architecture.
Number of w-bottlenecks: the total number of nodes that are missing direct access to the world state. Executing agents with direct w-access As can be seen from Figure 12, we found that these architectural features explain the differences in performance quite well. More precisely, the architectures can be roughly grouped into three different categories, indicated by slightly different color saturations in Figure 12): The poorest performing group consists of architectures that have between one and two w-bottlenecks, and therefore have only few agents with direct w-access, in particular none of their operational agents has direct w-access. Moreover, in this group, most architectures are not hierarchical at all, and their operational agents have low specialization, with two exceptions that both have two w-bottlenecks.
The architectures with medium performance have maximally one w-bottleneck and many of them are hierarchical. Here, those systems that have operational units with high specialization are missing direct w-access, and the systems that have operational units with direct w-access have low specialization.
All architectures in the top group have many agents with direct world-state access and they have no w-bottlenecks. Interestingly, the best six architectures are all strictly hierarchical. Moreover, the order of performance is almost in direct accordance with the average specialization of the operational agents.
Overall we can say that, it is best to have as many operational units as needed to discriminate the actions well, as long as the coordinating agents have enough resources to discriminate between them properly. The architecture (1, 4) [1,1,(2,4)] has eight operational agents, which are managed by two coordinating units, which need maximally two bits (for choosing among four agents) and one bit (for choosing among two agents) in order to perform well. Both of the other top three architectures, (1, 5) [1,3,6] and (1, 4) [1,3,(3,2)] , have six operational agents, which are managed by three coordinating units, so that each of them needs maximally one bit. But compared to (1, 4) [1,1,(2,4)] , there are less agents to spare for the operational stage. Hence, if the operational units have low resources, it is always a trade-off between the number of operational units and the resources of the coordinating ones.

Limitations of our analysis
The analysis presented above only provides a rough explanation of the differences in performance. Which architecture is optimal, depends a lot on the actual information bounds of each agent. In all of our conditions, we assumed that most agents have the same processing capabilities, which is why there is a certain bias towards architectures that perform well under this assumption (low variance in choice-per-agent ratio across the agents).
Due to the large amount of Lagrange parameters in the Free Energy principle (16), the data generation was done by running the Blahut-Arimoto-type algorithm for 10.000 different combinations of parameters for each of the architectures, for each type of the different types of resource limitations, (i)-(iii) in 5, and for each of the utility functions defined in 5. For a given information bound, the corresponding parameters were determined by looking for the points with the highest Free Energy that still respect the bound.A better approach would be to enhance the global parameter search by a more fine-grained local search. Another possibility is to use an evolutionary algorithm, where each population is given by multiple sets of parameters and the information constraints are built in by a method similar to (Chehouri et al., 2016). This works well but requires significantly more time to process.
Since the Blahut-Arimoto type algorithm is not guaranteed to converge to a global maximum, the resulting values for the expected utility and mutual information for a given set of parameters can depend on the initialization of the algorithm. In practice, this variation is small enough, so that it influences the average performance over multiple conditions only by a negligable amount. However, direct comparisons of architectures for a given information bound and utility function should be repeated multiple times to make sure that the results are stable.

Relation to Variational Bayes and Active Inference
Above, we determined the architectures that achieve the highest expected utility under a given resource constraint. These constraints are fulfilled by tuning the Lagrange multipliers in the Free Energy principle. If the Lagrange multipliers themselves are fixed, for instance as exchange rates between information and utility (Ortega and Braun, 2010), or inverse temperatures in thermodynamics (Ortega and Braun, 2013), then the Free Energy itself would be an appropriate performance measure. This is done, for example in Bayesian model selection, which is also known as structure learning and represents an important problem in Bayesian inference and machine learning. The Bayesian approach for evaluating different Bayesian network structures, in order to find the relation of a given set of hidden variables that best explains a dataset D, consists in comparing the marginal likelihood or evidence p(D|S) of the structures S (Friedman and Koller, 2003). This can be seen to be analogous to a performance comparison of different decision-making architectures measured by the Free Energy. In the simple case of one observable Y and one hidden variable X, we have where the likelihood p(y|x, S) is assumed to be known. Given a prior p(x|S) and, for simplicity, a single observed datapoint y ∈ Y , the posterior distribution of X can be inferred by using Bayes' rule, As has been noted before (Ortega and Braun, 2013), when comparing (22) with the Boltzmann equation (4) we can see that (22) is equivalent to the posterior P of a bounded rational decision-maker with choice space X , prior policy p(x|S), Lagrange parameter β = 1, and utility function given by U (x) := log p(y|x, S). Since the marginal likelihood p(y|S) is the normalization constant in (22), it follows immediately from (5) that log p(y|S) is the optimal Free Energy F var [P = p(·|y, S)] of this decision-maker, where In Bayesian statistics, F var is known as the variational Free Energy, and the given decomposition is often referred to in terms of the difference between accuracy (expected log-likelihood) and complexity (KL-divergence between prior and posterior). It is used in the variational characterization of Bayes' rule, i.e. the approximation of the exact Bayesian posterior p(·|y, S) given by (22) in terms of a simpler-for example a parametrized-distribution q by minimizing the KL-divergence between q and p(·|y, S). Since D KL (q p(·|y, S)) = −F var [q] + log p(y, S), this is equivalent to the maximization of F var . The same is true for multiple hidden variables. For example, let S be the 3-step architecture of type (1, 4) from Section 4.4 with W = Y and hidden variables X 1 , X 2 , and X 3 = A. Setting β 1 = β 2 = β 3 = 1 and U (a, x 1 , x 2 , y) = log p(y|a, x 1 , x 2 , S), we obtain F 2 (y, x 1 , x 2 ) = log p(y|x 1 , x 2 , S) , F 1 (y, x 1 ) = log p(y|x 1 , S) , and Z(y, x 1 , x 2 ) = p(y|x 1 , x 2 , S) , Z(y, x 1 ) = p(y|x 1 , S) , Z(y) = p(y|S) .
Note that, even though so far we always assumed that the utility function only depends on the world states and actions, the equations in Sections 3, 4, and 4.4 are also valid in the general case of U depending on all the variables in the system. The total Free Energy for a given y ∈ Y then takes the form Hence, also in this case, the logarithm of the marginal likelihood is given by the Free Energy of the corresponding decision-making system. Choosing the multi-step architecture with the highest Free Energy is then analogous to Bayesian model selection with the marginal likelihood or Bayesian model evidence as performance measure.
Another interesting interpretation of (23) is that here the hidden variable X can be thought of as an action causing observed outcomes y. This is close to the framework of Active Inference (Friston et al., 2015b(Friston et al., , 2017b, where actions directy cause transitions of hidden states, which generate outcomes that are observed by the actor. More precisely, there the real-world process generating observable outcomes is distinguished from an internal generative model describing the beliefs about the external generative process (e.g. a Markov decision process). Observations are generated from transitions of hidden states, which depend on the decision-maker's actions. Decision-making is given by the optimization of a variational Free Energy analogous to (23), where the log-likelihood is given by the generative model, which describes beliefs about the hidden and control states of the generative process. This way utilities are absorbed into a (desired) prior (Ortega and Braun, 2015). There are several differences to our approach. First, the structure of the Free Energy principle of bounded rationality originates from the maximization of a given pre-defined external utility function under information constraints, whereas the Free Energy principle of Active Inference aims to minimize surprise or Bayesian model evidence, effectively minimizing the divergence between approximate and true posterior. Second, in Active Inference, utility is transformed into preferences in terms of prior beliefs, while in bounded rationality prior policies over actions can be part of the optimization process, which results in specialization and abstraction. In constrast, Active Inference compounds utilities and priors into a single desired prior which is fixed and does not allow to separately optimize utility and action priors.

Conclusion
In this work, we have presented an information-theoretic framework to study systems of decision-making units with limited information-processing capabilities. It is based on an overreaching Free Energy optimization principle, which, on the one hand, allows to compute the optimal performances of explicit architectures, and on the other hand, produces optimal partitions of the involved choice spaces into regions of specialization. In order to combine a given set of bounded rational agents, first the full decision-making process is split into multiple decision steps by introducing intermediate decision variables, and then a given set of agents is distributed among these variables. We have argued that this leads to two types of agents, non-operational units that distribute the work among subordinates, and operational units that are doing the actual work in the sense of choosing a particular action that either serves as an input for another agent in the system, or represents the final decision of the full process. This "vertical" specialization is enhanced by optimizing over the agents' prior policies, which leads to an optimal soft partitioning of the underlying choice space of each step in the system, resulting in a "horizontal" specialization as well.
In order to illustrate the proposed framework, we have simulated and analyzed the performances under a number of different resource constraints and tasks for all possible 3-step architectures whose information flow starts by observing a given world state and ends with the selection of a final decision. Even though the relative architecture performances depend crucially on the explict information-processing constraints, the overall best performing architectures tend to be hierarchical systems of non-operational "manager" units at higher hierarchical levels and operational "worker" units at the lowest level.
Our approach is based on earlier work on information-theoretic bounded rationality Braun, 2011, 2013;Genewein and Braun, 2013;Genewein et al., 2015) (see also the references therein). In particular, the N -step decision-making systems introduced in Section 3 generalize the two-step processes studied in (Genewein and Braun, 2013;Genewein et al., 2015). According to Simon (Simon, 1979), there are three different bounded rational procedures that can transform intractable into tractable decision problems: (i) Looking for satisfactory choices instead of optimal ones, (ii) replacing global goals with tangible subgoals, and (iii) dividing the decision-making task among many specialists. From this point of view, the decision-making process of a single agent, given by the one-step case of information-theoretic bounded rationality Braun, 2011, 2013) described in Section 2, corresponds to (i), while the bounded rational multi-step and multi-agent decision-making processes introduced in Section 3 and 4, can be attributed to (ii) and (iii).
The main advantage of a purely information-theoretic treatment is its universality. To our knowledge this work is the first systematic theory-guided approach to the organization of agents with limited resources in the generality of information theory. In other approaches, more specific methods are used instead, that are tailored to each particular focus of study. In particular, bounded rationality has usually a very specific meaning, often being implemented by simply restricting the cardinality of the choice space. For example, in management theory the well-known results by Graicunas from the 1930s (Graicunas, 1933) suggest that managers must have a limited span of control in order to be efficient. By counting the number of possible relationships between managers and their subordinates, he concludes that there is an explicit upper bound of five or six subordinates. Of course, there are many cases of successful companies today that disagree with Graicunas' claim, e.g. Apple's CEO has 17 managers that are reporting directly to him. However, current management experts think that the optimal number is somewhere between 5 and 12. The idea of restricting the cardinality of the space of decision-making is also studied for operational agents. For example in (Camacho and Persky, 1988), Camacho and Persky explore the hierarchical organization of specialized producers with a focus on production. Even though their treatment is more abstract and more general than many preceeding studies, their take on bounded rationality is very explicit and based on the assumption that the number of elementary parts that form a product, as well as the number of possibilities of each part, are larger than a single individual can handle. Similarly, in most game theoretic approaches that are based on automaton theory (Neyman, 1985;Abreu and Rubinstein, 1988;Hernández and Solan, 2016), the boundedness of an agent's rationality is expressed by a bound on the number of states of the automaton. Most of these non-information theoretic treatments consider cases when there is a hard upper bound on the number of options, but they usually lack a probabilistic description of the behaviour in cases when the number of options is larger than the given bound.
The work by Geanakoplos and Milgrom (1991) uses "information" to describe the limited attention of managers in a firm. But here, the term is used more informally, and not in the classical information-theoretical sense. However, one of their results suggests that "firms with more prior information about parameters [...] will employ less able managers, or give their managers wider spans of control" (Geanakoplos and Milgrom, 1991, p. 207). This observation is in line with information-theoretic bounded rationality, since by optimizing over priors in the Free Energy principle, the required processing-information is decreased compared to the case of non-optimal priors, so that less able agents can perform a given task, or similarly, an agent with a higher information bound can have a larger choice space.
In neuroscience, the variational Bayes approach explained in Section 6.3 has been proposed as a theoretical framework to understand brain function in terms of Active Inference (Friston, 2009(Friston, , 2010Friston et al., 2015aFriston et al., ,b, 2017a, where perception is modelled as variational Bayesian inference over hidden causes of observations. There, a processing node (usually a neuron) is limited in the sense that it can only linearly combine a set of input signals into a single output signal. Decision-making is modelled by approximating Bayes' rule in terms of these basic operations, and then tuning the weights of the resulting linear transformations in order to optimize the Free Energy (23). Hence, there, the Free Energy serves as a tool to computationally simplify Bayesian inference on the neuronal level, whereas our Free Energy principle is a tool to computationally trade off expected utility and processing costs, providing an abstract probabilistic description of the best possible choices when the information-processing capability is limited.
In the general setting of approximate Bayesian inference, there are many interesting algorithms and belief update schemes, for example belief propagation in terms of message passing on factor graphs (see e.g. Yedidia et al., 2005). These algorithms make use of the notion of the Markov boundary (minimal Markov blanket) of a node X, which consists of the nodes that share a common factor with X (so-called neighbours). Conditioned on its Markov boundary a given random variable is independent of all other variables in the system, which allows to approximate marginal probabilities in terms of local messages between neighbours. These approximations are generally only exact on tree-like factor graphs without loops (Mézard and Montanari, 2009, Thm. 14.1). This raises the interesting question of whether such algorithms could also be applied to our setting. First, it should be noted that variational Bayesian inference constitutes only a subclass of problems that can be expressed by utility optimization with information constraints. In this subclass, all random variables have to appear either in utility functions, that is they have to be given as log-likelihoods, or they have to appear in marginal distributions that are kept fixed-see for example the definition of the utility in the inference example above where U (a, x 1 , x 2 , y) = log p(y|a, x 1 , x 2 , S) compared to the utility functions of the form U (w, a) used throughout the paper that leave all intermediate random variables X 1 , . . . , X N −1 unspecified. Second, while it may be possible to exploit the notion of Markov blankets by recursively computing free energies between the nodes in a similar fashion to message-passing, there can also be contributions from outside the Markov boundary, for example when the action node has to take an expectation over possible world states that lie outside the Markov boundary. Finally, it may be interesting to study whether message passing algorithms can be extended to deal with our general problem setting and at least to approximately generate the same kind of solutions as Blahut-Arimoto, even though in general we do not have tree-structured graphs.
There are plenty of other possible extensions of the basic framework introduced in this work. Marschak and Reichelstein (1998) study multi-agent systems in terms of communication cost minimization, while ignoring the actual decision-making process. One could combine our model with the information bottleneck method (Tishby et al., 1999) and explicitly include communication costs in order to study more general agent architectures, in particular systems with non-directed information flow. Moreover, we have seen in our simulations that specialization of operational agents is an important feature shared among all of the best performing architectures. In the biological literature, specialization is often paired with modularity. For example Kashtan and Alon (2005) and Wagner et al. (2007) show that modular networks are an evolutionary consequence of modularly varying goals. Similarly, it would be interesting to study the effects of changing environments on specialization, abstraction, and optimal network architectures of systems of bounded rational agents. .

By writing
where x <k := (x 0 , . . . , x k−1 ), and , we obtain for any k ∈ {0, . . . , n}, withx = (x k , x k sel , x k in ) andx c := (x 0 , . . . , x N ) \x. In this form, we can see that optimizing for P k yields the Boltzmann distribution (13) with respect to the effective utility F k (x) = x c p(x c |x)F k,loc (x) as defined in (14).