Jobst Heitzig (vodle.it) comments on My lab’s small AI safety agenda

Jobst Heitzig (vodle.it)Jun 20, 2023, 9:05 AM
3 points
1 ∶ 0
When I said “actual utility” I meant that which we cannot properly formalize (human welfare and other values) and hence not teach (or otherwise “give” to) the agent, so no, the agent does not “have” (or otherwise know) this as their utility function in any relevant way.
In my use of the term “maximization”, it refers to an act, process, or activity (as indicated by the ending “-ation”) that actively seeks to find the maximum of some given function. First there is the function to be maximized, then comes the maximization, and finally one knows the maximum and where the maximum is (argmax).
On the other hand, one might object the following: if we are given a deterministic program P that takes input x and returns output y=P(x), we can of course always construct a mathematical function f that takes a pair (x,y) and returns some number r=f(x,y) so that it turns out that for each possible y we have P(x)=argmax f(x,y). A trivial choice for such a function is f(x,y)=1 if y=P(x) and f(x,y)=0 otherwise. Notice, however, that here the program P is given first, and then we construct a specific function f for this equivalence to hold.
In other words, any deterministic program P is functionally equivalent to another program P’ that takes some input x, maximizes some function f(x,y), and returns the location y of that maximum. But being functionally equivalent to a maximizer is not the same as being a maximizer.
In the learning agent context: If I give you a learned policy pi that takes a state s and returns an action a=pi(s) (or a distribution of actions), then you might well be able to construct a reward function g that takes a state-action pair (s,a) and returns a reward (or expected reward) r=g(s,a) so that when I then calculate the corresponding optimal state-action-quality-function Q* of this reward function, it turns out that for all states s, we have pi(s)=argmax Q*(s,a). This means that the policy pi is the same policy as the one that a learning process would have produced that searches for the policy that maximizes the long-term discounted sum of rewards according to reward function g. But it does not mean that the policy pi was actually determined by such a possible optimization procedure: the learning process that produced pi can very well be of a completely different kind than an optimization procedure.