AdamGleave comments on Why I expect successful (narrow) alignment

AdamGleave Jan 4, 2019, 6:02 PM
11 points
0 ∶ 0
I’m sympathetic to the general thrust of the argument, that we should be reasonably optimistic about “business-as-usual” leading to successful narrow alignment. I put particular weight on the second argument, that the AI research community will identify and be successful at solving these problems.

However, you mostly lost me in the third argument. You suggest using whatever state-of-the-art general purpose learning technique exists to model human values, and then optimise them. I’m pessimistic about this since it involves an adversarial relationship between the optimiser (e.g. an RL algorithm) and the learned reward function. This will work if the optimiser is weak and the reward model is strong. But if we are hypothesising a far improved reward learning technique, we should also assume far more powerful RL algorithms than we have today.

Currently, it seems like RL is generally an easier problem than learning a reward function. For example, current IRL algorithms will overfit the reward function to demonstrations in a high-dimensional environment. If you later optimize the reward with an RL algorithm, you get a policy which does well under the learned reward function, but terribly (often worse than random) on the ground truth reward function. This is why you normally learn the policy jointly with the reward in a GAN-based approach. Regularizers to learn a good reward model (which can then generalize) is in active area of research, see e.g. the variational discriminator bottleneck. However, solving it in generality seems very hard. There’s been little success in adversarial defences, which is a related problem, and there are theoretical reasons to believe adversarial examples will be present for any model class in high-dimensional environments.

Overall, I’m optimistic about the research community solving these problems, but think that present techniques are far from adequate. Although improved general-purpose learning techniques will be important, I believe there will also need to be a concerted focus on solving alignment-related problems.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer