David Johnston

Karma: 677

David Johnston Feb 24, 2023, 1:26 AM
2 points
2 ∶ 3
on: Taking a leave of absence from Open Philanthropy to work on AI safety
I think your first priority is promising and seemingly neglected (though I’m not familiar with a lot of work done by governance folk, so I could be wrong here). I also get the impression that MIRI folk believe they have an unusually clear understanding of risks, would like to see risky development slow down and are pessimistic about their near-term prospects for solving technical problems of aligning very capable intelligent systems and generally don’t see any clearly good next steps. It appears to me that this combination of skills and views positions them relatively well for developing AI safety standards. I’d be shocked if you didn’t end up talking to MIRI about this issue, but I just wanted to point out that from my point of view there seems to be a substantial amount of fit here.

David Johnston Feb 22, 2023, 4:08 AM
8 points
2 ∶ 0
in reply to: Paul_Christiano’s comment on: Deceptive Alignment is <1% Likely by Default

If a model is deceptively aligned after fine-tuning, it seems most likely to me that it’s because it was deceptively aligned during pre-training.

How common do you think this view is? My impression is that most AI safety researchers think the opposite, and I’d like to know if that’s wrong.

I’m agnostic; pretraining usually involves a lot more training, but also fine tuning might involve more optimisation towards “take actions with effects in the real world”.

David Johnston Feb 22, 2023, 1:38 AM
1 point
0 ∶ 0
in reply to: DavidW’s comment on: Deceptive Alignment is <1% Likely by Default
All of these comments are focused on my third core argument. What do you think of the other two? They all need to be wrong for deceptive alignment to be a likely outcome.
Yeah, this is just partial feedback for now.
Recall that in this scenario, the model is not situationally aware yet, so it can’t be deceptive. Why would making the goal long-term increase immediate-term reward? If the model is trying to maximize immediate reward, making the goal longer-term would create a competing priority.
I think I don’t accept your initial premise. Maybe a model acquires situational awareness via first learning about how similar models are trained for object-level reasons (maybe it’s an AI development assistant), and understanding about how these lessons apply to it’s own training via a fairly straightforward generalisation (along the lines of “other models work like this, I am a model of a similar type, maybe I work like this too”). Neither of these steps requires an improvement in loss via reasoning about its own gradient updates.
If it can be deceptive, then making the goal longer term could help because it reasons from the goal back to performing well in training, and this might be replacing a goal that didn’t quite do the right thing, but because it was short term it also didn’t care about doing well in training.
This isn’t necessarily true. Humans frequently plan for their future without thinking about how their own values will be affected and how that will affect their long-term goals. Why wouldn’t a model do the same thing? It seems very plausible that a model could have crude long-term planning without yet modeling gradient descent updates
I agree it could go either way.
The relevant factor here is actually how much the model expects its future behavior to change from a gradient update, because the model doesn’t yet know the effect of the upcoming gradient update.
I think our disagreement here boils down to what I said above: I’m imagining a model that might already be able to draw some correct conclusions about how it gets changed by training.
Gradient descent can update in every direction at once. If updating its proxies helps performance, I see no reason why gradient descent wouldn’t update the proxies.
Right, that was wrong from me. I still think the broader conclusion is right—if goal shifting boosts performance, then it must already in some sense understand how to perform well and the goal shifting just helps it apply this knowledge. But I’m not sure if understanding how to perform well in this sense is enough to avoid deceptive alignment—that’s why I wanted to read your first post (which I still haven’t done).

David Johnston Feb 21, 2023, 10:37 PM
3 points
1 ∶ 0
on: Deceptive Alignment is <1% Likely by Default
Gradient descent can only update the model in the direction that improves performance hyper-locally. Therefore, building the effects of future gradient updates into the decision making of the current model would have to be advantageous on the current training batch for it to emerge from gradient descent.
I think the standard argument here would be that you’ve got the causality slightly wrong. In particular: pursuing long term goals is, by hypothesis, beneficial for immediate-term reward, but pursuing long term goals also entails considering the effects of future gradient updates. Thus there’s a correlation between “better reward” and “considering future gradient updates”, but the latter does not cause the former.
Because each gradient update should have only a small impact on model behavior, the relatively short-term reward improvements of considering these effects should be very small. If the model isn’t being trained on goals that extended far past the next gradient update, then learning to consider how current actions affect gradient updates, which is not itself especially consequential, should be very slow.
It’s not obvious to me that your “continuity” assumption generally holds (“gradient updates have only a small impact on model behaviour”). In particular, I have an intuition that small changes in “goals” could lead to large changes in behaviour. Furthermore, it is not clear to me that, granting the continuity assumption, the conclusion follows. I think the speed at which it learns to consider how current actions affect gradient updates should depend on how much extra reward (accounting for regularisation) is available from changing in other ways.
One line of argument is that if changing goals is the most impactful way to improve performance, then the model must already have a highly developed understanding of the world. But if it has a highly developed model of the world, then it probably already has a good “understanding of the base objective” (I use quotes here because I’m not exactly sure what this means).
When I click on the link to your first post, I am notified that I don’t have access to the draft.

David Johnston Feb 21, 2023, 6:16 AM
11 points
3 ∶ 0
in reply to: EJT’s comment on: There are no coherence theorems
I think your title might be causing some unnecessary consternation. “You don’t need to maximise utility to avoid domination” or something like that might have avoided a bit of confusion.

David Johnston Feb 21, 2023, 2:36 AM
10 points
3 ∶ 0
in reply to: Habryka [Deactivated]’s comment on: There are no coherence theorems
and I would urge the author to create an actual concrete situation that doesn’t seem very dumb in which a highly intelligence, powerful and economically useful system has non-complete preferences
I’d be surprised if you couldn’t come up with situations where completeness isn’t worth the cost—e.g. something like, to close some preference gaps you’d have to think for 100x as long, but if you close them all arbitrarily then you end up with intrasitivity.

David Johnston Feb 21, 2023, 1:32 AM
10 points
4 ∶ 0
on: There are no coherence theorems
I wonder if it is possible to derive expected utility maximisation type results from assumptions of “fitness” (as in, evolutionary fitness). This seems more relevant to the AI safety agenda—after all, we care about which kinds of AI are successful, not whether they can be said to be “rational”. It might also be a pathway to the kind of result AI safety people implicitly use—not that agents maximise some expected utility, but that they maximise utilities which force a good deal of instrumental convergence (i.e. describing them as expected utility maximisers is not just technically possible, but actually parsimonious). Actually, if we get the instrumental convergence then it doesn’t matter a great deal if the AIs aren’t strictly VNM rational.

In conclusion, I think we’re interested in results like fitness → instrumental convergence, not rationality → VNM utility.
I largely endorse the position that a number of AI safety people have seen theorems of the latter type and treated them as if that they imply theorems of the former type.

David Johnston Feb 7, 2023, 10:10 PM
8 points
0 ∶ 0
on: A personal reflection on SBF
Fixing the “I pit my evidence against itself” problem is easy enough once I’ve recognized that I’m doing this (or so my visualizer suggests); the tricky part is recognizing that I’m doing it.
One obvious exercise for me to do here is to mull on the difference between uncertainty that feels like it comes from lack of knowledge, and uncertainty that feels like it comes from tension/conflict in the evidence. I think there’s a subjective difference, that I just missed in this case, and that I can perhaps become much better at detecting, in the wake of this harsh lesson.
Something that helps me with problems like this is to verbalise the hypotheses I’m weighing up. Observing them seems to help me notice gaps.

David Johnston Feb 5, 2023, 10:20 PM
5 points
0 ∶ 2
in reply to: Rochelle Shen’s comment on: EA, Sexual Harassment, and Abuse

That being said, polyamory/kink is very often used as a tool of social pressure by predators to force women into a bad choice of either a situation they would not have otherwise agreed to or being called “close minded” and potentially withheld social/career opportunities.

Are such threats believable? Is there a broader culture where people feel that they’re constantly under evaluation such that personal decisions like this are plausibly taken into account for some career opportunities, or is this something that arises mainly where the career opportunities are within someone’s personal fiefdom?

David Johnston Feb 3, 2023, 7:33 AM
5 points
1 ∶ 0
on: Focus on the places where you feel shocked everyone’s dropping the ball
What you’re saying here resonates with me, but I wonder if there are people who might be more inclined to assume they’re missing something and consequently have a different feeling about what’s going on when they’re in the situation you’re trying to describe. In particular, I’m thinking about people prone to imposter syndrome. I don’t know what their feeling in this situation would be—I’m not prone to imposter syndrome—but I think it might be different.

David Johnston Jan 31, 2023, 4:48 AM
3 points
2 ∶ 0
on: An in-progress experiment to test how Laplace’s rule of succession performs in practice.
I would have thought that “all conjectures” is a pretty natural reference class for this problem, and Laplace is typically used when we don’t have such prior information—though if the resolution rate diverges substantially from the Laplace rule prediction I think it would still be interesting.
I think, because we expect the resolution rate of different conjectures to be correlated, this experiment is a bit like a single draw from a distribution over annual resolution probabilities rather than many draws from such a distribution ( if you can forgive a little frequentism).

David Johnston Jan 24, 2023, 11:11 PM
3 points
1 ∶ 0
on: Existential Risk Modelling with Continuous-Time Markov Chains
I think to properly model Ord’s risk estimates, you have to account for the fact that they incorporate uncertainty over the transition rate. Otherwise I think you’ll overestimate the rate at which risk compounds over time, conditional on no catastrophe so far.

David Johnston Jan 24, 2023, 10:38 PM
8 points
3 ∶ 1
in reply to: Un Wobbly Panda’s comment on: My highly personal skepticism braindump on existential risk from artificial intelligence.
I think Gary Marcus seems to play the role of an “anti-AI-doom” figurehead much more than Timnit Gebru. I don’t even know what his views on doom are, but he has established himself as a prominent critic of “AI is improving fast” views and seemingly gets lots of engagement from the safety community.

I also think Marcus’ criticisms aren’t very compelling, and so the discourse they generate isn’t terribly valuable. I think similarly of Gebru’s criticism (I think it’s worse than Marcus’, actually), but I just don’t think it has as much impact on the safety community.

David Johnston Dec 6, 2022, 11:29 PM
1 point
0 ∶ 0
on: AI Safety in a Vulnerable World: Requesting Feedback on Preliminary Thoughts
Some quick thoughts: A crude version of the vulnerable world hypothesis is “developing new technology is existentially dangerous, full stop”, in which case advanced AI that increase the rate of new technology development is existentially dangerous, full stop.

One of Bostroms solutions is totalitarianism. This seems to imply something like “new technology is dangerous, but this might be offset by reducing freedom proportionally”. Accepting this hypothesis seems to say that either advanced AI is existentially dangerous, or it accelerates a political transition to totalitarianism, which seems to be its own kind of risk.

David Johnston Dec 3, 2022, 2:04 PM
11 points
2 ∶ 1
in reply to: RobBensinger’s comment on: A challenge for AGI organizations, and a challenge for readers
What sort of substantial value would you expect to be added? It sounds like we either have a different belief about the value-add, or a different belief about the costs.
I’d be very surprised if the actual amount of big-picture strategic thinking at either organisation was “very little”. I’d be less surprised if they didn’t have a consensus view about big-picture strategy, or a clearly written document spelling it out. If I’m right, I think the current content is misleading-ish. If I’m wrong and actually little thinking has been done—there’s some chance they say “we’re focused on identifying and tackling near-term problems”, which would be interesting to me given what I currently believe. If I’m wrong and something clear has been written, then making this visible (or pointing out its existence) would also be a useful update for me.
Polished vs sloppy
Here are some dimensions I think of as distinguishing sloppy from polished:
- Vague hunches <-> precise theories
- First impressions <-> thorough search for evidence/prior work
- Hard <-> easy to understand
- Vulgar <-> polite
- Unclear <-> clear account of robustness, pitfalls and so forth
All else equal, I don’t think the left side is epistemically superior. It can be faster, and that might be worth it, but there are obvious epistemic costs to relying on vague hunches, first impressions, failures of communication and overlooked pitfalls (politeness is perhaps neutral here). I think these costs are particularly high in, as you say, domains that are uncertain and disagreement-heavy.
I think it is sloppy to stay too close to the left if you think the issue is important and you have time to address it properly. You have to manage your time, but I don’t think there are additional reasons to promote sloppy work.
You say that there are epistemic advantages to exposing thought processes, and you give the example of dialogues. I agree there are pedagogical advantages to exposing thought processes, but exposing thoughts clearly also requires polish, and I don’t think pedagogy is a high priority most of the time. I’d be way more excited to see more theory from MIRI than more dialogues.
If my reasoning process is actually flawed, then I want other EAs to be aware of that, so they can have an accurate model of how much weight to put on my views.
I don’t think it’s realistic to expect Lightcone forums to do serious reviews of difficult work. That takes a lot of individual time and dedication; maybe you occasionally get lucky, but you should mostly expect not to.
I agree that I’m not a paradigmatic example of the EAs who most need to hear this lesson [of exposing the thought process]; but I think non-established EAs heavily follow the example set by established EAs, so I want to set an example that’s closer to what I actually want to see more of
Maybe I’ll get into this more deeply one day, but I just don’t think sharing your thoughts freely is a particularly effective way to encourage other people to share theirs. I think you’ve been pretty successful at getting the “don’t worry about being polite to OpenAI” message across, less so the higher level stuff.

David Johnston Dec 3, 2022, 12:53 AM
6 points
2 ∶ 5
in reply to: RobBensinger’s comment on: A challenge for AGI organizations, and a challenge for readers
I don’t think this makes sense. Your group, in the EA community, regarding AI safety, gets taken seriously whatever you write. This in not the paradigmatic example of someone who feels worried about making public mistakes. A community that gives you even more leeway to do sloppy work is not one that encourages more people to share their independent thoughts about the problem. In fact, I think the reverse is true: when your criticisms carry a lot of weight even when they’re flawed, this has a stifling effect on people in more marginal positions who disagree with you.

If you want to promote more open discussion, your time would be far better spent seeking out flawed but promising work by lesser known individuals and pointing out what you think is valuable in it.

Am I correct in my belief that you are paid to do this work? If this is so, then I think the fact that you are both highly regarded and compensated for your time means your output should meet higher standards than a typical community post. Contacting the relevant labs is a step that wouldn’t take you much time, can’t be done by the vast majority of readers, and has a decent chance of adding substantial value. I think you should have done it.

David Johnston Nov 27, 2022, 11:03 AM
3 points
0 ∶ 0
in reply to: Rohin Shah’s comment on: Two contrasting models of “intelligence” and future growth
We might just be talking past each other—I’m not saying this is a reason to be confident explosive growth won’t happen and I agree it looks like growth could go much faster before hitting any limits like this. I just meant to say “here’s a speculative mechanism that could break some of the explosive growth models”

David Johnston Nov 26, 2022, 9:48 PM
2 points
0 ∶ 0
in reply to: Rohin Shah’s comment on: Two contrasting models of “intelligence” and future growth
I don’t think your summary is wrong as such, but it’s not how I think about it.

Suppose we’ve got great AI that, in practice, we still use with a wide variety of control inputs (“make better batteries”, “create software that does X”). Then it could be the case—if AI enables explosive growth in other domains—that “production of control inputs” becomes the main production bottleneck.

Alternatively, suppose there’s a “make me a lot of money” AI and money making is basically about making stuff that people want to buy. You can sell more stuff that people are already known to want—but that runs into the limit that people only want a finite amount of stuff. You could alternatively sell new stuff that people want but don’t know it yet. This is still limited by the number of people in the world, how often each wants to consider adopting a new technology and what things someone with life history X is actually likely to adopt and how long it takes them to make this decision. These things seem unlikely to scale indefinitely with AI capability.

This could be defeated by either money not being about making stuff people want—which seems fairly likely, but in this case I don’t really know what to think—or AI capability leading to (explosive?) human population expansion.

In defence of this not being completely wild speculation: advertising already comprises a nontrivial fraction of economic activity and seems to be growing faster than other sectors https://www.statista.com/statistics/272443/growth-of-advertising-spending-worldwide/

(Although only a small fraction of advertising is promoting the adoption of new tech)

David Johnston Nov 25, 2022, 9:15 PM
1 point
0 ∶ 0
in reply to: Rohin Shah’s comment on: Two contrasting models of “intelligence” and future growth
One objection to the “more AI → more growth” story is that it’s quite plausible that people still participate in an AI driven economy to the extent that they decide what they want, and this could be a substantial bottleneck to growth rates. Speeds of technological adoption do seem to have increased (https://www.visualcapitalist.com/rising-speed-technological-adoption/), but that doesn’t necessarily mean they can indefinitely keep pace with AI driven innovation.

David Johnston Nov 24, 2022, 11:49 PM
7 points
3 ∶ 0
in reply to: MichaelPlant’s comment on: Don’t just give well, give WELLBYs: HLI’s 2022 charity recommendation
I haven’t looked in detail at how Give Well evaluates evidence, so maybe you’re no worse here, but I don’t think “weighted average of published evidence” is appropriate when one has concerns about the quality of published evidence. Furthermore, I think some level of concern about the quality of published evidence should be one’s baseline position—I.e. a weighted average is only appropriate when there are unusually strong reasons to think the published evidence is good.

I’m broadly supportive of the project of evaluating impacts on happiness.