ShayBenMoshe

Karma: 613

On excluding dangerous information from training

ShayBenMosheNov 17, 2023, 8:09 PM

8 points

0 comments3 min readEA link

(www.lesswrong.com)

ShayBenMoshe Sep 30, 2023, 7:38 PM
21 points
8 ∶ 0
in reply to: Jason Schukraft’s comment on: Announcing the Winners of the 2023 Open Philanthropy AI Worldviews Contest
Hi Jason, thank you for giving a quick response. Both points are very reasonable.
The contest announcement post outlined “several ways an essay could substantively inform the thinking of a panelist”, namely, changing the central estimate or shape of the probability distribution of AGI / AGI catastrophe, or clarifying a concept or identifying a crux.
It would be very interesting to hear if any of the submissions did change any of the panelists’ (or other Open Phil employees’) mind in these ways, and how so. If not, whether because you learned an unanticipated kind of a thing, or because the contest turned out to be less useful than you initially hoped, I think that might also be very valuable for the community to know.
Thanks!

ShayBenMoshe Sep 26, 2023, 7:06 AM
1 point
1 ∶ 0
in reply to: Ben_West🔸’s comment on: There should be more AI safety orgs
Many people live in an area (or country) where there isn’t even a single AI safety organization, and can’t or don’t want to move. In that sense—no they can’t even join an existing organization (in any level).
(I think founding an organization has other advantages over joining an existing one, but this is my top disagreement.)

ShayBenMoshe Sep 9, 2023, 8:42 PM
3 points
0 ∶ 0
on: Who here knows?: Cryptography
I am not a cryptographer (though I do have knowledge in cryptography) and did not try to run an explicit cost-benefit analysis. Nevertheless, I think that being a researcher doing similar things to the other researchers in this field is not likely to be impactful (for various reasons, e.g., quantum resistant ciphers exist, while quantum computers [at scale] do not; and this research field is quite active so I don’t think there will be any shortage of solutions in the future). I think that it is possible (but not plausible) that one could come up with an alternative path within this field that would be impactful, but I don’t have any such ideas. Moreover, that would have to be a case-by-case analysis rather than an a priori cost-benefit analysis, so I am not sure carrying such an analysis would be helpful, and I would rather try to think of such alternative paths. (One alternative would be working on cryptography in the industry to earn-to-give. I actually think this is not a bad idea if one is a good fit.)

ShayBenMoshe Jul 29, 2023, 5:21 PM
1 point
0 ∶ 0
in reply to: Tristan Williams’s comment on: My Career Decision-Making Process
Glad to hear this was helpful! In short, the answer is no, it was much less structured (though also thought of).

These decisions were made under very different circumstances, and I’m not sure there is much value in explaining how I arrived at them. Very quickly though—I chose math because I highly enjoyed it and had the time (I did consider alternatives, such as physics and CS); cyber-security was part of my military service, which did involve some choice, but it is a very different situation and would drag me into a long digression.

Apologies if this doesn’t help much. Feel free to ask any other questions (publicly or privately).

ShayBenMoshe May 31, 2023, 12:28 PM
1 point
0 ∶ 0
in reply to: cdkg’s comment on: Language Agents Reduce the Risk of Existential Catastrophe
Thanks Cameron. I think that I understand our differences in views. My understanding is that you argue that language agents might be a safe path (I am not sure I fully agree with this, but I am willing to be on board so far).
Our difference then is, as you say, in whether there are models which are not safe and whether this is relevant. In Section 5, on the probability of misalignment, and in your last comment, you suggest that it is highly likely that language agents are the path forward. I am not at all convinced that this is correct (e.g., I think that it is more likely that systems like I mentioned will be more useful/profitable or just work better somehow, even in the near future) - you would have to convince a lot of people to use language agents alone, and that wouldn’t happen easily. Therefore, I think that it is relevant that there are other models which do not exhibit the sort of safety guarantees you think language agents have. Hope this clears our differences.
(I would like to mention again that I appreciate your thoughts on language agents, and your engagement with my criticism.)

ShayBenMoshe May 30, 2023, 9:15 PM
1 point
0 ∶ 0
in reply to: cdkg’s comment on: Language Agents Reduce the Risk of Existential Catastrophe
Thanks for responding so quickly.

I think the following might be a difference in our views: I expect that people will(/are) trying to train LLM variants that are RLHFed to express agentic behavior. There’s no reason to have one model to rule them all—it only makes sense to have a distinct models for short conversations and for autonomous agents. Maybe the agentic version would get a modified prompt including some background. Maybe it will be given context from memory as you specified. Do you disagree with this?

Given all of the above, I don’t see a big difference between this and how other agents (humans/RL systems/what have you) operate, aside maybe from the fact that the memory is more external.

In other words—I expect your point (i) to be in the prompt/LLM weights variant (via RLHF or some other modification, (ii) this is the standard convergent instrumental goals argument (which is relevant to these systems as much as to others, a priori), and (iii) again by this external memory (which could for example be a chain of thought or otherwise).

ShayBenMoshe May 30, 2023, 5:13 PM
51 points
19 ∶ 0
in reply to: jackva’s comment on: Statement on AI Extinction—Signed by AGI Labs, Top Academics, and Many Other Notable Figures
I, for one, think that it is good that climate change was not mentioned. Not necessarily because there are no analogies and lessons to be drawn, but rather because it can more easily be misinterpreted. I think that the kind of actions and risks are much more similar to bio and nuclear, in that there are way less actors and, at least for now, it is much less integrated to day-to-day life. Moreover, in many scenarios, the risk itself is of more abrupt and binary nature (though of course not completely so), rather than a very long and gradual process. I’d be worried that comparing AI safety to climate change would be easily misinterpreted or dismissed by irrelevant claims.

ShayBenMoshe May 30, 2023, 10:33 AM
3 points
0 ∶ 0
on: Language Agents Reduce the Risk of Existential Catastrophe
Thank you, Cameron and Simon, for writing this. It articulates some thoughts I’ve ben pondering. However, I would like to give one pushback, which I think is fairly significant. The relevant paragraph, which summarizes what I think is a wrong assumption, is the following:
We think this worry is less pressing than it might at first seem. The LLM in a language agent is integrated into the architecture of the agent as a whole in a way that would make it very difficult for it to secretly promote its own goals. The LLM is not prompted or otherwise informed that its outputs are driving the actions of an agent, and it does not have information about the functional architecture of the agent. This means that it has no incentive to answer prompts misleadingly and no understanding of what sorts of answers might steer the agent’s behavior in different ways. Moreover, since the model weights of the LLM are not updated in the process of operating a language agent, the only way for it to pursue a long-term plan by manipulating an agent would be to store information about that plan in the agent’s memory. But information stored in this way would not be secret.
I think that the highlighted part is wrong already today in an implicit way, and might be more explicitly broken in the (near) future. Processes like RLHF (used for GPT-4) or RLAIF (used for Claude) change the LLM’s weights (admittedly during training) by evaluating its behavior on relatively long tasks (in comparison to next token prediction or similar tasks). Loosely speaking, this essentially informs the LLM that it is being used as a foundation for an agent. This, at least in principle, reintroduces the pressure to steer the agent, and raises again the problem of goal misgeneralization, as policies might be learned into the LLM’s weights during this process which generalize poorly to other contexts.
(Of lesser importance, I think that the last two sentences in the quoted paragraph are also assuming that not only would the agent’s memory be non-secret, but also interpretable [e.g., written in English], and I don’t see why this has to be the case.)
I might have missed a pointed in your argument where you address this points. In any case, I would appreciate hearing your thoughts on this.

ShayBenMoshe Jul 31, 2022, 8:00 PM
23 points
0 ∶ 0
on: Why does GiveWell not provide lower and upper estimates for the cost-effectiveness of its top charities?
Not answering the question, but I would like to quickly mention a few of the benefits of having confidence/credible intervals or otherwise quantifying uncertainty. All of these comments are fairly general, and are not specific criticisms of GiveWell’s work.
1. Decision making under risk aversion—Donors (large or small) may have different levels of risk aversion. In particular, some donors might prefer having higher certainty of actually making an impact at the cost of having a lower expected value. Moreover, (mostly large) donors could build a portfolio of different donations in order to achieve a better risk profile. To that end, one needs to know more about the distribution rather than a point-estimate.
2. Point-estimates are many times done badly—It is fairly easy to make many kinds of mistakes when doing point-estimates, some of which are more noticeable when quantifying uncertainties. To name one example, point-estimates of cost-effectiveness typically try to estimate the expected value, and is many times calculated as a product of different factors. While it is true that expected value is multiplicative (assuming that the factors are uncorrelated or, more generally, independent, which is also sometimes not the case but that’s another problem), this is not true for other statistics, such as the median. I think it is a common mistake to use an estimate of the median for the mean, or something in between, which in many cases are wildly different.
3. Sensitivity analysis—Quantifying uncertainty allows for sensitivity analysis, which serves many purposes, one of which is to get more accurate (point-)estimate and reduce uncertainty. One example is by understanding which parameters are the most uncertain, and focus further (internal and external) research on improving their certainty.
In direct response to Hazelfire’s comment, I think that even if the uncertainty spans only one order of magnitude (he mentioned 2-3, which seems reasonable to me), this could have a really larger effect on resource allocation. The bar for funding is currently 8x relative to GiveDirectly IIRC, which is one order of magnitude, so gaining a better understanding of the certainty could be really important. For instance, we could learn that some interventions which are currently above the bar, are not very clearly so, whereas other interventions which seem to be under the bar but very close to it, could turn out to be fairly certain and thus perhaps a very safe bet.
I think that all of these effects could have a large influence on GiveWell’s recommendations and donors choices, future research, and directly on getting more accurate point-estimates (which could potentially be fairly big).

ShayBenMoshe Jun 20, 2022, 2:26 PM
2 points
0 ∶ 0
in reply to: abrahamrowe’s comment on: Critiques of EA that I want to read
Yeah, that makes sense, and is fairly clear selection bias. Since here in Israel we have a very strong tech hub and many people finishing their military service in elite tech units, I see the opposite selection bias, of people not finding too many EA (or even EA-inspired) opportunities that are of interest to them.
I failed to mention that I think your post was great, and I would also love to see (most of) these critiques flashed out.

ShayBenMoshe Jun 20, 2022, 2:06 PM
4 points
0 ∶ 0
on: Critiques of EA that I want to read
The fact that everyone in EA finds the work we do interesting and/or fun should be treated with more suspicion.
I would like to agree with Aaron’s comment and make a stronger claim—my impression is that many EAs around me in Israel, especially those coming from a strong technical background, don’t find most direct EA-work very intellectually interesting or fun (ignoring its impact).
Speaking for myself, my background is mostly in pure math and in cyber-security research / software engineering. Putting aside managerial and entrepreneurial roles, it seems to me that most of the roles in EA(-adjacent) organizations open for someone with background similar to mine are:
1. Research similar to research at Rethink Priorities or GiveWell—It seems to me that this research mostly involves literature review and analysis of existing research. I find this kind of work to be somewhat interesting, but not nearly as intrinsically interesting as the things I have done so far.
2. Technical AI safety—This could potentially be very interesting for someone like me, however, I am not convinced by the arguments for the relatively high importance or tractability of AI safety conveyed by EA. In fact, this is where I worry said critique might be right, on the community level, I worry that we are biased by motivated reasoning.
3. Software engineering—Most of the software needs in EA(-adjacent) organizations seem to be fairly simple technically (but the product and “market-fit” could be hard). As such, for someone looking for more research type of work or more complicated technical problems, this is not very appealing.
Additionally, most of the roles are not available in Israel or open for remote work.
In fact, I think this is a point where the EA community misses many highly capable individuals who could otherwise do great work, if we had interesting enough roles for them.

ShayBenMoshe Feb 18, 2022, 9:30 PM
26 points
0 ∶ 0
on: Announcing Alvea—An EA COVID Vaccine Project
I am extremely impressed by this, and this is a great example of the kind of ambitious projects I would love to see more of in the EA community. I have added it to the list on my post Even More Ambitious Altruistic Tech Efforts.
Best of luck!

ShayBenMoshe Jan 25, 2022, 7:33 AM
1 point
0 ∶ 0
in reply to: Peter Wildeford’s comment on: Why and how to be excited about megaprojects
I completely agree with everything you said (and my previous comment was trying to convey a part of this, admittedly in much less transparent way).

ShayBenMoshe Jan 24, 2022, 9:28 PM
3 points
0 ∶ 0
in reply to: MichaelA🔸’s comment on: Why and how to be excited about megaprojects
I simply disagree with your conclusion—it all boils down to what we have at hand. Doubling the cost-effectiveness also requires work, it doesn’t happen by magic. If you are not constrained by highly effective projects which can use your resources, sure, go for it. As it seems though, we have much more resources than current small scale projects are able to absorb, and there’s a lot of “left-over” resources. Thus, it makes sense to start allocating resources to some less effective stuff.

ShayBenMoshe Jan 24, 2022, 9:00 PM
15 points
0 ∶ 0
on: Why and how to be excited about megaprojects
I agree with the spirit of this post (and have upvoted it) but I think it kind of obscures the really simple thing going on: the (expected) impact of a project is by definition the cost-effectiveness (also called efficiency) times the cost (or resources).
A 2-fold increase in one, while keeping the other fixed, is literally the same as having the roles reversed.
The question then is what projects we are able to execute, that is, both come up with an efficient idea, and have the resources to execute it. When resources are scarce, you really want to squeeze as much as you can from the efficiency part. Now that we have more resources, we should be more lax, and increase our total impact by pursuing less efficient ideas that still achieve high impact. Right now it starts to look like there’s much more resources ready to be deployed, than projects which are able to absorb them.

ShayBenMoshe Dec 30, 2021, 9:03 PM
17 points
0 ∶ 0
in reply to: Larks’s comment on: Democratising Risk—or how EA deals with critics
I am not sure that there is actually a disagreement between you and Guy.
If I understand correctly, Guy says that in so far as the funder wants research to be conducted to deepen our understanding of a specific topic, the funders should not judge researchers based on their conclusions about the topic, but based on the quality and rigor of their work in the field and their contributions to the relevant research community.
This does not seem to conflict what you said, as the focus is still on work on that specific topic.

ShayBenMoshe Dec 10, 2021, 9:21 PM
7 points
0 ∶ 0
on: Flimsy Pet Theories, Enormous Initiatives
I strongly agree with this post and it’s message.
I also want to respond to Jason Crawford’s response. We don’t necessarily need to move to a situation where everyone tries to optimize things as you suggest, but at this point it seems that almost no one tries to optimize for the right thing. I think even changing this to a few percents of entrepreneurial work or philanthropy could have tremendous effect, without losing much of the creative spark people worry we might lose, or maybe gain even more, as new directions open.

ShayBenMoshe Nov 20, 2021, 9:38 PM
2 points
0 ∶ 0
in reply to: Ozzie Gooen’s comment on: Even More Ambitious Altruistic Tech Efforts
That’s great, thanks!
I was aware of Anthropic, but not of the figures behind it.
Unfortunately, my impression is that most funding for such projects are around AI safety or longtermism (as I hinted in the post...). I might be wrong about this though, and I will poke around these links and names.
Relatedly, I would love see OPP/EA Funds fund (at least a seed round or equivalent) such projects, unrelated to AI safety and longtermism, or hear their arguments against that.

ShayBenMoshe Nov 20, 2021, 3:47 PM
3 points
0 ∶ 0
in reply to: Ozzie Gooen’s comment on: Even More Ambitious Altruistic Tech Efforts
Thanks for clarifying Ozzie!
(Just to be clear, this post is not an attack on you or on your position, both of which I highly appreciate :). Instead, I was trying to raise a related point, which seems extremely important to me and I was thinking about recently, and making sure the discussion doesn’t converge to a single point).
With regards to the funding situation, I agree that many tech projects could be funded via traditional VCs, but some might not be, especially those that are not expected to be very financially rewarding or very risky (a few examples that come to mind are the research units of the HMOs in Israel, tech benefitting people in the developing world [e.g. Sella’s teams at Google], basic research enabling applications later [e.g. research on mental health]). An EA VC which funds projects based mostly on expected impact might be a good idea to consider!

ShayBenMoshe

On ex­clud­ing dan­ger­ous in­for­ma­tion from training

On excluding dangerous information from training