Former management consultant and data scientist. Currently on sabbatical to try to transition to work on AI safety in some capacity.
mjkerrisonšøļø
Do the olā career transition into AI safety. Probably governance/āpolicy, but with a technical flavour.
(If you need stats, data science, report writing, or management consultancy skillsāplease hit me up!)
Isnāt mechinterp basically setting out to build tools for AI self-improvement?
One of the things people are most worried about is AIs recursively improving themselves. (Whether all people who claim this kind of thing as a red line will actually treat this as a red line is a separate question for another post.)
It seems to me like mechanistic interpretability is basically a really promising avenue for that. Trivial example: Claude decides that the most important thing is being the Golden Gate Bridge. Claude reads up on Anthropicās work, gets access to the relevant tools, and does brain surgery on itself to turn into Golden Gate Bridge Claude.
More meaningfully, it seems like any ability to understand in a fine-grained way whatās going on in a big model could be co-opted by an AI to ālearnā in some way. In general, I think the case that seems most likely soonest is:
-
Learn in-context (e.g. results of experiments, feedback from users, things like weāve recently observed in scheming papers...)
-
Translate this to appropriate adjustments to weights (identified using mechinterp research)
-
Execute those adjustments
Maybe Iām late to this party and everyone was already conceptualising mechinterp as a very dual-use technology, but Iām here now.
Honestly, maybe it leans more towards āoffenseā (i.e., catastrophic misalignment) than defense! It will almost inevitably require automation to be useful, so weāre ceding it to machines out of the gate. Iād expect tomorrowās models to be better placed to make sense of and use of mechinterp techniques than humans areāpartly just because of sheer compute, but also maybe (and now Iām into speculating on stuff I understand even less) the nature of their cognition is more suited to whatās involved.
-
If someone isnāt already doing so, someone should estimate what % of (self-identified?) EAs donate according to our own principles. This would be useful (1) as a heuristic for the extent to which the movement/ācommunity/āwhatever is living up to its own standards, and (1i) assuming the answer is ādecentlyā it would be useful evidence for PR/āpublicity/āresponding to marginal-faith tweets during bouts of criticism.
Looking at the Rethink survey from 2020, they have some info about which causes EAs are giving to but they seem to note that not many people respond on this? And itās not quite the same question. To do: check GWWC for whether they publish anything like this.
Edit to add: maybe an imperfect but simple and quick instrument for this could be something like āFor what fraction of your giving did you attempt a cost-effectiveness assessment (CEA), read a CEA, or rely on someone else who said they did a CEA?ā. I donāt think it actually has to be about whether the respondent got the ārightā result per se; the point is the principles. Deferring to GiveWell seems like living up to the principles because of how they make their recommendations, etc.
Can you add /ā are you comfortable adding anything on who āusā is and which orgs or what kinds of orgs are hesitant? Is your sense this is universal, or more localised (geographically, politically, cause area...)?
Good point and good fact.
My sense, though, is that if you scratch most āexpand the moral circleā statements you find a bit of implicit moral realism. I think generally thereās an unspoken ā...to be closer to its truly appropriate extentā, and that thereās an unspoken assumption that thereāll be a sensible basis for that extent. Maybe some people are making the statement prima facie though. Could make for an interesting survey.
Love to see these reports!
I have two suggestions/ārequests for ācrosstabsā on this info (which is naturally organised by evaluator, because thatās what the project is!):
As-of-today, which evaluators/ācharities sit where on the recommendation scale. The info for that is mostly on GWWCās website but not quite organised as such. Iām thinking of rows for cause areas, columns for buckets, e.g. āRecommendedā at one end and āMaybe not cost-effectiveā at the other (though maybe youād drop things off altogether). Just something to help visualise whatās moved and by how much, and broadly why are things sitting where they are (e.g. THL corporate campaigns sliding off the recommended list for āproceduralā reasons, so not in the Recommended column but now in a āNearlyā column or something).
Iād love a clear checklist of what you think needs improvement per evaluated program to help with making the list a little more evergreen. I think all that info is in your reporting, but if you called it out I think it would
help evaluated programs and
help donors to
get a sense for how up-to-date that recommendation is (given the rotating/ārolling nature of the evaluation program)
and possibly do their own assessment for whether the charity āshouldā be recommended ānowā.
Is anyone keeping tabs on where AIās actually being deployed in the wild? I feel like I mostly see (and so this could be a me problem) big-picture stuff, but there seems to be a proliferation of small actors doing weird stuff. Twitter /ā X seems to have a lot more AI content, and apparently YouTube comments do now as well (per conversation I stumbled on while watching some YouTube recreationallyālanguage & content warnings: https://āāyoutu.be/āāp068t9uc2pk?si=orES1UIoq5qTV5TH&t=2240)
mjkĀerĀriĀsonļøās Quick takes
I think this is a really compelling addition to EA portfolio theory. Two half-formed thoughts:
-
Does portfolio theory apply better at the individual level than the community level? I think something like treating your own contributions (giving + career) as a portfolio makes a lot of sense, if youāre explicitly trying to hedge personal epistemic risk. I think this is a slightly different angle on one of Jeffās points: is this āk-level 2ā aggregate portfolio a ābetterā aggregation of everyoneās information than the āk-level 1ā³ of whatever portfolio emerges from everyone individually optimising their own portfolios? You could probably look at this analyticallyā¦ might put that on the to-do list.
-
At some point what matters is specific projects...? Like when I think about āunderfundedā, Iām normally thinking thereās good projects with high expected ROI that arenāt being done, relative to some other cause area where the marginal project has a lower ROI. Maybe my point is something likeāunderfunding and accounting for it should be done at a different stage of the donation process, rather than in looking at overall what the % breakdown of the portfolio is. Maybe weāre more private equity than index fund.
-
Excited to see this!
It seems useful and on priors more cost effective to centralise & outsource some of these thingsāavoiding reinventing the wheel, and producing the scale that (a) lets you build expertise and (b) makes it worthwhile investing in improvements.
I wonder if there might be particularly strong regional effects to thisāmaybe Goa had quite a large dog population, quite a lot of rabies, or quite dense dog/āhuman populations (affecting rabies, bite, and transmission incidences).
I think there could be room for further research to identify whether there would be better-looking (sub-country) regionsāthough like Helene_K found, data would be difficult.
Hey Alexanderāthanks for the write-up! I found it useful as a local, and it seems valuable to be sharing/ācoordinating on this globally.
One thing that occurred to me would be to zoom in on the sectors of the economy that are exposed to AI. I think that in Australia, it might be relatively more concentrated than elsewhereāspecifically in education, which is one of our biggest exports (though it gets accounted for domestically I think).
That could mean:
If there are distinct challenges in education vs other knowledge work, some calculus may change (not sure what exactly)
There might be stakeholders/ācoalitions we havenāt tapped yet to support less narrowly economic concerns
Another tentative implication that goes without saying, but Iāll say it anyway: review who youāre listening to.
Who got these developments ārightā and āwrongā? How will you weight what those people say in the next 12 months?