An independent researcher of ethics, AI safety, and AI impacts. LessWrong: https://www.lesswrong.com/users/roman-leventov. Twitter: https://twitter.com/leventov. E-mail: leventov.ru@gmail.com (the preferred mode of communication).
Roman Leventov
[Question] Has private AGI research made independent safety research ineffective already? What should we do about this?
Joscha Bach on Synthetic Intelligence [annotated]
Bing definitely “helps” people to over-anthropomorphise it by actively corroborating that it has emotions (via self-report and over-use of emojis), consciousness, etc.
Scientism vs. people
Note: this comment is cross-posted on LessWrong.
Classification of AI safety work
Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:
A monolithic AI system, e.g., a conversational LLM
AGI lab (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs)
A cyborg, human + AI(s)
A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science)
A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence, e.g., The Collective Intelligence Project
The whole civilisation, e.g., Open Agency Architecture, or the Gaia network
Another dimension is the “time” of consideration:
Design time: research into how the corresponding system should be designed (engineered, organised): considering its functional (“capability”, quality of decisions) properties, adversarial robustness (= misuse safety, memetic virus security), and security. AGI labs: org design and charter.
Manufacturing and deployment time: research into how to create the desired designs of systems successfully and safely:
AI training and monitoring of training runs.
Offline alignment of AIs during (or after) training.
AI strategy (= research into how to transition into the desirable civilisational state = design).
Designing upskilling and educational programs for people to become cyborgs is also here (= designing efficient procedures for manufacturing cyborgs out of people and AIs).
Operations time: ongoing (online) alignment of systems on all levels to each other, ongoing monitoring, inspection, anomaly detection, and governance.
Evolutionary time: research into how the (evolutionary lineages of) systems at the given level evolve long-term:
How the human psyche evolves when it is in a cyborg
How humans will evolve over generations as cyborgs
How AI safety labs evolve into AGI capability labs :/
How groups, communities, and society evolve.
Designing feedback systems that don’t let systems “drift” into undesired state over evolutionary time.
Considering system property: property of flexibility of values (i.e., the property opposite of value lock-in, Riedel (2021)).
IMO, it (sometimes) makes sense to think about this separately from alignment per se. Systems could be perfectly aligned with each other but drift into undesirable states and not even notice this if they don’t have proper feedback loops and procedures for reflection.
There would be 6*4 = 24 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is “too early” to consider.
Richard’s directions within the framework
Scalable oversight: (monolithic) AI system * manufacturing time
Mechanistic interpretability: (monolithic) AI system * manufacturing time, also design time (e.g., in the context of the research agenda of weaving together theories of cognition and cognitive development, ML, deep learning, and interpretability through the abstraction-grounding stack, interpretability plays the role of empirical/experimental science work)
Alignment theory: Richard phrases it vaguely, but referencing primarily MIRI-style work reveals that he means primarily “(monolithic) AI system * design, manufacturing, and operations time”.
Evaluations, unrestricted adversarial training: (monolithic) AI system * manufacturing, operations time
Threat modeling: system of AIs (rarely), human + AI group, whole civilisation * deployment time, operations time, evolutionary time
Governance research, policy research: human + AI group, whole civilisation * mostly design and operations time.
Takeaways
To me, it seems almost certain that many current governance institutions and democratic systems will not survive the AI transition of civilisation. Bengio recently hinted at the same conclusion.
Human+AI group design (scale-free: small group, org, society) and the civilisational intelligence design must be modernised.
Richard mostly classifies this as “governance research”, which has a connotation that this is a sort of “literary” work and not science, with which I disagree. There is a ton of cross-disciplinary hard science to be done about group intelligence and civilisational intelligence design: game theory, control theory, resilience theory, linguistics, political economy (rebuild as hard science, of course, on the basis of resource theory, bounded rationality, economic game theory, etc.), cooperative reinforcement learning, etc.
I feel that the design of group intelligence and civilisational intelligence is an under-appreciated area by the AI safety community. Some people do this (Eric Drexler, davidad, the cip.org team, ai.objectives.institute, the Digital Gaia team, and the SingularityNET team, although the latter are less concerned about alignment), but I feel that far more work is needed in this area.
There is also a place for “literary”, strategic research, but I think it should mostly concern deployment time of group and civilisational intelligence designs, i.e., the questions of transition from the current governance systems to the next-generation, computation and AI-assisted systems.
Also, operations and evolutionary time concerns of everything (AI systems, systems of AIs, human+AI groups, civilisation) seem to be under-appreciated and under-researched: alignment is not a “problem to solve”, but an ongoing, manufacturing-time and operations-time process.
However, talented individuals who have invested in upskilling themselves to go do AIS research (e.g. SERI MATS graduates) are largely unable to secure research positions.
It would be interesting to see the actual numbers, I think Ryan Kidd should have them.
• Increasing the number of research bets: additional independent research might increase the number of research directions being pursued. After all, as independent researchers individuals have more agency over deciding which research agendas to pursue. Pursuing more research bets could be very beneficial in this pre-paradigmatic field.
I somewhat disagree this is a good idea to increase the number of “bets”, where a “bet” is taken as an idiosyncratic framework or a theory. I explained this position here: https://www.alignmentforum.org/posts/FnwqLB7A9PenRdg4Z/for-alignment-we-should-simultaneously-use-multiple-theories#Creating_as_many_new_conceptual_approaches_to_alignment_as_possible__No and also touched upon it and discussed it with Ryan Kidd in the comments to this post: https://www.lesswrong.com/posts/bRtP7Mub3hXAoo4vQ/an-open-letter-to-seri-mats-program-organisers.
But independent researchers are not obliged to craft their own theories, of course, they could work within existing established frameworks (and collaborate with other researchers who work in these frameworks), just be organisationally independent.
The things that the proposed startup is going to do seems to overlap in various ways with MATS, AI Safety Camp, Orthogonal (https://www.lesswrong.com/posts/b2xTk6BLJqJHd3ExE/orthogonal-a-new-agent-foundations-alignment-organization), European Network for AI Safety (ENAIS, https://forum.effectivealtruism.org/posts/92TAmcppCL7t54Ajn/announcing-the-european-network-for-ai-safety-enais), Nonlinear.org, and LTFF (if you plan to ‘hire’ researchers and pay them salary, i.e., effectively fund them, you basically plan to increase the total fundraising for AI safety, which is currently the LTFF’s role).
Detailing similarities, differences, and partnerships with these projects and orgs would be useful
Re: Nonlinear, they directly do services that you plan to do as well:
The Nonlinear Network: Funders get access to AI safety deal flow similar to large EA funders. People working in AI safety can apply to >45 AI safety funders in one application. The Nonlinear Support Fund: Automatically qualify for mental health or productivity grants if you work full-time in AI safety.
(Note that both are targeted not only at AI safety founders as may seem from the website, but independent researchers as well.)
I think it’s better not to increase the number of distinct slack spaces without necessity. We can create a channel for independent researchers in the AI Alignment slack (see https://coda.io/@alignmentdev/alignmentecosystemdevelopment)
If OpenAI still had a moral compass, and were still among the good guys, they would pause AGI (and ASI) capabilities research until they have achieved a viable, scalable, robust set of alignment methods that have the full support and confidence of AI researchers, AI safety experts, regulators, and the general public.
I disagree with multiple things in this sentence. First, you take a deontology stance, whereas OpenAI clearly acts within consequentialist stance, assuming that if they don’t create ‘safe’ AGI, reckless open-source hackers will (upon the continuing exponential decrease in the cost of effective training compute, and/or the next breakthrough in DNN architecture or training that will make it much more efficient, and/or will enable effective online training). Second, I largely agree with OpenAI as well as Anthropic that iteration is important for building an alignment solution. One probably cannot design a robust, safe AI without empirical iteration, including with increasing capabilities.
I agree with your assessment of the strategy they are taking probably will fail, but mainly because I think we have inadequate human intelligence, human psychology, and coordination mechanisms to execute it. That is, I would support Yudkowsky’s proposal: halt all AGI R&D, develop narrow AI and tech for improving the human genome, make humans much smarter (von Neumann-level of intelligence should be just the average) and have much more peaceful psychology, like bonobos, reform coordination and collective decision-making, and only then re-visit the AGI project roughly with the same methodology as OpenAI proposes, albeit with more diversified methodology: I agree with your criticism that OpenAI is too narrowly focused on some sort of computationalism, to the detriment of the perspectives from psychology, neuroscience, biology, etc. BTW, it seems that DeepMind is more diversified in this regard.
It’s hard to imagine more general and capability-demanding activity as doing good (superhuman!) science in such an absurdly cross-disciplinary field as AI safety (and among the disciplines that are involved there are those that are notoriously not very scientific yet: psychology, sociology, economics, the studies of consciousness, ethics, etc.). So if there is an AI that can do that but still is not counted as AGI, I don’t know what the heck ‘AGI’ should even refer to. Compare with chess, which is a very narrow problem which can be formally defined and doesn’t require AI to operate with any science (and world models) whatsoever.
There are many more interventions that might work on decades-long timelines that you didn’t mention:
Collective intelligence/sense-making/decision-making/governance/democracy innovation (and it’s introduction in organisations, communities, and societies on larger scales), such as https://cip.org
Innovation in social network technology that fosters better epistemics and social cohesion rather than polarisation
Innovation in economic mechanisms to combat the deficiencies and blind spots of free markets and the modern money-on-money return financial system, such as various crypto projects, or https://digitalgaia.earth
Fixing other structural problems of the internet and money infrastructure that exacerbate risks: too much interconnectedness, too much centralisation of information storage, money is traceless, as I explained in this comment. Possible innovations: https://www.inrupt.com/, https://trustoverip.org/ , other trust-based (cryptocurrency) systems.
Other infrastructure projects that might address certain risks, notably https://worldcoin.org, albeit this is a double-edged sword (could be used for surveillence?)
OTOH, fostering better interconnectedness between humans and humans to computers, primarily via brain-computer interfaces such as Neuralink. (Also, I think in mid- to long-term, human-AI merge is only viable “good” outcome for humanity at least.) However, this is a double-edged sword (could be used by AI to manipulate humans or quickly take over humans?)
AI safety is a field concerned with preventing negative outcomes from AI systems and ensuring that AI is beneficial to humanity.
This is a bad definition of “AI safety” as a field, which muddles the water somewhat. I would say that AI safety is a particular R&D branch (plus we can add here meta and proxy activities for this R&D field, such as AI safety fieldbuilding, education, outreach and marketing among students, grantmaking, and platform development such as what apartresearch.com are doing), of the gamut of activity that strives to “prevent the negative result of civilisational AI transition”.
There are also other sorts of activity that strive for that more or less directly, some of which are also R&D (such as governance R&D (cip.org), R&D in cryptography, infosec, and internet decentralisation (trustoverip.org)), and others are not R&D: good old activism and outreach to the general public (StopAI, PauseAI), good old governance (policy development, UK foundational model task force), and various “mitigation” or “differential development” projects and startups, such as Optic, Digital Gaia, Ought, social innovations (I don’t know about any good examples as of yet, though), innovations in education and psychological training of people (I don’t know about any good examples as of yet). See more details and ideas in this comment.
It’s misleading to call this whole gamut of activities “AI safety”. It’s maybe “AI risk mitigation”. By the way, 80000 hours, despite properly calling “Preventing an AI-related catastrophe”, also suggest that the only two ways to apply one’s efforts to this cause is “technical AI safety research” and “governance research and implementation”, which is wrong, as I demonstrated above.
Somebody may ask, isn’t technical AI safety research more direct and more effective way to tackle this cause area? I suspect that it might not be the case for people who don’t work at AGI labs. That is, I suspect that independent or academic AI safety research might be inefficient enough (at least for most people attempting it) that it would be more effective to apply themselves to various other activities, and “mitigation” or “differential development” projects of the likes that are described above. (I will publish a post that details reasoning behind this suspicion later, but for now this comment has the beginning of it.)
From the AI “engineering” perspective, values/valued states are “rewards” that the agent adds themselves in order to train (in RL style) their reasoning/planning network (i.e., generative model) to produce behaviours that are adaptive but also that they like and find interesting (aesthetics). This RL-style training happens during conscious reflection.
Under this perspective, but also more generally, you cannot distinguish between intrinsic and instrumental values because intrinsic values are instrumental to each other, but also because there is nothing “intrinsic” about self-assigned reward labels. In the end, what matters is the generative model that is able to produce highly adaptive (and, ideally, interesting/beautiful) behaviours in a certain range of circumstances.
I think you confusion about the ontological status of values is further corroborated by this phrase for the post: “people are mostly guided by forces other than their intrinsic values [habits, pleasure, cultural norms]”. Values are not forces, but rather inferences about some features or one’s own generative model (that help to “train” this very model in “simulated runs”, i.e., conscious analysis of plans and reflections). However, the generative model itself is effectively the product of environmental influences, development, culture, physiology (pleasure, pain), etc. Thus, ultimately, values are not somehow distinct from all these “forces”, but are indirectly (through the generative model) derived from these forces.
Under the perspective described above, valuism appears to switch the ultimate objective (“good” behaviour) for “optimisation of metrics” (values). Thus, there is a risk of Goodharting. I also agree with dan.pandori who noted in another comment that valuism pretty much redefines utilitarianism, whose equivalent in AI engineering is RL.
You may say that I suggest an infinite regress, because how “good behaviour” is determined, other than through “values”? Well, as I explained above, it couldn’t be through “values”, because values are our own creation within our own ontological/semiotic “map”. Instead, there could be the following guides to “good behaviour”:
Good old adaptivity (survival) [roughly corresponds to so-called “intrinsic value” in expected free energy functional, under Active Inference]
Natural ethics, if exists (see the discussion here: https://www.lesswrong.com/posts/3BPuuNDavJ2drKvGK/scientism-vs-people#The_role_of_philosophy_in_human_activity). If “truly” scale-free ethics couldn’t be derived from basic physics alone, there is still evolutionary/game-theoreric/social/group stage on which we can look for an “optimal” ethics arrangement of agent’s behaviour (and, therefore, values that should help to train these behaviours), whose “optimality”, in turn, is derived either from adaptivity or aesthetics on the higher system level (i.e., group level).
Aesthetics and interestingness: there are objective, information-theoretic ways to measure these, see Schmidhuber’s works. Also, this roughly corresponds to “epistemic value” in expected free energy functional under Active Inference.
If the “ultimate” objective is the physical behaviour itself (happening in the real world), not abstract “values” (which appear only in agent’s mind), I think Valuism could be cast as any philosophy that emphasises creation of a “good life” and “right action”, such as Stoicism, plus some extra emphasis on reflection and meta-awareness, albeit I think Stoicism already puts significant emphasis on these.
[...] we are impressed by [...] ‘Eliciting Latent Knowledge’ [that] provided conceptual clarity to a previously confused concept
To me, it seems that ELK is (was) attention-captivating (among the AI safety community) but doesn’t assume a solid basis: logic and theories of cognition and language, and therefore is actually confusing, which prompted at least several clarification and interpretation atttempts (1, 2, 3). I’d argue that most people leave original ELK writings more confused than they were before. So, I’d classify ELK as a mind-teaser and maybe problem-statement (maybe useful than distracting, or maybe more distracting than useful; it’s hard to judge as of now), but definitely not as great “conceptual clarification” work.
AI romantic partners will harm society if they go unregulated
This is a sort of more general form of whataboutism that I considered in the last session. We are not talking just about some abstract “traditional option”, we are talking about total fertility rate. I think everybody agrees that it’s important, conservatives and progressives, long-termists and politicians.
If we are talking that childbirth (full families, and parenting) is not important because we will soon have artificial wombs, which, in tandem with artificial insemination and automated systems for child rearing from birth through the adulthood, will give us “full cycle automated human reproduction and development system” and make the traditional mode of human being (relationships and kids) “unnecessary” for reailsing value in the Solar system, then I would say: OK, let’s wait until we actually have an artificial womb and then reconsider about AI partners (if we will get to do it).
My “conservative” side would also say that AI partners (and even AI friends/companions, to some degree!) will harm society because it would reduce the total human-to-human interaction, culture transfer, and may ultimately precipitate the intersubjectivity collapse. However, this is a much less clear story for me, so I’ve left it out, and don’t oppose to AI friends/companions in this post.
Harris and Raskin talked about the risk that AI partners will be used for “product placement” or political manipulation here, but I’m sceptical about this. These AI partners will surely have a subscription business model rather than a freemium model, and, given how user trust will be extremely important for these businesses, I don’t think they will try to manipulate the users in this way.
More broadly speaking, values will surely change, there is no doubt about that. The very value of “human connection” and “human relationships” is eroded by definition if people are in AI relationships. A priori, I don’t think value drift is a bad thing. But in this particular case, this value change will inevitably go along with the reduction of the population, which is a bad thing (according to my ethics, and the ethics of most other people, I believe).
Alien values are guaranteed unless we explicitly impart non-alien ethics to AI, which we currently don’t know how to do, and don’t know (or can’t agree) what that ethics should be like. Next two points are synonyms and are also basically synonyms to “alien values”. The treacherous turn is indeed unlikely (link).
Self-improvement is given, the only question is where is the “ceiling” of this improvement. It might not be that “far”, by some measure, from human intelligence, or that difference may still not allow AI to plan that far ahead due to the intrinsic unpredictability of the world. So the world may start to move extremely fast (see below), but the horizon of planning and predictability of that movement may not be longer than it is now (or it could be even shorter).
I think you implicitly underestimate the cost of coordination among humans. Huge corporations are powerful but also very slow to act. AI corporations will be very powerful and also very fast and potentially very coherent in their strategy. This will be a massive change.