It’s just remarkable (and worrying) how business leaders and journalists misunderstand the core issues with AI adoption and transition[1]. All they talk about is “accuracy”, “correctness”, and “proving that AI is actuallyright”(!). The second piece has a hilarious passage “Cassar says this aspect of AI systems creates a trust issue because it goes against the human instinct to make ‘rule-based’ decisions.”(!)
There are many short- and medium-term applications where this “rule-following and accuracy” framing of the issue is correct, but they are all, by necessity, about automating and greasing bureaucratic procedures and formal compliance with rule books: filing tax forms, checking compliance with the law, etc. But these applications are not intrinsically productive, and on a longer time scale, they may lead to a Jevons effect: the cheaper bureaucratic compliance becomes, the more it is demanded, without actually making coordination, cooperation, and control more reliable and safe.
“Factual accuracy” and hallucinations are the lowest-hanging pieces of context alignment
Taking the viewpoints of information theory[2], philosophy of language, and institutional economics[3], it’s not the sophistication of bureaucracies that reduces the cumulative risk exposure and transaction costs of the interaction between humans, AIs, and organisations. Rather, it’s building shared reference frames (shared language) for these agents to communicate about their preferences, plans, and risks. The sophistication of bureaucratic procedures sometimes does have this effect (new concepts are invented that increase the expressiveness of communication about preferences, plans, and risks), but this is only an accidental byproduct of this bureaucratisation process. And then, making AIs use language effectively to communicate with humans and each other is not an “accuracy” or “factual correctness” problem, it’s the context(meaning, intent, outer) alignment problem.
Indeed, this is the core problem that Perplexity, Copilot, Bard, OpenAI, and other universal RAG helpers are facing: alignment with users’ context, on a hierarchy of timescales: pre-training[4], fine-tuning[5], RAG dataset curation, and online alignment through a dialogue with the user[6][7]. Preventing outright hallucinations is just the lowest-hanging part of this problem. And “aligning LLMs with human values” is hardly a part of this problem at all. Perhaps, the fact that this kind of “value alignment” is surprisingly ineffective in combatting jailbreaks evidences that jailbreaks expose the deeper problem, that is, misunderstanding of the user’s context (and therefore user’s intent, which is in the coupling between the user and their environment/context, from the enactivist perspective).
Then, as far as scale-free biology and Active Inference agency are concerned[8][9][10], there is no difference between understanding a context and alignment with a context, and hence we have the Waluigi effect that can only be addressed on the meta-cognitive level (output filters, meta-cognitive dialogue, and other approaches). Therefore, sharing arbitrary capable “bare” LLMs in open-source is inherently risky and there is no way to fix this with pre-training or fine-tuning. Humans have evolved to have obligatory meta-cognition for a good reason!
Real “safe and reliable reasoning” is compositional reasoning and provably correct computing
It’s richer language, better context alignment, and better capacities for (compositional, collective) reasoning, bargaining, planning, and decision-making[11] that make the economy more productive and civilisation (and Gaia) safer at the end of the day, not “better bureaucracies”. To a degree, we can also think about bureaucracies as scaffolding for “better reasoning, bargaining, planning, and decision-making”. There is some grain of truth in this view, but again, nobody currently thinks about bureaucracies, rule books, and compliance in this way and this only happens as an accidental side-effect of bureaucratisation.
In this sense, making LLMs “accurate” and “correct” followers of some formal rules hardly moves the needle of reasoning correctness (accuracy) forward. The right agenda for improving the correctness and accuracy of reasoning is scaffolding it in (or delegating it to) more “traditional” computing paradigms: symbolic and statistical, such as algorithms written in probabilistic programming languages (calling to NN modules), or other neurosymbolic frameworks, and generating mathematical proofs of correctness for the algorithms[12]. The last two miles of safety on this agenda would be
Proving the NN components themselves by treating them as humongous but precise statistical algorithms to rule out some forms of deceptive alignment[13], and
Generating proofs for hardware correctness and tamper-safety[12] that is going to run the above software.
The bottom line: AI safety = context alignment + languages and protocols + provably correct computing + governance mechanisms, incentive design, and immune systems
AI safety = Context alignment throughout pre-training[4], fine-tuning[5], and online inference[6]+ Languages and protocols for context alignment and (collective) reasoning (negotiation, bargaining, coordination, planning, decision-making) about preferences, plans, and risk bounding to make them (alignment and reasoning) effective, precise, and compositional[11]+ Provably correct computing+ (Not covered in this post) governance mechanisms, incentive design, and immune systems to negotiate and encode high-level, collective preferences, goals, and plans and ensure that the collective sticks to the current versions of these.[14]
Note that “control”, “rule following” (a.k.a. bureaucratisation), “trust”, and “value alignment” are not parts of the above decomposition of the problem of making beneficial transformative AI (cf. Davidad’s AI Neorealism). They in some sense emerge or follow from the interaction of the components listed above.
In general, I’m a methodological pluralist and open to the idea that “control” and “value alignment” frames capture something about AI safety and alignment that is not captured by the above decomposition. Still, I think it is very small and not commensurate to the attention share that these frames receive from the public, key decision-makers, and even the AGI labs and the AI safety community. This is ineffective and could also instil dangerous overconfidence and delude decision-makers and the public about the actual progress on AI safety and the risks of the AI transition.
Even then, bureaucratisation is probably just net harmful.
“Trust”, while important from the sociotechnical perspective and for optimal adoption of the technology, should not result in oversimplification of algorithms and concepts so that people understand them: this would just increase the “alignment tax” and would be ultimately futile, and also unnecessary if we have mathematical proofs for the correctness of protocols and algorithms. So, I think that to address the trust issue, AI developers and the AI community will ultimately need to educate decision-makers and the public about the difference between “trust in science” (context alignment) and “trust in math” (algorithms and computing), being vigilant about the former, and not unduly questioning the latter.
I realise that business leaders may also be not interested in this problem, but then it’s our, i.e., AI (safety) community’s and the public problem to influence the businesses to recognise the problem, or else businesses will externalise the risks onto all of us.
Fields, C., Fabrocini, F., Friston, K., Glazebrook, J. F., Hazan, H., Levin, M., & Marcianò, A. (2023). Control flow in active inference systems. OSF Preprints. https://doi.org/10.31219/osf.io/8e4ra
This online inference is usually called “in-context learning” for LLMs, though note that the meaning of the word “context” is very different in this phrase from the meaning of “context” in quantum free energy principle[8] and information theory.
Fields, C., Friston, K., Glazebrook, J. F., & Levin, M. (2022). A free energy principle for generic quantum systems. Progress in Biophysics and Molecular Biology, 173, 36–59. https://doi.org/10.1016/j.pbiomolbio.2022.05.006
Pezzulo, G., Parr, T., Cisek, P., Clark, A., & Friston, K. (2023). Generating Meaning: Active Inference and the Scope and Limits of Passive AI [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/8xgzv
Governance mechanisms should also include secession protocols for hedging against value lock-in and meta-ethical opportunity cost, but this is far outside the scope of this post.
Worrisome misunderstanding of the core issues with AI transition
This post is triggered by “Generative AI dominates Davos discussions as companies focus on accuracy” (CNBC) and “AI has a trust problem — meet the startups trying to fix it” (Sifted).
It’s just remarkable (and worrying) how business leaders and journalists misunderstand the core issues with AI adoption and transition[1]. All they talk about is “accuracy”, “correctness”, and “proving that AI is actually right”(!). The second piece has a hilarious passage “Cassar says this aspect of AI systems creates a trust issue because it goes against the human instinct to make ‘rule-based’ decisions.”(!)
There are many short- and medium-term applications where this “rule-following and accuracy” framing of the issue is correct, but they are all, by necessity, about automating and greasing bureaucratic procedures and formal compliance with rule books: filing tax forms, checking compliance with the law, etc. But these applications are not intrinsically productive, and on a longer time scale, they may lead to a Jevons effect: the cheaper bureaucratic compliance becomes, the more it is demanded, without actually making coordination, cooperation, and control more reliable and safe.
“Factual accuracy” and hallucinations are the lowest-hanging pieces of context alignment
Taking the viewpoints of information theory[2], philosophy of language, and institutional economics[3], it’s not the sophistication of bureaucracies that reduces the cumulative risk exposure and transaction costs of the interaction between humans, AIs, and organisations. Rather, it’s building shared reference frames (shared language) for these agents to communicate about their preferences, plans, and risks. The sophistication of bureaucratic procedures sometimes does have this effect (new concepts are invented that increase the expressiveness of communication about preferences, plans, and risks), but this is only an accidental byproduct of this bureaucratisation process. And then, making AIs use language effectively to communicate with humans and each other is not an “accuracy” or “factual correctness” problem, it’s the context (meaning, intent, outer) alignment problem.
Indeed, this is the core problem that Perplexity, Copilot, Bard, OpenAI, and other universal RAG helpers are facing: alignment with users’ context, on a hierarchy of timescales: pre-training[4], fine-tuning[5], RAG dataset curation, and online alignment through a dialogue with the user[6][7]. Preventing outright hallucinations is just the lowest-hanging part of this problem. And “aligning LLMs with human values” is hardly a part of this problem at all. Perhaps, the fact that this kind of “value alignment” is surprisingly ineffective in combatting jailbreaks evidences that jailbreaks expose the deeper problem, that is, misunderstanding of the user’s context (and therefore user’s intent, which is in the coupling between the user and their environment/context, from the enactivist perspective).
Then, as far as scale-free biology and Active Inference agency are concerned[8][9][10], there is no difference between understanding a context and alignment with a context, and hence we have the Waluigi effect that can only be addressed on the meta-cognitive level (output filters, meta-cognitive dialogue, and other approaches). Therefore, sharing arbitrary capable “bare” LLMs in open-source is inherently risky and there is no way to fix this with pre-training or fine-tuning. Humans have evolved to have obligatory meta-cognition for a good reason!
Real “safe and reliable reasoning” is compositional reasoning and provably correct computing
It’s richer language, better context alignment, and better capacities for (compositional, collective) reasoning, bargaining, planning, and decision-making[11] that make the economy more productive and civilisation (and Gaia) safer at the end of the day, not “better bureaucracies”. To a degree, we can also think about bureaucracies as scaffolding for “better reasoning, bargaining, planning, and decision-making”. There is some grain of truth in this view, but again, nobody currently thinks about bureaucracies, rule books, and compliance in this way and this only happens as an accidental side-effect of bureaucratisation.
In this sense, making LLMs “accurate” and “correct” followers of some formal rules hardly moves the needle of reasoning correctness (accuracy) forward. The right agenda for improving the correctness and accuracy of reasoning is scaffolding it in (or delegating it to) more “traditional” computing paradigms: symbolic and statistical, such as algorithms written in probabilistic programming languages (calling to NN modules), or other neurosymbolic frameworks, and generating mathematical proofs of correctness for the algorithms[12]. The last two miles of safety on this agenda would be
Proving the NN components themselves by treating them as humongous but precise statistical algorithms to rule out some forms of deceptive alignment[13], and
Generating proofs for hardware correctness and tamper-safety[12] that is going to run the above software.
The bottom line: AI safety = context alignment + languages and protocols + provably correct computing + governance mechanisms, incentive design, and immune systems
AI safety =
Context alignment throughout pre-training[4], fine-tuning[5], and online inference[6] +
Languages and protocols for context alignment and (collective) reasoning (negotiation, bargaining, coordination, planning, decision-making) about preferences, plans, and risk bounding to make them (alignment and reasoning) effective, precise, and compositional[11] +
Provably correct computing +
(Not covered in this post) governance mechanisms, incentive design, and immune systems to negotiate and encode high-level, collective preferences, goals, and plans and ensure that the collective sticks to the current versions of these.[14]
Note that “control”, “rule following” (a.k.a. bureaucratisation), “trust”, and “value alignment” are not parts of the above decomposition of the problem of making beneficial transformative AI (cf. Davidad’s AI Neorealism). They in some sense emerge or follow from the interaction of the components listed above.
In general, I’m a methodological pluralist and open to the idea that “control” and “value alignment” frames capture something about AI safety and alignment that is not captured by the above decomposition. Still, I think it is very small and not commensurate to the attention share that these frames receive from the public, key decision-makers, and even the AGI labs and the AI safety community. This is ineffective and could also instil dangerous overconfidence and delude decision-makers and the public about the actual progress on AI safety and the risks of the AI transition.
Even then, bureaucratisation is probably just net harmful.
“Trust”, while important from the sociotechnical perspective and for optimal adoption of the technology, should not result in oversimplification of algorithms and concepts so that people understand them: this would just increase the “alignment tax” and would be ultimately futile, and also unnecessary if we have mathematical proofs for the correctness of protocols and algorithms. So, I think that to address the trust issue, AI developers and the AI community will ultimately need to educate decision-makers and the public about the difference between “trust in science” (context alignment) and “trust in math” (algorithms and computing), being vigilant about the former, and not unduly questioning the latter.
I realise that business leaders may also be not interested in this problem, but then it’s our, i.e., AI (safety) community’s and the public problem to influence the businesses to recognise the problem, or else businesses will externalise the risks onto all of us.
Fields, C., Fabrocini, F., Friston, K., Glazebrook, J. F., Hazan, H., Levin, M., & Marcianò, A. (2023). Control flow in active inference systems. OSF Preprints. https://doi.org/10.31219/osf.io/8e4ra
Khan, M. & Wiblin, R. (2021). Mushtaq Khan on using institutional economics to predict effective government reforms
This is what OpenAI’s Superalignment agenda is about.
This is what Stuart Armstrong and Rebecca Gorman’s Aligned AI seems to tackle.
Who’s working on this?
This online inference is usually called “in-context learning” for LLMs, though note that the meaning of the word “context” is very different in this phrase from the meaning of “context” in quantum free energy principle[8] and information theory.
Fields, C., Friston, K., Glazebrook, J. F., & Levin, M. (2022). A free energy principle for generic quantum systems. Progress in Biophysics and Molecular Biology, 173, 36–59. https://doi.org/10.1016/j.pbiomolbio.2022.05.006
Pezzulo, G., Parr, T., Cisek, P., Clark, A., & Friston, K. (2023). Generating Meaning: Active Inference and the Scope and Limits of Passive AI [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/8xgzv
Fields, C., & Levin, M. (2020). How Do Living Systems Create Meaning? Philosophies, 5. https://doi.org/10.3390/philosophies5040036
Open Agency Architecture, Gaia Network, Friston et al.’s Ecosystems of Intelligence, Infra-Bayesianism, and probably Conjecture’s CoEms are the agendas that I’m aware of that approach the design of effective, precise, and compositional (collective) reasoning languages and protocols.
Tegmark, M., & Omohundro, S. (2023). Provably safe systems: The only path to controllable AGI (arXiv:2309.01933). arXiv. https://doi.org/10.48550/arXiv.2309.01933
However, I don’t know how to deal with evolutionary and dynamic NN architectures, such as Liquid.ai.
Governance mechanisms should also include secession protocols for hedging against value lock-in and meta-ethical opportunity cost, but this is far outside the scope of this post.