Hey, cool stuff! I have ideated and read a lot on similar topics and proposals. Love to see it!
Is the “Thinking Tools” concept worth exploring further as a direction for building a more trustworthy AI core?
I am agnostic about whether you will hit technical paydirt. I don″t really understand what you are proposing on a “gears level” I guess and I’m not sure I could make a good guess even if I did. But, I will say that I think the vibe of your approach sounded pleasant and empowering. It was a little abstract to me I guess I’m saying, but that need not be a bad thing maybe you’re just visionary.
It reminds me of the idea of using RAG or Toolformer to get LLMs to “show their work” and “cite their sources” and stuff. There is surely a lot of room for improvement there bc Claude bullshits me with links on the regular.
This also reminds me of Conjecture’s Cognitive Emulation work and even just Max Tegmark and Steve Omohundro’s emphasis on making inscrutable LLMs to use deterministic proof checkers heavily to win back certain gaurantees.
Is the “LED Layer” a potentially feasible and effective approach to maintain transparency within a hybrid AI system, or are there inherent limitations?
I don’t have a clear enough sense of what you’re even talking about, but there are definitely at least some additional interventions you could run in addition to the thinking tools… eg. monitoring, faithful CoT techniques for marginally truer reasoning traces, you could run probes, Anthropic runs a classifier to help with robust jailbreaking for misuse etc. …
I think that something like “defense in depth” is something like the current slogan of AI Safety. So, sure I can imagine all sorts of stuff you could try to run for more transparency beyond deterministic tool use, but w/o a cleaer conception of the finer points it feels like I should say that there are quite an awful lot of inherent limitations, but plenty of options / things to try as well.
Like, “robustly managing interpretability” is more like a holy grail than a design spec in some ways lol.
What are the biggest practical hurdles in considering the implementation of CCACS, and what potential avenues might exist to overcome them?
I think that a lot of what it is shooting for is aspirational and ambitious and correctly points out limitations in the current approaches and designs of AI. All of that is spot on and there is a lot to like here.
However, I think the problem of interpeting and building appropriate trust in complex learned algorithmic systems like LLMs is a tall order. “Transparency by design” is truly one of the great technological mandates of our era, but without more context it can feel like a buzzword like “security by design”.
I think the biggest “barrier” I can see is just that this framing just isn’t sticky enough to survive memetically and people keep trying to do transparency, tool use, control, reasoning, etc. under different frames.
But still, I think there is a lot of value in this space and you would get paid big bucks if you could even marginally improve current ablity to get trustworthy interpretable work out of LLMs. So, y’know, keep up the good work!
Hello :) thank you for the thoughtful comment on my old post. I really appreciate you taking the time to engage with it, and you’re spot on—it was a high-level, abstract vision.
It’s funny you ask for the “gears-level” design, because I did spend a long time trying to build it out. That effort resulted in a massive (and honestly, monstrously complex and still naive/amateur) paper on the G-CCACS architecture (https://doi.org/10.6084/m9.figshare.28673576.v5).
However, my own perspective has shifted significantly since then. Really.
My current diagnosis, detailed in my latest work “Warped Wetware” (https://doi.org/10.6084/m9.figshare.29183669.v13) and in this latest article “The Engine of Foreclosure” (https://forum.effectivealtruism.org/posts/6be7xQHFREPYJKmyE/the-engine-of-foreclosure) is that the AI control problem is formally intractable. Not because we can’t design clever technical architectures, but because the global human system (i call it “Distributed Human Optimizer”) is structurally wired to reject them. The evidence, from the 100:1 (or even 400+ to 1) capability-vs-safety funding gap to the failure of every governance paradigm we’ve tried, seems conclusive.
This has led me to a stark conclusion: focusing on purely technical solutions like G-CCACS, while intellectually interesting, feels dangerously naive until we confront these underlying systemic failures. The best blueprint in the world is uselessif the builders are locked in a race to the bottom.
That’s why my work has pivoted entirely to the Doctrine of Material Friction—pragmatic, physical interventions designed to slow the system down rather than “solve” alignment. Your point about “memetic stickiness” was incredibly sharp, and it’s even more of a challenge for this grimmer diagnosis.
Anyway, thanks again for the great feedback. It’s exactly the kind of clear-eyed engagement this field needs.
Hey, cool stuff! I have ideated and read a lot on similar topics and proposals. Love to see it!
I am agnostic about whether you will hit technical paydirt. I don″t really understand what you are proposing on a “gears level” I guess and I’m not sure I could make a good guess even if I did. But, I will say that I think the vibe of your approach sounded pleasant and empowering. It was a little abstract to me I guess I’m saying, but that need not be a bad thing maybe you’re just visionary.
It reminds me of the idea of using RAG or Toolformer to get LLMs to “show their work” and “cite their sources” and stuff. There is surely a lot of room for improvement there bc Claude bullshits me with links on the regular.
This also reminds me of Conjecture’s Cognitive Emulation work and even just Max Tegmark and Steve Omohundro’s emphasis on making inscrutable LLMs to use deterministic proof checkers heavily to win back certain gaurantees.
I don’t have a clear enough sense of what you’re even talking about, but there are definitely at least some additional interventions you could run in addition to the thinking tools… eg. monitoring, faithful CoT techniques for marginally truer reasoning traces, you could run probes, Anthropic runs a classifier to help with robust jailbreaking for misuse etc. …
I think that something like “defense in depth” is something like the current slogan of AI Safety. So, sure I can imagine all sorts of stuff you could try to run for more transparency beyond deterministic tool use, but w/o a cleaer conception of the finer points it feels like I should say that there are quite an awful lot of inherent limitations, but plenty of options / things to try as well.
Like, “robustly managing interpretability” is more like a holy grail than a design spec in some ways lol.
I think that a lot of what it is shooting for is aspirational and ambitious and correctly points out limitations in the current approaches and designs of AI. All of that is spot on and there is a lot to like here.
However, I think the problem of interpeting and building appropriate trust in complex learned algorithmic systems like LLMs is a tall order. “Transparency by design” is truly one of the great technological mandates of our era, but without more context it can feel like a buzzword like “security by design”.
I think the biggest “barrier” I can see is just that this framing just isn’t sticky enough to survive memetically and people keep trying to do transparency, tool use, control, reasoning, etc. under different frames.
But still, I think there is a lot of value in this space and you would get paid big bucks if you could even marginally improve current ablity to get trustworthy interpretable work out of LLMs. So, y’know, keep up the good work!
Hello :) thank you for the thoughtful comment on my old post. I really appreciate you taking the time to engage with it, and you’re spot on—it was a high-level, abstract vision.
It’s funny you ask for the “gears-level” design, because I did spend a long time trying to build it out. That effort resulted in a massive (and honestly, monstrously complex and still naive/amateur) paper on the G-CCACS architecture (https://doi.org/10.6084/m9.figshare.28673576.v5).
However, my own perspective has shifted significantly since then. Really.
My current diagnosis, detailed in my latest work “Warped Wetware” (https://doi.org/10.6084/m9.figshare.29183669.v13) and in this latest article “The Engine of Foreclosure” (https://forum.effectivealtruism.org/posts/6be7xQHFREPYJKmyE/the-engine-of-foreclosure) is that the AI control problem is formally intractable. Not because we can’t design clever technical architectures, but because the global human system (i call it “Distributed Human Optimizer”) is structurally wired to reject them. The evidence, from the 100:1 (or even 400+ to 1) capability-vs-safety funding gap to the failure of every governance paradigm we’ve tried, seems conclusive.
This has led me to a stark conclusion: focusing on purely technical solutions like G-CCACS, while intellectually interesting, feels dangerously naive until we confront these underlying systemic failures. The best blueprint in the world is useless if the builders are locked in a race to the bottom.
That’s why my work has pivoted entirely to the Doctrine of Material Friction—pragmatic, physical interventions designed to slow the system down rather than “solve” alignment. Your point about “memetic stickiness” was incredibly sharp, and it’s even more of a challenge for this grimmer diagnosis.
Anyway, thanks again for the great feedback. It’s exactly the kind of clear-eyed engagement this field needs.