In the case of Confidential Monitoring: the mechanism seems to rely on the ability of the monitoring system to verify and aggregate signals about agents’ behavior. How does this remain robust in an environment where generative AI — especially with open-weight models — makes it cheap to produce plausible but hard-to-verify evidence?
What prevents such a system from gradually legitimizing synthetic signals, rather than filtering them out?
This is basically the reason I regard this as the most technically challenging of the things we’re presenting here. You eventually want a system which is not just a passive consumer of data, but can actively explore. You may need to give it access to robots with cameras and internet so that it can verify some of the basics of its setup. It might still fear that the entire thing is being spoofed, but I think it’s vastly harder to generate a plausible world that’s robust to the agent exploring and running consistency probes.
Thanks — grounding verification in physical reality makes sense. But most coordination problems these sketches address involve socially constructed states: commitments, contractual intent, whether a sequence of actions counts as compliance or evasion. These are mediated by language and interpretation, not camera-visible facts.
In that setting, doesn’t the monitoring layer risk becoming an interpretive laundering mechanism rather than a truth-tracking one — especially once open-weight models can cheaply produce plausible accounts that fit the system’s expected format?
You can have a smart system make inferences from camera visible information.
But yeah, the main use case we had in mind for the monitoring layer was not about these very tricky-to-observe states, but expanding the space of things you can make agreements about (potentially including some high-stakes cases, as I write about at the end of this story: https://strangecities.substack.com/p/some-days-soon).
In the case of Confidential Monitoring: the mechanism seems to rely on the ability of the monitoring system to verify and aggregate signals about agents’ behavior. How does this remain robust in an environment where generative AI — especially with open-weight models — makes it cheap to produce plausible but hard-to-verify evidence? What prevents such a system from gradually legitimizing synthetic signals, rather than filtering them out?
This is basically the reason I regard this as the most technically challenging of the things we’re presenting here. You eventually want a system which is not just a passive consumer of data, but can actively explore. You may need to give it access to robots with cameras and internet so that it can verify some of the basics of its setup. It might still fear that the entire thing is being spoofed, but I think it’s vastly harder to generate a plausible world that’s robust to the agent exploring and running consistency probes.
Thanks — grounding verification in physical reality makes sense. But most coordination problems these sketches address involve socially constructed states: commitments, contractual intent, whether a sequence of actions counts as compliance or evasion. These are mediated by language and interpretation, not camera-visible facts. In that setting, doesn’t the monitoring layer risk becoming an interpretive laundering mechanism rather than a truth-tracking one — especially once open-weight models can cheaply produce plausible accounts that fit the system’s expected format?
You can have a smart system make inferences from camera visible information.
But yeah, the main use case we had in mind for the monitoring layer was not about these very tricky-to-observe states, but expanding the space of things you can make agreements about (potentially including some high-stakes cases, as I write about at the end of this story: https://strangecities.substack.com/p/some-days-soon).