It’s not currently clear how to find training procedures that train “giving non-deceptive answers to questions” as opposed to “giving answers to questions that appear non-deceptive to the most sophisticated human arbiters” (more at Eliciting Latent Knowledge).
It also appears that the link to ELK in this section is incorrect
Making use of an AI’s internal state,2 not just its outputs. For example, giving positive reinforcement to an AI when it seems likely to be “honest” based on an examination of its internal state (and negative reinforcement when it seems likely not to be). Eliciting Latent Knowledge provides some sketches of how this might look.
The link to ELK in this bullet point is broken.
It may intend to point to here: https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge
It also appears that the link to ELK in this section is incorrect
Very belatedly fixed—thanks!