I have never been satisfied by the “AI infers that it is simulated and changes its behavior” argument because it seems like the root issue is always that some information has leaked into the simulation. The problem goes from, “how do we prevent AI from escaping a box?” to “How do we prevent information from entering a box?” The components of this problem being:
What information is communicated via the nature of the box itself?
What information is built into an AI.
What information is otherwise entering the box?
These questions seem relatively approachable compared to other avenues of AI safety research.
What other methods are there that would in principle allow iteration?
If it is true that “a failed AGI attempt could result in unrecoverable loss of human potential within the bounds everything that it can affect”, then our options are to A) not fail or B) limit the bounds of everything that it can affect. In this sense any strategy that hopes to allow for iteration is abstractly equivalent to a box/simulation/sandbox whatever you may call it.