Executive summary: This section distinguishes between two concepts of an “episode” in machine learning training—the intuitive episode and the incentivized episode. The intuitive episode is a natural unit of training (e.g. a game), while the incentivized episode is the period of time over which training directly pressures the model to optimize.
Key points:
The incentivized episode is the period over which training actively punishes the model for not optimizing. It may be shorter than the full training period.
The intuitive episode is a natural unit picked for training (e.g. a game). It is not necessarily the same as the incentivized episode.
Care is needed in assessing if the intuitive episode matches the incentivized episode, i.e. if training incentivizes cross-episode optimization.
Some training methods directly pressure cross-episode optimization, others don’t. Details of training algorithms matter.
Conflating the two concepts can lead to inappropriate assumptions about incentivized time horizons. Empirical testing is important.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: This section distinguishes between two concepts of an “episode” in machine learning training—the intuitive episode and the incentivized episode. The intuitive episode is a natural unit of training (e.g. a game), while the incentivized episode is the period of time over which training directly pressures the model to optimize.
Key points:
The incentivized episode is the period over which training actively punishes the model for not optimizing. It may be shorter than the full training period.
The intuitive episode is a natural unit picked for training (e.g. a game). It is not necessarily the same as the incentivized episode.
Care is needed in assessing if the intuitive episode matches the incentivized episode, i.e. if training incentivizes cross-episode optimization.
Some training methods directly pressure cross-episode optimization, others don’t. Details of training algorithms matter.
Conflating the two concepts can lead to inappropriate assumptions about incentivized time horizons. Empirical testing is important.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.