I work at Netflix on the recommender. It’s interesting to read this abstract article about something that’s very concrete for me.
For example, the article asks, “The key question any model of the problem needs to answer is—why aren’t recommender systems already aligned.”
Despite working on a recommender system, I genuinely don’t know what this means. How does one go about measuring how much a recommender is aligned with user interests? Like, I guarantee 100% that people would rather have the recommendations given by Netflix and YouTube than a uniform random distribution. So in that basic sense, I think we are already aligned. It’s really not obvious to me that Netflix and YouTube are doing anything wrong. I’m not really sure how to go about measuring alignment, and without a measurement, I don’t know how to tell whether we’re making progress toward fixing it.
I’m not sure about users definitely preferring the existing recommendations to random ones—I actually have been trying to turn off YouTube recommendations because they make me spend more time on YouTube than I want. Meanwhile other recommendation systems send me news that is worse on average than the rest of the news I consume (from different channels). So in some cases at least, we could use a very minimal standard of: a system is aligned if the user better off because the recommendation system exists at all.
This is a pretty blunt metric, and probably we want something more nuanced, but at least to start off with it’d be interesting to think about how to improve whichever recommender systems are currently not aligned.
I just added a comment above which aims to provide a potential answer to this question—that you can use “approaches like those I describe here (end of the article; building on this which uses mini-publics)”. This may not directly get you something to measure, but it may be able to elicit the values needed for defining an objective function.
You provide the example of this very low bar:
I guarantee 100% that people would rather have the recommendations given by Netflix and YouTube than a uniform random distribution. So in that basic sense, I think we are already aligned.
The goal here would be to scope out what a much higher bar might look like.
Thanks for raising this. I appreciate specification is hard, but I think there’s a broader lens on ‘user interests’ with more acknowledgement for the behavioural side.
What users want in one moment isn’t always the same as what they might endorse when in a less slippery behavioural setting or upon reflection. You might say this is a human not a technical problem. True, but we can design systems to that help us optimize for our long-term goals and that is a different task to optimizing for what we click on in a given moment. Sure it’s much harder to specify, but I think user research can be done. Thinking about the user more holistically could open up new innovations too. Imagine a person has watched several videos in a row about weight loss and rather than keeping them on the couch longer, it learns to respond with good nudges: prompts them to get up and go for a run, reminds them of their personal goals for the day (because it has such integrations), messages your running buddy, closes itself (and has nice configurable settings with good defaults), or advertises joining a local running group (right now the local running group would not afford the advert, but in a world where recommenders weight ad quality to somehow include long-term preferences of the user, that might be different).
I understand the measurement frustration issue, the task is harder than just optimising for views and clicks though (not just technically, also to align to the company’s bottom line). However, I do think little steps towards better specification can help, and I’d love to read future user research on it at Netflix.
Sorry I think I didn’t address the measurement issue very well, and assumed your notion of user interests meant simply optimizing for views, when maybe it isn’t. I still think through user research you can learn to develop good measures. For example: surveys, cohort tests (e.g. if you discount ratings over time within a viewing session, to down weight lower agency views, do you see changes such as users searching more instead of just letting autoplay), is there a relationship between how much a user feels netflix is improving their life (in a survey) and how much they are sucked in by autoplay? Learning these higher order behavioural indicators can help give users a better long-term experience, if that’s what the company optimizes for.
Stated preference (uplifting documentaries) and revealed preference (reality TV crime shows) are different
Asking people for their preference is quite difficult—only a small fraction of Netflix users give star ratings or thumb ratings. In general, users like using software to achieve their immediate goals. It’s tough to get them to invest time and skill into making it better in the future. For most people, each app is a tiny tiny slice of their day and they don’t want to do work to optimize anything. Customization and user controls often fail because no one uses them.
If serving recommendations according to stated preferences causes people to unsubscribe more, how should we interpret that? That their true preference is to not be subscribed to Netflix? It’s unclear.
In any case, Netflix is financially incentivized to optimize for subscriptions, not viewing. So if people pay for what they want, then Netflix ought to be aligned with what they want. Netflix is only misaligned with what people want if people’s own spending is not aligned with what they want (theoretically).
I work at Netflix on the recommender. It’s interesting to read this abstract article about something that’s very concrete for me.
For example, the article asks, “The key question any model of the problem needs to answer is—why aren’t recommender systems already aligned.”
Despite working on a recommender system, I genuinely don’t know what this means. How does one go about measuring how much a recommender is aligned with user interests? Like, I guarantee 100% that people would rather have the recommendations given by Netflix and YouTube than a uniform random distribution. So in that basic sense, I think we are already aligned. It’s really not obvious to me that Netflix and YouTube are doing anything wrong. I’m not really sure how to go about measuring alignment, and without a measurement, I don’t know how to tell whether we’re making progress toward fixing it.
My two cents.
I’m not sure about users definitely preferring the existing recommendations to random ones—I actually have been trying to turn off YouTube recommendations because they make me spend more time on YouTube than I want. Meanwhile other recommendation systems send me news that is worse on average than the rest of the news I consume (from different channels). So in some cases at least, we could use a very minimal standard of: a system is aligned if the user better off because the recommendation system exists at all.
This is a pretty blunt metric, and probably we want something more nuanced, but at least to start off with it’d be interesting to think about how to improve whichever recommender systems are currently not aligned.
Thanks for sharing your perspective. I find it really helpful to hear reactions from practitioners.
I just added a comment above which aims to provide a potential answer to this question—that you can use “approaches like those I describe here (end of the article; building on this which uses mini-publics)”. This may not directly get you something to measure, but it may be able to elicit the values needed for defining an objective function.
You provide the example of this very low bar:
The goal here would be to scope out what a much higher bar might look like.
Thanks for raising this. I appreciate specification is hard, but I think there’s a broader lens on ‘user interests’ with more acknowledgement for the behavioural side.
What users want in one moment isn’t always the same as what they might endorse when in a less slippery behavioural setting or upon reflection. You might say this is a human not a technical problem. True, but we can design systems to that help us optimize for our long-term goals and that is a different task to optimizing for what we click on in a given moment. Sure it’s much harder to specify, but I think user research can be done. Thinking about the user more holistically could open up new innovations too. Imagine a person has watched several videos in a row about weight loss and rather than keeping them on the couch longer, it learns to respond with good nudges: prompts them to get up and go for a run, reminds them of their personal goals for the day (because it has such integrations), messages your running buddy, closes itself (and has nice configurable settings with good defaults), or advertises joining a local running group (right now the local running group would not afford the advert, but in a world where recommenders weight ad quality to somehow include long-term preferences of the user, that might be different).
I understand the measurement frustration issue, the task is harder than just optimising for views and clicks though (not just technically, also to align to the company’s bottom line). However, I do think little steps towards better specification can help, and I’d love to read future user research on it at Netflix.
Sorry I think I didn’t address the measurement issue very well, and assumed your notion of user interests meant simply optimizing for views, when maybe it isn’t. I still think through user research you can learn to develop good measures. For example: surveys, cohort tests (e.g. if you discount ratings over time within a viewing session, to down weight lower agency views, do you see changes such as users searching more instead of just letting autoplay), is there a relationship between how much a user feels netflix is improving their life (in a survey) and how much they are sucked in by autoplay? Learning these higher order behavioural indicators can help give users a better long-term experience, if that’s what the company optimizes for.
Absolutely. A few comments:
Stated preference (uplifting documentaries) and revealed preference (reality TV crime shows) are different
Asking people for their preference is quite difficult—only a small fraction of Netflix users give star ratings or thumb ratings. In general, users like using software to achieve their immediate goals. It’s tough to get them to invest time and skill into making it better in the future. For most people, each app is a tiny tiny slice of their day and they don’t want to do work to optimize anything. Customization and user controls often fail because no one uses them.
If serving recommendations according to stated preferences causes people to unsubscribe more, how should we interpret that? That their true preference is to not be subscribed to Netflix? It’s unclear.
In any case, Netflix is financially incentivized to optimize for subscriptions, not viewing. So if people pay for what they want, then Netflix ought to be aligned with what they want. Netflix is only misaligned with what people want if people’s own spending is not aligned with what they want (theoretically).