The Off-Policy Theory of Happiness
Why the metrics we use to evaluate decisions are not the ones we should use to make them.
When I was a sophomore in college, I realized something for the first time. My parents had never told me: “Son, we just want you to be happy.” It seemed like everyone else’s parents had told them that whatever they did, it was okay as long as it made them happy. At first, I was taken aback. Did my parents not care about my happiness?
But that didn’t seem like the right explanation. Their actions certainly reflected a concern for my happiness. And it’s not like they were forcing me to study something just because it would lead to a job that was prestigious or met some conventional definition of success. In fact, my personal happiness seemed like a huge priority for them.
The more I thought about this, the more I realized that it wasn’t out of a disregard for my happiness—but because they held a different theory about how to achieve it.
Our society has two fundamental beliefs about happiness: (1) that we can become more happy, and (2) that more happy is a desirable thing to be. And yet—I think most people would agree, when you frame it this way, that one of the most efficient ways to become less happy is to spend a great deal of time worried about your own happiness.
This presents a bit of riddle. We all want to be happy. But the key to pursuing it is... not pursuing it. How do you get more of something without trying to get more of it?
One of the most successful frameworks used in modern artificial intelligence is called reinforcement learning. The basic idea is simple. If an action leads to a good outcome, do it again; if it leads to a bad outcome, don’t. It is a definition of intelligence that places reward maximization at its heart. Simply put, there is something that you want, and intelligent behavior consists in getting as much of it as possible.
At the core of reinforcement learning is what’s known as a “policy.” For example, if your agent is a robot that plays basketball, then its reward comes in the form of points. The more baskets the robot makes, the more points it gets. The more likely it is to win the game, the more intelligently it behaved. The policy is the robot’s playbook.
A policy says, in mathematical abstraction, “This is where I am right now. This is what I have to do next to maximize my points.” In basketball, a good policy might be to get the ball, dribble it toward the basket, and toss in a lay-up. Each time the robot does this, it looks at how effective it was in getting points, and adjusts its behavior to do better next time. The robot might start off bad, but using reinforcement learning it could become better over time.
The hard part is that there’s no direct way of knowing the best policy. You have to try out different policies, and figure out which one is most effective. Is the best policy to drive toward the basket? Or should you sit back and shoot jumpers? How do you know which is going to work out better next time around? Will the same policy work against a different opponent?
In general, there are two strategies for how to learn a policy.
The first is called on-policy. It’s the more straightforward of the two strategies. On-policy means that the robot uses the same information to make decisions and evaluate whether or not they were the right ones. Basically, it’s going to make decisions based on what it thinks will most quickly increase its number on the scoreboard. If its current policy says to drive toward the basket and that results in a lot of points right away, then it’ll will be more likely to keep going with that same plan in the future.
The second strategy is called off-policy. This means that the robot is using different information to make decisions than it is to evaluate them. The robot could make decisions based on, for instance, its time of possession on the ball. Or how many passes it completes before taking a shot. Or how little time it lets the opponent spend in its own half. At the end of the game, it could then look back at its play based on the different policies and see if focusing on something else actually made it more likely to win.
At first, it might seem like the better strategy is always going to be on-policy. How could you achieve your goal more effectively by focusing on something totally irrelevant? But that’s not necessarily true. The empirical fact in artificial intelligence research is that some problems are better solved by off-policy methods.
And framed in another way, this makes total sense. It’s actually a pretty neat solution to Goodhart’s Law, which says that “once a measure becomes a target, it ceases to be a good measure.” How do you prevent a good measure from becoming a target? Simple. Pick something else as the target. Sometimes the best way to attain a goal is indirectly.
So, is happiness better pursued with an on-policy strategy, or an off-policy one?
It depends on what you mean by “happiness.”
Psychologists often distinguish between two kinds of happiness. The first is “subjective” well-being. This is a straightforward definition of hedonism: in any given moment, do you find yourself pleased with what you have? The second kind is “eudaimonic” well-being. This is an idea that goes back to Aristotle, that well-being is about a cultivation of virtue and skill in service of contributing to the larger, more worthy goals of humanity. The first kind of well-being is a vertical cross-section of happiness in the moment; the second takes a broader view of overall life-satisfaction. In a Christian context, this is often thought of as the difference between happiness and joy.
A recent study led by happiness research kingpin Kennon Sheldon put these two kinds of well-being to the on-policy/off-policy test. Do people who pursue subjective well-being feel more subjective well-being? What about for eudaimonic well-being?
Sheldon and his co-authors gave people questionnaires to measure both their subjective and eudaimonic well-being—how happy they were feeling in the moment, according to both definitions. They also gave participants a questionnaire to measure how motivated they were to change these respective aspects of their well-being. If an on-policy strategy works well, then you’d expect that the more motivated they are in their pursuit, the higher they’ll score on current well-being.
And for the pursuit of eudaimonic well-being, that’s exactly what they found. The more people prioritized eudaimonic well-being the more of it they felt. But the pursuit of subjective well-being? It was also correlated with subjective well-being—but in the opposite direction! The more they claimed to pursue it, the less of it they had over all.
But here’s the true test of the off-policy strategy: Did the students who were more motivated to pursue eudaimonic well-being also achieve more subjective well-being as a by-product? Of course they did! The correlation was even stronger than with achieving eudaimonic well-being itself. There was no such by-product effect for the pursuit of subjective well-being.
Now, it should be said that this is a correlational study. The authors ran it on mostly white, college-age students enrolled in a psychology course. So this study isn’t the final word in the discussion. But it’s nonetheless a clear test of on-policy versus off-policy approaches. If the on-policy strategy worked for subjective well-being, then pursuing more of it should lead to getting more of it.
As Sheldon and his co-authors put it in another paper: “the pursuit of happiness involves trying out different kinds of goals, values, behaviors, and activities, to determine which ones bring one satisfaction and happiness. Ironically (and reassuringly), the best happiness boosting behaviors tend be the ones that focus on long-term self-improvement and on deepening connections with others, just as most lay and eudaimonic theories of ‘a life well-lived’ have long proposed.”
Naturally, as Sheldon alludes to, these researchers weren’t the first to make this observation. My favorite characterization of the off-policy theory of happiness comes from John Stuart Mill, in a passage from his Autobiography:
Ask yourself whether you are happy, and you cease to be so. The only chance is for you to have as your purpose in life not happiness but something external to it. Let your self-consciousness, your scrutiny, your self-interrogation, exhaust themselves on that; and if you are otherwise fortunately circumstanced you will inhale happiness with the air you breathe, without dwelling on it or thinking about it, forestalling it in imagination, or putting it to flight by fatal questioning.
The reason, then, that my parents never told me to pursue happiness directly was that they, like Mill, believe in an off-policy approach to happiness. When someone tells you that you should “do what makes you happy,” they’re advocating for an on-policy approach—making decisions and evaluating them by the same metric. That’s exactly what my parents didn’t want me to do. And while my parents didn’t learn this from reading Mill, the surprising thing about this position on happiness that it is shared—in some version or another—by practically every other philosopher who has weighed in on the matter.
Another of my favorites comes from Bertrand Russell. He says more or less the same thing as Mill, but with a certain flair of nonchalance in contrast to Mill’s solemn weightiness. Russell writes in The Conquest of Happiness: “Fundamental happiness depends more than anything else upon what may be called a friendly interest in persons and things.” He continues, “let your interests be as wide as possible, and let your reactions to the things and persons that interest you be as far as possible friendly rather than hostile.”
Oh, and by the way—happiness isn’t the only thing the off-policy strategy applies to!
Earlier this week, I published a podcast episode on the perils of gamification, featuring game designer Adrian Hon. “Gamification” is the term for when we assign points to our actions in real life in an effort to incentive doing the right actions by maximizing our points. For example, learning a language via Duolingo. Learning a new language is hard, because it usually involves a lot of boring vocab learning. Duolingo uses game-inspired features like experience points, leaderboards, and streak counts to make the drudgery of vocabulary learning more fun. And the trend in modern life is toward increased gamification: from our fitness goals, to our diets, to the number of books we read.
In his latest book, Adrian argues that the problem with this is when we start to care more about the fictional game elements (for instance, keeping our streak alive) than we do about our original goal (learning the language). The reason Adrian thinks we should be concerned about this is because, psychologically-speaking, this problematic shift can happen pretty quickly.
Gamification starts off as an off-policy approach to learning. We make our immediate decisions according to what’s going to get us the most points, or hit our goal of 10,000 steps in a day, or gets us the most likes on a Tweet. Then in the long-run this will translate into having learned a language, or maintaining our overall health and fitness, or the feeling that people are listening to us and care about what we’re doing. But at some point, we shift into on-policy learning. We forget why we started in the first place. We’re just after the points.
Neither on- nor off-policy learning is a one-size-fits-all approach. For example, what if our basketball player decided to try an off-policy strategy based on maximizing possession? Maybe it would just takes the ball to the corner and hang onto it for dear life. That’s not going to help you win the game any more than naively dribbling the ball straight toward the basket. The off-policy approach may help you avoid Goodhart’s Law. But it’s possible for it to lead you too far astray, as well.
That’s why I like framing personal goals in terms of meaningfulness. It’s a lot easier not to get lost in your own pursuit of personal happiness. As Mill says, focus on something larger than yourself. One day you’ll wake up to realize that you inhale happiness with the air you breathe.
Thank you! Love you work.
I am reminded of the apocryphal tale of an Alcan CEO who made the firm more profitable by focusing on worker safety than on output.