Know when to explore versus exploit

One of the best things about being a parent is seeing the world through your kid’s eyes. Things you have grown used to and don’t think twice about are suddenly new again. Old questions that you long ago forgot that you asked as a child again make you curious. Like, why does water swirl down a drain? [Before opening the drain, the water already has some motion. It’s probably too slow for you to notice but as the water moves toward the drain, its rotational motion gets amplified and it starts swirling faster.]

Children aren’t little adults. A child’s brain is wired to explore. Childhood is when we gather the data and build the models we will continue to use for our whole lives. Pulling all your pots out of the cupboards? That’s data gathering. Dropping carrots onto the floor from the high chair? That’s exploring cause and effect. Asking “why?” a million times? That’s the drive for sensemaking.

As we live our daily lives, we have to decide whether to act on the information we know or whether to invest in collecting different information about the world. Every decision we make requires solving this dilemma.

A classic example of the explore/exploit dilemma is choosing a meal at your favorite restaurant. You know you love the pizza, so choosing this will not yield any new information but you will be guaranteed a good meal. Or you can choose the special. You’ll gain new information about the world but risk having a meal that’s below your expectations.

According to Alison Gopnik, professor of psychology at UC Berkeley, nature resolves the explore/exploit dilemma by giving us a childhood. Gopnik says that, with the possible exception of language, no single psychological trait is found in humans but not in any other species. Our uniqueness is due to our distinctive combination of brain and cognitive features.

We have long and expensive childhoods that afford us the time and protection to gather extraordinary amounts of information about the world. We use this time to develop our capacity for social interaction. We learn to read others’ intentions, to cooperate, and to learn through our cultures.

Children have a general capability for learning. Their brains are more plastic which makes them more sensitive to a wide range of possibilities. Children not only have different computation and neuronal capacities than adults, they also have different motivations, emotions, and drives for action. They are more novelty-seeking, curious, and active.

But fun as this time is, we can’t stay children forever. As adults we need different cognitive skills. We have to have attentional focus, to be able to inhibit certain behaviors, to be able to plan and act in line with long term goals. The explore/exploit dilemma explains why, as we get older, we are more likely to stick with what we know. We start to run out of time to put new information to use and we don’t want to waste the time we have left on bad experiences.

If there is a bias here, it’s to over-explore and underutilize what we know. This likely has something to do with our reward circuits being used for many different things in humans. Five hundred million years ago the first worms appeared. These worms learned the location of food with only a few examples. They also learned what to avoid and what constituted a favorable environment. Learning is by reward-prediction error—the difference between received and predicted rewards. The brains of these worms used dopamine to signal a useful behavior and to repeat it.

Dopamine worked so evolution kept it. Successful evolutionary strategies motivate the search for a soon-to-be-depleted substance with a rising unpleasant feeling and the prediction of a pleasant feeling. Dopamine is a way to update our expectations about the world. We think: did we get what we thought we would? Are things going better (or worse) than expected?

The winning evolutionary strategy, wired deep in our nature, is to remember what happened whenever there’s a big surprise (either pleasant or unpleasant) and then use this memory to select a behavior when it happens again. If the reward prediction error is positive, neurons deliver a pulse of pleasure. If it’s negative, there’s no good feeling. As Sterling notes, “we can live without daily rewarding pulses of dopamine—but we may not want to.”

Because we use the same dopamine-powered reward circuits to serve diverse behaviors and learning, our happiness is fleeting. We use the dopamine signal to learn, so we can’t be permanently happy or pleasantly surprised or delighted. Learning is the process of developing an accurate prediction about the world. Sterling’s insight is that satisfaction cannot be stored because of the very nature by which it is generated. We have to seek out new information about the world. We seek novelty, we are naturally curious, and we favor exploration. We are restless by nature.

Great Human Strength: We have the cognitive and emotional capacity to consciously decide whether to gather new information about the world, seek novelty, and try new things or whether to work with what we have and know best.

Great Human Weakness: We can be led to seek quick hits of pleasure which can disrupt our short term decision making.  

Machine Opportunity: Help us see new places in the solution space that humans can’t.

Machine Threat: Lead us wildly astray or keep us too constrained.