Caprice as Exploration: The Underground Man as a Reinforcement Learning Agent

Sai Sourabh Madur

Caprice as Exploration: The Underground Man as a Reinforcement Learning Agent

June 6, 2026

In Notes from the Underground, Dostoevsky’s narrator makes a claim that has irritated rationalists ever since: people do not reliably act in their own best interest. He builds his argument around the idea of “advantage” — the bundle of goods that supposedly governs human behavior. Advantage, on this account, means wealth, prosperity, honor, peace, and reason itself: the things every sensible person is assumed to want and to pursue. The Underground Man’s complaint is that if human beings really did organize their lives around these advantages, then we would be entirely predictable. Behavior would reduce to a kind of arithmetic. Give a person the optimal path to wealth or peace, and they would take it every time.

But they don’t. And the Underground Man insists this is not an accident or a failure of nerve. It is human nature to rebel against the very catalog of advantages that is supposed to define us. People will act against their own interests, spite themselves, choose the worse option simply because it is theirs to choose. What looks like irrationality, he argues, is actually the assertion of something the ledger of advantages cannot capture: independent will. And here is the turn that makes the book more than a tantrum — he claims that this capacity to act against one’s own advantage is itself the highest advantage of all. It is the unseen one, the most precious one, the advantage that no list of advantages can ever contain.

When I first read this, it struck me as deeply plausible, but for a reason Dostoevsky could not have had in mind. It maps almost exactly onto how adaptation works, whether you describe it in the language of evolutionary psychology or of reinforcement learning.

The primary advantages are an evolved policy

Start with the evolutionary reading. The “primary advantages” the Underground Man lists are, more or less, the things that improve reproductive fitness. Honor and status secure social standing. Prosperity and the accumulation of resources buffer against scarcity. Peace preserves the organism. Reason coordinates all of it. Over deep time, selection would have favored the dispositions that push us toward these goods — the instincts to gather resources, to compete for status, to improve our odds of finding and keeping mates. In other words, the drive toward the primary advantages is not arbitrary. It is the learned policy that evolution has handed down, refined across countless generations because it worked.

Exploitation versus exploration

But a learned policy that only does what has worked before has a fatal weakness, and this is where reinforcement learning makes the picture sharp. An RL agent faces a constant tension between exploitation and exploration. Exploitation means taking the action that, given everything learned so far, yields the highest expected reward. Exploration means trying something else — an action with no obvious payoff, possibly even a costly one — for the sake of discovering whether the current strategy is really the best available. An agent that always exploits will lock onto whatever it found early and never improve. It gets trapped in a local optimum, confident and stuck. Every functioning learning system needs some exploration term, some willingness to deviate from the policy it already trusts.

Caprice is the exploration term

Lay these two frames over each other and the Underground Man’s argument falls into place. The pursuit of the primary advantages — wealth, honor, peace, the rational accumulation of resources — is exploitation. It is the policy evolution optimized and passed down precisely because it reliably collects reward. The seemingly irrational behaviors are exploration: actions that bring no immediate advantage, that may even cut against survival, taken for no better reason than the assertion of one’s own will. Independence, contrariness, the refusal to be predictable, the impulse to act against the apparent laws of one’s own nature — these are the deviations from the trusted policy. They look like noise. They are noise. But they are the kind of noise a learning system cannot do without.

The crucial point is that this exploration cannot be optional, something a person decides to switch on. It has to be ingrained, built into the psychology at the level of disposition, because the environment is not stationary. The world does not hold still. Conditions shift, the pace of change never settles, and a strategy that was optimal in one regime can become a trap in the next. In a dynamic environment, an agent with no exploration is guaranteed to fall behind, because it keeps exploiting a model of a world that no longer exists. Without exploration wired into human psychology, we would never have adapted. We would be perfectly rational and perfectly stuck — optimized for a world that had already moved on.

So the irrationality the Underground Man celebrates is not a defect in the human design. It is the exploration term that keeps the species from collapsing into a brittle, frozen optimum. The quirky individual who throws away his obvious advantage, who insists on his own will against all reason, is not malfunctioning. He is running the exploration that the whole system depends on, even when — especially when — it costs him. Dostoevsky framed this as a defense of human freedom against the tyranny of reason. Read through reinforcement learning, it becomes something stranger and maybe deeper: the unseen advantage the Underground Man could name but not explain is exploration itself, and the caprice he so fiercely defended is exactly the mechanism by which a learning creature stays alive in a world that refuses to stop changing.

Cite this post

Sai Sourabh Madur (2026). Caprice as Exploration: The Underground Man as a Reinforcement Learning Agent. sourabhmadur.github.io. https://sourabhmadur.github.io/2026/caprice-as-exploration/

@misc{madur2026_caprice_as_exploration,
  author       = {Sai Sourabh Madur},
  title        = {Caprice as Exploration: The Underground Man as a Reinforcement Learning Agent},
  year         = {2026},
  howpublished = {\url{https://sourabhmadur.github.io/2026/caprice-as-exploration/}},
  publisher    = {sourabhmadur.github.io}
}