What's Wrong with Reinforcement Learning

On Homeostatic Learning, Differentiated Rewards, and Curious Exploration

Jan 19, 2026

Reinforcement Learning (RL) has chalked up a number of successes in the past few years due to its use in fine-tuning LLMs for human consumption (RLHF, RLAIF) and in catalyzing the DeepSeek moment (RLVR). As a result, there is a general opinion that just as backprop had to wait three decades before improving computation made it useful, so too RL just had to wait until its time. However, while the recent successes of RL in LLMs are impressive, they primarily result from how language models provided a useful vehicle for exploration that happened to interact well with the RL algorithm, as I’ll discuss below.

RL was originally intended to train agents, especially robots, to interact with their environments. For this task, RL has proved woefully insufficient, at least so far. Here, I’ll outline four reasons why RL has failed to deliver in the robotics setting, including:

Maximum lifetime rewards is not a natural principle for autonomy.
Reinforcement does not distinguish bad goals from bad plans.
Policy Gradient RL in particular is marred by noisy gradient estimates.
RL provides no mechanism to discover good action trajectories.

Organisms Seek Balance, Not Maximal Lifetime Rewards

The basic premise of RL is that an agent seeks to maximize its lifetime rewards. But, at least in my case, I am more interested in autonomy: how to build robots that can act and think independently to perform useful tasks. For this purpose, robots need to set and achieve goals. But what goals should they set? For this, let’s look at biology.

Real organisms do not seek to maximize lifetime rewards. Rather, they seek to maintain homeostasis. That is, they seek to keep certain quantities between lower and upper bounds. When blood sugar drops, you feel hunger. When your stomach stretches, you feel full. When you are hungry you are motivated to seek food, and when you become full you cease to eat. These drives are regulated by built-in chemo-sensitive neural pathways monitored by the hypothalamus in the brain. I call this system of built-in drives the motivational system.

The motivational system includes basic things like hunger, thirst, excretion, etc. but also higher order drives such sociality, reproduction, curiosity, and fear. Each of these has a different biological pathway, but primarily they are below the neocortex and especially located in the hypothalamus, amygdala, and the surrounding grey matter. For the most part, they also operate on a homeostatic principle. Take sociality: for many of us, if we spend to much time around people, we need some time alone. If we spend too much time alone, we just want to be around people. Thus the principle of learning should revolve around knowing how to keep motivations within acceptable upper and lower limits.

In many if not most cases, when a person seeks rewards without limit, this is regarded as pathological. If a person cannot stop drinking alcohol, we call them an alcoholic. If a person eats too much, they become obese. A person who drinks too much water may in fact die. A person who never stops needing sociality is insecure. And so on.

Now, we can rescue reinforcement learning through a concept of dynamic or contextual rewards: an action that triggers a reward at one time, such as eating a large meal when hungry, may not trigger a reward under different conditions, as when full after eating. The term homeostatic reinforcement learning has been introduced to describe this kind of setting (see e.g. Keramati & Gutkin, 2014, Yoshida et al., 2024). But another way to think of it is that an agent should be able to change goals depending on its current needs. It is the differentiation and selection of these goals, and not the rewards, that enables learning for autonomy.

Bad Goal or Bad Plan?

Another fundamental problem with RL is the inability to distinguish different kinds of problems with control policies. For example, suppose I am hungry, and I reach out to take a plastic apple from a tabletop display. This action will not satisfy hunger and should generate a negative reward (aka, a punishment). But the reaching action succeeded, and hence deserves a positive reward, whereas the choice of target was wrong and should be punished. In this case, the agent chose a bad goal but a good plan. RL cannot distinguish between the two, and hence will fail to reinforce the good plan in this case.

Human brains have differentiated reward pathways. There are at least five different reward pathways that are activated in different situations:

Value: Rewards are triggered when a target provides the expected value, and punishments when the value is less than expected. The neural pathway (actor → critic → target) is OFC, vmPFC, BLA, LH → medial VTA, PBP → OFC, vmPFC, BLA, NAc core.
Outcome: Rewards are triggered when an action plan does what it purported to do (e.g., reaching out to take the apple), and punishments when the plan fails (e.g., the apple is out of reach, or is dropped). The neural pathway is dlPFC, PMC, MC, dorsal striatum → lateral VTA, SNc → dlPFC, dorsal striatum.
Uncertainty: Rewards are triggered when new things are observed and uncertainty is resolved. Neural pathway: ACC, Sup. Colliculus, Hippocampus, BNST → lateral VTA, RRF → dmPFC, ACC, NAc shell, Hippocampus.
Danger: Punishments are triggered when danger is perceived, resulting in avoidance behavior. Neural pathway: Amygdala, PAG, LHb, dorsal raphe → ventral VTA+ → BLA, CeA, PAG,BNST.
Advantage: Rewards are triggered when expectations are correct, and punishments when expectations are wrong. Neural pathway: PFC, Amygdala, Hypothalamus, Insula → m/l VTA → NAc core, NAc shell

This kind of system is very different from reinforcement learning. Although it shares many features — including reward-based reinforcements, baseline expectations, anticipatory reward prediction (Q-Learning) — there is no single reward that is being tracked and maximized over a time horizon. Instead, rewards are provided along different facets or dimensions and used to train separate neural systems.

At the center of this reward system is the ventral tegmental area (VTA), which assesses whether an outcome met expectations. If expectations are exceeded, then rewards are given. If the results disappoint, then VTA doles out punishments to the relevant systems. Different parts of the VTA supervise different reward pathways. So, a bad goal can be punished while a good plan to achieve the bad goal is rewarded. In RL terms, the VTA calculates the advantage, that is, the difference A(t) = R(t) - B(t) between the reward R(t) and its expected baseline value B(t).

In order to determine expectations, the VTA references the nucleus accumbens (NAc), which is responsible for generating the baseline expectations using input from all major systems, including whole prefrontal cortex (PFC), the amygdala, and the hypothalamus. When that expectation is wrong, then VTA will punish NAc to update the expectations (pathway #5 above).

Bad goals trigger the value pathway (#1), which trains the orbitofrontal cortex (OFC) and ventromedial prefrontal cortex (vmPFC) to learn the worth of perceived objects. Bad plans trigger the outcome pathway (#2), which trains the dorsolateral prefrontal cortex (dlPFC) to generate better plans. The novelty / curiosity circuit (#3) involves the hippocampus (spatial memory), the anterior cingulate cortex (ACC), and the dorsomedial prefrontal cortext (dmPFC). And of course, the amygdala, especially the basolateral amygdala (BLA) is trained to perceive danger and anticipate pain (#4).

By differentiating reward pathways, neural learning induces different functions in different neural regions. The effect should be an exponential speedup on learning versus unidimensional reinforcement learning. And of course, there are other non-reward-based local learning algorithms active throughout the brain as well.

Implementing some of these ideas should result in an improved learning framework that outperforms bare RL.

The Variance Problem

Mathematically, reinforcement learning has serious problems as well, problems that may be partially solved by some of the biological systems listed above.

One of these is the variance problem. That is, RL seeks to learn a policy

\(\pi^* = \mathrm{arg\,max}_{\pi} \,\,\mathbb{E}_{\pi}\left[ \sum_t R_t \right]\)

A policy is a probability distribution over possible actions given prior actions and states, that is, a way of choosing what to do next in a given situation with a given history. Policy gradient algorithms, based on the REINFORCE algorithm from the 1990s (Sutton et al., 1999), do this using the derivative of the above quantity:

\(\nabla \pi = \nabla \mathbb{E}_\pi\left[\sum_t R_t\right] = \mathbb{E}_\pi\left[\left(\sum_t R_t\right) \nabla \log\pi\right]\)

This derivative is estimated using Monte Carlo methods. That is, we run simulations using the policy and average the results.

The problem is that simulations can get wildly different results, which means that the derivative of the policy has high variance. That in turn mean that a large number of simulations are needed to get an accurate estimate. Using the formula above to update a neural network policy is therefore quite noisy. As a result, it takes a lot of updates in order to get the neural network to learn.

Swapping out the reward R(t) for the advantage A(t) = R(t) - B(t), as discussed in the last section, removes systematic bias from the Monte Carlo estimator, but it does not eliminate variance. Many of the biggest advancements in policy gradient research involve learning better and better baseline expectations (GAE, PPO, GRPO).

The core problem of gradient variance, however, remains a hindrance to learning good policies with reinforcement learning.

The Exploration Conundrum

The biggest problem with reinforcement learning can be stated thus: you cannot reinforce what you have not observed. In order to learn by reinforcement learning, you must have action sequences that provide significant rewards. There is nothing about the formulation of the reinforcement learning problem that guarantees that all possible good actions will be observed. In fact, the most obvious and simplistic method for RL, that is, randomly choosing actions one at a time, almost guarantees that good action policies, which involve temporally correlated behavior, will be statistically unlikely, which can cause RL training to take exponentially long.

This task of proposing good action sequences that should be reinforced is known as exploration, since the agent must explore the space of possible action plans to find the good plans. One distinguishes between on-policy RL, where exploration is done using the policy that is being trained with RL, and off-policy RL, where exploration and action proposal are handled by a different control policy. From my perspective, most successful RL methods seem to be off-policy. In general, on-policy exploration tends to hyperfixate early on a particular type of solution, which hinders the discovery of new, more rewarding action trajectories.

The use of large language models (LLMs) has revolutionized RL to some degree because the LLM, trained on vast amounts of data, turns out to yield great exploration policies. Thus for RL with human feedback (RLHF), we can use a base LLM to generate candidate prompt responses that will be judged by a human, and these candidates will generally be both diverse and of high quality from the outset. This is fundamentally why RL has been so effective at refining LLMs, including RLHF, RLAIF (AI feedback), and RLVR (verified rewards, where the LLM writes code that can be tested and verified).

LLMs might help solve the exploration problem for more general autonomous robot control. The vision-language-action (VLA) paradigm uses a vision+language model (VLM) to interpret commands such as “pick up the red ball” into machine language (Brohan 2023, Physical Intelligence 2025, Bjorck 2025). The original versions generated machine commands directly from the LLM, but more recent versions (including Sharpa Craftnet) do learn a control policy. However, these algorithms do not typically use reinforcement learning, but instead opt for (self-)supervised learning of examples from a teleoperation or a dataset of robot or human behaviors. The reason why VLAs don’t commonly use RL is precisely the exploration policy: there is presently no good way to generate good exploratory policies.

Furthermore, using a language model that is trained on trillions of human words and equally capable of discussing quantum field theory or fashion trends in the 1680s just in order to teach a robot to fold paper seems a bit like using a sledge hammer to kill a gnat.

It is not completely clear how to develop better exploration trajectories. One research approach that did show some early promise was curiosity-based exploration. Varun Kompella, an old labmate of mine in Juergen Schmidhueber’s lab, wrote his 2015 Ph.D. thesis “Curious Dr. MISFA” on a method for controlling a humanoid robot based on learning to produce behaviors that caused structural changes to the environment. Work like this could be revisited or extended, leading to new methods of exploration that are directed towards temporally correlated, semantically meaningful action trajectories.

Conclusion

Traditional reinforcement learning has major inadequacies that make it undesirable as the learning paradigm for autonomous systems. An improved paradigm might feature:

Differentiated reward pathways for value prediction, planning, risk prediction, danger assessment, and advantage baselining.
A homeostatic learning scenario with contextualized reward supervision.
New methods for curiosity-driven exploration of semantically structured action trajectories.

Such an approach could still be meaningfully referred to as reinforcement learning, though it would be structured differently from most historical RL methods.

It will be exciting to see how we can develop new approaches of this type!

The AI Architect

Really appreciate the homeostatic angle here - that reframed alot for me. The idea that organisms seek balance rather than maximization feels obvious in hindsight but I havent seen it articulated this cleanly. Been working with policy gradient methods and the variance issue you mention has been a constant headache, dunno if the differentiated reward pathways would help but worth exploring. The VLA discussion at the end was particularly interesting too.

1 reply by Alan J Lockett

1 more comment...

Embodied Language and Cognition

Discussion about this post

Ready for more?