Reinforcement Learning: Agent, Environment & Rewards | AI

Explore Reinforcement Learning (RL), a key AI subfield. Learn about agents, environments, and maximizing cumulative rewards in machine learning.

5. Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning concerned with how intelligent agents ought to take actions in an environment to maximize some notion of cumulative reward. An RL agent learns by interacting with an environment, receiving feedback in the form of rewards or penalties based on its actions.

Core Concepts

At its heart, Reinforcement Learning involves the following key components:

  • Agent: The entity that learns and makes decisions.
  • Environment: The external system with which the agent interacts.
  • State ($(s)$): A representation of the current situation or configuration of the environment.
  • Action ($(a)$): A choice made by the agent in a given state.
  • Reward ($(r)$): A scalar feedback signal received by the agent from the environment after taking an action. Positive rewards indicate desirable outcomes, while negative rewards (penalties) indicate undesirable ones.
  • Policy ($(\pi)$): The agent's strategy or mapping from states to actions. It defines what action the agent will take in any given state.
  • Value Function: A prediction of the future cumulative reward.
    • State-Value Function ($(V(s))$): The expected cumulative reward starting from state $(s)$ and following a particular policy.
    • Action-Value Function ($(Q(s, a))$): The expected cumulative reward starting from state $(s)$, taking action $(a)$, and then following a particular policy.
  • Model (Optional): A representation of how the environment behaves. It predicts the next state and reward given the current state and action.

Types of Reinforcement Learning Algorithms

RL algorithms can be broadly categorized based on whether they use a model of the environment.

Model-Based Methods

Model-based RL algorithms learn a model of the environment. This model can then be used for planning, allowing the agent to simulate future outcomes and choose actions that lead to optimal rewards without necessarily interacting with the real environment.

  • How it works:

    1. Learn a model of the environment (transition probabilities and reward functions).
    2. Use the learned model to plan by simulating future states and rewards.
    3. Choose actions based on the planning results.
  • Pros: Can be more sample-efficient if the model is accurate. Allows for planning and lookahead.

  • Cons: Learning an accurate model can be challenging, especially in complex environments. Model errors can propagate and lead to suboptimal policies.

Model-Free Methods

Model-free RL algorithms learn a policy or value function directly from experience without explicitly learning a model of the environment's dynamics. They rely on trial-and-error.

  • How it works:

    1. Interact with the environment.
    2. Update the policy or value function directly based on observed rewards and state transitions.
  • Pros: Simpler to implement and can work well even when a precise model is hard to obtain.

  • Cons: Can be less sample-efficient, requiring more interactions with the environment.

Key Model-Free Algorithms

Here are some fundamental model-free RL algorithms:

Monte Carlo Methods

Monte Carlo methods learn from complete episodes of experience. They estimate value functions by averaging the returns obtained from many episodes.

  • How it works:

    1. Run many episodes of interaction with the environment using the current policy.
    2. For each state-action pair, calculate the return (cumulative reward) obtained in those episodes.
    3. Update the value function estimates based on these returns.
  • Key characteristics:

    • Learn from complete episodes.
    • No bootstrapping (values are not updated based on other estimated values).
    • Can be applied to episodic tasks.
  • Example: Imagine a simple game of Blackjack. A Monte Carlo agent plays many hands. For each hand, it records the sequence of states, actions, and the final outcome (win, lose, draw). To estimate the value of a particular state (e.g., having 17, dealer showing 6), it averages the rewards from all hands that reached that state.

Q-Learning

Q-Learning is a value-based, off-policy temporal difference (TD) learning algorithm. It learns an action-value function (Q-function) that estimates the expected future reward of taking a specific action in a specific state.

  • How it works: The Q-function is updated using the Bellman equation: $$ Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)] $$ where:

    • $(s)$: current state
    • $(a)$: current action
    • $(r)$: immediate reward
    • $(s')$: next state
    • $(a')$: next action
    • $(\alpha)$: learning rate
    • $(\gamma)$: discount factor (0 $\le \gamma \le$ 1)
    • $(\max_{a'} Q(s', a'))$: the maximum expected future reward from the next state $(s')$, indicating the greedy choice.
  • Off-policy: Q-learning learns the optimal policy regardless of the policy being followed to generate data. This means it can learn from data generated by a different policy (e.g., an $(\epsilon)$-greedy policy) while still aiming for the optimal Q-values.

SARSA (State-Action-Reward-State-Action)

SARSA is another value-based, on-policy temporal difference (TD) learning algorithm. It also learns an action-value function but updates it based on the action actually taken in the next state according to the current policy.

  • How it works: The Q-function is updated using the Bellman equation: $$ Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)] $$ where:

    • $(s), (a), (r), (s')$: current state, action, reward, and next state.
    • $(a')$: the next action actually taken in state $(s')$ according to the current policy.
  • On-policy: SARSA learns the value of the policy that is currently being followed, including its exploratory actions. If the policy is $(\epsilon)$-greedy, SARSA learns the Q-values for the $(\epsilon)$-greedy policy.

  • Difference from Q-Learning: The key difference is $(\max_{a'} Q(s', a'))$ in Q-learning versus $(Q(s', a'))$ in SARSA, where $(a')$ is the action taken by the current policy.

Actor-Critic Methods

Actor-Critic methods combine the strengths of value-based (like Q-learning) and policy-based methods. They maintain two components:

  1. Actor: Controls how the agent acts. It learns the policy and outputs actions.
  2. Critic: Evaluates the actions taken by the actor. It learns a value function (e.g., state-value or action-value) and provides feedback to the actor.
  • How it works:

    • The actor updates its policy based on the error signal (e.g., TD error) provided by the critic.
    • The critic updates its value function based on the rewards and state transitions, often using TD learning.
  • Pros: Can learn stochastic policies, often more stable than pure policy gradient methods, and can handle continuous action spaces.

  • Cons: Can be more complex to implement and tune due to having two learning components.

Deep Q-Networks (DQN)

Deep Q-Networks (DQN) is a pioneering algorithm that leverages deep neural networks to approximate the Q-function in Q-learning. This allows RL agents to handle high-dimensional state spaces, such as raw pixel inputs from video games.

  • Key Innovations:

    • Experience Replay: Stores past experiences (state, action, reward, next state, done) in a replay buffer. During training, mini-batches are randomly sampled from this buffer, which breaks the correlation between consecutive samples and improves learning stability.
    • Target Network: Uses a separate, periodically updated "target network" to calculate the target Q-values. This further stabilizes training by providing a consistent target for the Q-network updates. The target network's weights are periodically copied from the main Q-network.
  • How it works:

    1. The Q-network takes the current state as input and outputs Q-values for all possible actions.
    2. An action is selected (e.g., using $(\epsilon)$-greedy).
    3. The experience tuple is stored in the replay buffer.
    4. Periodically, a batch of experiences is sampled from the buffer.
    5. For each sampled experience, the target Q-value is computed using the target network: $(y = r + \gamma \max_{a'} Q_{target}(s', a'))$.
    6. The Q-network is trained to minimize the mean squared error (MSE) between its current prediction and the target Q-value: $(L(\theta) = \mathbb{E}[(y - Q(s, a; \theta))^2])$.
    7. The target network weights are periodically updated.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a policy gradient method that aims to improve stability and reliability by constraining the policy updates. It uses a clipped surrogate objective function to prevent large policy updates that could destabilize learning.

  • How it works:

    • PPO calculates the advantage of an action (how much better that action was than the average action in that state).
    • It uses a ratio between the new policy and the old policy to measure how much the policy has changed.
    • The objective function is designed to maximize the expected advantage, but with a clipping mechanism that penalizes large deviations from the old policy. This helps ensure that policy updates are not too drastic, preventing performance collapse.
  • Key objective function component (simplified): $$ L^{CLIP}(\theta) = \mathbb{E}[\min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t)] $$ where:

    • $(r_t(\theta))$ is the probability ratio of the new policy to the old policy for action $(t)$.
    • $(\hat{A}_t)$ is the estimated advantage.
    • $(\epsilon)$ is a hyperparameter (e.g., 0.1 or 0.2).
  • Pros: Relatively simple to implement, good performance across a wide range of tasks, and more stable than many other policy gradient methods.

Probabilistic Models in RL Context

While not RL algorithms themselves, probabilistic graphical models can be used in conjunction with RL or for understanding related concepts in sequential decision-making.

Bayesian Networks

Bayesian Networks (BNs) are probabilistic graphical models that represent conditional dependencies between random variables.

  • Relevance to RL:

    • Can be used to model the state transitions and reward functions in model-based RL.
    • Can represent uncertainty in the environment or in the agent's knowledge.
    • Can be used for belief state tracking in Partially Observable Markov Decision Processes (POMDPs).
  • Structure: A directed acyclic graph (DAG) where nodes represent random variables and directed edges represent conditional dependencies. Each node has a conditional probability distribution (CPD) that quantifies the relationship with its parents.

Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs) are statistical models where the system being modeled is assumed to be a Markov process with unobserved (hidden) states.

  • Relevance to RL:

    • HMMs are a fundamental model for sequential data and can be seen as a special case of POMDPs where the belief state is a single state (i.e., no partial observability).
    • Can be used to model environments where the underlying state is not directly observable, and actions influence the observable outputs or transitions.
    • The underlying dynamics of an HMM can be learned and then used in an RL setting to guide decision-making.
  • Components:

    • States: A finite set of hidden states.
    • Observations: A finite set of possible observations.
    • Transition Probabilities: Probabilities of transitioning from one hidden state to another.
    • Emission Probabilities: Probabilities of observing a particular observation given a hidden state.
    • Initial State Probabilities: Probabilities of starting in each hidden state.