SARSA: On-Policy Reinforcement Learning Explained

Learn SARSA, an on-policy reinforcement learning algorithm for optimal policy learning in MDPs. Understand its state-action-reward-state-action updates for AI agents.

SARSA (State-Action-Reward-State-Action)

SARSA is a popular on-policy reinforcement learning algorithm used for learning optimal policies in Markov Decision Processes (MDPs). It enables an agent to learn how to act optimally by interacting with an environment. The algorithm updates its policy based on the current state and action, the received reward, the next state, and the next action taken by the current policy.

How SARSA Works

SARSA updates its action-value function, denoted as $Q(s, a)$, by strictly following the current policy. At each time step, the agent observes the following:

  • Current state ($s$): The current situation the agent is in.
  • Action taken ($a$): The action the agent chooses to perform in state $s$.
  • Reward received ($r$): The feedback from the environment after taking action $a$.
  • Next state ($s'$): The new state the agent transitions to.
  • Next action ($a'$): The action the agent chooses to perform in the next state $s'$ according to its current policy.

The core update rule for the Q-value is:

$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]$

Where:

  • $\alpha$ (alpha): The learning rate. It controls how much new information overrides old information. A higher alpha means faster learning, but can lead to instability.
  • $\gamma$ (gamma): The discount factor. It determines the importance of future rewards. A gamma closer to 1 emphasizes future rewards more, while a gamma closer to 0 prioritizes immediate rewards.

Because SARSA updates its Q-values using the actual next action ($a'$) chosen by the current policy, it learns the value of that specific policy. This makes SARSA an on-policy algorithm.

SARSA vs. Q-Learning

The fundamental difference between SARSA and Q-Learning lies in their approach to updating the Q-values:

  • SARSA: Is on-policy. It updates its Q-values using the action ($a'$) that the agent's current policy actually chooses in the next state ($s'$).
  • Q-Learning: Is off-policy. It updates its Q-values using the maximum possible Q-value in the next state ($s'$), regardless of what action the current policy would actually take. This is often referred to as the "greedy" action in the next state.

This distinction makes SARSA generally more cautious. In environments where taking exploratory or suboptimal actions can lead to significant negative consequences (e.g., falling off a cliff), SARSA tends to learn a safer policy because it accounts for the potential risks associated with the policy it's actually following. Q-Learning, by contrast, might learn a policy that seems optimal in theory but is dangerous in practice if the exploratory steps are penalized heavily.

Key Advantages of SARSA

  • Stable Learning: By learning the value of the policy being executed, SARSA often leads to more stable and predictable learning.
  • Safety in Risky Environments: SARSA is well-suited for scenarios where exploratory actions might have severe penalties. It learns a policy that avoids these dangerous explorations.
  • Simplicity and Intuition: The update rule is relatively straightforward and intuitive, making it easier to implement and understand.

Python Example Code for SARSA

This example demonstrates a basic SARSA implementation using the gym library, specifically the 'Taxi-v3' environment.

import gym
import numpy as np

# Initialize the environment
env = gym.make('Taxi-v3')

# Initialize Q-table with zeros
# Dimensions: (number of states, number of actions)
Q = np.zeros((env.observation_space.n, env.action_space.n))

# Hyperparameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 0.1    # Exploration rate (probability of choosing a random action)
episodes = 5000  # Number of training episodes

def choose_action(state, epsilon):
    """
    Chooses an action using an epsilon-greedy policy.
    Explores with probability epsilon, exploits with probability 1-epsilon.
    """
    if np.random.uniform(0, 1) < epsilon:
        return env.action_space.sample()  # Explore: choose a random action
    else:
        return np.argmax(Q[state, :])     # Exploit: choose the action with the highest Q-value

# Training the SARSA agent
print("Starting SARSA training...")
for episode in range(episodes):
    state = env.reset()  # Reset the environment for a new episode
    action = choose_action(state, epsilon) # Choose the first action
    done = False

    while not done:
        # Take the chosen action and observe the results
        next_state, reward, done, _ = env.step(action)

        # Choose the *next* action based on the *next state* and the *current policy*
        next_action = choose_action(next_state, epsilon)

        # SARSA update rule: Q(s,a) <- Q(s,a) + alpha * (r + gamma*Q(s', a') - Q(s,a))
        Q[state, action] = Q[state, action] + alpha * (reward + gamma * Q[next_state, next_action] - Q[state, action])

        # Move to the next state and update the action
        state = next_state
        action = next_action

    if (episode + 1) % 500 == 0:
        print(f"Episode {episode + 1}/{episodes} completed.")

print("Training finished.")

# --- Testing the learned policy ---
print("\nTesting the learned policy...")
state = env.reset()
env.render()
done = False
total_reward = 0

while not done:
    # Choose the best action based on the learned Q-table (exploitation only)
    action = np.argmax(Q[state, :])
    state, reward, done, _ = env.step(action)
    env.render()
    total_reward += reward

print(f"Test finished. Total reward: {total_reward}")
env.close()

SARSA Hyperparameters

Tuning hyperparameters is crucial for effective SARSA learning:

  • Learning Rate ($\alpha$):
    • Too high: Can cause oscillations and prevent convergence.
    • Too low: Can lead to very slow learning.
    • Common range: $0.01$ to $0.5$. Often decayed over time.
  • Discount Factor ($\gamma$):
    • Close to 1: Prioritizes long-term rewards, making the agent look further into the future.
    • Close to 0: Prioritizes immediate rewards.
    • Common range: $0.9$ to $0.999$.
  • Exploration Rate ($\epsilon$):
    • Controls the trade-off between exploration (trying new actions) and exploitation (using known best actions).
    • A common strategy is epsilon-decay, where epsilon starts high (e.g., 1.0) and gradually decreases over episodes to favor exploitation as the agent learns.
    • Common initial range: $0.1$ to $1.0$.
  • What is the SARSA algorithm in reinforcement learning? SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference learning algorithm that estimates the action-value function $Q(s, a)$ by following the current policy.
  • How does SARSA differ from Q-Learning? SARSA is on-policy, updating with the actual next action taken by the policy ($Q(s', a')$), while Q-Learning is off-policy, updating with the maximum possible Q-value in the next state ($max_{a'} Q(s', a')$).
  • Explain the SARSA update rule in detail. $Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]$. It updates the current state-action pair's value based on the immediate reward plus the discounted value of the next state-action pair determined by the current policy.
  • What does it mean that SARSA is an on-policy algorithm? It means the algorithm learns the value of the policy it is currently following. The updates are based on actions that the policy actually chooses, including its exploratory actions.
  • How do you choose the next action in SARSA? The next action ($a'$) is chosen using the same policy that is currently being used to choose the action ($a$) in the current state ($s$). This is typically done using an $\epsilon$-greedy strategy.
  • What are the key hyperparameters in SARSA, and how do they affect learning? The key hyperparameters are the learning rate ($\alpha$), discount factor ($\gamma$), and exploration rate ($\epsilon$). $\alpha$ controls update magnitude, $\gamma$ controls future reward importance, and $\epsilon$ controls the exploration-exploitation balance.
  • In what scenarios is SARSA preferred over Q-learning? SARSA is preferred in environments where taking suboptimal or exploratory actions can lead to severe penalties or safety concerns, as it learns a more conservative policy.
  • How does SARSA handle exploration and exploitation? SARSA typically uses an $\epsilon$-greedy strategy, where with probability $\epsilon$, it explores by choosing a random action, and with probability $1-\epsilon$, it exploits by choosing the action with the highest estimated Q-value.
  • Can SARSA be used in continuous state or action spaces? The basic SARSA algorithm as described is for discrete state and action spaces, using a Q-table. For continuous spaces, function approximation methods (like neural networks) are needed to represent the Q-function, often referred to as Deep SARSA.
  • How do you evaluate the performance of a SARSA agent? Performance is typically evaluated by the total cumulative reward achieved over a set of test episodes after training. Metrics like convergence speed, stability of the learned policy, and success rate in achieving goals are also important.

SEO Keywords:

SARSA reinforcement learning algorithm, SARSA vs Q-learning differences, SARSA algorithm explained, SARSA Python implementation example, On-policy learning reinforcement learning, State-Action-Reward-State-Action algorithm, SARSA update rule, Reinforcement learning algorithms comparison, SARSA hyperparameters tuning, SARSA Taxi environment example.