Reinforcement Learning (RL)

    Sep 25, 2025

    • A machine learning technique
    • Every RL system has an agent
    • Unlike supervised learning, RL can learn in real time through interacting with it's environment

    How RL Works

    Reinforcement learning is the process of an agent learning by trial and error within its environment - guided by rewards and penalties, so its model improves at making good decisions.

    Its helpful to understand common RL terms by thinking of an agent as a person:

    • Agent = the whole person.
    • Model = the brain inside the person.
    • Environment = the world around the person.
    • State = what the person perceives about the world right now.
    • Action = what the person decides to do (move, speak, block traffic, etc.).
    • Reward/penalty = feedback from the world (like getting paid, praised, or hurt).
    • Policy = The life lessons learned over time to maximize rewards

    The agent learns by interacting with it's environment, after it makes a decision (action) the environment sends feedback.

    How Rewards Work

    AI agents have parameters such as weights in a neural net. RL pushes these parameters in the direction that produces the most reward. A good job is rewarded which shifts the parameters in a direction that produces more reward. A bad job will shift the parameters away to make that action less likely.

    AI models have an objective function which is like a math score. Rewards and penalties are numbers that increase or decrease the score. Carefully designing rewards is critical, otherwise the AI model might learn unintended or unsafe behavior.

    An agent takes an action based on it's current state which describes everything it needs to know currently to make a decision. After an agent takes an action the environment returns:

    1. The next state (what the environment looks like now)
    2. The reward (a number)
    3. Confirmation the task is over

    The agent's goal is to maximize it's score over time.

    Reward design is one of the trickiest parts of RL, if the reward is not designed well then the agent can cheat or unintentionally learn the wrong behavior.

    Common methods used for rewarding agents in practice:

    Reinforcement Learning from Human Feedback (RLHF)

    Humans rank or score the agent's outputs so that the agent learns to align with these values.

    Reinforcement Learning from Human Feedback generally works in 3 phases:

    1. Pretraining - the underlying model is trained on a large dataset.
    2. Human feedback - Humans look at the AI outputs and rank or rate them.
    3. Train a reward model - A reward model is trained to predict how a human would rate an output.

    An example of this is a firewall that blocks legitimate requests. The model is trained to block bad requests but a human sees it is blocking Google bots which is affecting SEO. The model is then rewarded for identifying which meta bots are bad which are good.

    RLHF addresses edge cases in model training outcomes and aligns AI behavior with human values and reduces undesirable behavior. It is important to note however, that the quality of RLHF depends on the quality of the feedback, if humans are biased or poorly trained then the AI will learn this behavior.

    Sparse Rewards

    The agent only gets rewarded at the end of a task. This is the most simple method of RL but it can make learning slow. For example if you finish the job you get the points.

    Dense (Shaped Rewards)

    The agent gets small rewards at each step which guide it towards its goal, this helps the agent to learn faster. For example 0.1 points when you get close, 10 if you succeed.

    Delayed Rewards

    Rewards are not given immediately and are instead awarded based on long term outcomes, for example in fraud detection, flagging a transaction as fraud is not rewarded, but grouping several transactions together over several weeks and then detecting a fraud pattern is rewarded.

    Exploration Bonuses

    Giving an extra reward when the agent tries something new or creative. This can be useful in research and can be managed through entropy configuration.

    Reward Shaping

    An agent is given an immediate strong reward for displaying actions that lead to an excellent outcome. For example a travel route from A to B that is efficient, safe and achieves it's objectives.

    Multi Objective Rewards

    Combining multiple criteria such as performance, safety, efficiency to balance trade offs in real world systems. For example a task that requires speed but also accuracy.

    Inverse Reinforcement Learning

    Instead of designing rewards manually, the AI learns the reward function by watching experienced or human experts or trained agents. This is useful in areas where rewarding is too complex with multiple variables that can best be learned through observation of humans or trained agents.

    How Penalties Work

    Negative rewards work in a similar way except the environment returns a negative number. This pushes the model further away from that behavior.

    Common methods used for penalizing agents in practice:

    Minimizing Loss

    RL agents can minimize losses through penalty terms to avoid unsafe or inefficient actions. It achieves this by shaping the loss. For example a model that decides on how fast a person can get from location A to location B might find the fastest route (reward) but it might take them over a river to do this. By shaping the loss an agent can be penalized for achieving the objective but in an unsafe way. The model will therefore be trained to avoid that action in future.

    Constraints And Safety Filters

    In RL an agent tries many different actions to learn. Sometimes there are actions which you do not want the agent to even try because they do not align with human values or present ethical or safety concerns. To address this extra rules can be placed outside of the agent's model that block or adjust unsafe actions before they happen.

    Early Termination

    If an agent breaks a rule then the training round can be ended immediately and the agent loses the chance to earn any further rewards in that round.

    Risk Sensitive Objectives

    The agent is penalized for risky or outlandish strategies. For example a car is given an efficient route from A to B that occasionally ends in a crash. A risk sensitive agent will avoid this route, preferring a route that takes longer but rarely results in a crash. The dangerous route will be penalized and the model trained to avoid that route.

    Entropy Control

    In RL, entropy determines the randomness of an agent's actions. Higher entropy will lead to varied and creative actions, while low entropy will lead to predictable well known actions. Entropy control can be adjusted depending on the training requirements.

    Next Article

    Continue reading in this category

    Reinforcement Learning (RL) | AIRTA Systems AI Safety Academy