Reinforcement Learning (RL) is about finding optimal actions automatically.
So you have an environment env
which has
env.reset() -> None
: Start a new episode. This could be a new game in the case of chess.env.step(action) -> observation, reward, is_done, additional_information
: Make a step in the environment. Theis_done
says if the episode is over, e.g. if a game of chess is over. If it is over, then the environment needs a reset.
and an agent
which has
agent.reset() -> agent
: Reset internal variablesagent.act(observation, no_exploration) -> action
: Let the agent take an action. If you want to evaluate the agent, setno_exploration
.agent.remember(prev_state, action, reward, state, is_done) -> agent
: Store what is necessary - here the learning -> agent
: Serialize the agent topath
agent.load(path) -> agent
: De-serialize the agent frompath
The idea of Q-Learning
The following is a mixed introduction to RL / Q-Learning. You might want to have a look at my Reinforcement Learning post as well.
If there is a limited set of observations $\mathcal{S}$ (states) and a limited set of actions $\mathcal{A}$, then you have $|\mathcal{S}| \cdot |\mathcal{A}|$ possibilities to rate. For some of the observations you also receive a reward. But rewards might be delayed:
a0 --- s3, r=10
a0-- s1, r= 10 - a1 --- s4, r= 0
s0 -
\a1 -- s2, r=-10 - a0 --- s5, r = 100
This shows that you start in state s0
where you can execute actions a0
. Action a0
lives you a reward of 10, action a1
a reward of -10
. So
if you take the action greedy, you would take a0
and end up in state s1
But if you look one step ahead, you can see that s2
ends up in state s5
with a reward of 100 whereas s1
can only get a reward of 10 or 0.
In many cases, one does not want a greedy action. And one does not want to rely completely on very high rewards in the very far future. Direct rewards are prefered, but if it is really high we wait a bit longer. This thought leads to the value of a state / action. The value of a state or a state/action pair is its current reward plus its reward in future. As we want to prefer rewards which come directly, we discount the future rewards with a factor $\gamma \in [0, 1]$:
$$V(s) = \max_{a \in \mathcal{A}} (R(s, a) + \gamma \sum_{s'} V(s'))$$
The $\max_{a \in \mathcal{A}}$ means we execute the optimal action all the time.
Most of the time, the environments are not deterministic. Then you need to take the transition probability from getting from state $s$ into state $s'$ when you execute action $a$ into account:
$$V(s) = \max_{a \in \mathcal{A}} (R(s, a) + \gamma \sum_{s' \in \mathcal{S}} T(s, a, s') V(s'))$$
Ok, awesome! But now comes the tricky part: We don't have the function $V$. If both, $\mathcal{S}$ and $\mathcal{A}$ are finite, we can simply:
- Initialize a table which has the columns (state, value of action 1, value of action 2, ..., value of action $n$) and one row per state. You could initialize it to zero.
- Run the agent. Update the $(state, action)$ cell with a weighted average of what was in the table + what was observed. The weighting factor is $\alpha \in (0, 1)$.
That's it.
You might want to read Best practice for Machine Learning Projects to understand why the following code was written as it is.
The latest code can be found on Github MartinThoma:algorithms/
First, the configuration file:
model_name: 'qlearning'
gamma: 0.99 # discounting factor
nb_epochs: 100000
learning_rate: 0.7 # alpha
print_score: 500 # each 500 episodes
name: 'Boltzmann'
clip: [-500, 500]
nb_epochs: 10000
Now the code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Q-Table Reinforcement Learning agent."""
# core modules
import logging
import os
import pickle
import sys
import yaml
# 3rd party modules
import gym
import numpy as np
format="%(asctime)s %(levelname)s %(message)s",
np.set_printoptions(formatter={"float_kind": lambda x: "{:.2f}".format(x)})
def main(environment_name, agent_cfg_file):
Load, train and evaluate a Reinforcment Learning agent.
environment_name : str
agent_cfg_file : str
cfg = load_cfg(agent_cfg_file)
# Set up environment and agent
env = gym.make(environment_name)
cfg["env"] = env
cfg["serialize_path"] = "artifacts/{}-{}.pickle".format(
cfg["model_name"], environment_name
agent = load_agent(cfg, env)
agent = train_agent(cfg, env, agent)
rewards = test_agent(cfg, env, agent)
print("Average reward: {:5.3f}".format(rewards))
print("Trained epochs: {}".format(agent.epoch))
class QTableAgent:
Q-Table Agent.
cfg : dict
def __init__(self, agent_cfg, nb_observations, nb_actions):
self.nb_obs = nb_observations
self.nb_act = nb_actions
self.Q = np.zeros([nb_observations, nb_actions]) = agent_cfg["training"]["learning_rate"]
self.gamma = agent_cfg["problem"]["gamma"] # discount
self.epoch = 0
self.exploration = agent_cfg["training"]["exploration"]
def reset(self):
"""Reset the agent. Call this at the beginning of an episode."""
self.epoch += 1
def act(self, observation, no_exploration=False):
Decide which action to execute.
observation : int
no_exploration : bool, optional (default: False)
action : int
assert self.epoch >= 1, "Reset before you run an episode."
action = np.argmax(self.Q[observation, :])
if not no_exploration:
if self.exploration["name"] == "epsilon-greedy":
if np.random.uniform() < self.exploration["epsilon"]:
action = np.random.random_integers(0, self.nb_act - 1)
elif self.exploration["name"] == "Boltzmann":
T = 1
clip = self.exploration["clip"]
q_values = self.Q[observation, :].astype("float64")
q_values = np.clip(q_values / T, clip[0], clip[1])
exp_values = np.exp(q_values)
probs = exp_values / np.sum(exp_values)
action = np.random.choice(range(self.nb_act), p=probs)
raise NotImplemented(self.exploration["name"])
return action
def remember(self, prev_state, action, reward, state, is_done):
Store data in the Q-Table. Here, the learning happens.
prev_state : int
action : int
reward : float
state : int
delta = reward - self.Q[prev_state, action]
if not is_done:
delta += self.gamma * np.max(self.Q[state, :])
self.Q[prev_state, action] += * delta
return self
def save(self, path):
"""Serialize an agent."""
data = {"Q": self.Q, "epoch": self.epoch}
with open(path, "wb") as handle:
pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)
return self
def load(self, path):
"""Load an agent."""
with open(path, "rb") as handle:
data = pickle.load(handle)
self.Q = data["Q"]
self.epoch = data["epoch"]
return self
def load_agent(cfg, env):
Create (and load) a QTableAgent.
cfg : dict
env : OpenAI environment
agent = QTableAgent(cfg, env.observation_space.n, env.action_space.n)
if os.path.isfile(cfg["serialize_path"]):
return agent
# General training and testing code
def train_agent(cfg, env, agent):
Train an agent in environment.
cfg : dict
env : OpenAI environment
agent : object
agent : object
cum_reward = 0.0
for episode in range(cfg["training"]["nb_epochs"]):
observation_previous = env.reset()
is_done = False
while not is_done:
action = agent.act(observation_previous)
observation, reward, is_done, _ = env.step(action)
cum_reward += reward
agent.remember(observation_previous, action, reward, observation, is_done)
observation_previous = observation
if episode % cfg["training"]["print_score"] == 0 and episode > 0:["serialize_path"])
print("Average score: {:>5.2f}".format(cum_reward / (episode + 1)))
return agent
def test_agent(cfg, env, agent):
"""Calculate average reward."""
cum_reward = 0.0
for episode in range(cfg["testing"]["nb_epochs"]):
observation_previous = env.reset()
is_done = False
while not is_done:
action = agent.act(observation_previous, no_exploration=True)
observation, reward, is_done, _ = env.step(action)
cum_reward += reward
observation_previous = observation
return cum_reward / cfg["testing"]["nb_epochs"]
# General code for loading ML configuration files
def load_cfg(yaml_filepath):
Load a YAML configuration file.
yaml_filepath : str
cfg : dict
# Read YAML experiment definition file
with open(yaml_filepath, "r") as stream:
cfg = yaml.load(stream)
cfg = make_paths_absolute(os.path.dirname(yaml_filepath), cfg)
return cfg
def make_paths_absolute(dir_, cfg):
Make all values for keys ending with `_path` absolute to dir_.
dir_ : str
cfg : dict
cfg : dict
for key in cfg.keys():
if key.endswith("_path"):
cfg[key] = os.path.join(dir_, cfg[key])
cfg[key] = os.path.abspath(cfg[key])
if not os.path.isfile(cfg[key]):
logging.error("%s does not exist.", cfg[key])
if type(cfg[key]) is dict:
cfg[key] = make_paths_absolute(dir_, cfg[key])
return cfg
def get_parser():
"""Get parser object."""
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
parser = ArgumentParser(
description=__doc__, formatter_class=ArgumentDefaultsHelpFormatter
help="OpenAI Gym environment",
help="Configuration file for the agent",
return parser
if __name__ == "__main__":
args = get_parser().parse_args()
main(args.environment_name, args.agent_cfg_file)
Environment | Config File | Time | Score |
FrozenLake-v0 | qlearning.yaml | 50s | 0.166 |
FrozenLake-v0 | q-lr10.yaml | 48s | 0.743 |
FrozenLake-v0 | q-lr90.yaml | 48s | 0.156 |
CliffWalking-v0 | q-lr10.yaml | 128s | -13.000 |
FrozenLake8x8-v0 | q-lr10.yaml | 179s | 0.569 |
NChain-v0 | q-lr90.yaml | 4614s | 1760.048 |
OneRoundDeterministicReward-v0 | q-lr10.yaml | 5s | 1.00 |
OneRoundNondeterministicReward-v0 | q-lr10.yaml | 6s | 2.475 |
Roulette-v0 | q-lr10.yaml | 533s | -2.764 |
Taxi-v2 | q-lr10.yaml | 68s | 8.471 |
TwoRoundDeterministicReward-v0 | q-lr10.yaml | 10s | 3.000 |
TwoRoundNondeterministicReward-v0 | q-lr10.yaml | ERROR | ERROR |