Build a Reinforcement Learning Terran Agent with PySC2 2.0 framework

Juan Arturo Cruz Cardona
8 min readNov 26, 2020

USEFUL LINKS FOR THE TUTORIAL

BEFORE STARTING

Through this tutorial, you will build a Smart Terran Agent capable of learning to develop a better strategy over time through a reward system based on the actions it tooks and the state resulting from those.

System requirements

  • PySC2 2.0 framework installed
  • Pandas python library
  • Numpy python library
  • Basic python programming skills

Also take into consideration that reinforcement learning to the complete game is extremely complex and takes a lot of time and computational power, we can make things easier for ourselves by significantly reducing the complexity of the game, that’s why the bot will be playing against “itself”, so instead of having an opponent with all the abilities in the game the opponent will have the same abilities, restrictions we will define and it will behave randomly.

1.- IMPORTS

Import the libraries at the top of the file as following:

2.- CREATE A QTABLE FOR REINFORCMENT LEARNING

We must define the algorithm for our Machine Learning Agent, there is where the QlearningTable comes to action, it is a simplified version of reinforcement learning.

It is essentially a spreadsheet of all the states the game has been in, and how good or bad each action is within each state. The bot updates the values of each action depending on whether it wins or loses, and over time it builds a fairly good strategy for a variety of scenarios.

Inside the class of the QLearningTable we will also define the following methods

Choose Action Method

The main method of the learning table is choosing the action to perform, here with the e_greedy parameter means it will choose 90% of times the preferred action and the 10% of the time it will choose randomly for exploring extra possibilities of having a random action.

In order to choose the best action it first retrieves the value of each action for the current state, then chooses the highest value action. If multiple actions have the same highest value, it will choose one of those states at random.

Learn Method

The next important method here is learn. It takes as parameters:

  • s refers to the previous state
  • a is that action that was performed in that state
  • r is the reward that was received after taking the action
  • s_ is the state the bot landed in after taking the action

First in q_predict we get the value that was given for taking the action when we were first in the state.

Next we determine the maximum possible value across all actions in the current state, discount it by the decay rate (0.9), and add the reward we received.

Finally we take the difference between the new value and the previous value and multiply it by the learning rate. We then add this to the previous action value and store it back in the Q table.

The result of all this is that the action will either increase or decrease a little depending on the state we end up in, which will make it either more or less likely to be chosen if we ever get into the previous state again in the future.

In simple terms it does a sort of mathematical calculations and finally updates the table accordingly and that is how it learns over time.

The check_state_exist method just check to see if the state is in the QLearningTable already, and if not it will add it with a value of 0 for all possible actions.

3.- DEFINE A BASE AGENT

Then we must define a Base Agent, an agent that both our random and learning agent will use. It has all of the actions that can be done and a few other methods that both agents can share to make them a little simpler.

Helper Functions

To perform these actions the bot will need helper functions described below:

Returns a specific set of units of the army (applies for buildings and troops)
Returns the units that are finished and not the ones that are being created (applies for buildings and troops)
Calculate the distances between a list of units and a specified point

Specific Actions

And finally the actions that can be done are:

The method that will send an idle SCV back to a mineral patch
Generate buildings (for further and deep reinforcement learning you can consider to add more type of buildings)
Create an army of marines and sending them to attack
Step method to know where our base is placed and do nothing (no operation to perform)

Its important to mention that unlike regular actions, raw actions do not crash if the action cannot be performed, but is best to perform your own checks so that error notifications do not appear in the game.

Finally each one of these methods will receive the observation from each step so that it can act independently.

4.- RANDOM AGENT

We choose an action at random here from our predefined list, and then we use Python’s getattr which essentially converts the action name into a method call, and passes in the observation as an argument.

5.- SMART AGENT

The Smart Agent is like the random agent but having more machine learning stuff because here is where we initialize the QLearning Table once the Agent is created. It takes the actions of the Base Agent and that is how the QLearningTable knows those are the actions that it can choose from and then perform.

New Game Method

Here we start a new game once the current it is finished by simply initializing some values like where the base is and the previous state and action.

Previous state and previous action are important for the reinforcement learning because we use those values each times it performs an action it stores that action in previous action variable so in the next step it knows what has already performed and the same with the state.

Get State Method

Essentially takes all the values of the game that we can find useful and important for example how many barracks, or supplies, or idle svcs we have, and then returning those in a tuple that can feed into our machine learning algorithm and it can know which is the current state of the game at certain point in game.

To make the Machine Learning Algorithm learn faster we just simply want to know if we can afford or not to do certain actions and not putting too much attention in how many minerals we have.

If you want to add more type of units or buildings and store its value, you can check the links at the top to find its function related

That seems like a lot of code but really we’re just keeping track of our units and the enemy units.

Step Action Method

It gets the current state of the game, it chooses and action, so it feeds the state into the QLearningTable and then chooses the best action or an action at random for finally return it.

Then what we want to do is to learn from that action and that state that were previously saved, so now we call the learn method in the QLeaningTable and pass those values and the reward received from the game (most of the times will be almost 0, 1 if we win or -1 if we lose) and then we pass in whether it is terminal or not and is kind of important because is not the same value a reward at the end of the game than the rewards we receive on the way to the end of the game.

Then we store the new previous state/action to use it in the next step of the game, and then we execute the action we have chosen.

If we don’t reset the previous_state and previous_action it could teach our agent incorrectly at the start of each game, so let’s reset the values when we start a new game:

6.- MAIN METHOD

At the end we have the method that runs the game to see what happens in real time. Here we create the SmartAgent and our RandomAgent, set those as the players, pass those also in the run loop to control both agents instead of one and then, once it starts it will open 2 windows, one for each agent.

Here you can see both agents running:

CONCLUSIONS

When you initially run this, both agents will do pretty much the same random actions, training some marines and attacking, building randomly or any other weird stuff, but this is just because the smart agent is still not smart enough.

Eventually the Smart Agent will discover that it can actually build more units for attacking, putting buildings in the front of the line to stop the enemy attack and by the time we get to hundreds of games played the strategy has already evolved.

There is also the fact that the random agent can have a better strategy just because is operating randomly, so the win percentage of the Smart Agent will not always be 100%.

There are things not considered like the health of the units, buildings, where the enemy marines are and more, so maybe implementing those in the future it can have better decisions and having a better win rate percentage.

Thanks for reading :) and you can find complete code here.

References

--

--