Liminal Bits

Using Neural Networks to Detect Bots

26 Jun 17 19:55 -0400

Hi, in this blog post, I hope to explore the performance of neural networks, particularly auto-encoders, in detecting anomalous Dota 2 matches using the feature pipeline (if it can even be called such) I have published on my github. Essentially, the idea was to collect a bunch of matches from Patch 7.06c and then feed them into a neural network to detect weird matches.

To give a bit of a context, Dota 2 is an online video game designed and managed by Valve. In this game, five players are placed on two different factions, Radiant and Dire, for a total of ten players in a match. The goal of the game is clear objectives on the game map and make way to destroy the “throne” of the opposing faction. Each of the five players plays a different role on the team, so players are well differentiated, and teamwork is required. Basically, one has to acquire resources such as gold and magical items, eliminate the enemy multiple times, and do whatever it takes to clear the main objective of destroying the opposing throne.

The sort of data used for the project will be what I consider to be highly representative bits of information from Dota 2 matches. Normally, some features would be categorical (such as the type of hero a player is selecting) or numerical (such as the gold spent by a player), and the categorical features would need to be encoded into some sparse representation. This was the case in earlier attempts when I used the character, or hero, chosen by each player and the items they bought. Unfortunately, these categorical features had high cardinalities, and I ended up only using the numerical features such as the kills per minute or gold per minute of the different players— while also making sure to differentiate by role and faction. A more complete list of the features used would include:

kills, gold, and experience per minute
total gold, kills, assists, and deaths and differentiated by player as well
hero damage and healing
tower kills and courier kills by player
game duration
abandons
negative and positive votes
ward usage

The network architecture consists of a single hidden layer with sigmoid activation function and the output layer using the identity activation function, but more complicated networks might have multiple layers of weights as the encoder and decoder.

The matches in the designated training and test sets would be used train the neural network and subsequently generate a distribution of reconstruction errors. If the auto-encoder and the features used were both perfect, then the matches would be reconstructed perfectly with no errors; in actuality, some the features of some matches will be reconstructed with higher errors than others.

The idea is that an auto-encoder can find a sparse representation to encode most matches, and the matches with higher reconstruction errors could be considered anomalous. Other uses of auto-encoders include dimension reduction or de-noising, but I will go along with the anomalous interpretation. Ideally, the most anomalous matches will have situations where some of the players feed or troll or bots are present in the game.

The behavior, training, and tuning of auto-encoders is very similar to that of regular (not necessarily regularized) multi-layer perceptrons. Feed-forward and backpropagation are still key parts of the algorithm. The main constraint is that the final output layer needs to match the input layer as much as possible.

To gather data, I used the OPENDOTA API over a period a little over a week to gather about 170,000 matches. The data gathered included things such as what heroes went to which lane, the kills per minute of the Position 1 player from Radiant (and Dire), ward uses, and various other things that could be relevant in a match. I selected for All Pick or Captain’s Mode Public and Ranked matches. Since the game is mostly balanced for these sorts of games, then I figured the auto-encoder would have an easier time training on this sort of data.

After generating the data, I did standard feature scaling to make the numerical data for susceptible to learning. I replaced missing values with zeros, as a missing value for the ancient creep kills for the position 5 player on a team indicated that the player just did not harvest ancient creeps. At this point, I also selected out some of the data I gathered because it data quality issues or because it confused the network model in reaching my purpose.

The data goes into a TensorFlow implementation of a simple auto-encoder model. The model has one hidden layer with three quarters of the number of neurons of the input layer (and output layer). I used the AdamOptimizer to train the network. Stochastic gradient descent was having difficulties with local minima. After training, I calculate the reconstruction error for each match, and the reconstruction error is simply only the sum of the square of the residual across all the features.

x = tf.placeholder(tf.float32, [None, NumFeatures])
y = x

weights_1 = tf.Variable(tf.random_normal([NumFeatures, layer_size[0]], stddev = 1.0/NumFeatures/100), name='weights_1')
bias_1 = tf.Variable(tf.random_normal([layer_size[0]], stddev = 1.0/NumFeatures/100), name='bias_1')

weights_2 = tf.Variable(tf.random_normal([layer_size[0], layer_size[1]], stddev = 1.0/NumFeatures/100), name='weights_2')
bias_2 = tf.Variable(tf.random_normal([layer_size[1]], stddev = 1.0/NumFeatures/100), name='bias_2')
  
layer1 = tf.tanh(tf.matmul(x, weights_1) + bias_1)
output = tf.tanh(tf.matmul(layer1, weights_2) + bias_2)

cost = tf.reduce_mean(tf.reduce_sum(tf.pow(y-output, 2), 1))
rank = tf.rank(cost)

learning_rate = 0.000001
beta1 = 0.5
beta2 = 0.5
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1, beta2=beta2)
gradients, variables = zip(*optimizer.compute_gradients(cost))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
train_op = optimizer.apply_gradients(zip(gradients, variables))
    
variable_dict = {'weights_1': weights_1, 'weights_2': weights_2,
                     'bias_1': bias_1, 'bias_2': bias_2}
saver = tf.train.Saver(variable_dict)
init = tf.global_variables_initializer()

ckpoint_dir = os.path.join(os.getcwd(), 'model-backups/model.ckpt')

Afterwards, the anomalous matches are the matches with a total reconstruction cost at the 99th percentile or higher. This level was determined by manual inspection at the 90th, 95th, 98th and 99.9th percentiles.

I realize that there are many false positives in the anomalous matches. I think simple filtering rules and further improvements to the model mitigate the real risk of these concerns. In order to detect leavers by using anomalous matches, one could filter matches where an abandon happened given certain game durations and first blood timers. A particularly useful filter for detecting feeders is the following: In matches with a total residual being in the 99th percentile, if the feature with the highest reconstruction error was related to player kills (such as kills per minute or total deaths), then the match quite often included feeders or bots. This is not to say that matches in this higher end of the spectrum of total residuals did not include feeders or bots when most deformed feature was something unrelated to player kills, but false negatives are less costly than false positives.

Some results are pretty indicative of partial success. The follow matches had high enough residuals to count as anomalous matches and are clearly bots (or really dedicated players):

Some matches are just really weird:

https://www.dotabuff.com/matches/3215162338
https://www.dotabuff.com/matches/3215344415 (In this particular match, Disruptor placed more wards than expected, and a player abandoned.)

I am not really sure what is going with these matches:

Following the idea of success, using some of these bot matches, I was, incidentally, able to find some players that seemed to be losing a large amount of games via bots in order to play Battle Cup games at skill levels lower than their typical tier level. Since I happened onto these accounts by accident, I think further focus could serve the purpose of finding players that tend to exploit the matchmaking rating (MMR) system. Although, this is limited to players exposing their player data to third-parties.

Unfortunately, I had some issues with the data that were not foreseen later, and there is nothing beyond a rudimentary design in the neural network, nonetheless, I found what are some interesting trends that corresponded to some of my prior intuitions, and, more excitingly, some trends completely contradicted my expectations. Because of this, I’d say that some more effort into one of the tasks I mention above could enable the data (or a similar batch) into a more directed effort such.

These data quality issues did impact the distribution of the residuals in initial iterations. In fact, the first few times I tried this experiment, I was getting matches from the arcade mode Dark Moon.

Not all the matches I recorded are now available on Dotabuff.

A somewhat nuanced issue is the large number of matches were a substantial number of the players do not give third parties their match data. In these games, a substantial number of players would not have some features accessible such as actions per minute, so I stopped collecting that feature in later iterations of this pipeline/model.

Other issues in data arise from me collecting the wrong features or missing some features that could have been useful. The training and test data has which heroes were selected for the particular roles—along with their items—but the high cardinality of these dimensions made training infeasible on the machine I was using and the limits of my patience.

I also messed up in encoding the positive and negative votes that a match has. In fact, the match with the highest residual also had among the highest of negative votes. One should be suspect if a match has a residual when it comes to the number of votes, at least for this batch of data. When graphing, I excluded matches with these data abnormalities to better show behavior.

Often times, my intuition about certain features was just wrong. In a first few of the iterations, matches were sometimes considered anomalous if one of the players pinged a lot. Often times my intuition was right. Sometimes a match would be considered highly anomalous because a player spent an abnormal amount of gold for his role. Often times, this would be a player spam buying wards or teleport scrolls such as in match 3215353856 (Dire Position 2 spent an abnormal amount of gold relative to his performance).

I removed neutral kills as a feature at some point because they were dominating the effect of detecting outliers in lower priority positions, but I retained ancients because ancients still had signal towards anomalies (such as games with 5 carries). I also do not consider rune pickups or pings anymore.

In the iteration of the project at the time of writing this article, there is some interesting behavior on-going with wards. I expected that sentry and observer ward usage would be highly indicative of anomalous matches where players get frustrated and spam buy sentries. Often times, these matches actually corresponded to a player just buying more wards than usual but with good intentions such as in this match https://www.dotabuff.com/matches/3215448302.

Here are some generated graphs that attempt to demonstrate the effect of some of the features (or their combinations) on the residual. Surprisingly, the overall trends when looking at some of the individual features are the opposite of what one might expect in high residual games. There is only a weak correlation, if any, in most of these graphs, which highlights how contextual DotA matches are; any particular feature is meaningless without other metrics because of how tightly correlated everything is.

You can see that the Position 1 player on the Radiant side tended to have no activity in high residual matches.

The logarithm of the absolute total kill difference between teams against the residual. There is a downward trend, suggesting that most anomalous games have little action.

Below is a sample table of games in the 99th percentile of reconstruction error that correspond to the filtering. The possible identification of feeders and bots using the appropriate filters is pretty consistent. If you want to dive at the data yourself, feel free to access the Github repository for this project. If you want access to all the data, feel free to message.

actual	column	match id	predicted
49.0	negative votes	3215497967	1.0
15.0	negative votes	3215097220	1.0
9.0	negative votes	3215312675	1.0
0.997236430645	dire pos5 courier kills	3215162338	0.254108220339
0.967831552029	radiant pos3 gpm	3215181105	0.215277791023
0.986187636852	dire pos5 kda	3215377994	0.675353765488
7.0	negative votes	3215271631	1.0
0.915403366089	dire pos4 gpm	3215289100	0.162800624967
0.915403366089	dire pos4 gpm	3215279863	0.155206382275
5.0	negative votes	3215253055	1.0
0.977942168713	radiant pos1 sentry uses	3215448302	0.730815529823
0.0236208867282	dire pos3 gpm	3215416370	0.595808267593
0.243853777647	radiant pos3 gpm	3215261563	0.536930799484
4.0	negative votes	3215294948	0.999999940395
0.996014475822	radiant pos5 kills	3215155134	0.729707717896
0.172037020326	radiant pos4 gpm	3215350255	0.478152662516
0.765351176262	radiant pos1 kpm	3215341221	0.473023504019
0.796920537949	radiant pos2 gpm	3215306187	0.594073414803
0.995038449764	dire pos4 gpm	3215302636	0.616404891014
0.995248615742	dire pos5 gpm	3215316820	0.661065816879
0.991396725178	dire pos3 sentry uses	3215190109	0.725191056728
0.994378209114	dire pos2 gpm	3215355847	0.634865164757
0.992697298527	radiant pos3 kills	3215130854	0.734208106995
0.99599146843	radiant pos1 kills	3215259897	0.731749773026
0.420948237181	dire pos1 totalgold	3215394671	0.15580162406
0.853668451309	dire pos4 deaths	3215152225	0.677466273308
0.97054040432	radiant pos2 kpm	3215274711	0.548053383827
0.417959868908	dire pos3 xpm	3215366609	0.140711635351
0.837321698666	radiant pos2 courier kills	3215449992	0.687711238861
0.417725324631	radiant pos2 xpm	3215376405	0.142942205071
0.617924034595	dire pos5 kpm	3215268540	0.222700610757
0.772488415241	radiant pos4 kpm	3215420902	0.579163551331
0.720203399658	dire pos1 hero heal	3215332931	0.546277999878
0.720203399658	dire pos1 hero heal	3215327547	0.547204256058
0.570976436138	dire pos5 kpm	3215354992	0.277706980705
0.696351587772	dire pos2 lasthits	3215254939	0.59213912487
0.896779537201	dire pos5 tower kills	3215149372	0.706301689148
0.846275269985	radiant pos1 observer uses	3215370025	0.688222169876
0.99476057291	dire pos2 sentry uses	3215201012	0.0275219380856
0.737460970879	dire pos3 totalxp	3215272070	0.62419462204
2.0	negative votes	3215223639	0.999818563461
0.702699840069	dire pos1 hero dmg	3215422492	0.599272608757
0.420081436634	radiant pos1 xpm	3215260456	0.113636702299
0.13533718884	radiant pos3 sentry uses	3215354934	0.0243814960122
2.0	negative votes	3215377592	0.999819636345
2.0	negative votes	3215445972	0.999818325043
0.992697298527	radiant pos3 kills	3215101148	0.748484134674
0.776657044888	radiant pos4 sentry uses	3215174827	0.654739558697

There is still a lot of work to attempt in the architecture of the auto-encoder. I attempted ReLu activation functions, but I was having issues with dying neurons, especially in the iterations when I had multiple hidden layers. The hyperparameters have yet to be explored. To introduce regularization, one could attempt to modify the objective function or introduce dropout.

Nonetheless, hopefully this suggests a path to automate the moderation of the player community in order to have higher quality games.