I am looking at evaluating chess rating systems, such as: Elo, Chessmetrics and Glicko. Infering predictions from ratings is simple: if player A has a higher rating than player B, then player A is expected to win a chess match.

The problem is that in evaluating the different rating systems/predictions there are three possible outcomes:

1. player A can win;
2. player B can win; 
3. the game is a draw.

It is not really satisfactory to ignore draws because 50 to 60 per cent of chess games end in a draw. Moreover, draws hold useful information (if a highly rated player draws with a lower rated player - this is a good result for the lower rated player).

What is the best way to evaluate chess rating systems without ignoring draws?

asked Jul 15 '10 at 20:54

Anton%20Ballus's gravatar image

Anton Ballus
266101415

edited Jul 15 '10 at 21:27

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


3 Answers:

ELO type rating systems are constructed to model p(win for A) as Rating(A) / (Rating(A) + Rating(B)). If we just want to choose a system as "best" we can compute likelihood (assuming independent trials) using the actual scores and select the system that maximizes the likelihood. Ties are no problem since we want our rating system to have prob(win) close to 1/2 when ratings are close. The formula for computing likelihood (given a series of ratings and players for n games) would be:

alt text

And, usually we want to compute the log likelihood for numerical convenience.

This answer is marked "community wiki".

answered Jul 17 '10 at 07:23

gordon007's gravatar image

gordon007
163

edited Jul 17 '10 at 07:39

I think maybe trueskill might interest you. The paper linked uses trueskill to build a time series of chess skills.

I'm not sure about the training a classifier approach. Reading briefly about these metrics, each one of them is already optimized to capture a particular form of win/lose/draw behavior, and just trying to predict based on their values might violate some important assumption on their design. For example, measuring the hinge|square|log loss of a classifier will usually assume a symmetry that goes against the nature of these methods (since they go through a lot of trouble to weight the prior probabilities of a win/defeat/draw on a match).

Maybe you could do a bootstrap evaluation of these metrics. You start with a set of matches, and want to evaluate the skills of all players in these matches. To do that, you sample (with replacement) many copies of this match set, and estimate the skills of all players according to one of these methods. Presumably, the best one should have less variance than the others (less variance means it should be expected to agree with itself often, it should not be too sensitive to one or two results, etc).

answered Jul 16 '10 at 13:42

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jul 16 '10 at 13:47

Is there really no other evaluation metric that will work for this problem? It seems like a relatively simple problem (but i may be missing something).

(Jul 16 '10 at 23:27) Anton Ballus

What about a measure like pseudo-r-squared?

(Jul 17 '10 at 00:18) Anton Ballus

I think the problem is that the ratings are already a model of the winning probability. Something else you could do is compute perplexity ( http://en.wikipedia.org/wiki/Perplexity ) or just held-out likelihood according to these models, and picking least perplexity or highest held-out likelihood.

(Jul 17 '10 at 08:05) Alexandre Passos ♦

Thanks Alexandre. Will look into this. Much appreciated.

(Jul 17 '10 at 18:55) Anton Ballus

Here is a possible way to evaluate the scores, using all three game outcomes. Admittedly, it's a bit of a funny, indirect way, but here it goes:

You have a database of past games. For each game, you know the outcome (A wins, A loses, or draw) as well as the Elo, Chessmetrics and Glicko scores of the two players. Use 80% of the chess games as training data and train three different classifier models: The first model uses only the Elo scores as features to predict the outcome, the second model uses only the Chessmetrics scores as features, and the third one uses only the Glicko scores. After the three models are trained, apply each of them to the test portion of the chess games to predict the outcome and compare the accuracies of the models.

As classifier you could use the Megam maxent classifier, for example. You have three classes (0=A wins, 1=A loses, 2=Draw). You can specify the strength with which a feature fires. For example, player A has score 1100, player B has score 1250, and player A wins; you describe the game with just one feature, which is the relative strength of player A according to the score, which is here -150. You say that the feature fires with strength -150, and the outcome class is 0.

Note that the classifier can derive non-obvious information, such as: Whenever Elo rates someone as a worse player he end up winning, so Elo is an excellent score.

answered Jul 15 '10 at 23:25

Frank's gravatar image

Frank
1349274453

1

I think he should probably use a nonlinear classifier, however; because with only two features a logistic regression will only learn very limited functions. Maybe an SVM with gaussian kernel should do better.

(Jul 16 '10 at 07:09) Alexandre Passos ♦

I agree that logistic regression might not be a powerful enough model with this few features.

(Jul 16 '10 at 08:10) Joseph Turian ♦♦

@Alexandre: Thanks, yes an SVM would be a good choice.

(Jul 16 '10 at 08:16) Frank
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.