Empirical Rabbit: More Rating, Time and Score

In Rating, Tme and Score, I suggested that a chess player’s rating increases by K rating points for every doubling of his time on the clock. I was not the first person to make this suggestion, see:

http://web.zone.ee/chessanalysis/summary450.pdf

This link concludes that K is about 120 Elo points, based on time handicap matches. It does not, however, include the clock simultaneous match between Topalov and the four player Irish team, see:

http://www.chessvibes.com/reports/topalov-draws-irish-team-in-clock-simul

This match was drawn, with a rating difference of 367 Elo points. Including this result would have raised the estimated value of K a little.

In Rethinking Chess Problem Server Ratings, I suggested applying the K relationship to a set of problems of that are statistically representative of the moves of a chess game, rather than chess games themselves. In A Simpler Sever Rating Method, I suggested applying this relationship to a single problem or a set of problems at the same level of difficulty. I proposed the formula:

s_U = 1/(1+10^[(R_P - R_U - K*log₂(t/T))/400])

In this formula, s_U is the user’s expected score, R_P is the problem rating, R_U is the user rating, t is the time taken on the problem, and T is the target time at which the user and problem ratings are valid. If the time that the user takes doubles, the rating of the problem is reduced by K rating points. This results in a K rating point reduction in the user’s rating. We can see this by solving the equation above for R_U:

R_U = R_P - K*log₂(t/T) - log(1/s_U - 1)

The corresponding expected score for a problem is:

s_P = 1 - s_U = 1/(1+10^[(R_U - R_P + K*log₂(t/T))/400])

If the user solves a set of problems with different ratings, and the time that he takes increases by a factor r, the ratings of these problems are all reduced by K*log₂(r) rating points, in the formula for s_U above. The user’s rating is reduced by K*log₂(r) rating points as a result.

The graph below shows plots the expected score against the user rating for Michael de la Maza's last 19 games (in red), and a single problem approximation to this curve (in green):

(See Rating by Expected Score.) The curve for a single problem satisfies the Elo formula, but the curve for the last 19 games does not. If the Elo formula (without time adjustment) applies to single opponents or problems, it does not, in general, apply to sets of problems or opponents.

In Rating, Time and Score, I showed that if we assume both the K formula and the Elo formula, it follows that:

t = h*[s/(1-s)]^1/k

In this formula, t is the time taken, h is the time to score 50%, s is the score, and k = K/(400*log(2)). This is a very useful formula. However, if it applies to individual problems, it does not (in general) apply to sets of problems. In particular, it does not apply to sets of problems that are representative of the positions in a chess game.

Applying the Elo formula individual problems, as I proposed in A Simpler Server Rating Method, provides a very simple, logical and mathematically sound rating method that takes time into account. All we need to do is:

(1). Subtract K*log₂(t/T) from the problem ratings, when calculating user ratings.

(2). Add K*log₂(t/T) to the user ratings, when calculating problem ratings.

We can then use a standard rating method that does not take time into account. If we calculate the user ratings from the problem ratings and then recalculate the problem ratings, we will get back to where we started; provided that the rating method used is mathematically consistent. Few rating systems are mathematically consistent, but my Rating by Expected Score method is very simple and does the job.

Existing chess problem servers adjust the score that a user receives for tackling a single problem according to that time that he took on that problem. This makes superficial sense if the user solves the problem correctly (the longer he takes the lower the score that he receives), but runs into difficulties if the user fails to solve the problem correctly. Here too, the longer he takes, the worse his result, and the lower the score that he deserves. If he instantly rejects a problem without even looking at it, he does not deserve to be penalised at all. On the other hand, if he spends all day failing to solve a simple problem, he deserves to be penalised heavily. The existing servers give him a zero score for a failure, however long he takes. This does not make sense, but we cannot give him a negative problem score. I believe that we have to cross this approach off our list.

In Rethinking Chess Problem Server Ratings, I tried to get round this problem by adjusting the users’ scores on sets of problems rather than on individual problems, and using these adjusted scores to calculate the user ratings. Specifically, I suggested adjusting a user’s score to what it would have been if his average time taken was equal to a fixed target time. I also suggested adjusting a problem‘s score to what it would have been if the average time taken by the users on that problem was equal to the fixed target time. I ran into difficulties, however, when I tried to implement this method. I found that successively calculating the user ratings from the problem ratings, and the problem ratings from the user ratings did not converge to a consistent solution. I believe that we also have to cross this approach off our list.

My scheduling method (see Rethinking Chess Problem Sever Ratings) remains an interesting possibility. This method is based in the idea of using the server scheduling to ensure that all the users take the same average time per problem, and that the same average time is taken on every problem. This method is potentially very convincing, because it does not rely on any assumptions concerning the relationship between time and score, but it needs reliable problem ratings calculated by another method.

We are left with the Simpler Server Rating method. The only question that remains is whether we can validly apply the Elo formula to individual problems, rather than sets of problems representative of the positions in a chess game. Nonetheless, if we want to calculate ratings for individual problems, we have little choice but to make that assumption.

2 comments:

Anonymous11 December 2012 at 17:13
I am glad to see I am not the only one who is interested in such things! I have long wondered if there is a predictable/quantifiable pattern with respect to increased time and playing strength.

I find the K value of 120 cited by the paper to be suspicious, not so much because 120 seems implausible itself (I don't really have any strong intuition about what the value would be), but because it is unclear how he calculated his numbers.

Most conspicuously, for the Kramnik simul against Sebag and Werle in 2008, he gives them as having a performance rating of 2576, but since they lost 0-2, their performance rating should be 800 points lower than their average rating.

I have started compiling results from all the clock simuls for which I can find reliable information, and so far it is quite hard to discern any sort of reliable pattern.

Right now the average value of K in that review is 86, but the individual results are quite varied. The one in which K is the highest is the Topalov-Irish Team simul you mentioned in the post, where K is 168. The lowest value is for Kazimdzhanov-Uzbek Team 2007, where K=10(!).

I haven't even gotten to the Kasparov-Israeli Team clock simul from 1998, where K will actually be negative, I think, because the Israeli team had an average ELO around 2600, if I remember correctly, and scored 1/8, which corresponds to about -330 rating points, which even from Kasparov's 2851, still has the team underperforming their ratings.

I also think that in general there are issues with using clock simuls, as they encourage certain sorts of strategies that skew the results (for example, the stronger player makes easy draws on a couple boards so that the remaining time can be distributed among a fraction of the original boards; those half-points given away may not have been earned if the stronger player were more combative).

As you indicate in another post (I think, and I'm too lazy to look it up right now), I believe the best/only way to settle this is to have a large number of time-odds matches/tournaments held, and to draw correlations from that.

One of the other things on my to-do list is to analyze the results from the Puhajarve rapid that just concluded recently, as that sort of format is more in line with that sort of testing.

Cheers, and keep up the excellent work on the blog!
Geoff Fergusson11 December 2012 at 22:16
I believe that the most practical method here is to play matches against a fast computer program playing at a fixed strength. There are then no issues concerning thinking on the opponent's clock time, or having all the opponents move at the same time. Thirty games at one minute per move, and another thirty at half a minute per move would give us a much clearer picture than we have at present.

Tuesday 9 October 2012

More Rating, Time and Score

2 comments: