Empirical Rabbit: Finding K

This article discusses various attempts to find the value of K (the number of rating points a chess player gains for each doubling of his clock time.) It turns out that none of the attempts using results from problem servers work, which leaves us with just the more direct methods.

Using Server Times and Scores

I simulated a set of solution times and scores for 50 users solving the same 500 problems once, with K=200 (my last run in Varying K). I used these results to calculate sets of user and problem ratings, using various values of K. I then calculated the log of the probability of obtaining the set of solution times and scores, for each of these values of K (see Rating by Maximum Likelihood). The graph below plots the log this probability against the value of K used in the rating calculation:

The smaller the value of K, the more likely the set of results. There is no peak in this graph at K=200. It is clearly impossible to recover the value of K used in this simulation from the solution times and scores. This graph explains why my proposed method to find K failed, see Varying K. (N.B. We can prove that my method maximises the likelihood with respect to changes in the ratings, by applying the proof that I gave in Rating by Expected Score to each user and problem rating in turn.)

I believe that it is impossible to use solution times and scores alone to find the value of K that was used to generate these solution times and scores. (I am assuming that only first attempts at solving problems are included in the results.) I do not have a formal mathematical proof, but I expect that a proof is possible. No doubt there will be conditions for this proof, but I do not expect that they will be of practical importance.

I had much the same result when calculating a rating for each user based on the user's shorter solution times, and another based on his longer solution times. The difference between these two sets of ratings was minimised by the smallest value of K, not the value of K that was used to generate the solution times and scores.

Using Server Ratings, Times and Scores

If we have ratings, we can calculate a value for K, but this value of K will be the value used to calculate the ratings, not the value used to generate the solution times and scores. On my last simulation run, with K=200, five users happened to score exactly 0.75. The graph below plots the ratings of these users, calculated with K=100, against log₂ of the user’s average solution time:

(Two of the data points are coincident on the scale of this graph). K is the negative slope of this straight line. The negative slope is 100, which is the K value that I used to calculate the ratings, not the K=200 that I used to generate the solution times and scores. Since the user and problems ratings are calculated from the solution times and scores, they do not add any new information, and do not increase our prospects of finding the correct value of K.

I again had much the same result when calculating ratings for each user based on the user's shorter solution times, and another based on his longer solution times. If I used problem ratings previously calculated with a fixed value of K, the difference between these two sets of ratings was minimised by the value of K used to calculate the problem ratings, not the value of K that was used to generate the solution times and scores.

Using Single User Times and Scores

The approach that I suggested in Rating vs. Time on the Clock fails too. I sorted the times and scores for user number 49 (from my last simulation run), into ascending order of solution time. For this calculation, I assumed that the set of problems tackled could be approximated by a set of equally rated problems. I worked out the theoretical score 1/(1+ (h/t)^k), for each of the solution times. (As previously, h is the time to score 50%, t is the solution time, and k=K/(400*log(2)).) The graph below plots the cumulative actual score (blue) and cumulative theoretical score (red), with the values of h and k that gave the best fit:

The optimum value of k was 0.923, corresponding to K=111, which is a long way from the correct value of K=200.

Methods that Work

All these methods are hopeless, so how can we find K? There are several possibilities:

(1). We can get a rough idea from clock simultaneous matches, as discussed in More Rating, Time and Score.

(2). On a problem server, we can find the value of K that gives the best fit to the users’ ratings for playing chess games.

(3). We can construct equally difficult problem sets with the same statistical profile, as discussed in Tactics Performance Measurement, and solve them with differing time limits. This method does not require problem ratings, and the problem sets could be generated by a computer.

(4). We can play matches against a fixed strength computer program with differing time limits. A very rapid response time would minimise the effect of thinking on the computer's clock time. This method can be viewed as using a computer program to generate equally difficult problem sets.

2 comments:

Temposchlucker13 October 2012 at 02:21
I always thought that the so called K-factor was meant to regulate the speed of adaptation of the rating. High K is used for junior players since their rating changes fast overtime and low K for grandmasters, which gives a more reliable rating but slower adaptation to changes in level.
Geoff Fergusson13 October 2012 at 04:56
That K factor is completely different to mine. Perhaps I should have used a different letter. The USCF (and possibly others) use a K factor to combine a player’s results from a tournament with his previous results, to give an estimate of his current playing strength. I suggest a simpler approach akin to that used by the ECF, where players grades are calculated afresh each year. On a problem server, we should have plenty of results, so there is no need to muddy the issue. This approach was used in calculating the Edo Historical Ratings:

http://www.edochess.ca/Edo.explanation.html

(The rating calculation used there is mathematically equivalent to mine.) Each user effectively becomes a new user every N problems. If a user improves, we count him as a different user. This approach ensures that there is no rating inflation or deflation. It also gives a clear picture of any improvement, as a series of snap shots.

Friday 12 October 2012

Finding K

2 comments: