Empirical Rabbit: Problem Server Ratings Revisited

In my earlier article Rethinking Problem Server Ratings, I discussed the rating systems used by the current generation of online chess problem servers, and demonstrated that these systems do not give reliable ratings. Nobody has questioned the accuracy of this assessment, but it has been said that the original purpose of these rating systems was to serve up problems at a reasonable level for the user concerned, and that this objective has been largely met. Nonetheless, many less well informed players do take these numbers seriously. In my previous article, I proposed a sounder approach. The purpose of this article is to clarify the reasoning behind this approach, and expand upon the practical issues.

For the sake of clarity, I will make a few definitions. Let “time taken” be the time that a user takes to solve a problem either correctly or incorrectly, or to give up. For a user, let his “average time taken” be the sum of all the times taken on the problems that he has tackled, divided by the number of problems that he has tackled. For a problem, let the “average time taken” be the sum of all the times taken on that problem by all the users who have tackled it, divided by the number of these users.

The simplest sound approach to rating both users and problems is to use problem scheduling (see below) to ensure that the AVERAGE times taken for all the users AND all the problems are the same. This gives us a level playing field on the average times taken, so that we can calculate ratings for both the users and the problems using the Elo formula.

Clearly, it is important that all the users have the same average time taken if we are to apply the Elo formula. If we played a game against an opponent who had twice as much time on the clock, we would not expect the game to be rated as usual, ignoring our time handicap. Consider two problems on an online server. Suppose that:

* Half the users who tackled problem A got it right, and that half the users who tacked problem B got it right.

* The average rating of the users who tackled problem A was the same as the average rating of the users who tackled problem B.

* The average time taken by the users who tackled problem A was 10 seconds, and that the average time taken by the users who tackled problem B was 100 seconds.

Clearly, problem B was much harder than problem A, and it is not at all reasonable give both these problems the same rating.

What data do we need for the Elo calculations? For users, we need their average score, as a fraction between 0 and 1, and the average rating of the problems that they tackled. For problems, we need their average score, and the average rating of the users tackling them. The average score for a problem is the average of all its scores against the users who tackled it. The problem’s score against an individual user is 1 minus the user’s score. (If a user tackles a problem more than once, any attempt after the first is ignored for rating purposes.)

This method has the advantage of not restricting the learning process. If the user is too slow, we serve him easier problems. If he is too fast, we serve him harder problems. He can just forget about time and just solve the problems. He has the freedom to work on speed or accuracy, as he chooses. If he does more checking, he will be served easier problems to allow time for the checking.

I do not believe that we can do any better than this. I do not believe that it would at all be reasonable, for example, to attempt to make a user always take the same time to solve any problem that is served up to him. It might be argued that if the problem ratings were very accurate, we could ensure that all the problems served were of equal difficulty, and the user should be able to take the same time on each. One difficulty with this proposal is some users with the same rating will find a problem easy, and others will find it hard. Indeed, we all sometimes solve a problem quickly on one occasion and slowly on another. Most problems seem easy if the first move we try works! Problems are also be easier to solve if we happen to have seen a similar problem recently. Having recently seen a problem that looks similar, but has a different solution can have a negative effect. If we serve up a problem that the user solves straight away, it is not at all reasonable to ask him to twiddle his thumbs until a target time has arrived. He should be able to use the time productively to tackle another problem. Equally, it is unreasonable to time a user out when a target time arrives, just as he is about to find the solution. In a real game, we do not have to spend the same time on every move, but we do typically have to make a fixed number of moves in a fixed time, i.e. we have a fixed average time per move. Since we cannot force all our users always to take the same time on each problem, I do not see any way of ensuring that the same time is always taken on every problem.

How can we achieve equal average times taken for both users and problems? Let us consider an example. Suppose that the target time for the average time taken, for both users and problems is 30 seconds. Suppose also that a new user has tackled 10 problems with an average rating of 1500, and that his average time taken on these problems is 20 seconds. Suppose that we want to bring his average time taken up to the target time of 30 seconds, after he has solved about 30 problems more. Let the average time taken for the next 30 problems be x seconds. We then have the equation:

(10*20 + 30*x) / 40 = 30

Solving this equation gives x = 33.33... seconds. The user therefore needs to slow down by a factor of 33.33.../20 = 1.66... On the assumption that that the average time taken doubles for every 200 Elo points increase in rating (see my earlier article Rating vs. Time on the Clock), the average problem rating needs to increase by 200*log2(1.66...) = 147 Elo points. The ideal rating for the problems served to this user is therefore 1500 + 147 = 1647. Planning to reach the target average time after tackling 30 more problems may not be optimal, but it looks like a reasonable choice. We update this calculation every time the user tackles a new problem. If the user resists our attempts to make him speed up, this algorithm ensures that he will be served up progressively harder problems, until he does conform. (N.B. Because we update the calculation every time the user solves a problem, we claw back an average of about one thirtieth of what remains of the shortfall every time the user tackles a problem. It is therefore likely to take several times 30 problems to get very close to the target time.)

While this user is catching up, his solution times will be higher on average than the target time of 30 seconds. We therefore serve him only problems that have an average time taken of less than 30 seconds, so that their average times are also dragged towards to the target time. After applying this constraint, we simply select the problem with the rating closest to 1647, that has not already been tackled by the user concerned. (If the user wishes to revisit problems that he has already solved, these attempts must be excluded from the rating calculations of both that user and the problems concerned.) We update the ratings for users and problems only when their average time taken is very close to the target time.

I discussed an alternative approach in my previous article. [See my later article More Rating, Time and Score for a discussion of alternative approaches.] This method is based on the idea of adjusting the scores both for the users and the problems to what they would have been if their average times taken had all been equal to the target time. To do this, we can use the equation:

(t₁ / t₂)^k = [s₁/(1-s₁)] / [s₂/(1-s₂)]

See my earlier article Rating Time and Score. (In this equation, s₁ is the score with average time taken t₁, and s₂ is the score at the target time t₂. The optimum value of the constant k is not known, but it is believed to be about 1.7.) N.B. A score of zero becomes a score of zero and a score of 1 becomes a score of 1, irrespective of the times. With this method, we do not need to constrain the solution times. The scheduling method is the more convincing, however, because with that method time does not enter into the rating calculation. As noted in my earlier article, the two methods can be combined.

There is a potential chicken and egg problem with the scheduling method. With this method, we need problem ratings to get problem ratings. If we have no problem ratings at the outset, we have to serve problems up at random. Initially, we can rate users relative to the average difficulty of the problem set. The equation above can be used to adjust the users’ scores to what they would have been if they had all spent the same average time taken on each problem. Given sufficient results from one or more users with reliable real world FIDE ratings, we can convert these rating differences to full ratings. The next step is calculating estimated ratings for the problems. We can use the equation above again to adjust the scores of the problems, and use the method described in the previous article to optimise the value of k. When we have problem ratings, we can use the scheduling method.

It is highly desirable that a problem server serves up problems that none of the users have seen before. If we have seen a problem before, that not only inflates our rating, but also deflates the rating of that problem. Our inflated rating also inflates the ratings of the other problems that we solve. Our inflated rating does not matter to other users, but their ratings are not reliable unless the problem ratings are reliable. Outside a laboratory setting, it is impossible to solve this problem completely. Perfection is not likely to be achieved in practice, but the methods that I have described are easy to implement, and a big improvement on those currently in use.

Empirical Rabbit

Wednesday 1 August 2012

Problem Server Ratings Revisited

No comments:

Post a Comment