|
This material concerns the concept and use of ratings systems
for games such a chess, backgammon, and, yes, even pitch.
The subject seems to be mysterious and little undertood for
most players. The principle is simple but the implementation
often leads to dissatisfaction. |
ELO Ratings
“Things are always least interesting when they're most
clear, ...when everybody understands what's going on.”
Brian Eno, regarding music. He liked it mysterious, which is fine for music.
Questions & Answers
What is the ELO rating system?
The ELO Rating system assumes that one's skill level can be represented by a
single number (as opposed to a formula). It also assumes that one's performance,
though it may vary over a number of games, will have a mean value representing
the player's true skill. It provides a number that purportedly represents
that level. It was developed for one-on-one competition in games of skill
with no elements of chance.
What does "ELO" stand for?
Nothing. It's the anglicized spelling of the name of the li'l
feller that originated the system. It's capitilized to distinguish references
to the system from Mr. Élő , himself.
Does it work?
Properly configured and calculated, it can work sufficiently well. Your skill
level determines your rating. Your rating is used to determine your skill
level. Obviously that's a chicken-egg thing, a circular process. Things
can go kerplookety in such systems (those with "feedback").
Does my win/loss percentage affect my rating?
No, not directly.
Why not?
It shouldn't. If you amass a 60/40 record against a lobotomized chimpanzee,
your skill is not the same as if you amass a 60/40 record against a world-class
champion. Your win/loss record would be a great indicator if you
played every possible opponent an equal number of games and so did
everyone else. This could be the case in a small tournament, but the
situation is not common, particularly in the online gaming environment,
so another method must be used.
Tell me more about ELO, then.
If your skill level (and that of your opponent) can be represented
numerically, then the outcome of a series of games (your wins versus
your losses) can be quite closely predicted. If chance (luck) is a factor,
then the series needs to be long enough for the effects of chance to
balance out.
Conversely, if the outcome of a series of games is known, and the skill level
of your opponent is known, your skill level can be accurately judged.
So what's the problem?
One problem is the starting point. Without the rating, how does one know
what the skill level of a given player is? Without knowing the skill
level, how does one calculate a meaningful rating? One might estimate
the skills of a group of players by observing the outcome of a large
number of games. One might assume that all players are equal, start them
with the same rating, and let the games adjust the ultimate outcome.
One might assume that new players are below the average by some
amount and let their games adjust the outcome. Differing approaches are
taken. For the results to be meaningful, the representation of the formula
must be correct.
You mean, all ELO systems aren't alike?
"ELO system" is a generic term for the approach used.
Different users have different views regarding the type of distribution
that represents a single player's performance variation from game to
game. They have different views about how much a player's rating should
be adjusted in response to a single loss or win. They have different
views about the uncertainty associated with a single prediction. Their
versions of the formula therefore differ in respect to certain coefficients
used.
How Ratings Work
| Copyright © 2005 David Mills |
Contact Me |
How Ratings Work
A rating describes your absolute skill level. Properly defined, it bears no
relationship to the skill of the other people you happen to compete with.
If other people are also rated correctly, the rating makes it possible to
predict the outcome of your matches against those other people. Given two
ratings, it is easy to mathematically derive the probability of a win for
either party. That calculation is not the place where most ratings systems
tend to break down.
If you play a number of games with another properly rated person, and neither
of you gains or loses skill during the match, you will come out at the far
end with exactly the same ratings you went in with. Whether one of you is
stronger than the other will only affect the win/loss ratio.
If your rating indicates that you should win two games out of three with a
particular opponent, then you will be awarded half as much for a win as you
are penalized for a loss. If you get 10 points for a win, you will be penalized
20 points for a loss. Your two wins will give you 20 points, you loss will
cost you 20 points. There will be no net change.
If your skill level has increased prior to (or during) the match, you will
win more than two out of three. If that is the case, your rating will be
increased at the end of the match. How much it will (and should) increase
depends upon several factors. It is in applying those factors that most ratings
systems fail to work entirely correctly.
Obviously, no one always plays a game that reflects perfectly their skill
level. Variations in concentration and perception will affect the outcome
to some degree. In games which include an element of chance, in addition
to pure skill, the outcome of any given game is subject to a degree of uncertainty
larger than that associated with a mere lack of attention and application.
One of the difficulties in determining appropriate rating changes lies in
dealing with this uncertainty. Clearly, a larger number of games will have
less uncertainty, overall, than a single game. Chance ("luck") will tend
to cancel out, as will any mental lack of application due to other factors.
If you play one game, the reward or penalty should be less than the reward
or penalty for each game of a longer series. Any rating system that fails
to account for match length cannot be accurate. In an environment that doesn't
fix match lengths in advance (Freeverse pitch, for instance), the difficulty
of correctly adjusting ratings increases.
Questions & Answers
Freeverse Ratings
| Copyright © 2005 David Mills |
Contact Me |
Freeverse Pitch Ratings
In comparison to a number of other sites with other games, Freeverse has very
few pitch players. It is somewhat difficult to evaluate the ratings system
given the sparseness of the data. Approximately 350 players are represented
in the published figures on the ratings page. Because people may have the
same rating as someone else (same skill level), there are only about 175
distinct ratings. Freeverse also does not have player history data in the
same way that Yahoo does -- data that shows a player's games, the
rating of the player and the opponent at the time of the game, and the outcome
of the game.
Elo's rating system presumed player ratings would assume a normal (Gaussian)
distribution. Chess implementations of the formula have been tailored slightly
to accomodate what they call a "logistical" distribution. The difference
between the two is primarily at the extremes.
A Gaussian distribution has a large number of values near the mean (average,
center of the curve). As one moves away from the average, fewer values are
found in equal ranges about the center. The result is a bell-shaped curve,
a dome that curls outward near the extremes.
This figure
shows a randomly generated set of values with a Gaussian distribution. A
"perfect" Gaussian bell-curve is superposed for visual reference. Note that
the majority of the values occur in the middle third of the curve, not at
the extremes.
The second
figure, on the right, is the distribution of the published Freeverse pitch
ratings. There is a rather obvious "hole" in the center, rather than a majority.
If you have felt that you tend to slide from one extreme of the ranks to
the other after a relatively brief streak of good or bad luck, guess what?
You are looking at a depiction of a system that is improperly adjusted and
is, in fact, unstable, or nearly so.
I have contacted the support people in an effort to get the actual formula
implemented for the site. They have promised to try to put me in touch with
the appropriately cognizant people. I have volunteered my efforts for either
additional analysis efforts, or for programmatic adjustments. If you have
any suggestions or comments, please use the link below.
How Ratings Work
| Copyright © 2005 David Mills |
Contact Me |