Scores and Rankings
Scoring functions
are measures that reduce multi-dimensional data to a single value, to highlight some particular property.
Rankings
order items, usually by sorting scores.
Assigning Grades
Course grades get assigned by scoring functions.
Observe that grading systems have:
Degrees of arbitrariness
: each teacher differs.Lack of validation data
: there is no right grade.General robustness
: students tend to get similar grades in all their classes anyway.
Calling scores statistics lends them more dignity.
Scoring vs. Regression
The critical issue in designing scoring functions is that there is no gold standard/right answer.
Machine learning techniques like linear regression can learn a scoring function from features if you had training data, which generally you don’t.
Google’s ranking algorithms train on click data.
The Body-Mass Index (BMI)
BMI is a score designed to capture whether your weight is under control:
Mass is in kg and height in meters
- Underweight: below 18.5
- Normal: 18.5 to 25
- Overweight: 25 to 30
- Obese: over 30
BMI: Pro Basketball and US Football
BMI is easy to interpret, and correlates with body fat. Mass should scale with the square (or cube) of height.
Gold Standards and Proxies
Gold standards
are labels or answers we trust to be correct, reflecting the scoring goal.
Proxies
are available quantities correlated with what we want to measure.
Your GPA or SAT/GRE is a proxy for how you should do in my class.
Scoring vs. Machine Learning
When you have a gold standard, you can train a regression function to accurately predict things.
When all you have are proxies, all you can do is evaluate your scoring funciton.
‘Weapons of Math Destruction’ happen when you confuse proxies for gold standards: e.g. student test scores for teaching quality.
Scores vs. Rankings
Whick is more interpretable depends on:
Will the numbers be presented in isolation?
Stony Brook ranks 111 of 351 teams RPI=39.18
What is the distribution of scores?
How much better is #1 than #2?
Do you care about the middle or extremes?
Small changes in score can cause big rank diffs
Recognizing Good Scoring Functions
- Easily computable
- Easily understandable
- Monotonic interpretation of variables
- Produces satisfying results on outliers
- Uses systematically normalized variables
Normalization and Z-scores
It is critical to normalize different variavles to make their range/distribution comparable.
Z-scores are computed: $Z_i = \frac{X_i - \bar X}{\sigma}$
Z-scores of height measured in inches is the same as height measured in miles.
Your biggest analysis sins will come in using unnormalized variables for analysis.
Z-score Examples
Z-scores have mean 0 and sigma = 1.
Thus Z-scores of different variables are of comparable magnitude.
The sign identifies if it is above/below the mean.
Advanced Ranking Techniques
Linear combinations of normalized values generally yield reasonable scores, but other techniques include:
- Elo rankings
- Merging rank orderings
- Directed graph orderings
- PageRank
Binary Comparisons
Rankings are often formed by analyzing series of binary comparisons:
- Team A beats team B
- Expert votes for A instead of B
- Student choses university A over B
Vote counts fail to pick the best when different teams face different levels of competition.
Elo Rankings
After starting equally ranked, scores are then adjusted to reflect the surprise of each match.
$r’(A) = r(A) + k(S_A - \mu_A)$
S is the actual score (-1, 1) for A, with mu the expected score from the previous r(A) and r(B).
Parameter k modulates the maximum possible swing in any one match.
What is the Expected Match Score?
If P(A > B) estimates the probability that A beats B, then:
$\mu_A = 1\times P_{A>B} + (-1)\times (1 - P_{A>B})$
If the ranking system is meaningful, this probability shoud be a function of the difference between the scores r(A) and r(B).
The Logit Function
We need a function f(x) that takes x and yields a probability:
- f(0) = 1/2
- f(infinity) = 1
- f(-infinity) = 0
Merging Rankings/Votes
Consider determining the winner of an multiparty election where each voter ranks the candidates in order of preference.
- Stony Brook
- MIT
- Illinois
- …
Equivalently, consider merging rankings independently drawn on different features.
Borda’s Method
By assigning an increasing score per position, the resulting point total ranks the items:
1 | 1| A B A A |
Four voters, each ranking five items.
Weights for Borda’s Method
Linear position weights make sense when we have equal confidence across all positions.
But we presumably trust our distinctions among the best/worst more than the middle elements, suggesting normally distributed weights.
Directed Graph Orderings
Treating the vote (A>B) as am edge (A, B) yields a directed graph.
If there are no inconsistencies, we get a directed acyclic graph (DAG).
Topologically sorting this DAG gives a reasonable order, like ABCDEF or GABCDEF.
1 | edge(A, B) |
Ranking General Digraphs
For general directed graphs, we seek the order minimizing the number of “wrong way” edges.
Cutting the minimum number of edges to leave a DAG is NP-complete.
But reasonable heuristics start by sorting by the difference between in/out degree.
Arrow’s Impossibility Theorem
There is no ranking system that satisfies all desiravle properties:
- The system should be complete: given A and B it must pick one or say equal preference.
- The system must be transitive, meaning that if A > B and B > C then A > C.
- If every voter prefers A to B, then A wins over B.
- Preferences cannot depend only on one dictator.
- The preference of A to B should be independent of preferences for all other candidates.
Voter | Red | Green | Blue |
---|---|---|---|
X | 1 | 2 | 3 |
Y | 2 | 3 | 1 |
Z | 3 | 1 | 2 |
Red beats Blue, Green beats Blue, Blue beats Red
Ranking Example: Who’s Bigger?
Analyzed Wikipedia to extract measures of historical significance: PageRank, length, hits…
- Mapped values to normal distributions
- Use linear combination (factor analysis)
- Corrected for time by decaying modern figures.
- Separate scores for celebrity/gravitas.
Who’s Biggest?
Here are the top 20 most significant historical figures among over 800,000 in the English Wikipedia.
The Decline of the Great Scientist
The magnitude of Nobel Prize scientists is declining, but not literature/peace winners.
What can You Learn from Rankings?
- Women are underrepresented in Wikipedia.
- Halls of Fame/textbooks do not always pick the strongest figures.
- Certain fields (e.g. poetry) are not producing historically significant figures as in the past.