Scores and Rankings

Scores and Rankings

Scoring functions are measures that reduce multi-dimensional data to a single value, to highlight some particular property.

Rankings order items, usually by sorting scores.

Assigning Grades

Course grades get assigned by scoring functions.

Observe that grading systems have:

  • Degrees of arbitrariness: each teacher differs.
  • Lack of validation data: there is no right grade.
  • General robustness: students tend to get similar grades in all their classes anyway.

Calling scores statistics lends them more dignity.

Scoring vs. Regression

The critical issue in designing scoring functions is that there is no gold standard/right answer.

Machine learning techniques like linear regression can learn a scoring function from features if you had training data, which generally you don’t.

Google’s ranking algorithms train on click data.

The Body-Mass Index (BMI)

BMI is a score designed to capture whether your weight is under control:

Mass is in kg and height in meters

  • Underweight: below 18.5
  • Normal: 18.5 to 25
  • Overweight: 25 to 30
  • Obese: over 30

BMI: Pro Basketball and US Football

BMI is easy to interpret, and correlates with body fat. Mass should scale with the square (or cube) of height.

Gold Standards and Proxies

Gold standards are labels or answers we trust to be correct, reflecting the scoring goal.

Proxies are available quantities correlated with what we want to measure.

Your GPA or SAT/GRE is a proxy for how you should do in my class.

Scoring vs. Machine Learning

When you have a gold standard, you can train a regression function to accurately predict things.

When all you have are proxies, all you can do is evaluate your scoring funciton.

‘Weapons of Math Destruction’ happen when you confuse proxies for gold standards: e.g. student test scores for teaching quality.

Scores vs. Rankings

Whick is more interpretable depends on:

  • Will the numbers be presented in isolation?

    Stony Brook ranks 111 of 351 teams RPI=39.18

  • What is the distribution of scores?

    How much better is #1 than #2?

  • Do you care about the middle or extremes?

    Small changes in score can cause big rank diffs

Recognizing Good Scoring Functions

  • Easily computable
  • Easily understandable
  • Monotonic interpretation of variables
  • Produces satisfying results on outliers
  • Uses systematically normalized variables

Normalization and Z-scores

It is critical to normalize different variavles to make their range/distribution comparable.

Z-scores are computed: $Z_i = \frac{X_i - \bar X}{\sigma}$

Z-scores of height measured in inches is the same as height measured in miles.

Your biggest analysis sins will come in using unnormalized variables for analysis.

Z-score Examples

Z-scores have mean 0 and sigma = 1.

Thus Z-scores of different variables are of comparable magnitude.

The sign identifies if it is above/below the mean.

Advanced Ranking Techniques

Linear combinations of normalized values generally yield reasonable scores, but other techniques include:

  • Elo rankings
  • Merging rank orderings
  • Directed graph orderings
  • PageRank

Binary Comparisons

Rankings are often formed by analyzing series of binary comparisons:

  • Team A beats team B
  • Expert votes for A instead of B
  • Student choses university A over B

Vote counts fail to pick the best when different teams face different levels of competition.

Elo Rankings

After starting equally ranked, scores are then adjusted to reflect the surprise of each match.

$r’(A) = r(A) + k(S_A - \mu_A)$

S is the actual score (-1, 1) for A, with mu the expected score from the previous r(A) and r(B).

Parameter k modulates the maximum possible swing in any one match.

What is the Expected Match Score?

If P(A > B) estimates the probability that A beats B, then:

$\mu_A = 1\times P_{A>B} + (-1)\times (1 - P_{A>B})$

If the ranking system is meaningful, this probability shoud be a function of the difference between the scores r(A) and r(B).

The Logit Function

We need a function f(x) that takes x and yields a probability:

  • f(0) = 1/2
  • f(infinity) = 1
  • f(-infinity) = 0

Merging Rankings/Votes

Consider determining the winner of an multiparty election where each voter ranks the candidates in order of preference.

  1. Stony Brook
  2. MIT
  3. Illinois

Equivalently, consider merging rankings independently drawn on different features.

Borda’s Method

By assigning an increasing score per position, the resulting point total ranks the items:

1
2
3
4
5
6
7
1| A B A A
2| C A B B
3| B C C D
4| D D E C
5| E E D E

A:5, B: 8, C: 12, D:16, E:19

Four voters, each ranking five items.

Weights for Borda’s Method

Linear position weights make sense when we have equal confidence across all positions.

But we presumably trust our distinctions among the best/worst more than the middle elements, suggesting normally distributed weights.

Directed Graph Orderings

Treating the vote (A>B) as am edge (A, B) yields a directed graph.

If there are no inconsistencies, we get a directed acyclic graph (DAG).

Topologically sorting this DAG gives a reasonable order, like ABCDEF or GABCDEF.

1
2
3
4
5
6
7
8
edge(A, B)
edge(B, C)
edge(B, D)
edge(B, E)
edge(C, E)
edge(D, E)
edge(G, D)
edge(E, F)

Ranking General Digraphs

For general directed graphs, we seek the order minimizing the number of “wrong way” edges.

Cutting the minimum number of edges to leave a DAG is NP-complete.

But reasonable heuristics start by sorting by the difference between in/out degree.

Arrow’s Impossibility Theorem

There is no ranking system that satisfies all desiravle properties:

  • The system should be complete: given A and B it must pick one or say equal preference.
  • The system must be transitive, meaning that if A > B and B > C then A > C.
  • If every voter prefers A to B, then A wins over B.
  • Preferences cannot depend only on one dictator.
  • The preference of A to B should be independent of preferences for all other candidates.
Voter Red Green Blue
X 1 2 3
Y 2 3 1
Z 3 1 2

Red beats Blue, Green beats Blue, Blue beats Red

Ranking Example: Who’s Bigger?

Analyzed Wikipedia to extract measures of historical significance: PageRank, length, hits…

  • Mapped values to normal distributions
  • Use linear combination (factor analysis)
  • Corrected for time by decaying modern figures.
  • Separate scores for celebrity/gravitas.

Who’s Biggest?

Here are the top 20 most significant historical figures among over 800,000 in the English Wikipedia.

The Decline of the Great Scientist

The magnitude of Nobel Prize scientists is declining, but not literature/peace winners.

What can You Learn from Rankings?

  • Women are underrepresented in Wikipedia.
  • Halls of Fame/textbooks do not always pick the strongest figures.
  • Certain fields (e.g. poetry) are not producing historically significant figures as in the past.