Preliminaries
Ideas and methods that will repeatedly apper:
- Bonferroni’s principle
- Normalization (TF.IDF)
- Power Laws
- Hash Functions
- IO Bounded (Secondary Storage)
- Unstructured Data
- Parallelism
- Functional Programming
Bonferroni’s Principle (Doubt)
If we are willing to view as an interesting feature of data something of which many instances can be expected to exist in random data, then we cannot rely on such features being significant. This observation limits our ability to mine data for features that are not sufficiently rare in practice.
Roughly, calculating the probability of any of n findings being true requires n times the probability as testing for 1 finding. In brief, one can only look for so many patterns in the data before one finds something just by chance. (i.e. finding something that does not generalize)
Examples of How to Apply Bonferroni’s Principle
Another Example of Bonferroni’s Principle(from the book: Mining of Massive Datasets)
Size of the problem and its assumptions:
- There are one billion people who might be evil-doers.
- Everyone goes to a hotel one day in 100.
- A hotel holds 100 people. Hence, there are 100,000 hotels - enough to hold the 1% of a billion people who visit a hotel on any given day.
- We shall examine hotel records for 1000 days.
Statistical Limis on Data Mining
Total Information Awareness (TIA)
This program was intended to mine all the data it could find in order to track terrorist activity.
What are the advantages and disadvantages of TIA?
Pros:
- Find the terrorist
Cons:
- Raise privacy violations
- May misjudge (Bonferroni’s principle)
How to avoid treating random occurrences as if they were real?
Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus.
Normalization (TF.IDF)
What is TF.IDF and why do we need to use it?
TF.IDF stands for Term Frequency times Inverse Document Frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
How to calculate it?
Term frequency
$$
TF_{ij} = \frac{f_{ij}}{\max_k f_{kj}}, \text{where }f_{ij}\text{ is the frequency of term (word) i in document j}
$$
The most frequent term in document j gets a TF of 1, and other terms get fractions as their term frequency for this document.Inverse document frequency
When to use it?
It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
Reference
[1] lecture slide1-30.pdf)