Correlation Analysis
Correlation Coefficients: Pearson and Spearman Rank
These are two primary statistics used to measure correlation. Both operate on the same -1 to 1 scale. While -1 means anti-correlated, 1 means fully correlated and 0 means uncorrelated.
- Pearson correlation
- Spearman rank correlation
- Counts the number of disordered pairs
- NOT how well the data fits a line
- Thus better with non-linear relationships and outliers
The Power and Significance of Correlation
Strength of correlation $r^2$
The square of the sample correlation coefficient estimates the fraction of the variance in Y explained by X in a simple linear regression.
The predictive value of a correlation decreases quadratically with r
Variance Reduction and $r^2$
If there is a good linear fit f(x), then the residuals y - f(x) will have lower variance than y
Statistical significance
The statistical significance of a correlation depends upon its sample size n as well as r.
Even small correlations become significant (at the 0.05 level) with large-enough sample sizes.
This motivates
big data
multiple parameter models:Each single correlation may explain/predict only small effects, but large numbers of weak but independent correlations may together have strong predictive power.
Correlation Does Not Imply Causation
At best, the implication works only one way. But many observed correlations are completely spurious, with neither variable having any real impact on the other.
Generally speaking, few statistical tools are available to tease out whether A really causes B. We can conduct controlled experiments, if we can manipulate one of the variables and watch the effect on the other.
Detecting Periodicities by Autocorrelation
Generally speaking, the autocorrelation function for many quantities tends to be highest for very short lags. This is why long-term predictions are less accurate than short-term forecasts: the autocorrelations are generally much weaker. But periodic cycles do sometimes stretch much longer.
Time-series data often exhibits cycles which affect its interpretation
- A cycle of length k can be identified by unexpectedly large autocorrelation between S[t] and S[t+k] for all 0 < t < n - k
- Computing the lag-k autocorrelation takes O(n), but the full set can be computed in O(n log n) via the Fast Fourier transform (FFT).
Logarithms
Logarithms and Multiplying Probabilities
- Logarithms were first invented as an aide to computation, by reducing the problem of multiplication to that of addition.
- Multiplying long chains of probability yield very small numbers that govern the chances of very rare events. There are serious numerical stability problems with floating point multiplication on real computers.
- Summing logs of probabilities is more numerically stable than multiplying them
Logarithms and Ratios
- Always plot logarithms of ratios
Logarithms and Normalizing Skewed Distributions
- Hitting a skewed data distribution with a log often yields a more bell-shaped distribution
Calculate the correlation coefficients by python
Using scipy.stats.pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
Parameters: | x : (N,) array_likeInput y : (N,) array_likeInput |
---|---|
Returns: | (Pearson’s correlation coefficient, 2-tailed p-value) |
Using scipy.stats.spearmanr(a, b = None, axis = 0)
Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation.
The Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
Parameters: | a, b : 1D or 2D array_like, b is optional One or two 1-D or 2-D arrays containing multiple variables and observations. Each column of a and b represents a variable, and each row entry a single observation of those variables. See also axis. Both arrays need to have the same length in the axis dimension. axis : int or None, optional If axis=0 (default), then each column represents a variable, with observations in the rows. If axis=0, the relationship is transposed: each row represents a variable, while the columns contain observations. If axis=None, then both arrays will be raveled. |
---|---|
Returns: | rho : float or ndarray (2-D square) Spearman correlation matrix or correlation coefficient (if only 2 variables are given as parameters. Correlation matrix is square with length equal to total number of variables (columns or rows) in a and b combined. p-value : floatThe two-sided p-value for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated, has same dimension as rho. |
Using pandas.corr
clean_data[‘dist’].corr(clean_data[‘fare_amount’])