Correlation

Correlation Analysis

Correlation Coefficients: Pearson and Spearman Rank

These are two primary statistics used to measure correlation. Both operate on the same -1 to 1 scale. While -1 means anti-correlated, 1 means fully correlated and 0 means uncorrelated.

Pearson correlation
Spearman rank correlation
- Counts the number of disordered pairs
- NOT how well the data fits a line
- Thus better with non-linear relationships and outliers

The Power and Significance of Correlation

Strength of correlation $r^2$

The square of the sample correlation coefficient estimates the fraction of the variance in Y explained by X in a simple linear regression.
- The predictive value of a correlation decreases quadratically with r
- Variance Reduction and $r^2$
  
  If there is a good linear fit f(x), then the residuals y - f(x) will have lower variance than y
Statistical significance
- The statistical significance of a correlation depends upon its sample size n as well as r.
- Even small correlations become significant (at the 0.05 level) with large-enough sample sizes.
  - This motivates big data multiple parameter models:
    
    Each single correlation may explain/predict only small effects, but large numbers of weak but independent correlations may together have strong predictive power.

Correlation Does Not Imply Causation

At best, the implication works only one way. But many observed correlations are completely spurious, with neither variable having any real impact on the other.
Generally speaking, few statistical tools are available to tease out whether A really causes B. We can conduct controlled experiments, if we can manipulate one of the variables and watch the effect on the other.

Detecting Periodicities by Autocorrelation

Generally speaking, the autocorrelation function for many quantities tends to be highest for very short lags. This is why long-term predictions are less accurate than short-term forecasts: the autocorrelations are generally much weaker. But periodic cycles do sometimes stretch much longer.
Time-series data often exhibits cycles which affect its interpretation
A cycle of length k can be identified by unexpectedly large autocorrelation between S[t] and S[t+k] for all 0 < t < n - k
Computing the lag-k autocorrelation takes O(n), but the full set can be computed in O(n log n) via the Fast Fourier transform (FFT).

Logarithms

Logarithms and Multiplying Probabilities

Logarithms were first invented as an aide to computation, by reducing the problem of multiplication to that of addition.
Multiplying long chains of probability yield very small numbers that govern the chances of very rare events. There are serious numerical stability problems with floating point multiplication on real computers.
Summing logs of probabilities is more numerically stable than multiplying them

Logarithms and Ratios

Always plot logarithms of ratios

Logarithms and Normalizing Skewed Distributions

Hitting a skewed data distribution with a log often yields a more bell-shaped distribution

Calculate the correlation coefficients by python

Using scipy.stats.pearsonr(x, y)

Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.

Parameters:	x : (N,) array_likeInput y : (N,) array_likeInput
Returns:	(Pearson’s correlation coefficient, 2-tailed p-value)

Using scipy.stats.spearmanr(a, b = None, axis = 0)

Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation.

The Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.

Parameters:	a, b : 1D or 2D array_like, b is optional One or two 1-D or 2-D arrays containing multiple variables and observations. Each column of a and b represents a variable, and each row entry a single observation of those variables. See also axis. Both arrays need to have the same length in the axis dimension. axis : int or None, optional If axis=0 (default), then each column represents a variable, with observations in the rows. If axis=0, the relationship is transposed: each row represents a variable, while the columns contain observations. If axis=None, then both arrays will be raveled.
Returns:	rho : float or ndarray (2-D square) Spearman correlation matrix or correlation coefficient (if only 2 variables are given as parameters. Correlation matrix is square with length equal to total number of variables (columns or rows) in a and b combined. p-value : floatThe two-sided p-value for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated, has same dimension as rho.

Parameters:

a, b : 1D or 2D array_like, b is optional
One or two 1-D or 2-D arrays containing multiple variables and observations. Each column of a and b represents a variable, and each row entry a single observation of those variables. See also axis. Both arrays need to have the same length in the axis dimension.
axis : int or None, optional
If axis=0 (default), then each column represents a variable, with observations in the rows. If axis=0, the relationship is transposed: each row represents a variable, while the columns contain observations. If axis=None, then both arrays will be raveled.

Returns:

rho : float or ndarray (2-D square)
Spearman correlation matrix or correlation coefficient (if only 2 variables are given as parameters. Correlation matrix is square with length equal to total number of variables (columns or rows) in a and b combined.
p-value : floatThe two-sided p-value for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated, has same dimension as rho.

Using pandas.corr

clean_data[‘dist’].corr(clean_data[‘fare_amount’])

Reference

Lecture 5: Correlation

The Data Science Design Manual Chapter 2.3, 2.4