3 Statistics Foundations
To properly understand multivariate models, an analyst needs to have a decent grasp of foundational statistics. Many of the assumptions and results of multivariate models require an understanding of these foundations in order to be properly interpreted. There are three topics that are particularly important for those proceeding further in this book:
 Descriptive statistics of populations and samples
 Distribution of random variables
 Hypothesis testing
If you have never really studied these topics, I would strongly recommend taking a course in them and spending good time getting to know them. Again, just as the last chapter was not intended to be a comprehensive tutorial on R, neither is this chapter intended to be a comprehensive tutorial on introductory statistics. However, we will introduce some key concepts here that are critical to understanding later chapters, and as always we will illustrate using real data examples.
In preparation for this chapter we are going to download a data set that we will work through in a later chapter, and use it for practical examples and illustration purposes. The data are a set of information on the sales, customer ratings and performance ratings on a set of 351 salespeople as well as an indication of whether or not they were promoted.
# if needed, use online url to download salespeople data
url < "http://peopleanalyticsregressionbook.org/data/salespeople.csv"
salespeople < read.csv(url)
Let’s take a brief look at the first few rows of this data to make sure we know what is inside it.
head(salespeople)
## promoted sales customer_rate performance
## 1 0 594 3.94 2
## 2 0 446 4.06 3
## 3 1 674 3.83 4
## 4 0 525 3.62 2
## 5 1 657 4.40 3
## 6 1 918 4.54 2
And let’s understand the structure of this data.
str(salespeople)
## 'data.frame': 351 obs. of 4 variables:
## $ promoted : int 0 0 1 0 1 1 0 0 0 0 ...
## $ sales : int 594 446 674 525 657 918 318 364 342 387 ...
## $ customer_rate: num 3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ...
## $ performance : int 2 3 4 2 3 2 3 1 3 3 ...
It looks like:

promoted
is a binary value, either 1 or 0, indicating ‘promoted’ or ‘not promoted’ respectively 
sales
andcustomer_rate
look like normal numerical values 
performance
looks like a set of performance categories—there appear to be four based on what we can see
3.1 Elementary descriptive statistics of populations and samples
Any collection of numerical data on one or more variables can be described using a number of common statistical concepts. Let \(x = x_1, x_2, \dots, x_n\) be a sample of \(n\) observations of a variable drawn from a population.
3.1.1 Mean, variance and standard deviation
The mean is the average value of the observations and is defined by adding up all the values and dividing by the number of observations. The mean \(\bar{x}\) of our sample \(x\) is defined as:
\[
\bar{x} = \frac{1}{n}\sum_{i = 1}^{n}x_i
\]
While the mean of a sample \(x\) is denoted by \(\bar{x}\), the mean of an entire population is usually denoted by \(\mu\). The mean can have a different interpretation depending on the type of data being studied. Let’s look at the mean of three different columns of our salespeople
data, making sure to ignore any missing data.
mean(salespeople$sales, na.rm = TRUE)
## [1] 527.0057
This looks very intuitive and appears to be the average amount of sales made by the individuals in the data set.
mean(salespeople$promoted, na.rm = TRUE)
## [1] 0.3219373
Given that this data can only have the value of 0 or 1, we interpret this mean as likelihood or expectation that an individual will be labeled as 1. That is, the average probability of promotion in the data set. If this data showed a perfectly random likelihood of promotion, we would expect this to take the value of 0.5. But it is lower than 0.5, which tells us that the majority of individuals are not promoted.
mean(salespeople$performance, na.rm = TRUE)
## [1] 2.5
Given that this data can only have the values 1, 2, 3 or 4, we interpret this as the expected value of the performance rating in the data set. Higher or lower means inform us about the distribution of the performance ratings. A low mean will indicate a skew towards a low rating, and a high mean will indicate a skew towards a high rating.
Other common statistical summary measures include the median, which is the middle value when the values are ranked in order, and the mode, which is the most frequently occurring value.
The variance is a measure of how much the data varies around its mean. There are two different definitions of variance. The population variance assumes that that we are working with the entire population and is defined as the average squared difference from the mean:
\[ \mathrm{Var}_p(x) = \frac{1}{n}\sum_{i = 1}^{n}(x_i  \bar{x})^2 \] The sample variance assumes that we are working with a sample and attempts to estimate the variance of a larger population by applying Bessel’s Correction to account for potential sampling error. The sample variance is:
\[ \mathrm{Var}_s(x) = \frac{1}{n1}\sum_{i = 1}^{n}(x_i  \bar{x})^2 \]
You can see that
\[ \mathrm{Var}_p(x) = \frac{n  1}{n}\mathrm{Var}_s(x) \] So as the data set gets larger, the sample variance and the population variance become less and less distinguishable, which intuitively makes sense.
Because we rarely work with full populations, the sample variance is calculated by default in R and in many other statistical software packages.
# sample variance
(sample_variance_sales < var(salespeople$sales, na.rm = TRUE))
## [1] 34308.11
So where necessary, we need to apply a transformation to get the population variance.
# population variance (need length of nonNA data)
n < length(na.omit(salespeople$sales))
(population_variance_sales < ((n1)/n) * sample_variance_sales)
## [1] 34210.09
Variance does not have intuitive scale relative to the data being studied, because we have used a ‘squared distance metric,’ therefore we can squareroot it to get a measure of ‘deviance’ on the same scale as the data. We call this the standard deviation \(\sigma(x)\), where \(\mathrm{Var}(x) = \sigma(x)^2\). As with variance, standard deviation has both population and sample versions, and the sample version is calculated by default. Conversion between the two takes the form
\[ \sigma_p(x) = \sqrt{\frac{n1}{n}}\sigma_s(x) \]
# sample standard deviation
(sample_sd_sales < sd(salespeople$sales, na.rm = TRUE))
## [1] 185.2245
# verify that sample sd is sqrt(sample var)
sample_sd_sales == sqrt(sample_variance_sales)
## [1] TRUE
# calculate population standard deviation
(population_sd_sales < sqrt((n1)/n) * sample_sd_sales)
## [1] 184.9597
Given the range of sales is [151, 945] and the mean is 527, we see that the standard deviation gives a more intuitive sense of the ‘spread’ of the data relative to its inherent scale.
3.1.2 Covariance and correlation
The covariance between two variables is a measure of the extent to which one changes as the other changes. If \(y = y_1, y_2, \dots, y_n\) is a second variable, and \(\bar{x}\) and \(\bar{y}\) are the means of \(x\) and \(y\) respectively, then the sample covariance of \(x\) and \(y\) is defined as
\[ \mathrm{cov}_s(x, y) = \frac{1}{n  1}\sum_{i = 1}^{n}(x_i  \bar{x})(y_i  \bar{y}) \] and as with variance, the population covariance is
\[ \mathrm{cov}_p(x, y) = \frac{n1}{n}\mathrm{cov}_s(x, y) \]
Again, the sample covariance is the default in R and we need to transform to obtain the population covariance.
# get sample covariance for sales and customer_rate,
# ignoring observations with missing data
(sample_cov < cov(salespeople$sales, salespeople$customer_rate,
use = "complete.obs"))
## [1] 55.81769
# convert to population covariance (need number of complete obs)
cols < subset(salespeople, select = c("sales", "customer_rate"))
n < nrow(cols[complete.cases(cols), ])
(population_cov < ((n1)/n) * sample_cov)
## [1] 55.65821
As can be seen, the difference in covariance is very small between the sample and population versions, and both confirm a positive relationship between sales and customer rating. However, we again see this issue that there is no intuitive sense of scale for this measure.
Pearson’s correlation coefficient divides the covariance by the product of the standard deviations of the two variables:
\[ r_{x, y} = \frac{\mathrm{cov}(x, y)}{\sigma(x)\sigma(y)} \] This creates a scale of 1 to 1 for \(r_{x, y}\) which is an intuitive way of understanding both the direction and strength of the relationship between \(x\) and \(y\), with 1 indicating that \(x\) increases perfectly as \(y\) decreases, 1 indicating that \(x\) increases perfectly as \(y\) increases, and 0 indicating that there is no relationship between the two.
As before, there is a sample and population version of the correlation coefficient and R calculates the sample version by default. Similar transformations can be used to determine a population correlation coefficient and over large samples the two measures converge.
# calculate sample correlation between sales and customer_rate
cor(salespeople$sales, salespeople$customer_rate, use = "complete.obs")
## [1] 0.337805
This tells us that there is a moderate positive correlation between sales and customer rating.
You will notice that we have so far used two variables on a continuous scale to demonstrate covariance and correlation. Pearson’s correlation can also be used between a continuous scale and a dichotomous (binary) scale variable, and this is known as a pointbiserial correlation.
cor(salespeople$sales, salespeople$promoted, use = "complete.obs")
## [1] 0.8511283
Correlating ranked variables involves an adjusted approach leading to Spearman’s rho (\(\rho\)) or Kendall’s tau (\(\tau\)), among others. We will not dive into the mathematics of this here, but a good source is P. K. Bhattacharya (2016). Spearman’s or Kendall’s variant should be used whenever at least one of the variables is a ranked variable, and both variants are available in R.
# spearman's rho correlation
cor(salespeople$sales, salespeople$performance,
method = "spearman", use = "complete.obs")
## [1] 0.2735446
# kendall's tau correlation
cor(salespeople$sales, salespeople$performance,
method = "kendall", use = "complete.obs")
## [1] 0.2073609
In this case, both indicate a low to moderate correlation. Spearman’s rho or Kendall’s tau can also be used to correlate a ranked and a dichotomous variable, and this is known as a rankbiserial correlation.
3.2 Distribution of random variables
As we outlined in Section 1.2, when we build a model we are using a set of sample data to infer a general relationship on a larger population. A major underlying assumption in our inference is that we believe the real life variables we are dealing with are random in nature. For example, we might be trying to model the drivers of the voting choice of millions of people in a national election, but we may only have sample data on a few thousand people. When we infer nationwide voting intentions from our sample, we assume that the characteristics of the voting population are random variables.
3.2.1 Sampling of random variables
When we describe variables as random, we are assuming that they take a form which is independent and identically distributed. Using our salespeople
data as an example, we are assuming that the sales of one person in the data set is not influenced by the sales of another person in the data set. In this case, this seems like a reasonable assumption, and we will be making it for many (though not all) of the statistical methods used in this book. However, it is good to recognize that there are scenarios where this assumption cannot be made. For example, if the salespeople worked together in serving the same customers on the same products, and each individual’s sales represented some proportion of the overall sales to the customer, we cannot say that the sales data is independent and identically distributed. In this case we will expect to see some hierarchy in our data and will need to adjust our techniques accordingly to take this into consideration.
Under the central limit theorem, if we take samples from a random variable and calculate a summary statistic for each sample, that statistic is itself a random variable, and its mean converges to the true population statistic with more and more sampling. Let’s test this with a little experiment on our salespeople
data. Figure 3.1 shows the results of taking 10, 100 and 1000 different random samples of 50, 100 and 150 salespeople from the salespeople
data set and creating a histogram of the resulting mean sales values. We can see how greater numbers of samples (down the rows) leads to a more normal distribution curve and larger sample sizes (across the columns) leads to a ‘spikier’ distribution with a smaller standard deviation.
3.2.2 Standard errors, the \(t\)distribution and confidence intervals
One consequence of the observations in Figure 3.1 is that the summary statistics calculated from larger sample sizes fall into distributions that are ‘narrower’ and hence represent more precise estimations of the population statistic. The standard deviation of a sampled statistic is called the standard error of that statistic. In the special case of a sampled mean, the formula for the standard error of the mean can be derived to be
\[ SE = \frac{\sigma}{\sqrt{n}} \] where \(\sigma\) is the (sample) standard deviation and \(n\) is the sample size^{8}. This confirms that the standard error of the mean decreases with greater sample size, confirming our intuition that the estimation of the mean is more precise with larger samples.
To apply this logic to our salespeople
data set, let’s take a random sample of 100 values of customer_rate
.
# set seed for reproducibility of sampling
set.seed(123)
# generate a sample of 100 observations
custrate < na.omit(salespeople$customer_rate)
n < 100
sample_custrate < sample(custrate, n)
We can calculate the mean of the sample and the standard error of the mean.
# mean
(sample_mean < mean(sample_custrate))
## [1] 3.6485
## [1] 0.08494328
Because the normal distribution is a frequency (or probability) distribution, we can interpret the standard error as a fundamental unit of ‘sensitivity’ around the sample mean. For greater multiples of standard errors around the sample mean, we can have greater certainty that the range contains the true population mean.
To calculate how many standard errors we would need around the sample mean to have a 95% probability of including the true population mean, we need to use the \(t\)distribution. The \(t\)distribution is essentially an approximation of the normal distribution acknowledging that we only have a sample estimate of the true population standard deviation in how we calculate the standard error. In this case where we are dealing with a single sample mean, we use the \(t\)distribution with \(n1\) degrees of freedom. We can use the qt()
function in R to find the standard error multiple associated with the level of certainty we need. In this case we are looking for our true population mean to be outside the top 2.5% or bottom 2.5% of the distribution^{9}.
# get se multiple for 0.975
(t < qt(p = 0.975, df = n  1))
## [1] 1.984217
We see that approximately 1.98 standard errors either side of our sample mean will give us 95% confidence that our range contains the true population mean. This is called the 95% confidence interval^{10}.
# 95% confidence interval lower and upper bounds
lower_bound < sample_mean  t*se
upper_bound < sample_mean + t*se
cat(paste0('[', lower_bound, ', ', upper_bound, ']'))
## [3.47995410046738, 3.81704589953262]
3.3 Hypothesis testing
Observations about the distribution of statistics on samples of random variables allow us to construct tests for hypotheses of difference or similarity. Such hypothesis testing is useful in itself for simple bivariate analysis in practice settings, but it will be particularly critical in later chapters in determining whether models are useful or not. Before we go through some technical examples of hypothesis testing, let’s overview the logic and intuition for how hypothesis testing works.
The purpose of hypothesis testing is to establish a high degree of statistical certainty regarding a claim of difference in a population based on the properties of a sample. Consistent with a high burden of proof, we start from the hypothesis that there is no difference, called the null hypothesis. We only reject the null hypothesis if the statistical properties of the sample data render it very unlikely, in which case we confirm the alternative hypothesis that a statistical difference does exist in the population.
Most hypothesis tests can return a pvalue, which is the maximum probability of finding the sample results (or results that are more extreme or unusual than the sample results) when the null hypothesis is true for the population. The analyst must decide on the level of pvalue needed to reject the null hypothesis. This threshold is referred to as the significance level \(\alpha\) (alpha). A common standard is to set \(\alpha\) at 0.05. That is, we reject the null hypothesis if the pvalue that we find for our sample results is less than 0.05. If we reject the null hypothesis at \(\alpha = 0.05\), this means that the results we observe in the sample are so extreme or unusual that they would only occur by chance at most 1 in 20 times if the null hypothesis were true. An alpha of 0.05 is not the only standard of certainty used in research and practice, and in some fields of study smaller alphas are the norm, particularly if erroneous conclusions might have very serious consequences.
Three of the most common types of hypothesis tests are^{11}:
 Testing for a difference in the means of two groups
 Testing for a nonzero correlation between two variables
 Testing for a difference in frequency distributions between different categories
We will go through an example of each of these. In each case you will see a three step process. First, we calculate a test statistic. Second, we determine an expected distribution for that test statistic. Finally, we determine where our calculated statistic falls in that distribution in order to assess the likelihood that our null hypothesis is true. During these examples we will go through the steps of manually calculating all the steps needed to do the hypothesis testing, before we demonstrate the simple functions that perform all the steps for you in R. Readers don’t absolutely need to know all the details contained in this section, but a strong understanding of the underlying methods is encouraged.
3.3.1 Testing for a difference in means (Welch’s \(t\)test)
Imagine that we are asked if, in general, the sales of low performing salespeople are different from the sales of high performing salespeople. This question refers to all salespeople, but we only have data for the sample in our salespeople
data set. Let’s take two subsets of our data for those with a performance rating of 1 and those with a performance rating of 4, and calculate the difference in mean sales.
# take two performance group samples
perf1 < subset(salespeople, subset = performance == 1)
perf4 < subset(salespeople, subset = performance == 4)
# calculate the difference in mean sales
(diff < mean(perf4$sales)  mean(perf1$sales))
## [1] 154.9742
We can see that those with a higher performance rating in our sample did generate higher mean sales than those with a lower performance rating. But these are just samples, and we are being asked to give a conclusion about the populations they are drawn from.
Let’s take a null hypothesis that there is no difference in true mean sales between the two performance groups that these samples are drawn from. We combine the two samples and calculate the distribution around the difference in means. To reject the null hypothesis at \(\alpha = 0.05\), we would need to determine that the 95% confidence interval of this distribution does not contain zero.
We calculate the standard error of the combined sample using the formula^{12}:
\[ \sqrt{\frac{\sigma_{\mathrm{perf1}}^2}{n_{\mathrm{perf1}}} + \frac{\sigma_{\mathrm{perf4}}^2}{n_{\mathrm{perf4}}}} \] where \(\sigma_{\mathrm{perf1}}\) and \(\sigma_{\mathrm{perf4}}\) are the standard deviations of the two samples and \(n_{\mathrm{perf1}}\) and \(n_{\mathrm{perf4}}\) are the two sample sizes.
We use a special formula called the WelchSatterthwaite approximation to calculate the degrees of freedom for the two samples, which in this case calculates to 100.98^{13}. This allows us to construct a 95% confidence interval for the difference between the means, and we can test whether this contains zero.
# calculate standard error of the two sets
se < sqrt(sd(perf1$sales)^2/length(perf1$sales)
+ sd(perf4$sales)^2/length(perf4$sales))
# calculate the required tstatistic
t < qt(p = 0.975, df = 100.98)
# calculate 95% confidence interval
(lower_bound < diff  t*se)
## [1] 88.56763
(upper_bound < diff + t*se)
## [1] 221.3809
# test if zero is inside this interval
(0 <= upper_bound) & (0 >= lower_bound)
## [1] FALSE
Since this has returned FALSE
, we conclude that a mean difference of zero is outside the 95% confidence interval of our sample mean difference, and so we cannot have 95% certainty that the difference in population means is zero. We reject the null hypothesis that the mean sales of both performance levels are the same.
Looking at this graphically, we are assuming a \(t\)distribution of the mean difference, and we are determining where zero sits in that distribution, as in Figure 3.2.
The red dashed lines in this diagram represent the 95% confidence interval around the mean difference of our two samples. The ‘tails’ of the curve outside of these two lines each represent a maximum of 0.025 probability for the true population mean. So we can see that the position of the blue line can correspond to a maximum probability that the population mean difference is zero. This is the pvalue of the hypothesis test^{14}.
The pvalue can be derived by calculating the standard error multiple associated with zero in the \(t\)distribution (called the \(t\)statistic or \(t\)value), by applying the conversion function pt()
to obtain the upper tail probability and then multiplying by 2 to get the probability associated with both tails of the distribution.
# get tstatistic
t_actual < diff/se
# convert tstatistic to pvalue
2*pt(t_actual, df = 100.98, lower = FALSE)
## [1] 1.093212e05
Nowadays, it is never necessary to do these manual calculations ourselves because hypothesis tests are a standard part of statistical software. In R, the t.test()
function performs a hypothesis test of difference in means of two samples and confirms our manually calculated pvalue and 95% confidence interval.
t.test(perf4$sales, perf1$sales)
##
## Welch Two Sample ttest
##
## data: perf4$sales and perf1$sales
## t = 4.6295, df = 100.98, pvalue = 1.093e05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 88.5676 221.3809
## sample estimates:
## mean of x mean of y
## 619.8909 464.9167
Because our pvalue is less than our alpha of 0.05, we reject the null hypothesis in favor of the alternative hypothesis. The standard \(\alpha = 0.05\) is associated with the term statistically significant. Therefore we could say here that the two performance groups have a statistically significant difference in mean sales.
In practice, there are numerous alphas that are of interest to analysts, each reflecting different levels of certainty. While 0.05 is the most common standard in many disciplines, more stringent alphas of 0.01 and 0.001 are often used in situations where a high degree of certainty is desirable (for example some medical fields). Similarly, a less stringent alpha standard of 0.1 can be of interest particularly when sample sizes are small and the analyst is satisfied with ‘indications’ from the data. In many statistical software packages, including those that we will see in this book, tests that meet an \(\alpha = 0.1\) standard are usually marked with period(.
), those that meet \(\alpha = 0.05\) with an asterisk(*
), \(\alpha = 0.01\) a double asterisk(**
) and \(\alpha = 0.001\) a triple asterisk(***
).
Many leading statisticians have argued that pvalues are more a test of sample size than anything else and have cautioned against too much of a focus on pvalues in making statistical conclusions from data. In particular, situations where data and methodology have been deliberately manipulated to achieve certain alpha standards—a process known as ‘phacking’—has been of increasing concern recently. See Chapter 11 for a better understanding of how the significance level and the sample size contribute to determining statistical power in hypothesis testing.
3.3.2 Testing for a nonzero correlation between two variables (\(t\)test for correlation)
Imagine that we are given a sample of data for two variables and we are asked if the variables are correlated in the overall population. We can take a null hypothesis that the variables are not correlated, determine a tstatistic associated with a zero correlation and convert this to a pvalue. The tstatistic associated with a correlation \(r\) between two samples of length \(n\) is often notated \(t^*\) and is defined as
\[ t^* = \frac{r\sqrt{n2}}{\sqrt{1r^2}} \] \(t^*\) can be converted to an associated pvalue using a \(t\)distribution in a similar way to the previous section, this time with \(n  2\) degrees of freedom in our \(t\)distribution. As an example, let’s calculate \(t^*\) for the correlation between sales and customer rating in our sample and convert it to a pvalue.
# remove NAs from salespeople
salespeople < salespeople[complete.cases(salespeople), ]
# calculate t_star
r < cor(salespeople$sales, salespeople$customer_rate)
n < nrow(salespeople)
t_star < (r*sqrt(n  2))/sqrt(1  r^2)
# convert to pvalue on tdistribution with n  2 degrees of freedom
2*pt(t_star, df = n  2, lower = FALSE)
## [1] 8.647952e11
Again, there is a useful function in R to cut out the need for all our manual calculations. The cor.test()
function in R performs a hypothesis test on the null hypothesis that two variables have zero correlation.
cor.test(salespeople$sales, salespeople$customer_rate)
##
## Pearson's productmoment correlation
##
## data: salespeople$sales and salespeople$customer_rate
## t = 6.6952, df = 348, pvalue = 8.648e11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2415282 0.4274964
## sample estimates:
## cor
## 0.337805
This confirms our manual calculations and we see the null hypothesis has been rejected and we can conclude that there is a significant correlation between sales and customer rating.
3.3.3 Testing for a difference in frequency distribution between different categories in a data set (Chisquare test)
Imagine that we are asked if the performance category of each person in the salespeople
data set has a relationship with their promotion likelihood. We will test the null hypothesis that there is no difference in the distribution of promoted versus not promoted across the four performance categories.
First we can produce a contingency table which is matrix containing counts of how many people were promoted or not promoted in each category.
# create contingency table of promoted vs performance
(contingency < table(salespeople$promoted, salespeople$performance))
##
## 1 2 3 4
## 0 50 85 77 25
## 1 10 25 48 30
We can see by summing each row that for the total sample we can expect 113 people to be promoted and 237 to miss out on promotion. We can use this ratio to compute an expected proportion in each performance category under the assumption that the distribution was exactly the same across all four categories.
# calculate expected promoted and not promoted
(expected_promoted < (sum(contingency[2, ])/sum(contingency)) *
colSums(contingency))
## 1 2 3 4
## 19.37143 35.51429 40.35714 17.75714
## 1 2 3 4
## 40.62857 74.48571 84.64286 37.24286
Now we can compare our observed versus expected values using the difference metric:
\[ \frac{(\mathrm{observed}  \mathrm{expected})^2}{\mathrm{expected}} \] and add these all up to get a total, known as the \(\chi^2\) statistic.
# calculate the difference metrics for promoted and not promoted
promoted < sum((expected_promoted  contingency[2, ])^2/
expected_promoted)
notpromoted < sum((expected_notpromoted  contingency[1, ])^2/
expected_notpromoted)
# calculate chisquared statistic
(chi_sq_stat < notpromoted + promoted)
## [1] 25.89541
The \(\chi^2\) statistic has an expected distribution that can be used to determine the pvalue associated with this statistic. As with the \(t\)distribution, the \(\chi^2\)distribution depends on the degrees of freedom. This is calculated by subtracting one from the number of rows and from the number of columns in the contingency table and multiplying them together. In this case we have 2 rows and 4 columns which calculates to 3 degrees of freedom. Armed with our \(\chi^2\) statistic and our degrees of freedom, we can now calculate the pvalue for the hypothesis test using the pchisq()
function.
# calculate pvalue from chi_squared stat
pchisq(chi_sq_stat, df = 3, lower.tail=FALSE)
## [1] 1.003063e05
The chisq.test()
function in R performs all the steps involved in a chisquare test of independence on a contingency table and returns the \(\chi^2\) statistic and associated pvalue for the null hypothesis, in this case confirming our manual calculations.
chisq.test(contingency)
##
## Pearson's Chisquared test
##
## data: contingency
## Xsquared = 25.895, df = 3, pvalue = 1.003e05
Again, we can reject the null hypothesis and confirm the alternative hypothesis that there is a difference in the distribution of promoted/not promoted individuals between the four performance categories.
3.4 Foundational statistics in Python
Elementary descriptive statistics can be performed in Python using various packages. Descriptive statistics of numpy
arrays are usually available as methods.
import pandas as pd
import numpy as np
# get data
= "http://peopleanalyticsregressionbook.org/data/salespeople.csv"
url = pd.read_csv(url)
salespeople
# mean sales
= salespeople.sales.mean()
mean_sales print(mean_sales)
## 527.0057142857142
# sample variance
= salespeople.sales.var()
var_sales print(var_sales)
## 34308.11458043389
# sample standard deviation
= salespeople.sales.std()
sd_sales print(sd_sales)
## 185.2244977869663
Population statistics can be obtained by setting the ddof
parameter to zero.
# population standard deviation
= salespeople.sales.std(ddof = 0)
popsd_sales print(popsd_sales)
## 184.9597020864771
The numpy
covariance function produces a covariance matrix.
# generate a sample covariance matrix between two variables
= salespeople[['sales', 'customer_rate']]
sales_rate = sales_rate[~np.isnan(sales_rate)]
sales_rate = sales_rate.cov()
cov print(cov)
## sales customer_rate
## sales 34308.114580 55.817691
## customer_rate 55.817691 0.795820
Specific covariances between variable pairs can be pulled out of the matrix.
# pull out specific covariances
print(cov['sales']['customer_rate'])
## 55.817691199345006
Similarly for Pearson correlation.
# sample pearson correlation matrix
= sales_rate.corr()
cor print(cor)
## sales customer_rate
## sales 1.000000 0.337805
## customer_rate 0.337805 1.000000
Specific types of correlation coefficients can be accessed via the stats
module of the scipy
package.
from scipy import stats
# spearman's correlation
stats.spearmanr(salespeople.sales, salespeople.performance, ='omit') nan_policy
## SpearmanrResult(correlation=0.27354459847452534, pvalue=2.0065434379079837e07)
# kendall's tau
stats.kendalltau(salespeople.sales, salespeople.performance, ='omit') nan_policy
## KendalltauResult(correlation=0.20736088105812, pvalue=2.7353258226376615e07)
Common hypothesis testing tools are available in scipy.stats
. Here is an example of how to perform Welch’s \(t\)test on a difference in means of samples of unequal variance.
# get sales for top and bottom performers
= salespeople[salespeople.performance == 1].sales
perf1 = salespeople[salespeople.performance == 4].sales
perf4
# welch's ttest with unequal variance
= stats.ttest_ind(perf4, perf1, equal_var=False)
ttest print(ttest)
## Ttest_indResult(statistic=4.629477606844271, pvalue=1.0932443461577038e05)
As seen above, hypothesis tests for nonzero correlation coefficients are performed automatically as part of scipy.stats
correlation calculations.
# calculate correlation and pvalue
= salespeople.sales[~np.isnan(salespeople.sales)]
sales
= salespeople.customer_rate[
cust_rate ~np.isnan(salespeople.customer_rate)
]
= stats.pearsonr(sales, cust_rate)
cor print(cor)
## (0.33780504485867796, 8.647952212091035e11)
Finally, a chisquare test of difference in frequency distribution can be performed on a contingency table as follows. The first value of the output is the \(\chi^2\) statistic and the second value is the pvalue.
# create contingency table for promoted versus performance
= pd.crosstab(salespeople.promoted, salespeople.performance)
contingency
# perform chisquare test
= stats.chi2_contingency(contingency)
chi2_test print(chi2_test)
## (25.895405268094862, 1.0030629464566802e05, 3, array([[40.62857143, 74.48571429, 84.64285714, 37.24285714],
## [19.37142857, 35.51428571, 40.35714286, 17.75714286]]))
3.5 Learning exercises
3.5.1 Discussion questions
Where relevant in these discussion exercises, let \(x = x_1, x_2, \dots, x_n\) and \(y = y_1, y_2, \dots, y_m\) be samples of two random variables of length \(n\) and \(m\) respectively.
 If the values of \(x\) can only take the form 0 or 1, and if their mean is 0.25, how many of the values equal 0?
 If \(m = n\) and \(x + y\) is formed from the elementwise sum of \(x\) and \(y\), show that the mean of \(x + y\) is equal to the sum of the mean of \(x\) and the mean of \(y\).
 For a scalar multiplier \(a\), show that \(\mathrm{Var}(ax) = a^2\mathrm{Var}(x)\).
 Explain why the standard deviation of \(x\) is a more intuitive measure of the deviation in \(x\) than the variance.
 Describe which two types of correlation you could use if \(x\) is an ordered ranking.
 Describe the role of sample size and sampling frequency in the distribution of sampling means for a random variable.
 Describe what a standard error of a statistic is and how it can be used to determine a confidence interval for the true population statistic.
 If we conduct a ttest on the null hypothesis that \(x\) and \(y\) are drawn from populations with the same mean, describe what a pvalue of 0.01 means.
 Extension: The sum of variance law states that, for independent random variables \(x\) and \(y\), \(\mathrm{Var}(x \pm y) = \mathrm{Var}(x) + \mathrm{Var}(y)\). Use this together with the identity from Exercise 3 to derive the formula for the standard error of the mean of \(x = x_1, x_2, \dots, x_n\):
\[ SE = \frac{\sigma(x)}{\sqrt{n}} \]
 Extension: In a similar way to Exercise 9, show that the standard error for the difference between the means of \(x\) and \(y\) is
\[ \sqrt{\frac{\sigma(x)^2}{n} + \frac{\sigma(y)^2}{m}} \]
3.5.2 Data exercises
For these exercises, load the charity_donation
data set via the peopleanalyticsdata
package, or download it from the internet^{15}. This data set contains information on a sample of individuals who made donations to a nature charity.
 Calculate the mean
total_donations
from the data set.  Calculate the sample variance for
total_donation
and convert this to a population variance.  Calculate the sample standard deviation for
total_donations
and verify that it is the same as the square root of the sample variance.  Calculate the sample correlation between
total_donations
andtime_donating
. By using an appropriate hypothesis test, determine if these two variables are independent in the overall population.  Calculate the mean and the standard error of the mean for the first 20 entries of
total_donations
.  Calculate the mean and the standard error of the mean for the first 50 entries of
total_donations
. Verify that the standard error is less than in Exercise 5.  By using an appropriate hypothesis test, determine if the mean age of those who made a recent donation is different from those who did not.
 By using an appropriate hypothesis test, determine if there is a difference in whether or not a recent donation was made according to where people reside.
 Extension: By using an appropriate hypothesis test, determine if the age of those who have recently donated is at least 10 years older than those who have not recently donated in the population.
 Extension: By using an appropriate hypothesis test, determine if the average donation amount is at least ten dollars higher for those who recently donated versus those who did not. Retest for twenty dollars higher.