1 The Importance of Regression in People Analytics
In the 19th century, when Francis Galton first used the term ‘regression’ to describe a statistical phenomenon (see Chapter 4), little did he know how important that term would be today. Many of the most powerful tools of statistical inference that we now have at our disposal can be traced back to the types of early analysis that Galton and his contemporaries were engaged in. The sheer number of different regression-related methodologies and variants that are available to researchers and practitioners today is mind-boggling, and there are still rich veins of ongoing research that are focused on defining and refining new forms of regression to tackle new problems.
Neither could Galton have imagined the advent of the age of data we now live in. Those of us (like me) who entered the world of work even as recently as 20 years ago remember a time when most problems could not be expected to be solved using a data-driven approach, because there simply was no data. Things are very different now, with data being collected and processed all around us and available to use as direct or indirect measures of the phenomena we are interested in.
Along with the growth in data that we have seen in recent years, we have also seen a rapid growth in the availability of statistical tools—open source and free to use—that fundamentally change how we go about analytics. Gone are the clunky, complex, repeated steps on calculators or spreadsheets. In their place are lean statistical programming languages that can implement a regression analysis in milliseconds with a single line of code, allowing us to easily run and reproduce multivariable analysis at scale.
So given that we have access to well-developed methodology, rich sources of data and readily accessible tools, it is somewhat surprising that many analytics practitioners have a limited knowledge and understanding of regression and its applications. The aim of this book is to encourage inexperienced analytics practitioners to ‘dip their toes’ further into the wide and varied world of regression in order to deliver more targeted and precise insights to their organizations, stakeholders and research communities on the problems they are most interested in. While the primary subject matter focus of this book is the analysis of people-related phenomena, the material is easily and naturally transferable to other disciplines. Therefore this book can be regarded as a practical introduction to a wide range of regression methods for any analytics student or practitioner.
It is my firm belief that all people analytics professionals should have a strong understanding of regression models and how to implement and interpret them in practice, and my aim with this book is to provide those who need it with help in getting there. In this chapter we will set the scene for the technical learning in the remainder of the book by outlining the relevance of regression models in people analytics practice. We also touch on some general modeling theory to set a context for later chapters, distinguishing between the core goals of prediction, inference, and causal explanation. Finally, we provide a preview of the contents, structure and learning objectives of this book.
1.1 Why is regression modeling so important in people analytics?
People analytics involves the study of the behaviors and characteristics of people or groups in relation to important business, organizational or institutional outcomes. This can involve both qualitative methods and quantitative methods, but if data is available related to a particular topic of interest, then quantitative methods are almost always considered important. With such a specific focus on outcomes, any analyst working in people analytics will frequently need to model these outcomes both to understand what influences them and to potentially predict them in the future.
When we model an outcome, we are usually trying to achieve at least one of three research goals. First, we may want to predict whether or not the outcome will occur in the future based on a set of input variables. Second, we may want to infer what influences the outcome based on a set of input variables. Third, we may want to establish whether certain input variables might cause the outcome to occur. Each of these goals has different implications for the modeling approach.
When we model an outcome purely with the goal of accurately predicting its future occurrence, we are less concerned with understanding how the input variables relate to the outcome, and more concerned with ensuring that the model has learned a pattern which allows it to predict accurately based on new data. For this reason, we can use a wide range of modeling approaches, many of which we will not be able to explicitly describe or interpret because of their underlying complexity. It is often these more complex models that are able to deliver the highest levels of predictive accuracy on the new data. For example, imagine we need to build a model that accurately predicts the loss of a customer from a subscription service based on the customer’s usage patterns, so that we can send an automatic and timely discounted renewal offer to them to encourage them to maintain their subscription. In this case, the overriding business priority is to accurately predict the customers who will leave the service, ensuring that they (and only they) receive the discounted offer. The more accurate the model can be in its predictions, the better the business outcome. It does not matter so much if the most accurate model is too complex to understand or explain to stakeholders. We call such a model a predictive model.
In contrast, when we model an outcome with the goal of understanding what influences it, we are less concerned with prediction accuracy and more concerned with the explainability of the relationship between the input variables and the outcome. Often when we build such models, we have no intention of asking them to predict based on new data. Instead, our goal is to advise our stakeholders on which input variables are most important in influencing the outcome, how they influence it, and to what extent. For example, imagine we need to build a model that helps us understand which factors influence absenteeism in our organization, so that we can design a program to reduce employee absence. In this case, the overriding business priority is to make a judgment on what elements such a program should contain in order to maximize its effectiveness. The more explainable the model is in terms of how these elements relate to employee absence, the easier it will be for decision makers to design an effective program. In this case, it does not matter if the model is less accurate in predicting employee absence based on new data. We call such a model an inferential model.
Finally, there are situations when we aim to go further than understanding what input variables influence an outcome. There are situations where we want to isolate an input variable and try to determine if it is directly causing the outcone. When we model an outcome with the goal of establishing causality, we are interested in the direct impact of a specific intervention or change on that outcome. More often than not in most organizational settings, we are trying to do this on the basis of observational data and not based on the results of a true experiment. When estimating causal effects from observational data, we need to carefully consider how our input variables interact in a causal framework, often based on informed hypotheses, and we need to carefully construct our inferential models to ensure that we can isolate the true effect of the intervention. For example, imagine we need to build a model that helps us understand whether a new training program for customer service representatives causes an increase in customer satisfaction scores. In this case, the overriding business priority is to determine whether the training program is effective in improving customer satisfaction, so that we can decide whether to continue or expand it. The more rigorously we can estimate a causal relationship between the training program and customer satisfaction, the more confident we can be in our decision. We call an inferential model which is built to estimate causal effects a causal inference model, or simply a causal model.
Primarily this book is concerned with the latter two goals: selecting, implementing and interpreting inferential or causal models. The current reality in the field of people analytics is that inferential models, whether causal or not, are in substantially greater demand than predictive models. There are two reasons for this. First, data sets in people analytics are rarely large enough and in good enough condition to facilitate satisfactory prediction accuracy, and so attention is usually shifted to inference for this reason alone. Second, in the field of people analytics, decisions often have a real impact on individuals. Therefore, even in the rare situations where accurate predictive modeling is attainable, stakeholders are unlikely to trust the output and bear the consequences of predictive models without some sort of elementary understanding of how the predictions are generated. This requires the analyst to consider explainability and inferential clarity as well as predictive accuracy in selecting their modeling approach. Again, many regression models come to the fore because they are commonly able to provide both inferential and predictive value.
A notable trend in most professional and clinical fields in the past decade or so is the growing importance of evidence-based practice. This has generated a need for more advanced modeling skills to satisfy rising demand for quantitative evidence from decision makers. In people-related fields such as human resources and in finance-related fields such as econometrics, many varieties of specialized regression-based models such as survival models or latent variable models have crossed from academic and clinical settings into business settings in recent years, and there is an increasing need for qualified individuals who can implement and interpret these models in practice. Of particular value are individuals who are fluent with the two main philosophies of statistical inference: frequentist and Bayesian statistics. Different organizational contexts require different types of analytic outputs, and being able to select, implement and interpret the right approach for the right context is a highly valuable skill. Of note here is the ability to understand the trade-off between models that are easier to run and interpret based on point estimates of the most likely values of parameters, versus those which require more computationally intensive simulations in order to interpret across a range of uncertainty around parameters. This is the essence of the choice between classical (frequentist) and Bayesian approaches to statistical modeling. Each have their place, and both approaches are covered in this book.
1.2 What do we mean by ‘modeling’?
The term ‘modeling’ has a very wide range of meaning in everyday life and work. In this book we are focused on inferential modeling, and we define that as a specific form of statistical learning, which tries to discover and understand a mathematical relationship between a set of measurements of certain constructs and a measurement of an outcome of interest, based on a sample of data on each. Modeling is both a concept and a process.
1.2.1 The theory of inferential modeling
We will start with a theoretical description and then provide a real example from a later chapter to illustrate.
Imagine we have a population \(\mathscr{P}\) for which we believe there may be a non-random relationship between a certain construct or set of constructs \(\mathscr{C}\) and a certain measurable outcome \(\mathscr{O}\). Imagine that for a certain sample \(S\) of observations from \(\mathscr{P}\), we have a collection of data which we believe measure \(\mathscr{C}\) to some acceptable level of accuracy, and for which we also have a measure of the outcome \(\mathscr{O}\). By convention, we denote the set of data that measure \(\mathscr{C}\) on our sample \(S\) as \(X = x_1, x_2, \dots, x_p\), where each \(x_i\) is a vector (or column) of data measuring at least one of the constructs in \(\mathscr{C}\). We denote the set of data that measure \(\mathscr{O}\) on our sample set \(S\) as \(y\). An upper-case \(X\) is used because the expectation is that there will be several columns of data measuring our constructs, and a lower-case \(y\) is used because the expectation is that the outcome is a single column.
Inferential modeling is the process of learning about a relationship (or lack of relationship) between the data in \(X\) and \(y\) and using that to describe a relationship (or lack of relationship) between our constructs \(\mathscr{C}\) and our outcome \(\mathscr{O}\) that is valid to a high degree of statistical certainty on the population \(\mathscr{P}\). This process may include:
- Testing a proposed mathematical relationship in the form of a function, structure or iterative method
- Comparing that relationship against other proposed relationships
- Describing the relationship statistically
- Determining whether the relationship (or certain elements of it) can be generalized from the sample set \(S\) to the population \(\mathscr{P}\)
When we test a relationship between \(X\) and \(y\), we acknowledge that data and measurements are imperfect and so each observation in our sample \(S\) may contain random error that we cannot control. Therefore we define our relationship as:
\[ y = f(X) + \epsilon \] where \(f\) is some transformation or function of the data in \(X\) and \(\epsilon\) is a random, uncontrollable error.
\(f\) can take the form of a predetermined function with a formula defined on \(X\), like a linear function for example. In this case we can call our model a parametric model. In a parametric model, the modeled value of \(y\) is known as soon as we know the values of \(X\) by simply applying the formula. In a non-parametric model, there is no predetermined formula that defines the modeled value of \(y\) purely in terms of \(X\). Non-parametric models need further information in addition to \(X\) in order to determine the modeled value of \(y\)—for example the value of \(y\) in other observations with similar \(X\) values.
Regression models are designed to derive \(f\) using estimation based on statistical likelihood and expectation, founded on the theory of the distribution of random variables. Regression models can be both parametric and non-parametric, but by far the most commonly used methods (and the majority of those featured in this book) are parametric. Because of their foundation in statistical likelihood and expectation, they are particularly suited to helping answer questions of generalizability—that is, to what extent can the relationship being observed in the sample \(S\) be inferred for the population \(\mathscr{P}\), which is usually the driving force in any form of inferential modeling. There are many different types of regression models available to us—far too numerous to list here—and they can all be implemented using both frequentist and Bayesian approaches. We will cover the most commonly used regression models for people-related outcomes in this book.
Note that there is a difference between establishing a statistical relationship between \(\mathscr{C}\) and \(\mathscr{O}\) and establishing a causal relationship between the two. This can be a common trap that inexperienced statistical analysts fall into when communicating the conclusions of their modeling. Establishing that a relationship exists between a construct and an outcome is a far cry from being able to say that one causes the other. This is the common truism that ‘correlation does not equal causation’. Running regression models with the goal of finding evidence of a causal effect requires a more rigorous framework, an enhanced toolkit and skilled practice in bringing expert knowledge to bear on the problem. In the final part of this book, we will introduce modern methods for causal inference that provide a language and a toolkit for moving beyond association to estimate causal effects under specific, clearly stated assumptions.
To bring our theory to life, consider the walkthrough example in Chapter 4 of this book. In this example, we discuss how to establish a relationship between the academic results of students in the first three years of their education program and their results in the fourth year. In this case, our population \(\mathscr{P}\) is all past, present and future students who take similar examinations, and our sample \(S\) is the students who completed their studies in the past three years. \(X = x_1, x_2, x_3\) are each of the three scores from the first three years, and \(y\) is the score in the fourth year. We test \(f\) to be a linear relationship, and we establish that such a relationship can be generalized to the entire population \(\mathscr{P}\) with a substantial level of statistical confidence1.
Almost all our work in this book will refer to the variables \(X\) as input variables and the variable \(y\) as the outcome variable. There are many other common terms for these which you may find in other sources—for example \(X\) are often known as independent variables or covariates while \(y\) is often known as a dependent or response variable.
1.2.2 The process of inferential modeling
Inferential modeling—regression or otherwise—is a process of numerous steps. Typically the main steps are2:
- Defining the outcome of interest \(\mathscr{O}\) and the input constructs \(\mathscr{C}\) based on a broader evidence-based objective
- Confirming that \(\mathscr{O}\) has reliable measurement data
- Determining which data can be used to measure \(\mathscr{C}\)
- Determining a sample \(S\) and collecting, refining and cleaning data.
- Performing exploratory data analysis (EDA) and proposing a set of models to test for \(f\)
- Putting the data in an appropriate format for each model
- Running the models to estimate their parameters
- Interpreting the parameter estimates and performing model diagnostics
- Selecting an optimal model or models
- Articulating the uncertainty around the parameter estimates and inferring which input variables have a meaningful effect on the outcome for the entire population \(\mathscr{P}\)
This book is primarily focused on steps 7–10 of this process3. That is not to say that steps 1–6 are not important. Indeed these steps are critical and often loaded with analytic traps. Defining the problem, collecting reliable measures and cleaning and organizing data are still the source of much pain and angst for analysts, but these topics are for another day.
1.3 The structure, system and organization of this book
The purpose of this book is to put inexperienced practitioners firmly on a path to the confident and appropriate use of regression techniques in their day-to-day work. This requires enough of an understanding of the underlying theory so that judgments can be made about results, but also a clear workflow to help practitioners apply the most common regression methods to a variety of typical modeling scenarios in a reliable and reproducible way.
In most chapters, time is spent on the underlying mathematics. Not to the degree of an academic theorist, but enough to ensure that the reader can associate some mathematical meaning to the outputs of models. While it may be tempting to skip the math, I strongly recommend against it if you intend to be a high performer in your field. The best analysts are those who can genuinely understand what the numbers are telling them.
The statistical programming language R is used for most of the practical demonstration in each chapter. Because R is open source and particularly well geared to statistics, it is an excellent choice for those whose work involves a lot of inferential analysis. Though R will be the primary mode of technical instruction, we will additionally show implementations of all of the available methodologies in Python, which is also a powerful open source tool for this sort of work, and is much more popular and in the broader field of data science. These can usually be found at the end of chapters.
Most chapters involve a walkthrough example to illustrate the specific method and to allow the reader to replicate the analysis for themselves. The exercises at the end of each chapter are designed so that the reader can try the same method on a different data set, or a different problem on the same data set, to test their learning and understanding. In the final chapter, a series of data sets and exercises are provided with limited instruction in order to give the reader an opportunity to test their overall knowledge in selecting and applying regression methods to a variety of people analytics data sets and problems. All in all, nineteen different data sets are used as walkthrough or exercise examples, and all of these data sets are fictitious constructions unless otherwise indicated. Despite the fiction, they are deliberately designed to present the reader with something resembling how the data might look in practice, albeit cleaner and more organized.
In anticipation that there will be readers who have little or no experience in using programming languages to perform analytics, Chapter 2 covers the basics of the R programming language. Experienced R programmers can skip this chapter. The next nine chapters cover a range of classical statistical modeling methods, starting with foundational statistical concepts and moving through a variety of regression models for different types of outcome variables. The methods in these chapters will align closely with the statistical approaches most commonly taught as part of undergraduate statistical methods programs in psychology, sociology, econometrics and other disciplines. They are suitable for most situations where a reasonable amount of data is available for analysis.
- Chapter 3 covers the essential statistical concepts needed to understand multivariable regression models. It also serves as a tutorial in the classical approach to univariate and bivariate statistics illustrated with real data. If you need help developing a decent understanding of descriptive statistics, random distribution and classical hypothesis testing, this is an important chapter to study.
- Chapter 4 covers linear regression and in the course of that introduces many other foundational concepts. The walkthrough example involves modeling academic results from previous results. The exercises involve modeling income levels based on various work and demographic factors.
- Chapter 5 covers binomial logistic regression. The walkthrough example involves modeling promotion likelihood based on performance metrics. The exercises involve modeling charitable donation likelihood based on prior donation behavior and demographics.
- Chapter 6 covers multinomial regression. The walkthrough example and exercise involves modeling the choice of three health insurance products by company employees based on demographic and position data.
- Chapter 7 covers ordinal regression. The walkthrough example involves modeling in-game disciplinary action against soccer players based on prior discipline and other factors. The exercises involve modeling manager performance based on varied data.
- Chapter 8 covers count data regression, including Poisson, Quasi-Poisson and negative binomial regression. The walkthrough example involves modeling absenteeism in a technology company. The exercises involve modeling the number of complaints received about telephone customer service representatives in a retail company.
- Chapter 9 covers modeling options for data with explicit or latent hierarchy. The first part covers mixed modeling and uses a model of speed dating decisions as a walkthrough and example. The second part covers structural equation modeling and uses a survey for a political party as a walkthrough example. The exercises involve modeling latent variables in an employee engagement survey.
- Chapter 10 covers survival analysis, Cox proportional hazard regression and frailty models. The chapter uses employee attrition as a walkthrough example and exercise.
- Chapter 11 covers power analysis, focusing in particular on estimating the required minimum sample sizes in establishing meaningful inferences for both simple statistical tests and multivariate models. Examples related to experimental studies are used to illustrate, such as concurrent validity studies of selection instruments.
The next three chapters cover Bayesian statistical modeling methods, introducing the reader to a different paradigm for statistical inference. While the models covered in these chapters are similar to those covered in the previous chapters, the Bayesian approach to inference is substantially different from the classical approach, and this has important implications for how models are implemented and interpreted. The methods in these chapters would most commonly be found in postgraduate methods programs, and are most suitable for situations where analysts have a high degree of prior knowledge about the processes they are modeling, or where modeling is being conducted on limited data and sample sizes, requiring the analyst to be more explicit about the underlying uncertainty of their estimates:
Chapter 12 introduces the Bayesian approach to statistical inference, contrasting it with the classical approach. It covers foundational concepts such as prior, likelihood and posterior distributions, and illustrates these with simple examples. The chapter also covers hypothesis testing using the Bayes Factor. The walkthrough example involves estimating the success of a learning program based on expert beliefs and a small number of learners who took a pilot program. The exercises go back to our academic results dataset in Chapter 4 to provide practice in estimating posteriors and performing Bayesian hypothesis testing.
Chapter 13 covers Bayesian linear regression, introducing models that use Markov-Chain Monte Carlo (MCMC) methods to simulate posterior distributions of model parameters. This chapter covers all the practical aspects of implementing and interpreting Bayesian regression models which will be carried over to the next chapter. The walkthrough example and exercises are similar to Chapter 4.
Chapter 14 goes through Bayesian implementations of the other regression models covered in the earlier chapters of the book, including binomial logistic regression, multinomial regression, ordinal regression, count data regression, mixed models and survival models. The walkthrough examples and exercises are similar to those in the corresponding classical chapters.
We complete the instructional material in Chapter 15 with an overview of causal inference theory and methods. In this chapter we introduce the potential outcomes framework, concepts such as confounding, mediation and moderation, and how to use directed acyclic graphs (DAGs) to state causal hypotheses and construct causal models. The walkthrough example involves estimating the causal effect of the score in a selection test on the decision to hire a candidate in a recruiting process. The exercises return to those in Chapter 4, encouraging the reader to estimate possible causal effects in the data.
The final chapter is a set of problems and data sets which will allow the reader to practice the skills they have learned in this book and apply them to a variety of people analytics domains such as recruiting, performance, promotion, compensation and learning. Sets of discussion questions and data exercises will guide the reader through each problem, but these are designed in a way that encourages the independent selection and application of the methods covered in this book. These data sets, problems and exercises would suit as homework material for classes in statistical modeling or people analytics.
We also determine that \(x_1\) (the first-year examination score) plays no significant role in \(f\) and that introducing some non-linearity into \(f\) further improves the statistical accuracy of the inferred relationship.↩︎
Additional critical steps are required if the goal is causal inference, but these are covered in Chapter 15.↩︎
The book also addresses Steps 5 and 6 in some chapters.↩︎