Handbook of Regression Modeling in People Analytics: With Examples in R and Python

Keith McNulty

1 The Importance of Regression in People Analytics

In the 19th century, when Francis Galton first used the term ‘regression’ to describe a statistical phenomenon (see Chapter 4), little did he know how important that term would be today. Many of the most powerful tools of statistical inference that we now have at our disposal can be traced back to the types of early analysis that Galton and his contemporaries were engaged in. The sheer number of different regression-related methodologies and variants that are available to researchers and practitioners today is mind-boggling, and there are still rich veins of ongoing research that are focused on defining and refining new forms of regression to tackle new problems.

Neither could Galton have imagined the advent of the age of data we now live in. Those of us (like me) who entered the world of work even as recently as 20 years ago remember a time when most problems could not be expected to be solved using a data-driven approach, because there simply was no data. Things are very different now, with data being collected and processed all around us and available to use as direct or indirect measures of the phenomena we are interested in.

Along with the growth in data that we have seen in recent years, we have also seen a rapid growth in the availability of statistical tools—open source and free to use—that fundamentally change how we go about analytics. Gone are the clunky, complex, repeated steps on calculators or spreadsheets. In their place are lean statistical programming languages that can implement a regression analysis in milliseconds with a single line of code, allowing us to easily run and reproduce multivariate analysis at scale.

So given that we have access to well-developed methodology, rich sources of data and readily accessible tools, it is somewhat surprising that many analytics practitioners have a limited knowledge and understanding of regression and its applications. The aim of this book is to encourage inexperienced analytics practitioners to ‘dip their toes’ further into the wide and varied world of regression in order to deliver more targeted and precise insights to their organizations and stakeholders on the problems they are most interested in. While the primary subject matter focus of this book is the analysis of people-related phenomena, the material is easily and naturally transferable to other disciplines. Therefore this book can be regarded as a practical introduction to a wide range of regression methods for any analytics student or practitioner.

It is my firm belief that all people analytics professionals should have a strong understanding of regression models and how to implement and interpret them in practice, and my aim with this book is to provide those who need it with help in getting there. In this chapter we will set the scene for the technical learning in the remainder of the book by outlining the relevance of regression models in people analytics practice. We also touch on some general inferential modeling theory to set a context for later chapters, and we provide a preview of the contents, structure and learning objectives of this book.

1.1 Why is regression modeling so important in people analytics?

People analytics involves the study of the behaviors and characteristics of people or groups in relation to important business, organizational or institutional outcomes. This can involve both qualitative methods and quantitative methods, but if data is available related to a particular topic of interest, then quantitative methods are almost always considered important. With such a specific focus on outcomes, any analyst working in people analytics will frequently need to model these outcomes both to understand what influences them and to potentially predict them in the future.

Modeling an outcome with the primary goal of understanding what influences it can be quite a different matter to modeling an outcome with the primary goal of predicting if it will happen in the future. If we need to understand what influences an outcome, we need to get inside a model and construct a formula or structure to infer how each variable acts on that outcome, we need to get a sense of which variables are meaningful or not, and we need to quantify the ‘explainability’ of the outcome based on our variables. If our primary aim is to predict the outcome, getting inside the model is less important because we don’t have to explain the outcome, we just need to be confident that it predicts accurately.

A model constructed to understand an outcome is often called an inferential model. Regression models are the most well-known and well-used inferential models available, providing a wide range of measures and insights that help us explain the relationship between our input variables and our outcome of interest, as we shall see in later chapters of this book.

The current reality in the field of people analytics is that inferential models are more required than predictive models. There are two reasons for this. First, data sets in people analytics are rarely large enough to facilitate satisfactory prediction accuracy, and so attention is usually shifted to inference for this reason alone. Second, in the field of people analytics, decisions often have a real impact on individuals. Therefore, even in the rare situations where accurate predictive modeling is attainable, stakeholders are unlikely to trust the output and bear the consequences of predictive models without some sort of elementary understanding of how the predictions are generated. This requires the analyst to consider inference power as well as predictive accuracy in selecting their modeling approach. Again, many regression models come to the fore because they are commonly able to provide both inferential and predictive value.

Finally, the growing importance of evidence-based practice in many clinical and professional fields has generated a need for more advanced modeling skills to satisfy rising demand for quantitative evidence from decision makers. In people-related fields such as human resources, many varieties of specialized regression-based models such as survival models or latent variable models have crossed from academic and clinical settings into business settings in recent years, and there is an increasing need for qualified individuals who understand and can implement and interpret these models in practice.

1.2 What do we mean by ‘modeling’‍?

The term ‘modeling’ has a very wide range of meaning in everyday life and work. In this book we are focused on inferential modeling, and we define that as a specific form of statistical learning, which tries to discover and understand a mathematical relationship between a set of measurements of certain constructs and a measurement of an outcome of interest, based on a sample of data on each. Modeling is both a concept and a process.

1.2.1 The theory of inferential modeling

We will start with a theoretical description and then provide a real example from a later chapter to illustrate.

Imagine we have a population \(\mathscr{P}\) for which we believe there may be a non-random relationship between a certain construct or set of constructs \(\mathscr{C}\) and a certain measurable outcome \(\mathscr{O}\). Imagine that for a certain sample \(S\) of observations from \(\mathscr{P}\), we have a collection of data which we believe measure \(\mathscr{C}\) to some acceptable level of accuracy, and for which we also have a measure of the outcome \(\mathscr{O}\).

By convention, we denote the set of data that measure \(\mathscr{C}\) on our sample \(S\) as \(X = x_1, x_2, \dots, x_p\), where each \(x_i\) is a vector (or column) of data measuring at least one of the constructs in \(\mathscr{C}\). We denote the set of data that measure \(\mathscr{O}\) on our sample set \(S\) as \(y\). An upper-case \(X\) is used because the expectation is that there will be several columns of data measuring our constructs, and a lower-case \(y\) is used because the expectation is that the outcome is a single column.

Inferential modeling is the process of learning about a relationship (or lack of relationship) between the data in \(X\) and \(y\) and using that to describe a relationship (or lack of relationship) between our constructs \(\mathscr{C}\) and our outcome \(\mathscr{O}\) that is valid to a high degree of statistical certainty on the population \(\mathscr{P}\). This process may include:

Testing a proposed mathematical relationship in the form of a function, structure or iterative method
Comparing that relationship against other proposed relationships
Describing the relationship statistically
Determining whether the relationship (or certain elements of it) can be generalized from the sample set \(S\) to the population \(\mathscr{P}\)

When we test a relationship between \(X\) and \(y\), we acknowledge that data and measurements are imperfect and so each observation in our sample \(S\) may contain random error that we cannot control. Therefore we define our relationship as:

\[ y = f(X) + \epsilon \] where \(f\) is some transformation or function of the data in \(X\) and \(\epsilon\) is a random, uncontrollable error.

\(f\) can take the form of a predetermined function with a formula defined on \(X\), like a linear function for example. In this case we can call our model a parametric model. In a parametric model, the modeled value of \(y\) is known as soon as we know the values of \(X\) by simply applying the formula. In a non-parametric model, there is no predetermined formula that defines the modeled value of \(y\) purely in terms of \(X\). Non-parametric models need further information in addition to \(X\) in order to determine the modeled value of \(y\)—for example the value of \(y\) in other observations with similar \(X\) values.

Regression models are designed to derive \(f\) using estimation based on statistical likelihood and expectation, founded on the theory of the distribution of random variables. Regression models can be both parametric and non-parametric, but by far the most commonly used methods (and the majority of those featured in this book) are parametric. Because of their foundation in statistical likelihood and expectation, they are particularly suited to helping answer questions of generalizability—that is, to what extent can the relationship being observed in the sample \(S\) be inferred for the population \(\mathscr{P}\), which is usually the driving force in any form of inferential modeling.

Note that there is a difference between establishing a statistical relationship between \(\mathscr{C}\) and \(\mathscr{O}\) and establishing a causal relationship between the two. This can be a common trap that inexperienced statistical analysts fall into when communicating the conclusions of their modeling. Establishing that a relationship exists between a construct and an outcome is a far cry from being able to say that one causes the other. This is the common truism that ‘correlation does not equal causation’‍.

To bring our theory to life, consider the walkthrough example in Chapter 4 of this book. In this example, we discuss how to establish a relationship between the academic results of students in the first three years of their education program and their results in the fourth year. In this case, our population \(\mathscr{P}\) is all past, present and future students who take similar examinations, and our sample \(S\) is the students who completed their studies in the past three years. \(X = x_1, x_2, x_3\) are each of the three scores from the first three years, and \(y\) is the score in the fourth year. We test \(f\) to be a linear relationship, and we establish that such a relationship can be generalized to the entire population \(\mathscr{P}\) with a substantial level of statistical confidence¹.

Almost all our work in this book will refer to the variables \(X\) as input variables and the variable \(y\) as the outcome variable. There are many other common terms for these which you may find in other sources—for example \(X\) are often known as independent variables or covariates while \(y\) is often known as a dependent or response variable.

1.2.2 The process of inferential modeling

Inferential modeling—regression or otherwise—is a process of numerous steps. Typically the main steps are:

Defining the outcome of interest \(\mathscr{O}\) and the input constructs \(\mathscr{C}\) based on a broader evidence-based objective
Confirming that \(\mathscr{O}\) has reliable measurement data
Determining which data can be used to measure \(\mathscr{C}\)
Determining a sample \(S\) and collecting, refining and cleaning data.
Performing exploratory data analysis (EDA) and proposing a set of models to test for \(f\)
Putting the data in an appropriate format for each model
Running the models
Interpreting the outputs and performing model diagnostics
Selecting an optimal model or models
Articulating the inferences that can be generalized to apply to \(\mathscr{P}\)

This book is primarily focused on steps 7–10 of this process². That is not to say that steps 1–6 are not important. Indeed these steps are critical and often loaded with analytic traps. Defining the problem, collecting reliable measures and cleaning and organizing data are still the source of much pain and angst for analysts, but these topics are for another day.

1.3 The structure, system and organization of this book

The purpose of this book is to put inexperienced practitioners firmly on a path to the confident and appropriate use of regression techniques in their day-to-day work. This requires enough of an understanding of the underlying theory so that judgments can be made about results, but also a practical set of steps to help practitioners apply the most common regression methods to a variety of typical modeling scenarios in a reliable and reproducible way.

In most chapters, time is spent on the underlying mathematics. Not to the degree of an academic theorist, but enough to ensure that the reader can associate some mathematical meaning to the outputs of models. While it may be tempting to skip the math, I strongly recommend against it if you intend to be a high performer in your field. The best analysts are those who can genuinely understand what the numbers are telling them.

The statistical programming language R is used for most of the practical demonstration in each chapter. Because R is open source and particularly well geared to inferential statistics, it is an excellent choice for those whose work involves a lot of inferential analysis. In later chapters, we show implementations of all of the available methodologies in Python and Julia, which are also powerful open source tools for this sort of work.

Each chapter involves a walkthrough example to illustrate the specific method and to allow the reader to replicate the analysis for themselves. The exercises at the end of each chapter are designed so that the reader can try the same method on a different data set, or a different problem on the same data set, to test their learning and understanding. All in all, eleven different data sets are used as walkthrough or exercise examples, and all of these data sets are fictitious constructions unless otherwise indicated. Despite the fiction, they are deliberately designed to present the reader with something resembling how the data might look in practice, albeit cleaner and more organized.

The chapters of this book are arranged as follows:

Chapter 2 covers the basics of the R programming language for those who want to attempt to jump straight in to the work in subsequent chapters but have very little R experience. Experienced R programmers can skip this chapter.
Chapter 3 covers the essential statistical concepts needed to understand multivariate regression models. It also serves as a tutorial in univariate and bivariate statistics illustrated with real data. If you need help developing a decent understanding of descriptive statistics, random distribution and hypothesis testing, this is an important chapter to study.
Chapter 4 covers linear regression and in the course of that introduces many other foundational concepts. The walkthrough example involves modeling academic results from prior results. The exercises involve modeling income levels based on various work and demographic factors.
Chapter 5 covers binomial logistic regression. The walkthrough example involves modeling promotion likelihood based on performance metrics. The exercises involve modeling charitable donation likelihood based on prior donation behavior and demographics.
Chapter 6 covers multinomial regression. The walkthrough example and exercise involves modeling the choice of three health insurance products by company employees based on demographic and position data.
Chapter 7 covers ordinal regression. The walkthrough example involves modeling in-game disciplinary action against soccer players based on prior discipline and other factors. The exercises involve modeling manager performance based on varied data.
Chapter 8 covers modeling options for data with explicit or latent hierarchy. The first part covers mixed modeling and uses a model of speed dating decisions as a walkthrough and example. The second part covers structural equation modeling and uses a survey for a political party as a walkthrough example. The exercises involve modeling latent variables in an employee engagement survey.
Chapter 9 covers survival analysis, Cox proportional hazard regression and frailty models. The chapter uses employee attrition as a walkthrough example and exercise.
Chapter 10 outlines alternative technical approaches to regression modeling in R, Python and Julia. Models from previous chapters are used to illustrate these alternative approaches.
Chapter 11 covers power analysis, focusing in particular on estimating the required minimum sample sizes in establishing meaningful inferences for both simple statistical tests and multivariate models. Examples related to experimental studies are used to illustrate, such as concurrent validity studies of selection instruments. Example implementations in R, Python and Julia are outlined.

Introduction

2 The Basics of the R Programming Language