Introduction
As a fresh-faced undergraduate in mathematics in the 1990s, I took an introductory course in statistics in my first term. I would never take another. I struggled with the subject, scored my lowest grade in it and swore I would never go anywhere near it again.
How wrong I was. Today I live and breathe statistics. How did that happen?
Firstly, statistics is about solving real-world problems, and amazingly there was not a single mention of a relatable problem from real life in that course I took all those years ago, just abstract mathematics. Nowadays, I know from my work and my personal learning activities that the mathematics has no meaning without a motivating problem to apply it to, and you’ll see example problems all through this book.
Secondly, statistics is all about data, and working with real data has encouraged me to reengage with statistics and come at it from a different angle—bottom-up you could say. Suddenly all those concepts that were put up on whiteboards using abstract formulas now had real meaning and consequence to the data I was working with. For me, real data helps statistical theory come to life, and this book is supported by numerous data sets designed for the reader to engage with.
But one more step solidified my newfound love of statistics, and that was when I put regression modeling into practice. Faced with data sets that I initially believed were just far too messy and random to be able to produce genuine insights, I progressively became more and more fascinated by how regression can cut through the messiness, compartmentalize the randomness and lead you straight to inferences that are often surprising both in their clarity and in their conclusions.
Hence my motivation for writing this book, which is to give others—whether working in people analytics or otherwise—a starting point for a practical learning of regression methods, with the hope that they will see immediate applications to their work and take advantage of a much-underused toolkit that provides strong support for evidence-based practice.
I am a mathematician who is now a practitioner of analytics. For this reason you should see that this book is neither afraid of nor obsessed with the mathematics of the methodologies covered. It is my general observation that many students and practitioners make the mistake of trying to run multivariable models without even a basic understanding of the underlying mathematics of those models, and I find it very difficult to see how they can be credible in responding to a wide range of questions or critique about their work without such an understanding. That said, it is also not necessary for students and practitioners to understand the deepest levels of theory in order to be fluent in running and interpreting multivariablemodels. In this book I have tried to limit the mathematical exposition to a level that allows confident and fluent execution and interpretation.
I subscribe strongly to the principles of open source sharing of knowledge. If you want to reference the material in this book or use the exercises or data sets in trainings or classes, you are free to do so and you do not need to request my permission. I only ask that you make reference to this book as the source.
I expect this book to continuously. If you found this book or any part of it helpful to solving a problem, I’d love to hear about it. If you have comments to improve or question any aspect of the contents of this book I encourage you to leave an issue on its Github repository. This is the most reliable way for me to see your comment. I promise to consider all comments and input, but I do have to make a personal judgment about whether they are helpful to the aims and purpose of this book.
I would like to thank the many individuals who have contributed to the content of this book by making suggestions, finding errors and suggesting clarifications. Over the past five years since the first edition was published, those names have become too numerous to mention, but they have all played an important part and this book would not be where it is today without their generous time and effort. My sincere thanks to my friend and colleague Alexis Fink for drawing on her years of people analytics experience to set the context for this book in her foreword. Her constant encouragement and positivity, along with our shared passion to improve the science and study of work, has been an important motivator for me in creating this resource. My thanks to the people analytics community for their constant encouragement and support in sharing theory, content and method, and to the R and Python communities for all the work they do in giving us amazing and constantly improving statistical software tools to work with. Finally, I would like to thank my family for their patience and understanding on the evenings and weekends I dedicated to the writing of this book, and for tolerating far too much dinner conversation on the topic of statistics.
Updates for the second edition
Since the first edition of this book was published in 2021, I have been genuinely and pleasantly surprised by the level of interest it has generated, the many comments I have received from readers, and the widespread use of the book in both academic and practitioner settings. I have also heeded the many requests to expand the content, and this second edition represents a substantial expansion of the first edition. The chapters on classical regression methods have been added to to include new material on modeling count data, while a number of the explanations have been improved and the methods updated to ensure they are in line with the latest available software. Five years is a long time in the open-source software universe!
An entire new set of chapters has been added introducing Bayesian methods for regression modeling. Bayesian methods have become increasingly popular in recent years, and I believe they have a lot to offer people analytics practitioners, especially those working with prior knowledge of their data generating processes or those working with small data samples in highly uncertain environments. The new chapters introduce the fundamental concepts of Bayesian statistics and then build on those concepts to show how all of our regression methods can be implemented and interpreted using a Bayesian framework.
Finally, I have added a chapter on causal inference methods, which are garnering increasing interest not just in people analytics but in most fields of applied statistics. Causal inference is essentially a more disciplined approach to regression modeling which encourages the analyst to construct their model with extra care on the basis of a belief about the causal relationships between variables. It is a nice way to wrap up the extensive range of methods covered in the book, adding a framework and set of tools which can be applied to everything learned in previous chapters.
Keith McNulty
January 2025