10 Alternative Technical Approaches in R and Python
As outlined earlier in this book, all technical implementations of the modeling techniques in previous chapters have relied wherever possible on base R code and specialist packages for specific methodologies - this allowed a focus on the basics of understanding, running and interpreting these models which is the key aim of this book. For those interested in a wider range of technical options for running inferential statistical models, this chapter illustrates some alternative options and should be considered a starting point for those interested rather than an in-depth exposition.
First we look at options for generating models in more predictable formats in R. We have seen in prior chapters that the output of many models in R can be inconsistent. In many cases we are given more information than we need and in some cases we have less than we need. Formats can vary and we sometimes need to look in different parts of the output to see specific statistics that we seek. The tidymodels
set of packages tries to bring the principles of tidy data into the realm of statistical modeling and we will illustrate this briefly.
Second, for those whose preference is to use Python, we provide some examples for how inferential regression models can be run in Python. While Python is particularly well-tooled for running predictive models, it does not have the full range of statistical inference tools that are available in R. In particular, using predictive modeling or machine learning packages like scikit-learn
to conduct regression modeling can often leave the analyst lacking when seeking information about certain model statistics when those statistics are not typically sought after in a predictive modeling workflow. We briefly illustrate some Python packages which perform modeling with a greater emphasis on inference versus prediction.
10.1 ‘Tidier’ modeling approaches in R
The tidymodels
meta-package is a collection of packages which collectively apply the principles of tidy data to the construction of statistical models. More information and learning resources on tidymodels
can be found here. Within tidymodels
there are two packages which are particular useful in controlling the output of models in R: the broom
and parsnip
packages.
10.1.1 The broom
package
Consistent with how it is named, broom
aims to tidy up the output of the models into a predictable format. It works with over 100 different types of models in R. In order to illustrate its use, let’s run a model from a previous chapter - specifically our salesperson promotion model in Chapter 5.
# obtain salespeople data
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
As in Chapter 5, we convert the promoted
column to a factor and run a binomial logistic regression model on the promoted
outcome.
# convert promoted to factor
salespeople$promoted <- as.factor(salespeople$promoted)
# build model to predict promotion based on sales and customer_rate
promotion_model <- glm(formula = promoted ~ sales + customer_rate,
family = "binomial",
data = salespeople)
We now have our model sitting in memory. We can use three key functions in the broom
package to view a variety of model statistics. First, the tidy()
function allows us to see the coefficient statistics of the model.
# load tidymodels metapackage
library(tidymodels)
# view coefficient statistics
broom::tidy(promotion_model)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -19.5 3.35 -5.83 5.48e- 9
## 2 sales 0.0404 0.00653 6.19 6.03e-10
## 3 customer_rate -1.12 0.467 -2.40 1.63e- 2
The glance()
function allows us to see a row of overall model statistics:
# view model statistics
broom::glance(promotion_model)
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 440. 349 -32.6 71.1 82.7 65.1 347 350
And the augment()
function augments the observations in the dataset with a range of observation-level model statistics such as residuals:
## # A tibble: 6 x 9
## promoted sales customer_rate .fitted .resid .std.resid .hat .sigma .cooksd
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 594 3.94 0.0522 -1.20 -1.22 0.0289 0.429 1.08e- 2
## 2 0 446 4.06 -6.06 -0.0683 -0.0684 0.00212 0.434 1.66e- 6
## 3 1 674 3.83 3.41 0.255 0.257 0.0161 0.434 1.84e- 4
## 4 0 525 3.62 -2.38 -0.422 -0.425 0.0153 0.433 4.90e- 4
## 5 1 657 4.4 2.08 0.485 0.493 0.0315 0.433 1.40e- 3
## 6 1 918 4.54 12.5 0.00278 0.00278 0.0000174 0.434 2.24e-11
These functions are model-agnostic for a very wide range of common models in R. For example, we can use them on our proportional odds model on soccer discipline from Chapter 7 and they will generate the relevant statistics in tidy tables.
# get soccer data
url <- "http://peopleanalytics-regression-book.org/data/soccer.csv"
soccer <- read.csv(url)
# convert discipline to ordered factor
soccer$discipline <- ordered(soccer$discipline,
levels = c("None", "Yellow", "Red"))
# run proportional odds model
library(MASS)
soccer_model <- polr(
formula = discipline ~ n_yellow_25 + n_red_25 + position +
country + level + result,
data = soccer
)
# view model statistics
broom::glance(soccer_model)
## # A tibble: 1 x 7
## edf logLik AIC BIC deviance df.residual nobs
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10 -1722. 3465. 3522. 3445. 2281 2291
broom
functions integrate well into other tidyverse methods, and allow easy running of models over nested subsets of data. For example, if we want to run our soccer discipline model across the different countries in the dataset and see all the model statistics in a neat table, we can use typical tidyverse grammar to do so using dplyr
.
# load the tidyverse metapackage (includes dplyr)
library(tidyverse)
# define function to run soccer model and glance at results
soccer_model_glance <- function(df) {
broom::glance(
polr(
formula = discipline ~ n_yellow_25 + n_red_25 +
position + level + result,
data = df
)
)
}
# run it nested by country
soccer %>%
dplyr::nest_by(country) %>%
dplyr::summarise(soccer_model_glance(data))
## # A tibble: 2 x 8
## # Groups: country [2]
## country edf logLik AIC BIC deviance df.residual nobs
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 England 9 -843. 1704. 1749. 1686. 1123 1132
## 2 Germany 9 -877. 1773. 1818. 1755. 1150 1159
10.1.2 The parsnip
package
The parsnip
package aims create a unified interface to running models, to avoid users needing to understand different model terminology and other minutiae. It also takes a more hierarchical approach to defining models that is similar in nature to the object-oriented approaches that Python users would be more familiar with.
Again let’s use our salesperson promotion model example to illustrate. We start by defining a model family that we wish to use, in this case logistic regression, and define a specific engine and mode.
model <- parsnip::logistic_reg() %>%
parsnip::set_engine("glm") %>%
parsnip::set_mode("classification")
We can use the translate()
function to see what kind of model we have created:
model %>%
parsnip::translate()
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
##
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
## family = stats::binomial)
Now with our model defined, we can fit it using a formula and data and then use broom
to view the coefficients:
model %>%
parsnip::fit(formula = promoted ~ sales + customer_rate,
data = salespeople) %>%
broom::tidy()
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -19.5 3.35 -5.83 5.48e- 9
## 2 sales 0.0404 0.00653 6.19 6.03e-10
## 3 customer_rate -1.12 0.467 -2.40 1.63e- 2
parsnip
functions are particularly motivated around tooling for machine learning model workflows in a similar way to scikit-learn
in Python, but they can offer an attractive approach to coding inferential models, particularly where common families of models are used.
10.2 Inferential statistical modeling in Python
In general, the modeling functions contained in scikit-learn
- which tends to be the go-to modeling package for most Python users - are oriented towards predictive modeling and can be challenging to navigate for those who are primarily interested in inferential modeling. In this section we will briefly review approaches for running some of the models contained in this book in Python. The statsmodels
package is highly recommended as it offers a wide range of models which report similar statistics to those reviewed in this books. Full statsmodels
documentation can be found here.
10.2.1 Ordinary least squares (OLS) linear regression
The OLS linear regression model reviewed in Chapter 4 can be generated using the statsmodels
package, which can report a reasonably thorough set of model statistics. By using the statsmodels
formula API, model formulas similar to those used in R can be used.
import pandas as pd
import statsmodels.formula.api as smf
# get data
= "http://peopleanalytics-regression-book.org/data/ugtests.csv"
url = pd.read_csv(url)
ugtests
# define model
= smf.ols(formula = "Final ~ Yr3 + Yr2 + Yr1", data = ugtests)
model
# fit model
= model.fit()
ugtests_model
# see results summary
print(ugtests_model.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: Final R-squared: 0.530
## Model: OLS Adj. R-squared: 0.529
## Method: Least Squares F-statistic: 365.5
## Date: Mon, 25 Jan 2021 Prob (F-statistic): 8.22e-159
## Time: 16:10:39 Log-Likelihood: -4711.6
## No. Observations: 975 AIC: 9431.
## Df Residuals: 971 BIC: 9451.
## Df Model: 3
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 14.1460 5.480 2.581 0.010 3.392 24.900
## Yr3 0.8657 0.029 29.710 0.000 0.809 0.923
## Yr2 0.4313 0.033 13.267 0.000 0.367 0.495
## Yr1 0.0760 0.065 1.163 0.245 -0.052 0.204
## ==============================================================================
## Omnibus: 0.762 Durbin-Watson: 2.006
## Prob(Omnibus): 0.683 Jarque-Bera (JB): 0.795
## Skew: 0.067 Prob(JB): 0.672
## Kurtosis: 2.961 Cond. No. 858.
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
10.2.2 Binomial logistic regression
Binomial logistic regression models can be generated in a similar way to OLS linear regression models using the statsmodels
formula API, calling the binomial family from the general statsmodels
API.
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# obtain salespeople data
= "http://peopleanalytics-regression-book.org/data/salespeople.csv"
url = pd.read_csv(url)
salespeople
# define model
= smf.glm(formula = "promoted ~ sales + customer_rate",
model = salespeople,
data = sm.families.Binomial())
family
# fit model
= model.fit()
promotion_model
# see results summary
print(promotion_model.summary())
## Generalized Linear Model Regression Results
## ==============================================================================
## Dep. Variable: promoted No. Observations: 350
## Model: GLM Df Residuals: 347
## Model Family: Binomial Df Model: 2
## Link Function: logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -32.566
## Date: Mon, 25 Jan 2021 Deviance: 65.131
## Time: 16:10:40 Pearson chi2: 198.
## No. Iterations: 9
## Covariance Type: nonrobust
## =================================================================================
## coef std err z P>|z| [0.025 0.975]
## ---------------------------------------------------------------------------------
## Intercept -19.5177 3.347 -5.831 0.000 -26.078 -12.958
## sales 0.0404 0.007 6.189 0.000 0.028 0.053
## customer_rate -1.1221 0.467 -2.403 0.016 -2.037 -0.207
## =================================================================================
10.2.3 Multinomial logistic regression
Multinomial logistic regression is similarly available using the statsmodels
formula API. As usual, care must be taken to ensure that the reference category is appropriately defined, dummy input variables need to be explictly constructed, and a constant input variable must be added to ensure an intercept is calculated.
import pandas as pd
import statsmodels.formula.api as smf
# load health insurance data
= "http://peopleanalytics-regression-book.org/data/health_insurance.csv"
url = pd.read_csv(url)
health_insurance
# convert product to categorical as an outcome variable
= pd.Categorical(health_insurance['product'])
y
# create dummies for gender
= pd.get_dummies(health_insurance['gender'], drop_first = True)
X1
# replace back into input variables
= health_insurance.drop(['product', 'gender'], axis = 1)
X2 = pd.concat([X1, X2], axis = 1)
X
= sm.add_constant(X)
Xc
# define model
= sm.MNLogit(y, Xc)
model
# fit model
= model.fit()
insurance_model
# see results summary
print(insurance_model.summary())
## MNLogit Regression Results
## ==============================================================================
## Dep. Variable: y No. Observations: 1453
## Model: MNLogit Df Residuals: 1439
## Method: MLE Df Model: 12
## Date: Mon, 25 Jan 2021 Pseudo R-squ.: 0.5332
## Time: 16:10:40 Log-Likelihood: -744.68
## converged: True LL-Null: -1595.3
## Covariance Type: nonrobust LLR p-value: 0.000
## ==================================================================================
## y=B coef std err z P>|z| [0.025 0.975]
## ----------------------------------------------------------------------------------
## const -4.6010 0.511 -9.012 0.000 -5.602 -3.600
## Male -2.3826 0.232 -10.251 0.000 -2.838 -1.927
## Non-binary 0.2528 1.226 0.206 0.837 -2.151 2.656
## age 0.2437 0.015 15.790 0.000 0.213 0.274
## children -0.9677 0.069 -13.938 0.000 -1.104 -0.832
## position_level -0.4153 0.089 -4.658 0.000 -0.590 -0.241
## tenure 0.0117 0.013 0.900 0.368 -0.014 0.037
## ----------------------------------------------------------------------------------
## y=C coef std err z P>|z| [0.025 0.975]
## ----------------------------------------------------------------------------------
## const -10.2261 0.620 -16.501 0.000 -11.441 -9.011
## Male 0.0967 0.195 0.495 0.621 -0.286 0.480
## Non-binary -1.2698 2.036 -0.624 0.533 -5.261 2.721
## age 0.2698 0.016 17.218 0.000 0.239 0.301
## children 0.2043 0.050 4.119 0.000 0.107 0.302
## position_level -0.2136 0.082 -2.597 0.009 -0.375 -0.052
## tenure 0.0033 0.012 0.263 0.793 -0.021 0.028
## ==================================================================================
10.2.4 Structural equation models
The semopy
package is a specialized package for the implementation of Structural Equation Models in Python, and its implementation is very similar to the lavaan
package in R. However, its reporting is not as intuitive compared to lavaan
. A full tutorial is available here. Here is an example of how to run the same model as that studied in Section 8.2 using semopy
.
import pandas as pd
from semopy import Model
# get data
= "http://peopleanalytics-regression-book.org/data/politics_survey.csv"
url = pd.read_csv(url)
politics_survey
# define full measurement and strucural model
= """
measurement_model # measurement model
Pol =~ Pol1 + Pol2
Hab =~ Hab1 + Hab2 + Hab3
Loc =~ Loc2 + Loc3
Env =~ Env1 + Env2
Int =~ Int1 + Int2
Pers =~ Pers2 + Pers3
Nat =~ Nat1 + Nat2
Eco =~ Eco1 + Eco2
# structural model
Overall ~ Pol + Hab + Loc + Env + Int + Pers + Nat + Eco
"""
= Model(measurement_model)
full_model
# fit model to data and inspect
full_model.fit(politics_survey)
Then to inspect the results:
# inspect the results of SEM
full_model.inspect()
## lval op rval Estimate Std. Err z-value p-value
## 0 Pol1 ~ Pol 1.000000 - - -
## 1 Pol2 ~ Pol 0.704494 0.0285709 24.6577 0
## 2 Hab1 ~ Hab 1.000000 - - -
## 3 Hab2 ~ Hab 1.182646 0.0308335 38.3558 0
## 4 Hab3 ~ Hab 1.124635 0.0305709 36.7878 0
## .. ... .. ... ... ... ... ...
## 74 Int1 ~~ Int1 0.522234 0.0232766 22.436 0
## 75 Nat1 ~~ Nat1 0.216159 0.00951019 22.7292 0
## 76 Pers3 ~~ Pers3 0.216122 0.0102881 21.0069 0
## 77 Hab3 ~~ Hab3 0.258835 0.0116792 22.162 0
## 78 Eco1 ~~ Eco1 0.159955 0.010466 15.2834 0
##
## [79 rows x 7 columns]
10.2.5 Survival analysis
The lifelines
package in Python is designed to support survival analysis, with functions to calculate survival estimates, plot survival curves, perform Cox proportional hazard regression and check proportional hazard assumptions. A full tutorial is available here.
Here is an example of how to plot Kaplan-Meier survival curves in Python using our Chapter 9 walkthrough example. The survival curves are displayed in Figure 10.1.
import pandas as pd
from lifelines import KaplanMeierFitter
from matplotlib import pyplot as plt
# get data
= "http://peopleanalytics-regression-book.org/data/job_retention.csv"
url = pd.read_csv(url)
job_retention
# fit our data to Kaplan-Meier estimates
= job_retention["month"]
T = job_retention["left"]
E = KaplanMeierFitter()
kmf = E)
kmf.fit(T, event_observed
# split into high and not high sentiment
= (job_retention["sentiment"] >= 7)
highsent
# set up plot
= plt.subplot()
survplot
# plot high sentiment survival function
= E[highsent],
kmf.fit(T[highsent], event_observed = "High Sentiment")
label
= survplot)
kmf.plot_survival_function(ax
# plot not high sentiment survival function
~highsent], event_observed = E[~highsent],
kmf.fit(T[= "Not High Sentiment")
label
= survplot)
kmf.plot_survival_function(ax
# show survival curves by sentiment category
plt.show()

Figure 10.1: Survival curves by sentiment category in the job retention data
And here is an example of how to fit a Cox Proportional Hazard model similarly to Section 9.240.
from lifelines import CoxPHFitter
# fit Cox PH model to job_retention data
= CoxPHFitter()
cph = 'month', event_col = 'left',
cph.fit(job_retention, duration_col = "gender + field + level + sentiment") formula
# view results
cph.print_summary()
## <lifelines.CoxPHFitter: fitted with 3770 total observations, 2416 right-censored observations>
## duration col = 'month'
## event col = 'left'
## baseline estimation = breslow
## number of observations = 3770
## number of events observed = 1354
## partial log-likelihood = -10722.96
## time fit was run = 2021-01-25 16:10:42 UTC
##
## ---
## coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
## covariate
## gender[T.M] -0.05 0.95 0.06 -0.16 0.07 0.85 1.07
## field[T.Finance] 0.22 1.25 0.07 0.09 0.35 1.09 1.42
## field[T.Health] 0.28 1.32 0.13 0.02 0.53 1.02 1.70
## field[T.Law] 0.10 1.11 0.15 -0.18 0.39 0.83 1.47
## field[T.Public/Government] 0.11 1.12 0.09 -0.06 0.29 0.94 1.33
## field[T.Sales/Marketing] 0.09 1.09 0.10 -0.11 0.29 0.89 1.33
## level[T.Low] 0.14 1.15 0.09 -0.03 0.32 0.97 1.38
## level[T.Medium] 0.17 1.19 0.10 -0.03 0.37 0.97 1.45
## sentiment -0.12 0.89 0.01 -0.15 -0.09 0.86 0.91
##
## z p -log2(p)
## covariate
## gender[T.M] -0.79 0.43 1.22
## field[T.Finance] 3.31 <0.005 10.06
## field[T.Health] 2.15 0.03 4.99
## field[T.Law] 0.69 0.49 1.03
## field[T.Public/Government] 1.25 0.21 2.23
## field[T.Sales/Marketing] 0.86 0.39 1.36
## level[T.Low] 1.59 0.11 3.16
## level[T.Medium] 1.69 0.09 3.45
## sentiment -8.57 <0.005 56.42
## ---
## Concordance = 0.58
## Partial AIC = 21463.93
## log-likelihood ratio test = 92.29 on 9 df
## -log2(p) of ll-ratio test = 50.65
Proportional Hazard assumptions can be checked using the check_assumptions()
method41.
= 0.05) cph.check_assumptions(job_retention, p_value_threshold
## The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
## covariates will be below the threshold by chance. This is compounded when there are many covariates.
## Similarly, when there are lots of observations, even minor deviances from the proportional hazard
## assumption will be flagged.
##
## With that in mind, it's best to use a combination of statistical tests and visual tests to determine
## the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
## and looking for non-constant lines. See link [A] below for a full example.
##
## <lifelines.StatisticalResult: proportional_hazard_test>
## null_distribution = chi squared
## degrees_of_freedom = 1
## model = <lifelines.CoxPHFitter: fitted with 3770 total observations, 2416 right-censored observations>
## test_name = proportional_hazard_test
##
## ---
## test_statistic p -log2(p)
## field[T.Finance] km 1.24 0.27 1.91
## rank 1.13 0.29 1.79
## field[T.Health] km 4.29 0.04 4.70
## rank 4.11 0.04 4.55
## field[T.Law] km 1.18 0.28 1.86
## rank 0.89 0.35 1.53
## field[T.Public/Government] km 1.94 0.16 2.62
## rank 1.89 0.17 2.57
## field[T.Sales/Marketing] km 1.97 0.16 2.64
## rank 2.19 0.14 2.85
## gender[T.M] km 0.40 0.53 0.93
## rank 0.38 0.54 0.89
## level[T.Low] km 1.48 0.22 2.16
## rank 1.46 0.23 2.14
## level[T.Medium] km 0.08 0.78 0.37
## rank 0.12 0.73 0.45
## sentiment km 1.94 0.16 2.61
## rank 1.54 0.21 2.22
##
##
## 1. Variable 'field[T.Health]' failed the non-proportional test: p-value is 0.0383.
##
## Advice: with so few unique values (only 2), you can include `strata=['field[T.Health]', ...]` in
## the call in `.fit`. See documentation in link [E] below.
##
## ---
## [A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
## [B] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it
## [C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
## [D] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form
## [E] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification
##
## []
10.2.6 Other model variants
Implementation of other model variants featured in earlier chapters becomes thinner in Python. However, of note are the following:
Ordinal regression is a recent addition to the
statsmodels
package and is still only available in the development version. Themord
package offers an implementation for predictive analytics purposes, but for inferential modeling users will need to wait for the next release ofstatsmodels
or for immediate use they will need to install the development version from source.Mixed models only currenly have an implementation for linear mixed modeling in
statsmodels
. Generalized linear mixed models equivalent to those found in thelme4
R package are not yet available in Python.