# 10 Alternative Technical Approaches in R and Python

As outlined earlier in this book, all technical implementations of the modeling techniques in previous chapters have relied wherever possible on base R code and specialist packages for specific methodologies—this allowed a focus on the basics of understanding, running and interpreting these models which is the key aim of this book. For those interested in a wider range of technical options for running inferential statistical models, this chapter illustrates some alternative options and should be considered a starting point for those interested rather than an in-depth exposition.

First we look at options for generating models in more predictable formats in R. We have seen in prior chapters that the output of many models in R can be inconsistent. In many cases we are given more information than we need and in some cases we have less than we need. Formats can vary and we sometimes need to look in different parts of the output to see specific statistics that we seek. The `tidymodels`

set of packages tries to bring the principles of tidy data into the realm of statistical modeling and we will illustrate this briefly.

Second, for those whose preference is to use Python, we provide some examples for how inferential regression models can be run in Python. While Python is particularly well-tooled for running predictive models, it does not have the full range of statistical inference tools that are available in R. In particular, using predictive modeling or machine learning packages like `scikit-learn`

to conduct regression modeling can often leave the analyst lacking when seeking information about certain model statistics when those statistics are not typically sought after in a predictive modeling workflow. We briefly illustrate some Python packages which perform modeling with a greater emphasis on inference versus prediction.

## 10.1 ‘Tidier’ modeling approaches in R

The `tidymodels`

meta-package is a collection of packages which collectively apply the principles of tidy data to the construction of statistical models. More information and learning resources on `tidymodels`

can be found here. Within `tidymodels`

there are two packages which are particular useful in controlling the output of models in R: the `broom`

and `parsnip`

packages.

### 10.1.1 The `broom`

package

Consistent with how it is named, `broom`

aims to tidy up the output of the models into a predictable format. It works with over 100 different types of models in R. In order to illustrate its use, let’s run a model from a previous chapter—specifically our salesperson promotion model in Chapter 5.

```
# obtain salespeople data
<- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
url <- read.csv(url) salespeople
```

As in Chapter 5, we convert the `promoted`

column to a factor and run a binomial logistic regression model on the `promoted`

outcome.

```
# convert promoted to factor
$promoted <- as.factor(salespeople$promoted)
salespeople
# build model to predict promotion based on sales and customer_rate
<- glm(formula = promoted ~ sales + customer_rate,
promotion_model family = "binomial",
data = salespeople)
```

We now have our model sitting in memory. We can use three key functions in the `broom`

package to view a variety of model statistics. First, the `tidy()`

function allows us to see the coefficient statistics of the model.

```
# load tidymodels metapackage
library(tidymodels)
# view coefficient statistics
::tidy(promotion_model) broom
```

```
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -19.5 3.35 -5.83 5.48e- 9
## 2 sales 0.0404 0.00653 6.19 6.03e-10
## 3 customer_rate -1.12 0.467 -2.40 1.63e- 2
```

The `glance()`

function allows us to see a row of overall model statistics:

```
# view model statistics
::glance(promotion_model) broom
```

```
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 440. 349 -32.6 71.1 82.7 65.1 347 350
```

The `augment()`

function augments the observations in the data set with a range of observation-level model statistics such as residuals:

```
# view augmented data
head(broom::augment(promotion_model))
```

```
## # A tibble: 6 x 9
## promoted sales customer_rate .fitted .resid .std.resid .hat .sigma .cooksd
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 594 3.94 0.0522 -1.20 -1.22 0.0289 0.429 1.08e- 2
## 2 0 446 4.06 -6.06 -0.0683 -0.0684 0.00212 0.434 1.66e- 6
## 3 1 674 3.83 3.41 0.255 0.257 0.0161 0.434 1.84e- 4
## 4 0 525 3.62 -2.38 -0.422 -0.425 0.0153 0.433 4.90e- 4
## 5 1 657 4.4 2.08 0.485 0.493 0.0315 0.433 1.40e- 3
## 6 1 918 4.54 12.5 0.00278 0.00278 0.0000174 0.434 2.24e-11
```

These functions are model-agnostic for a very wide range of common models in R. For example, we can use them on our proportional odds model on soccer discipline from Chapter 7 and they will generate the relevant statistics in tidy tables.

```
# get soccer data
<- "http://peopleanalytics-regression-book.org/data/soccer.csv"
url <- read.csv(url)
soccer
# convert discipline to ordered factor
$discipline <- ordered(soccer$discipline,
soccerlevels = c("None", "Yellow", "Red"))
# run proportional odds model
library(MASS)
<- polr(
soccer_model formula = discipline ~ n_yellow_25 + n_red_25 + position +
+ level + result,
country data = soccer
)
# view model statistics
::glance(soccer_model) broom
```

```
## # A tibble: 1 x 7
## edf logLik AIC BIC deviance df.residual nobs
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10 -1722. 3465. 3522. 3445. 2281 2291
```

`broom`

functions integrate well into other tidyverse methods, and allow easy running of models over nested subsets of data. For example, if we want to run our soccer discipline model across the different countries in the data set and see all the model statistics in a neat table, we can use typical tidyverse grammar to do so using `dplyr`

.

```
# load the tidyverse metapackage (includes dplyr)
library(tidyverse)
# define function to run soccer model and glance at results
<- function(form, df) {
soccer_model_glance <- polr(formula = form, data = df)
model ::glance(model)
broom
}
# run it nested by country
%>%
soccer ::nest_by(country) %>%
dplyr::summarise(
dplyrsoccer_model_glance("discipline ~ n_yellow_25 + n_red_25", data)
)
```

```
## # A tibble: 2 x 8
## # Groups: country [2]
## country edf logLik AIC BIC deviance df.residual nobs
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 England 4 -883. 1773. 1794. 1765. 1128 1132
## 2 Germany 4 -926. 1861. 1881. 1853. 1155 1159
```

In a similar way, by putting model formulas in a dataframe column, numerous models can be run in a single command and results viewed in a tidy dataframe.

```
# create model formula column
<- c(
formula "discipline ~ n_yellow_25",
"discipline ~ n_yellow_25 + n_red_25",
"discipline ~ n_yellow_25 + n_red_25 + position"
)
# create dataframe
<- data.frame(formula)
models
# run models and glance at results
%>%
models ::group_by(formula) %>%
dplyr::summarise(soccer_model_glance(formula, soccer)) dplyr
```

```
## # A tibble: 3 x 8
## formula edf logLik AIC BIC deviance df.residual nobs
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 discipline ~ n_yellow_25 3 -1861. 3728. 3745. 3722. 2288 2291
## 2 discipline ~ n_yellow_25 + n_red_25 4 -1809. 3627. 3650. 3619. 2287 2291
## 3 discipline ~ n_yellow_25 + n_red_25 + position 6 -1783. 3579. 3613. 3567. 2285 2291
```

### 10.1.2 The `parsnip`

package

The `parsnip`

package aims create a unified interface to running models, to avoid users needing to understand different model terminology and other minutiae. It also takes a more hierarchical approach to defining models that is similar in nature to the object-oriented approaches that Python users would be more familiar with.

Again let’s use our salesperson promotion model example to illustrate. We start by defining a model family that we wish to use, in this case logistic regression, and define a specific engine and mode.

```
<- parsnip::logistic_reg() %>%
model ::set_engine("glm") %>%
parsnip::set_mode("classification") parsnip
```

We can use the `translate()`

function to see what kind of model we have created:

```
%>%
model ::translate() parsnip
```

```
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
##
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
## family = stats::binomial)
```

Now with our model defined, we can fit it using a formula and data and then use `broom`

to view the coefficients:

```
%>%
model ::fit(formula = promoted ~ sales + customer_rate,
parsnipdata = salespeople) %>%
::tidy() broom
```

```
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -19.5 3.35 -5.83 5.48e- 9
## 2 sales 0.0404 0.00653 6.19 6.03e-10
## 3 customer_rate -1.12 0.467 -2.40 1.63e- 2
```

`parsnip`

functions are particularly motivated around tooling for machine learning model workflows in a similar way to `scikit-learn`

in Python, but they can offer an attractive approach to coding inferential models, particularly where common families of models are used.

## 10.2 Inferential statistical modeling in Python

In general, the modeling functions contained in `scikit-learn`

—which tends to be the go-to modeling package for most Python users—are oriented towards predictive modeling and can be challenging to navigate for those who are primarily interested in inferential modeling. In this section we will briefly review approaches for running some of the models contained in this book in Python. The `statsmodels`

package is highly recommended as it offers a wide range of models which report similar statistics to those reviewed in this book. Full `statsmodels`

documentation can be found here.

### 10.2.1 Ordinary least squares (OLS) linear regression

The OLS linear regression model reviewed in Chapter 4 can be generated using the `statsmodels`

package, which can report a reasonably thorough set of model statistics. By using the `statsmodels`

formula API, model formulas similar to those used in R can be used.

```
import pandas as pd
import statsmodels.formula.api as smf
# get data
= "http://peopleanalytics-regression-book.org/data/ugtests.csv"
url = pd.read_csv(url)
ugtests
# define model
= smf.ols(formula = "Final ~ Yr3 + Yr2 + Yr1", data = ugtests)
model
# fit model
= model.fit()
ugtests_model
# see results summary
print(ugtests_model.summary())
```

```
## OLS Regression Results
## ==============================================================================
## Dep. Variable: Final R-squared: 0.530
## Model: OLS Adj. R-squared: 0.529
## Method: Least Squares F-statistic: 365.5
## Date: Tue, 02 Mar 2021 Prob (F-statistic): 8.22e-159
## Time: 16:54:34 Log-Likelihood: -4711.6
## No. Observations: 975 AIC: 9431.
## Df Residuals: 971 BIC: 9451.
## Df Model: 3
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 14.1460 5.480 2.581 0.010 3.392 24.900
## Yr3 0.8657 0.029 29.710 0.000 0.809 0.923
## Yr2 0.4313 0.033 13.267 0.000 0.367 0.495
## Yr1 0.0760 0.065 1.163 0.245 -0.052 0.204
## ==============================================================================
## Omnibus: 0.762 Durbin-Watson: 2.006
## Prob(Omnibus): 0.683 Jarque-Bera (JB): 0.795
## Skew: 0.067 Prob(JB): 0.672
## Kurtosis: 2.961 Cond. No. 858.
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

### 10.2.2 Binomial logistic regression

Binomial logistic regression models can be generated in a similar way to OLS linear regression models using the `statsmodels`

formula API, calling the binomial family from the general `statsmodels`

API.

```
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# obtain salespeople data
= "http://peopleanalytics-regression-book.org/data/salespeople.csv"
url = pd.read_csv(url)
salespeople
# define model
= smf.glm(formula = "promoted ~ sales + customer_rate",
model = salespeople,
data = sm.families.Binomial())
family
# fit model
= model.fit()
promotion_model
# see results summary
print(promotion_model.summary())
```

```
## Generalized Linear Model Regression Results
## ==============================================================================
## Dep. Variable: promoted No. Observations: 350
## Model: GLM Df Residuals: 347
## Model Family: Binomial Df Model: 2
## Link Function: logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -32.566
## Date: Tue, 02 Mar 2021 Deviance: 65.131
## Time: 16:54:34 Pearson chi2: 198.
## No. Iterations: 9
## Covariance Type: nonrobust
## =================================================================================
## coef std err z P>|z| [0.025 0.975]
## ---------------------------------------------------------------------------------
## Intercept -19.5177 3.347 -5.831 0.000 -26.078 -12.958
## sales 0.0404 0.007 6.189 0.000 0.028 0.053
## customer_rate -1.1221 0.467 -2.403 0.016 -2.037 -0.207
## =================================================================================
```

### 10.2.3 Multinomial logistic regression

Multinomial logistic regression is similarly available using the `statsmodels`

formula API. As usual, care must be taken to ensure that the reference category is appropriately defined, dummy input variables need to be explicitly constructed, and a constant term must be added to ensure an intercept is calculated.

```
import pandas as pd
import statsmodels.api as sm
# load health insurance data
= "http://peopleanalytics-regression-book.org/data/health_insurance.csv"
url = pd.read_csv(url)
health_insurance
# convert product to categorical as an outcome variable
= pd.Categorical(health_insurance['product'])
y
# create dummies for gender
= pd.get_dummies(health_insurance['gender'], drop_first = True)
X1
# replace back into input variables
= health_insurance.drop(['product', 'gender'], axis = 1)
X2 = pd.concat([X1, X2], axis = 1)
X
# add a constant term to ensure intercept is calculated
= sm.add_constant(X)
Xc
# define model
= sm.MNLogit(y, Xc)
model
# fit model
= model.fit()
insurance_model
# see results summary
print(insurance_model.summary())
```

```
## MNLogit Regression Results
## ==============================================================================
## Dep. Variable: y No. Observations: 1453
## Model: MNLogit Df Residuals: 1439
## Method: MLE Df Model: 12
## Date: Tue, 02 Mar 2021 Pseudo R-squ.: 0.5332
## Time: 16:54:34 Log-Likelihood: -744.68
## converged: True LL-Null: -1595.3
## Covariance Type: nonrobust LLR p-value: 0.000
## ==================================================================================
## y=B coef std err z P>|z| [0.025 0.975]
## ----------------------------------------------------------------------------------
## const -4.6010 0.511 -9.012 0.000 -5.602 -3.600
## Male -2.3826 0.232 -10.251 0.000 -2.838 -1.927
## Non-binary 0.2528 1.226 0.206 0.837 -2.151 2.656
## age 0.2437 0.015 15.790 0.000 0.213 0.274
## household -0.9677 0.069 -13.938 0.000 -1.104 -0.832
## position_level -0.4153 0.089 -4.658 0.000 -0.590 -0.241
## absent 0.0117 0.013 0.900 0.368 -0.014 0.037
## ----------------------------------------------------------------------------------
## y=C coef std err z P>|z| [0.025 0.975]
## ----------------------------------------------------------------------------------
## const -10.2261 0.620 -16.501 0.000 -11.441 -9.011
## Male 0.0967 0.195 0.495 0.621 -0.286 0.480
## Non-binary -1.2698 2.036 -0.624 0.533 -5.261 2.721
## age 0.2698 0.016 17.218 0.000 0.239 0.301
## household 0.2043 0.050 4.119 0.000 0.107 0.302
## position_level -0.2136 0.082 -2.597 0.009 -0.375 -0.052
## absent 0.0033 0.012 0.263 0.793 -0.021 0.028
## ==================================================================================
```

### 10.2.4 Structural equation models

The `semopy`

package is a specialized package for the implementation of Structural Equation Models in Python, and its implementation is very similar to the `lavaan`

package in R. However, its reporting is not as intuitive compared to `lavaan`

. A full tutorial is available here. Here is an example of how to run the same model as that studied in Section 8.2 using `semopy`

.

```
import pandas as pd
from semopy import Model
# get data
= "http://peopleanalytics-regression-book.org/data/politics_survey.csv"
url = pd.read_csv(url)
politics_survey
# define full measurement and structural model
= """
measurement_model # measurement model
Pol =~ Pol1 + Pol2
Hab =~ Hab1 + Hab2 + Hab3
Loc =~ Loc2 + Loc3
Env =~ Env1 + Env2
Int =~ Int1 + Int2
Pers =~ Pers2 + Pers3
Nat =~ Nat1 + Nat2
Eco =~ Eco1 + Eco2
# structural model
Overall ~ Pol + Hab + Loc + Env + Int + Pers + Nat + Eco
"""
= Model(measurement_model)
full_model
# fit model to data and inspect
full_model.fit(politics_survey)
```

Then to inspect the results:

```
# inspect the results of SEM (first few rows)
full_model.inspect().head()
```

```
## lval op rval Estimate Std. Err z-value p-value
## 0 Pol1 ~ Pol 1.000000 - - -
## 1 Pol2 ~ Pol 0.713719 0.0285052 25.0382 0
## 2 Hab1 ~ Hab 1.000000 - - -
## 3 Hab2 ~ Hab 1.183981 0.0306792 38.5923 0
## 4 Hab3 ~ Hab 1.127639 0.0304292 37.0578 0
```

### 10.2.5 Survival analysis

The `lifelines`

package in Python is designed to support survival analysis, with functions to calculate survival estimates, plot survival curves, perform Cox proportional hazard regression and check proportional hazard assumptions. A full tutorial is available here.

Here is an example of how to plot Kaplan-Meier survival curves in Python using our Chapter 9 walkthrough example. The survival curves are displayed in Figure 10.1.

```
import pandas as pd
from lifelines import KaplanMeierFitter
from matplotlib import pyplot as plt
# get data
= "http://peopleanalytics-regression-book.org/data/job_retention.csv"
url = pd.read_csv(url)
job_retention
# fit our data to Kaplan-Meier estimates
= job_retention["month"]
T = job_retention["left"]
E = KaplanMeierFitter()
kmf = E)
kmf.fit(T, event_observed
# split into high and not high sentiment
= (job_retention["sentiment"] >= 7)
highsent
# set up plot
= plt.subplot()
survplot
# plot high sentiment survival function
= E[highsent],
kmf.fit(T[highsent], event_observed = "High Sentiment")
label
= survplot)
kmf.plot_survival_function(ax
# plot not high sentiment survival function
~highsent], event_observed = E[~highsent],
kmf.fit(T[= "Not High Sentiment")
label
= survplot)
kmf.plot_survival_function(ax
# show survival curves by sentiment category
plt.show()
```

And here is an example of how to fit a Cox Proportional Hazard model similarly to Section 9.2^{43}.

```
from lifelines import CoxPHFitter
# fit Cox PH model to job_retention data
= CoxPHFitter()
cph = 'month', event_col = 'left',
cph.fit(job_retention, duration_col = "gender + field + level + sentiment") formula
```

```
# view results
cph.print_summary()
```

```
## <lifelines.CoxPHFitter: fitted with 3770 total observations, 2416 right-censored observations>
## duration col = 'month'
## event col = 'left'
## baseline estimation = breslow
## number of observations = 3770
## number of events observed = 1354
## partial log-likelihood = -10724.52
## time fit was run = 2021-03-02 16:54:35 UTC
##
## ---
## coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
## covariate
## gender[T.M] -0.05 0.96 0.06 -0.16 0.07 0.85 1.07
## field[T.Finance] 0.22 1.25 0.07 0.09 0.35 1.10 1.43
## field[T.Health] 0.28 1.32 0.13 0.03 0.53 1.03 1.70
## field[T.Law] 0.11 1.11 0.15 -0.18 0.39 0.84 1.48
## field[T.Public/Government] 0.11 1.12 0.09 -0.06 0.29 0.94 1.34
## field[T.Sales/Marketing] 0.09 1.09 0.10 -0.11 0.29 0.89 1.33
## level[T.Low] 0.15 1.16 0.09 -0.03 0.32 0.97 1.38
## level[T.Medium] 0.18 1.19 0.10 -0.02 0.38 0.98 1.46
## sentiment -0.12 0.89 0.01 -0.14 -0.09 0.87 0.91
##
## z p -log2(p)
## covariate
## gender[T.M] -0.77 0.44 1.19
## field[T.Finance] 3.34 <0.005 10.24
## field[T.Health] 2.16 0.03 5.02
## field[T.Law] 0.73 0.47 1.10
## field[T.Public/Government] 1.29 0.20 2.35
## field[T.Sales/Marketing] 0.86 0.39 1.36
## level[T.Low] 1.65 0.10 3.32
## level[T.Medium] 1.73 0.08 3.58
## sentiment -8.41 <0.005 54.49
## ---
## Concordance = 0.58
## Partial AIC = 21467.04
## log-likelihood ratio test = 89.18 on 9 df
## -log2(p) of ll-ratio test = 48.58
```

Proportional Hazard assumptions can be checked using the `check_assumptions()`

method^{44}.

`= 0.05) cph.check_assumptions(job_retention, p_value_threshold `

```
## The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
## covariates will be below the threshold by chance. This is compounded when there are many covariates.
## Similarly, when there are lots of observations, even minor deviances from the proportional hazard
## assumption will be flagged.
##
## With that in mind, it's best to use a combination of statistical tests and visual tests to determine
## the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
## and looking for non-constant lines. See link [A] below for a full example.
##
## <lifelines.StatisticalResult: proportional_hazard_test>
## null_distribution = chi squared
## degrees_of_freedom = 1
## model = <lifelines.CoxPHFitter: fitted with 3770 total observations, 2416 right-censored observations>
## test_name = proportional_hazard_test
##
## ---
## test_statistic p -log2(p)
## field[T.Finance] km 1.20 0.27 1.88
## rank 1.09 0.30 1.76
## field[T.Health] km 4.27 0.04 4.69
## rank 4.10 0.04 4.54
## field[T.Law] km 1.14 0.29 1.81
## rank 0.85 0.36 1.49
## field[T.Public/Government] km 1.92 0.17 2.59
## rank 1.87 0.17 2.54
## field[T.Sales/Marketing] km 2.00 0.16 2.67
## rank 2.22 0.14 2.88
## gender[T.M] km 0.41 0.52 0.94
## rank 0.39 0.53 0.91
## level[T.Low] km 1.53 0.22 2.21
## rank 1.52 0.22 2.20
## level[T.Medium] km 0.09 0.77 0.38
## rank 0.13 0.72 0.47
## sentiment km 2.78 0.10 3.39
## rank 2.32 0.13 2.97
##
##
## 1. Variable 'field[T.Health]' failed the non-proportional test: p-value is 0.0387.
##
## Advice: with so few unique values (only 2), you can include `strata=['field[T.Health]', ...]` in
## the call in `.fit`. See documentation in link [E] below.
##
## ---
## [A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
## [B] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it
## [C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
## [D] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form
## [E] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification
##
## []
```

### 10.2.6 Other model variants

Implementation of other model variants featured in earlier chapters becomes thinner in Python. However, of note are the following:

Ordinal regression is not currently available in the release version of the

`statsmodels`

package, but is available in the development version. The`mord`

package offers an implementation of ordinal regression for predictive analytics purposes, but for inferential modeling users will need to wait for a release of`statsmodels`

that contains ordinal regression methods or for immediate use they will need to install the development version from source.Mixed models only currently have an implementation for linear mixed modeling in

`statsmodels`

. Generalized linear mixed models equivalent to those found in the`lme4`

R package are not yet available in Python.