8 Modeling Explicit and Latent Hierarchical Structure in Data
So far in this book we have learned all of the most widely used and foundational regression techniques for inferential modeling. Starting with this chapter, we will look at situations where we need to adapt or combine techniques to address certain inference goals or data characteristics. In this chapter we look at some situations where data has a hierarchy and where we wish to consider this hierarchy in our modeling efforts.
It is very often the case that data has an explicit hierarchy. For example, each observation in our data may refer to a different individual and each such individual may be a member of a few different groups. Similarly, each observation might refer to an event involving an individual, and we may have data on multiple events for the same individual. For a particular problem that we are modeling, we may wish to take into consideration the effect of the hierarchical grouping on the observations. This requires a model which has a mixture of random effects and fixed effects - called a mixed model.
Separately, it can be the case that data we are given could have a latent hierarchy. The input variables in the data might be measures of a smaller set of higher level latent constructs, and we may have a more interpretable model if we hypothesize, confirm and model those latent constructs against our outcome of interest rather than using a larger number of explicit input variables. Latent variable modeling is a common technique to address this situation, and in this chapter we will review a form of latent variable modeling called structural equation modeling which is very effective especially in making inferences from survey instruments with large numbers of items.
These topics are quite broad and there are many different approaches, techniques and terms involved in mixed modeling and latent variable modeling. In this chapter we will only cover some of the simpler approaches, which would suffice for the majority of common situations in people analytics. For a deeper treatment of these topics, see Jiang (2007) for mixed models and David J. Bartholomew (2011) or Anders Skrondal (2004) for latent variable models.
8.1 Mixed models for explicit hierarchy in data
The most common explicit hierarchies that we see in data are group-based and time-based. A group-based hierarchy occurs when we are taking observations that belong to different groups. For example, in our first walkthrough example in Chapter 4, we modeled final examination performance against examination performance for the previous three years. In this case we considered each student observation to be independent and identically distributed and we ran a linear regression model on all the students. If we were to receive additional information that these students were actually a mix of students on different degree programs, then we may wish to take this into account in how we model the problem - that is, we would want to assume that each student observation is only independent and identically distributed within each degree program.
Similarly, a time-based hierarchy occurs when we have multiple observations of the same subject taken at different times. For example, if we are conducting a weekly survey on the same people over the course of a year, and we are modeling how answers to some questions might depend on answers to others, we may wish to consider the effect of the person on this model.
Both of these situations introduce a new grouping variable into the problem we are modeling, thus creating a hierarchy. It is not hard to imagine that analyzing each group may produce different statistical properties compared to analyzing the entire population - for example there could be correlations between the data inside groups which are less evident when looking at the overall population. It follows therefore that in some cases a model may provide more accurate and reliable estimates if this grouping is taken into account.
8.1.1 Fixed and random effects
Let’s imagine that we have a set of observations consisting of a continuous output variable \(y\) and input variables \(x_1, x_2, ..., x_p\). Let’s also assume that we have an additional data point for each observation where we assign it to a group \(G\). We are asked to determine the relationship between the outcome and the input variables. One option is to develop a linear model \(y = \beta_0 + \beta_1x_1 + ... + \beta_px_p + \epsilon\), ignoring the group data. In this model, we assume that the coefficients all have a fixed effect on the input variables - that is, they act on every observation in the same way. This may be fine if there is trust that group membership is unlikely to have any impact on the relationship being modeled, or if we are comfortable making inferences about variables at the observation level only.
If, however, there is a belief that group membership may have an effect on the relationship being modeled, and if we are interested in interpreting our model at the group and observation level, then we need to adjust our model to a mixed model for more accurate and reliable inference. The most common adjustment is a random intercept. In this situation, we imagine that group membership has an effect on the ‘starting point’ of the relationship: the intercept. Therefore, for a given observation \(y = \alpha_G + \beta_0 + \beta_1x_1 + ... + \beta_px_p + \epsilon\), where \(\alpha_G\) is a random effect with a mean of zero associated with the group that the observation is a member of. This can be restated as:
\[ y = \beta_G + \beta_1x_1 + ... + \beta_px_p + \epsilon \]
where \(\beta_G = \alpha_G + \beta_0\), which is a random intercept with a mean of \(\beta_0\).
This model is very similar to a standard linear regression model, except instead of having a fixed intercept, we have an intercept that varies by group. Therefore, we will essentially have two ‘levels’ in our model: one at the observation level to describe \(y\) and one at the group level to describe \(\beta_G\). For this reason mixed models are sometimes known as multi-level models.
It is not too difficult to see how this approach can be extended. For example, suppose that we believe the groups also have an effect on the coefficient of the input variable \(x_1\) as well as the intercept. Then
\[ y = \beta_{G0} + \beta_{G1}x_1 + \beta_2x_2 + ... + \beta_px_p \] where \(\beta_{G0}\) is a random intercept and \(\beta_{G1}\) is a random slope. In this case, a mixed model would return the estimated coefficients at the observation level and the statistics for the random effects \(\beta_{G0}\) and \(\beta_{G1}\) at the group level.
Finally, our model does not need to be linear for this to apply. This approach also extends to logistic models and other generalized linear models. For example, if \(y\) was a binary outcome variable and our model was a binomial logistic regression model, our last equation would translate to
\[ \mathrm{ln}\left(\frac{P(y = 1)}{P(y = 0)}\right) = \beta_{G0} + \beta_{G1}x_1 + \beta_2x_2 + ... + \beta_px_p \]
8.1.2 Running a mixed model
Let’s look at a fun and straightforward example of how mixed models can be useful. The speed dating data set can be found here and is a set of information captured during experiments with speed dating by students at Columbia University in New York33. Each row represents one meeting between an individual and a partner of the opposite sex. The data contains the following fields:
-
iid
is an id number for the individual -
gender
is the gender of the individual with 0 as Female and 1 and Male -
match
indicates that the meeting resulted in a match -
samerace
indicates that both the individual and the partner were of the same race -
race
is the race of the individual, with race coded as follows: Black/African American=1, European/Caucasian-American=2, Latino/Hispanic American=3, Asian/Pacific Islander/Asian-American=4, Native American=5, Other=6 -
goal
is the reason why the individual is participating in the event, coded as follows: Seemed like a fun night out=1, To meet new people=2, To get a date=3, Looking for a serious relationship=4, To say I did it=5, Other=6 -
dec
is a binary rating from the individual as to whether they would like to see their partner again (1 is Yes and 0 is No) -
attr
is the individual’s rating out of ten on the attractiveness of the partner -
intel
is the individual’s rating out of ten on the intelligence level of the partner -
prob
is the individual’s rating out of ten on whether they believe the partner will want to see them again -
agediff
is the absolute difference in the ages of the individual and the partner.
This data can be explored in numerous ways, but we will focus here on modeling options. We are interested in the binary outcome dec
(the decision of the individual) and we would like to understand how it relates to the age difference, the racial similarity and the ratings on attr
, intel
and prob
. First, let’s assume that we don’t care about how an individual makes up their mind about their speed date, and that we are only interested in the dynamics of speed date decisions. Then we would simply run a binomial logistic regression on our data set, ignoring iid
and other grouping variables like race
, goal
and gender
.
# if needed, get data
url <- "http://peopleanalytics-regression-book.org/data/speed_dating.csv"
speed_dating <- read.csv(url)
# run standard binomial model
model <- glm(dec ~ agediff + samerace + attr + intel + prob,
data = speed_dating,
family = "binomial")
summary(model)
##
## Call:
## glm(formula = dec ~ agediff + samerace + attr + intel + prob,
## family = "binomial", data = speed_dating)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6497 -0.8514 -0.3477 0.8809 2.8871
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.812900 0.184340 -31.534 <2e-16 ***
## agediff -0.010518 0.009029 -1.165 0.2440
## samerace -0.093422 0.055710 -1.677 0.0936 .
## attr 0.661139 0.019382 34.111 <2e-16 ***
## intel -0.004485 0.020763 -0.216 0.8290
## prob 0.270553 0.014565 18.575 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10647.3 on 7788 degrees of freedom
## Residual deviance: 8082.9 on 7783 degrees of freedom
## (589 observations deleted due to missingness)
## AIC: 8094.9
##
## Number of Fisher Scoring iterations: 5
In general, we see that the factors which significantly influence the speed dating decision seem to be the attractiveness of the partner and the feeling of reciprocation of interest from the partner, and that age difference, racial similarity and intelligence do not seem to play a significant role at the level of the speed date itself.
Now let’s say that we are interested in how a given individual weighs up these factors in coming to a decision. Different individuals may have different ingoing criteria for making speed dating decisions. As a result, each individual may have varying base likelihoods of a positive decision, and each individual may be affected by the input variables in different ways as they come to their decision. Therefore we will need to assign random effects for individuals based on iid
. The lme4
package in R contains functions for performing mixed linear regression models and mixed generalized linear regression models. These functions take formulas with additional terms to define the random effects to be estimated. The function for a linear model is lmer()
and for a generalized linear model is glmer()
.
In the simple case, let’s assume that each individual has a different ingoing base likelihood of making a positive decision on a speed date. We will therefore model a random intercept according to the iid
of the individual. Here we would use the formula dec ~ agediff + samerace + attr + intel + prob + (1 | iid)
, where (1 | iid)
means ‘a random effect for iid
on the intercept of the model.’
# run binomial mixed effects model
library(lme4)
iid_intercept_model <- lme4:::glmer(
dec ~ agediff + samerace + attr + intel + prob + (1 | iid),
data = speed_dating,
family = "binomial"
)
# view summary without correlation table of fixed effects
summary(iid_intercept_model, correlation = FALSE)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
## Family: binomial ( logit )
## Formula: dec ~ agediff + samerace + attr + intel + prob + (1 | iid)
## Data: speed_dating
##
## AIC BIC logLik deviance df.resid
## 6420.3 6469.0 -3203.1 6406.3 7782
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -25.6965 -0.3644 -0.0606 0.3608 25.0368
##
## Random effects:
## Groups Name Variance Std.Dev.
## iid (Intercept) 5.18 2.276
## Number of obs: 7789, groups: iid, 541
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.88882 0.42143 -30.583 < 2e-16 ***
## agediff -0.03671 0.01401 -2.621 0.00877 **
## samerace 0.20187 0.08139 2.480 0.01313 *
## attr 1.07894 0.03334 32.363 < 2e-16 ***
## intel 0.31592 0.03473 9.098 < 2e-16 ***
## prob 0.61998 0.02873 21.581 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see the two levels of results in this summary. The fixed effects level gives the the coefficients of the model at an observation (speed date) level, and the random effects tell us how the intercept (or base likelihood) of that model can vary according to the individual. We see that there is considerable variance in the intercept from individual to individual, and taking this into account, we now see that the decision of an individual on a given date is significantly influenced by all the factors in this model. If we had stuck with the simple binomial model, the effects of age difference, racial similarity and intelligence at an individual level would have gotten lost, and we could have reached the erroneous conclusion that none of these really matter in speed dating.
If we believe that different individuals are influenced differently by one or more of the various decision factors they consider during a speed date, we can extend our random effects to the slope coefficients of our model. For example we could use (1 + agediff | iid)
to model a random effect of iid
on the intercept and the agediff
coefficient. Similarly, if we wanted to consider two grouping variables - like iid
and goal
- on the intercept, we could add both (1 | iid)
and (1 | goal)
to our model formula.
8.2 Structural Equation Models for latent hierarchies in data
In this section we will focus entirely on survey data use cases, as this is the most common application of Structural Equation Modeling in People Analytics. However it should be noted that survey data is not the only situation where latent variables may be modeled, and this technology has substantially broader applications. Indeed, advanced practitioners may see opportunities to experiment with this technology in other use cases.
It is a frequent occurrence with surveys conducted on large samples of people, such as a public survey or a large company survey, that attempts to run regression models can be problematic due to the large number of survey questions or items. Often many of the items are highly correlated, and even if they were not, high dimensionality makes interpretability very challenging. Decision-makers are not usually interested in explanations that involve 50 or 100 variables.
Usually, such a large number of survey items are not all independently measuring a different construct. Many of the items can be considered to be addressing similar thematic contructs. For example, the items ‘I believe I am compensated well’ and ‘I am happy with the benefits offered by my employer,’ could both be considered to be related to employee rewards. In some cases, survey instruments can be explicitly constructed around these themes, and in other cases surveys have grown organically over time to include a disorganized set of items that could be grouped into themes after the fact.
It is a common request for an analyst to model a certain outcome using the many items in a complex survey as input variables. In some cases the outcome being modeled is an item in the survey itself - usually some overall measure of sentiment - or in other cases the outcome could be independent of the survey instrument, for example future attrition from the organization. In this situation, a model using the themes as input variables is likely to be a lot more useful and interpretable than a model using the items as input variables.
Structural Equation Modeling is a technique that allows an analyst to hypothesize a smaller set of latent variables or factors that explain the responses to the survey items themselves (the ‘measured variables’), and then regresses the outcome of interest against these factors. It is a two part approach, each part being a separate model in and of itself, as follows:
Measurement Model: This is focused on how well the hypothesized factors explain the responses to the survey items using a technique called factor analysis. In the most common case, where a subject matter expert has pre-organized the items into several groups corresponding to hypothesized latent variables, the process is called confirmatory factor analysis, and the objective is to confirm that the groupings represent a high-quality measurement model, adjusting as necessary to refine the model. In the simplest case, items are fitted into separate independent themes with no overlap.
Structural Model: Assuming a satisfactory measurement model, the structural model is effectively a regression model which explains how each of the proposed factors relate to the outcome of interest.
As a walkthrough example, we will work with the politics_survey
data set.
# if needed, get data
url <- "http://peopleanalytics-regression-book.org/data/politics_survey.csv"
politics_survey <- read.csv(url)
This data set represents the results of a survey conducted by a political party on a set of approximately 2100 voters. The results are on a Likert scale of 1 to 10 where 1 indicates strong negative sentiment with a statement and 10 represents strong positive sentiment. Subject matter experts have already grouped the items into proposed latent variables or factors, and the data takes the following form:
-
Overall
represents the overall intention to vote for the party in the next election. - Items beginning with
Pol
are considered to be related to the policies of the political party - Items beginning with
Hab
are considered to be related to prior voting habits in relation to the political party - Items beginning with
Loc
are considered to be related to interest in local issues around where the respondent resided - Items beginning with
Env
are considered to be related to interest in environmental issues - Items beginning with
Int
are considered to be related to interest in international issues - Items beginning with
Pers
are considered to be related to the personalities of the party representatives/leaders - Items beginning with
Nat
are considered to be related to interest in national issues - Items beginning with
Eco
are considered to be related to interest in economic issues
The outcome of interest here is the Overall
rating. Our first aim is to confirm that the eight factors suggested by the subject matter experts represent a satisfactory measurement model (that they reasonably explain the responses to the 22 items), adjusting or refining if needed. Assuming we can confirm a satisfactory measurement model, our second aim is to run a structural model to determine how each factor relates to the overall intention to vote for the party in the next election.
8.2.1 Running and assessing the measurement model
The proposed measurement model can be seen in Figure 8.1. In this path diagram, we see the eight latent variables or factors (circles) and how they map to the individual measured items (squares) in the survey using single headed arrows. Here we are making a simplifying assumption that each latent variable influences an independent group of survey items. The diagram also notes that the latent variables may be correlated with each other, as indicated by the double headed arrows at the top. Dashed line paths indicate that a specific item will be used to scale the variance of the latent factor.

Figure 8.1: Simple path diagram showing proposed measurement model for politics_survey
The lavaan
package in R is a specialized package for running analysis on latent variables. The function cfa()
can be used to perform a confirmatory factor analysis on a specified measurement model. The measurement model can be specified using an appropriately commented and formatted text string as follows. Note the =~
notation and note also that each factor is defined on a new line.
# define measurement model
meas_mod <- "
# measurement model
Pol =~ Pol1 + Pol2 + Pol3
Hab =~ Hab1 + Hab2 + Hab3
Loc =~ Loc1 + Loc2 + Loc3
Env =~ Env1 + Env2
Int =~ Int1 + Int2
Pers =~ Pers1 + Pers2 + Pers3
Nat =~ Nat1 + Nat2 + Nat3
Eco =~ Eco1 + Eco2
"
With the measurement model defined, the confirmatory factor analysis can be run and a summary viewed. The lavaan
summary functions used in this section produce quite large outputs . We will proceed to highlight which parts of this output are important in interpreting and refining the model.
library(lavaan)
cfa_meas_mod <- lavaan::cfa(model = meas_mod, data = politics_survey)
lavaan::summary(cfa_meas_mod, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-7 ended normally after 96 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 70
##
## Number of observations 2108
##
## Model Test User Model:
##
## Test statistic 845.400
## Degrees of freedom 161
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 16483.222
## Degrees of freedom 210
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.958
## Tucker-Lewis Index (TLI) 0.945
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -39288.067
## Loglikelihood unrestricted model (H1) -38865.367
##
## Akaike (AIC) 78716.134
## Bayesian (BIC) 79111.878
## Sample-size adjusted Bayesian (BIC) 78889.481
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.045
## 90 Percent confidence interval - lower 0.042
## 90 Percent confidence interval - upper 0.048
## P-value RMSEA <= 0.05 0.997
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.035
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Pol =~
## Pol1 1.000 0.571 0.775
## Pol2 0.875 0.032 27.104 0.000 0.500 0.732
## Pol3 0.486 0.024 20.217 0.000 0.277 0.505
## Hab =~
## Hab1 1.000 0.625 0.754
## Hab2 1.206 0.032 37.731 0.000 0.754 0.887
## Hab3 1.136 0.031 36.337 0.000 0.710 0.812
## Loc =~
## Loc1 1.000 0.346 0.594
## Loc2 1.371 0.052 26.249 0.000 0.475 0.826
## Loc3 1.517 0.058 25.999 0.000 0.526 0.801
## Env =~
## Env1 1.000 0.427 0.812
## Env2 0.596 0.039 15.157 0.000 0.255 0.598
## Int =~
## Int1 1.000 0.608 0.643
## Int2 1.289 0.064 20.031 0.000 0.783 0.854
## Pers =~
## Pers1 1.000 0.502 0.634
## Pers2 1.037 0.040 25.763 0.000 0.520 0.769
## Pers3 0.937 0.038 24.422 0.000 0.470 0.695
## Nat =~
## Nat1 1.000 0.521 0.753
## Nat2 1.002 0.032 30.956 0.000 0.522 0.749
## Nat3 0.990 0.040 24.580 0.000 0.516 0.585
## Eco =~
## Eco1 1.000 0.527 0.791
## Eco2 1.097 0.042 26.138 0.000 0.578 0.745
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Pol ~~
## Hab 0.166 0.011 14.913 0.000 0.465 0.465
## Loc 0.107 0.007 15.069 0.000 0.539 0.539
## Env 0.084 0.008 10.878 0.000 0.344 0.344
## Int 0.144 0.012 11.892 0.000 0.416 0.416
## Pers 0.163 0.010 15.562 0.000 0.569 0.569
## Nat 0.176 0.010 17.044 0.000 0.592 0.592
## Eco 0.151 0.010 15.105 0.000 0.503 0.503
## Hab ~~
## Loc 0.069 0.006 10.927 0.000 0.319 0.319
## Env 0.051 0.008 6.714 0.000 0.190 0.190
## Int 0.133 0.012 11.063 0.000 0.350 0.350
## Pers 0.121 0.010 12.452 0.000 0.387 0.387
## Nat 0.105 0.009 11.149 0.000 0.323 0.323
## Eco 0.089 0.009 9.531 0.000 0.272 0.272
## Loc ~~
## Env 0.076 0.005 14.566 0.000 0.516 0.516
## Int 0.091 0.008 11.925 0.000 0.432 0.432
## Pers 0.099 0.007 14.786 0.000 0.571 0.571
## Nat 0.116 0.007 16.709 0.000 0.646 0.646
## Eco 0.091 0.006 14.355 0.000 0.498 0.498
## Env ~~
## Int 0.081 0.009 9.496 0.000 0.312 0.312
## Pers 0.076 0.007 10.866 0.000 0.355 0.355
## Nat 0.093 0.007 12.880 0.000 0.417 0.417
## Eco 0.076 0.007 10.715 0.000 0.339 0.339
## Int ~~
## Pers 0.158 0.012 13.061 0.000 0.517 0.517
## Nat 0.186 0.013 14.584 0.000 0.586 0.586
## Eco 0.133 0.011 11.898 0.000 0.416 0.416
## Pers ~~
## Nat 0.190 0.011 17.896 0.000 0.725 0.725
## Eco 0.157 0.010 15.973 0.000 0.592 0.592
## Nat ~~
## Eco 0.195 0.010 19.229 0.000 0.710 0.710
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .Pol1 0.217 0.012 18.559 0.000 0.217 0.399
## .Pol2 0.217 0.010 21.614 0.000 0.217 0.465
## .Pol3 0.225 0.008 29.470 0.000 0.225 0.745
## .Hab1 0.297 0.012 25.821 0.000 0.297 0.432
## .Hab2 0.154 0.011 14.322 0.000 0.154 0.214
## .Hab3 0.260 0.012 21.927 0.000 0.260 0.340
## .Loc1 0.220 0.008 29.061 0.000 0.220 0.647
## .Loc2 0.105 0.006 18.203 0.000 0.105 0.318
## .Loc3 0.155 0.008 20.350 0.000 0.155 0.359
## .Env1 0.094 0.011 8.207 0.000 0.094 0.340
## .Env2 0.117 0.005 21.867 0.000 0.117 0.643
## .Int1 0.526 0.023 23.035 0.000 0.526 0.587
## .Int2 0.228 0.028 8.265 0.000 0.228 0.271
## .Pers1 0.374 0.014 27.058 0.000 0.374 0.598
## .Pers2 0.187 0.009 20.583 0.000 0.187 0.408
## .Pers3 0.237 0.010 24.846 0.000 0.237 0.517
## .Nat1 0.208 0.009 23.285 0.000 0.208 0.434
## .Nat2 0.214 0.009 23.521 0.000 0.214 0.440
## .Nat3 0.510 0.018 28.965 0.000 0.510 0.657
## .Eco1 0.166 0.010 16.158 0.000 0.166 0.374
## .Eco2 0.268 0.014 19.838 0.000 0.268 0.445
## Pol 0.326 0.018 18.035 0.000 1.000 1.000
## Hab 0.391 0.020 19.209 0.000 1.000 1.000
## Loc 0.120 0.009 13.830 0.000 1.000 1.000
## Env 0.183 0.014 13.337 0.000 1.000 1.000
## Int 0.370 0.028 13.436 0.000 1.000 1.000
## Pers 0.252 0.017 14.609 0.000 1.000 1.000
## Nat 0.271 0.015 18.485 0.000 1.000 1.000
## Eco 0.277 0.015 17.936 0.000 1.000 1.000
This is a large set of results, but we can focus in on some important parameters to examine. First we note that the results did not come attached to a warning. One particular warning to look out for relates to the covariance matrix being non-positive definite. This renders some of the attempted measurement invalid and is usually caused by too small a sample size for the complexity of the measurement model. Since we did not receive this warning, we can proceed to examine the fit of the measurement model.
Second, we can examine the fit statistics. Numerous statistics are reported34, but for larger samples such as this dataset, the following measures should be examined:
CFI and TLI, which compare the proposed model to a baseline (null or random) model to determine if it is better. Ideally we look for both of these measures to exceed 0.95. We see that our measurement model comes very close to meeting these criteria
RMSEA should ideally be less than 0.06, which is met by our measurement model
SRMR should ideally be less than 0.08, which is met by our measurement model
Finally, the parameter estimates for the latent variables should be examined. In particular the Std.all
column which is similar to standardized regression coefficients. These parameters are commonly knows as factor loadings - they can be interpreted as the extent to which the item response is explained by the proposed latent variable. In general, factor loadings of 0.7 or above are considered reasonable. Factor loadings less than this may be introducing unacceptable measurement error. One option if this occurs is to drop the item completely from the measurement model, or to explore an alternative measurement model with the item assigned to another latent variable. In any case, the analyst will need to balance these considerations against the need to have factors measured against multiple items wherever possible in order to minimize other aspects of measurement error.
In our case we could consider dropping Pol3
, Loc1
, Pers1
and Nat3
from the measurement model as they have factor loadings of less than 0.7 and are in factors that contain three items.
meas_mod_revised <- "
# measurement model
Pol =~ Pol1 + Pol2
Hab =~ Hab1 + Hab2 + Hab3
Loc =~ Loc2 + Loc3
Env =~ Env1 + Env2
Int =~ Int1 + Int2
Pers =~ Pers2 + Pers3
Nat =~ Nat1 + Nat2
Eco =~ Eco1 + Eco2
"
cfa_meas_mod_rev <- lavaan::cfa(model = meas_mod_revised,
data = politics_survey)
lavaan::summary(cfa_meas_mod_rev,
fit.measures = TRUE,
standardized = TRUE)
## lavaan 0.6-7 ended normally after 90 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 62
##
## Number of observations 2108
##
## Model Test User Model:
##
## Test statistic 381.748
## Degrees of freedom 91
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 13401.444
## Degrees of freedom 136
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.978
## Tucker-Lewis Index (TLI) 0.967
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -31792.769
## Loglikelihood unrestricted model (H1) -31601.895
##
## Akaike (AIC) 63709.539
## Bayesian (BIC) 64060.055
## Sample-size adjusted Bayesian (BIC) 63863.075
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.039
## 90 Percent confidence interval - lower 0.035
## 90 Percent confidence interval - upper 0.043
## P-value RMSEA <= 0.05 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.027
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Pol =~
## Pol1 1.000 0.616 0.836
## Pol2 0.738 0.031 23.664 0.000 0.455 0.666
## Hab =~
## Hab1 1.000 0.624 0.753
## Hab2 1.209 0.032 37.662 0.000 0.755 0.888
## Hab3 1.137 0.031 36.288 0.000 0.710 0.812
## Loc =~
## Loc2 1.000 0.462 0.803
## Loc3 1.185 0.038 31.292 0.000 0.547 0.834
## Env =~
## Env1 1.000 0.432 0.821
## Env2 0.583 0.039 15.046 0.000 0.252 0.591
## Int =~
## Int1 1.000 0.609 0.644
## Int2 1.284 0.066 19.324 0.000 0.782 0.852
## Pers =~
## Pers2 1.000 0.518 0.766
## Pers3 0.954 0.037 25.541 0.000 0.494 0.731
## Nat =~
## Nat1 1.000 0.513 0.742
## Nat2 1.031 0.035 29.772 0.000 0.529 0.759
## Eco =~
## Eco1 1.000 0.531 0.797
## Eco2 1.081 0.042 25.761 0.000 0.574 0.739
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Pol ~~
## Hab 0.179 0.012 15.282 0.000 0.466 0.466
## Loc 0.154 0.009 16.740 0.000 0.541 0.541
## Env 0.091 0.008 11.020 0.000 0.341 0.341
## Int 0.158 0.013 12.092 0.000 0.421 0.421
## Pers 0.168 0.011 15.699 0.000 0.525 0.525
## Nat 0.194 0.011 17.624 0.000 0.612 0.612
## Eco 0.166 0.011 15.647 0.000 0.508 0.508
## Hab ~~
## Loc 0.089 0.008 10.991 0.000 0.308 0.308
## Env 0.051 0.008 6.730 0.000 0.189 0.189
## Int 0.133 0.012 10.989 0.000 0.350 0.350
## Pers 0.108 0.010 11.235 0.000 0.334 0.334
## Nat 0.103 0.009 10.826 0.000 0.320 0.320
## Eco 0.090 0.009 9.531 0.000 0.271 0.271
## Loc ~~
## Env 0.104 0.007 15.863 0.000 0.522 0.522
## Int 0.120 0.010 12.213 0.000 0.425 0.425
## Pers 0.129 0.008 16.012 0.000 0.541 0.541
## Nat 0.152 0.008 17.989 0.000 0.642 0.642
## Eco 0.118 0.008 14.965 0.000 0.481 0.481
## Env ~~
## Int 0.082 0.009 9.470 0.000 0.311 0.311
## Pers 0.073 0.007 10.272 0.000 0.327 0.327
## Nat 0.090 0.007 12.389 0.000 0.405 0.405
## Eco 0.077 0.007 10.736 0.000 0.336 0.336
## Int ~~
## Pers 0.152 0.012 12.762 0.000 0.481 0.481
## Nat 0.173 0.013 13.796 0.000 0.552 0.552
## Eco 0.134 0.011 11.795 0.000 0.415 0.415
## Pers ~~
## Nat 0.193 0.010 18.797 0.000 0.726 0.726
## Eco 0.157 0.010 16.362 0.000 0.571 0.571
## Nat ~~
## Eco 0.194 0.010 18.976 0.000 0.712 0.712
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .Pol1 0.163 0.014 11.407 0.000 0.163 0.300
## .Pol2 0.260 0.011 24.012 0.000 0.260 0.557
## .Hab1 0.298 0.012 25.850 0.000 0.298 0.433
## .Hab2 0.153 0.011 14.170 0.000 0.153 0.212
## .Hab3 0.260 0.012 21.911 0.000 0.260 0.341
## .Loc2 0.118 0.007 17.847 0.000 0.118 0.355
## .Loc3 0.132 0.009 15.077 0.000 0.132 0.305
## .Env1 0.090 0.012 7.674 0.000 0.090 0.325
## .Env2 0.118 0.005 22.215 0.000 0.118 0.650
## .Int1 0.524 0.023 22.407 0.000 0.524 0.586
## .Int2 0.231 0.029 8.010 0.000 0.231 0.274
## .Pers2 0.189 0.011 17.900 0.000 0.189 0.413
## .Pers3 0.213 0.010 20.581 0.000 0.213 0.466
## .Nat1 0.216 0.010 22.585 0.000 0.216 0.450
## .Nat2 0.206 0.010 21.308 0.000 0.206 0.424
## .Eco1 0.161 0.010 15.435 0.000 0.161 0.364
## .Eco2 0.273 0.014 20.019 0.000 0.273 0.454
## Pol 0.380 0.021 18.241 0.000 1.000 1.000
## Hab 0.390 0.020 19.177 0.000 1.000 1.000
## Loc 0.213 0.011 19.385 0.000 1.000 1.000
## Env 0.187 0.014 13.368 0.000 1.000 1.000
## Int 0.371 0.028 13.233 0.000 1.000 1.000
## Pers 0.269 0.016 17.249 0.000 1.000 1.000
## Nat 0.264 0.015 17.740 0.000 1.000 1.000
## Eco 0.282 0.016 17.952 0.000 1.000 1.000
We now see that our measurement model comfortably meets all fit requirements. In our case we chose to completely drop four items from the model. Analysts may wish to experiment with relaxing criteria on dropping items, or on reassigning items to other factors to achieve a good balance between fit and factor measurement reliability.
8.2.2 Running and interpreting the structural model
With a satisfactory measurement model, the structural model is a simple regression formula. The sem()
function in lavaan
can be used to perform a full structural equation model including the measurement model and structural model, as follows:
# define full SEM using revised measurement model
full_sem <- "
# measurement model
Pol =~ Pol1 + Pol2
Hab =~ Hab1 + Hab2 + Hab3
Loc =~ Loc2 + Loc3
Env =~ Env1 + Env2
Int =~ Int1 + Int2
Pers =~ Pers2 + Pers3
Nat =~ Nat1 + Nat2
Eco =~ Eco1 + Eco2
# structural model
Overall ~ Pol + Hab + Loc + Env + Int + Pers + Nat + Eco
"
# run full SEM
full_model <- lavaan::sem(model = full_sem, data = politics_survey)
lavaan::summary(full_model, standardized = TRUE)
## lavaan 0.6-7 ended normally after 99 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 71
##
## Number of observations 2108
##
## Model Test User Model:
##
## Test statistic 455.489
## Degrees of freedom 100
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Pol =~
## Pol1 1.000 0.631 0.856
## Pol2 0.704 0.029 24.656 0.000 0.444 0.650
## Hab =~
## Hab1 1.000 0.632 0.762
## Hab2 1.183 0.031 38.354 0.000 0.748 0.879
## Hab3 1.125 0.031 36.787 0.000 0.711 0.813
## Loc =~
## Loc2 1.000 0.463 0.805
## Loc3 1.178 0.037 31.951 0.000 0.546 0.831
## Env =~
## Env1 1.000 0.429 0.815
## Env2 0.592 0.039 15.198 0.000 0.254 0.596
## Int =~
## Int1 1.000 0.611 0.645
## Int2 1.277 0.065 19.514 0.000 0.780 0.850
## Pers =~
## Pers2 1.000 0.521 0.770
## Pers3 0.943 0.037 25.715 0.000 0.492 0.727
## Nat =~
## Nat1 1.000 0.513 0.741
## Nat2 1.033 0.035 29.904 0.000 0.530 0.760
## Eco =~
## Eco1 1.000 0.532 0.799
## Eco2 1.075 0.042 25.780 0.000 0.572 0.737
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Overall ~
## Pol 0.329 0.035 9.356 0.000 0.207 0.308
## Hab 0.257 0.024 10.667 0.000 0.162 0.241
## Loc 0.217 0.047 4.651 0.000 0.100 0.149
## Env -0.098 0.040 -2.432 0.015 -0.042 -0.063
## Int 0.041 0.029 1.430 0.153 0.025 0.037
## Pers 0.108 0.047 2.290 0.022 0.056 0.084
## Nat 0.123 0.071 1.738 0.082 0.063 0.094
## Eco -0.001 0.043 -0.033 0.974 -0.001 -0.001
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Pol ~~
## Hab 0.184 0.012 15.421 0.000 0.461 0.461
## Loc 0.157 0.009 16.960 0.000 0.537 0.537
## Env 0.091 0.008 10.954 0.000 0.335 0.335
## Int 0.161 0.013 12.186 0.000 0.417 0.417
## Pers 0.170 0.011 15.785 0.000 0.516 0.516
## Nat 0.195 0.011 17.730 0.000 0.604 0.604
## Eco 0.168 0.011 15.748 0.000 0.502 0.502
## Hab ~~
## Loc 0.091 0.008 11.091 0.000 0.311 0.311
## Env 0.052 0.008 6.734 0.000 0.190 0.190
## Int 0.137 0.012 11.108 0.000 0.354 0.354
## Pers 0.111 0.010 11.328 0.000 0.336 0.336
## Nat 0.105 0.010 10.926 0.000 0.323 0.323
## Eco 0.092 0.010 9.573 0.000 0.272 0.272
## Loc ~~
## Env 0.104 0.007 15.855 0.000 0.524 0.524
## Int 0.120 0.010 12.286 0.000 0.426 0.426
## Pers 0.131 0.008 16.113 0.000 0.541 0.541
## Nat 0.153 0.008 18.062 0.000 0.643 0.643
## Eco 0.118 0.008 15.021 0.000 0.481 0.481
## Env ~~
## Int 0.082 0.009 9.482 0.000 0.312 0.312
## Pers 0.073 0.007 10.268 0.000 0.328 0.328
## Nat 0.089 0.007 12.364 0.000 0.407 0.407
## Eco 0.077 0.007 10.732 0.000 0.338 0.338
## Int ~~
## Pers 0.153 0.012 12.835 0.000 0.482 0.482
## Nat 0.173 0.012 13.858 0.000 0.553 0.553
## Eco 0.135 0.011 11.831 0.000 0.414 0.414
## Pers ~~
## Nat 0.194 0.010 18.865 0.000 0.725 0.725
## Eco 0.158 0.010 16.426 0.000 0.571 0.571
## Nat ~~
## Eco 0.194 0.010 19.005 0.000 0.712 0.712
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .Pol1 0.145 0.014 10.588 0.000 0.145 0.267
## .Pol2 0.269 0.010 25.659 0.000 0.269 0.577
## .Hab1 0.289 0.011 25.534 0.000 0.289 0.419
## .Hab2 0.164 0.011 15.646 0.000 0.164 0.227
## .Hab3 0.259 0.012 22.160 0.000 0.259 0.339
## .Loc2 0.116 0.006 18.034 0.000 0.116 0.352
## .Loc3 0.133 0.008 15.688 0.000 0.133 0.309
## .Env1 0.093 0.011 8.099 0.000 0.093 0.336
## .Env2 0.117 0.005 22.032 0.000 0.117 0.645
## .Int1 0.522 0.023 22.438 0.000 0.522 0.583
## .Int2 0.234 0.028 8.237 0.000 0.234 0.278
## .Pers2 0.186 0.011 17.690 0.000 0.186 0.406
## .Pers3 0.216 0.010 21.007 0.000 0.216 0.472
## .Nat1 0.216 0.010 22.728 0.000 0.216 0.451
## .Nat2 0.205 0.010 21.337 0.000 0.205 0.422
## .Eco1 0.160 0.010 15.281 0.000 0.160 0.361
## .Eco2 0.275 0.014 20.184 0.000 0.275 0.457
## .Overall 0.245 0.008 29.567 0.000 0.245 0.542
## Pol 0.398 0.021 19.239 0.000 1.000 1.000
## Hab 0.399 0.020 19.519 0.000 1.000 1.000
## Loc 0.215 0.011 19.601 0.000 1.000 1.000
## Env 0.184 0.014 13.395 0.000 1.000 1.000
## Int 0.373 0.028 13.315 0.000 1.000 1.000
## Pers 0.272 0.016 17.403 0.000 1.000 1.000
## Nat 0.263 0.015 17.748 0.000 1.000 1.000
## Eco 0.283 0.016 18.008 0.000 1.000 1.000
The Std.all
column of the Regressions
section of the output provides standardized estimates which can be approximately interpreted as the proportion of the variance of the outcome that is explained by each factor. Here we can make the following interpretations:
Policies, habit and interest in local issues represent the three strongest drivers of likelihood of voting for the party at the next election, and explain approximately 70% of the overall variance in the outcome.
Interest in national or international issues, and interest in the economy each have no significant relationship with likelihood to vote for the party at the next election.
Interest in the environment has a significant negative relationship with likelihood to vote for the party at the next election.

Figure 8.2: Path diagram for full structural equation model on politics_survey
The full structural equation model can be seen in Figure 8.2. This simple example illustrates the value of structural equation modeling in both reducing the dimensions of a complex regression problem and in developing more intuitive and interpretable results for stakeholders. The underlying theory of latent variable modeling, and its implementation in the lavaan
package, offer much more flexibility and parameter control options than illustrated here and further exploration is highly recommended. David J. Bartholomew (2011) and Anders Skrondal (2004) are excellent resources for a deeper study of the theory and a wider range of case examples.
8.3 Learning exercises
8.3.1 Discussion questions
- Describe some common forms of explicit hierarchies in data. Can you think of some datasets that you have worked with recently that contain an explicit hierarchy?
- Describe the meaning of ‘fixed effect’ and ‘random effect’ in a mixed regression model.
- Which parameter in a mixed regression model is most commonly used when applying a random effect?
- Describe why mixed models are sometimes referred to as multilevel models.
- In a two level mixed model, describe the two levels of statistics that are produced and how to interpret these statistics.
- In latent variable modeling, what is the difference between a latent variable and a measured variable?
- Describe some reasons why latent variable modeling can be valuable in practice.
- Describe the two components of a structural equation model. What is the purpose of each component?
- What are the steps involved in a confirmatory factor analysis on a sufficiently large data set? Describe some fit criteria and the ideal standards for those criteria.
- Describe a process for refining a factor analysis based on fit criteria and factor loadings. What considerations should be addressed during this process?
8.3.2 Data exercises
For Exercises 1-4, use the speed_dating
set used earlier in the chapter which can be found at http://peopleanalytics-regression-book.org/data/speed_dating.csv.
- Split the data into two sets according to the gender of the participant. Run standard binomial logistic regression models on each set to determine the relationship between the
dec
decision outcome and the input variablessamerace
,agediff
,attr
,intel
andprob
. - Run similar mixed models on these sets with a random intercept for
iid
. - What different conclusions can you make in comparing the mixed models with the standard models?
- Experiment with some random slope effects to see if they reveal anything new about the input variables.
For exercises 5-10, load the employee_survey
data set via the peopleanalyticsdata
package or download it from the internet35. This data set contains the results of an engagement survey of employees of a technology company. Each row represents the responses of an individual to the survey and each column represents a specific survey question, with responses on a Likert scale of 1 to 5, with 1 indicating strongly negative sentiment and 5 indicating strongly positive sentiment. Subject matter experts have grouped the items into hypothesized latent factors as follows:
-
Happiness
is an overall measure of the employees current sentiment about their job - Items beginning with
Ben
relate to employment benefits - Items beginning with
Work
relate to the general work environment - Items beginning with
Man
relate to perceptions of management - Items beginning with
Car
relate to perceptions of career prospects
- Write out the proposed measurement model, defining the latent factors in terms of the measured items.
- Run a confirmatory factor analysis on the proposed measurement model. Examine the fit and the factor loadings.
- Experiment with the removal of measured items from the measurement model in order to improve the overall fit.
- Once satisfied with the fit of the measurement model, run a full structural equation model on the data.
- Interpret the results of the structural model. Which factors appear most related to overall employee sentiment? Approximately what proportion of the variance in overall sentiment does the model explain?
- If you dropped measured items from your measurement model, experiment with assigning them to other factors to see if this improves the fit of the model. What statistics would you use to compare different measurement models?