One common use of structural equation modeling (SEM) is confirmatory factor analysis (CFA), which allows us to test how well our data fit in the factors that we think they should in our model. Naturally, a good way to find this model in the first place is through EFA, and doing this process all together in an SEM framework is sometimes called E/CFA. There are other types of SEM analyses, including path analysis. Here we’ll focus on EFA and CFA, but the methods and tools are similar across other types of SEM.
We’ll use the items from the “Nerdy Personality Attributes Scale” to build and evaluate our model. Download the data here: https://openpsychometrics.org/_rawdata/
On the right side, click “NPAS-data-16December2018.zip” to download the ZIP file with the data and codebook. More info is available on the page for the scale itself (and the link below on its development): https://openpsychometrics.org/tests/NPAS/
After extracting the CSV file, place it in your /data
folder and rename it to npas.txt
(since it’s actually tab delimited, not comma delimited).
Then load it in, using the readr
package due to a few messed up rows:
## Parsed with column specification:
## cols(
## .default = col_double(),
## country = col_character(),
## major = col_character()
## )
## See spec(...) for full column specifications.
## Warning: 14 parsing failures.
## row col expected actual file
## 2290 major delimiter or quote , './data/npas.txt'
## 2290 major delimiter or quote h './data/npas.txt'
## 2405 major delimiter or quote , './data/npas.txt'
## 2405 major delimiter or quote b './data/npas.txt'
## 4267 major delimiter or quote './data/npas.txt'
## .... ..... .................. ...... .................
## See problems(...) for more details.
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20
## 1 3 5 3 3 5 5 5 3 5 5 4 5 5 5 3 5 4 5 5 5
## 2 4 4 4 3 5 2 5 1 4 4 1 5 4 4 1 3 3 3 1 3
## 3 5 5 5 5 5 5 5 5 5 5 5 4 5 5 4 5 5 4 5 5
## 4 5 5 5 5 5 5 5 3 5 5 5 5 4 4 4 2 5 5 5 5
## 5 4 4 4 4 4 4 4 4 4 5 4 4 5 5 1 1 5 5 3 5
## 6 4 4 4 4 4 4 5 3 3 4 4 4 3 5 3 4 3 4 5 4
## Q21 Q22 Q23 Q24 Q25 Q26 country introelapse testelapse surveyelapse
## 1 5 5 5 5 5 5 US 41 93 125
## 2 3 3 4 4 4 5 GB 13 131 161
## 3 5 5 5 5 3 5 PL 2 49 87
## 4 4 1 5 5 5 5 US 26 710 228
## 5 5 4 4 5 4 0 US 6 195 424
## 6 4 2 5 5 3 4 US 410 117 113
## TIPI1 TIPI2 TIPI3 TIPI4 TIPI5 TIPI6 TIPI7 TIPI8 TIPI9 TIPI10 VCL1 VCL2
## 1 1 1 7 1 6 7 7 1 7 1 1 1
## 2 5 3 5 4 5 5 5 5 6 1 1 1
## 3 1 6 3 7 6 7 2 6 2 1 1 1
## 4 7 2 7 4 7 2 6 1 7 1 1 1
## 5 2 5 4 3 7 6 5 5 6 2 1 1
## 6 2 3 5 5 5 5 4 3 4 4 1 1
## VCL3 VCL4 VCL5 VCL6 VCL7 VCL8 VCL9 VCL10 VCL11 VCL12 VCL13 VCL14 VCL15
## 1 1 1 1 0 1 1 0 1 1 0 1 1 1
## 2 1 1 1 0 0 1 0 1 1 0 1 1 1
## 3 1 1 1 1 0 0 0 1 1 1 1 1 1
## 4 1 1 1 0 1 1 0 1 1 0 1 1 1
## 5 0 1 1 0 0 0 0 1 0 0 1 1 1
## 6 1 1 1 1 0 0 0 1 0 1 1 1 1
## VCL16 education urban gender engnat age screenw screenh hand religion
## 1 1 3 2 2 1 69 2520 1418 1 6
## 2 1 4 2 2 1 50 854 534 1 1
## 3 1 3 1 2 2 22 1366 768 1 1
## 4 1 4 3 1 1 44 768 1024 1 2
## 5 1 1 1 2 1 17 1366 768 1 7
## 6 1 2 3 1 1 18 1366 768 1 1
## orientation voted married familysize major race_arab race_asian
## 1 1 1 3 4 Studio Art 0 0
## 2 1 1 1 3 biophysics 0 0
## 3 2 1 1 2 biology 0 0
## 4 1 2 3 4 Mathematics 0 0
## 5 4 2 1 1 <NA> 0 0
## 6 2 2 1 3 Geology 0 0
## race_black race_white race_hispanic race_nativeam race_nativeau
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## 3 0 1 0 0 0
## 4 0 1 0 0 0
## 5 0 1 0 0 0
## 6 0 1 0 0 0
## race_other nerdy ASD
## 1 0 7 2
## 2 0 6 2
## 3 0 7 2
## 4 0 7 2
## 5 0 6 2
## 6 0 5 2
## 'data.frame': 19749 obs. of 80 variables:
## $ Q1 : num 3 4 5 5 4 4 5 5 5 4 ...
## $ Q2 : num 5 4 5 5 4 4 3 3 5 5 ...
## $ Q3 : num 3 4 5 5 4 4 4 4 5 5 ...
## $ Q4 : num 3 3 5 5 4 4 3 5 4 5 ...
## $ Q5 : num 5 5 5 5 4 4 5 5 2 4 ...
## $ Q6 : num 5 2 5 5 4 4 4 3 3 3 ...
## $ Q7 : num 5 5 5 5 4 5 4 3 0 5 ...
## $ Q8 : num 3 1 5 3 4 3 5 5 5 4 ...
## $ Q9 : num 5 4 5 5 4 3 5 3 4 4 ...
## $ Q10 : num 5 4 5 5 5 4 5 5 4 5 ...
## $ Q11 : num 4 1 5 5 4 4 3 4 4 1 ...
## $ Q12 : num 5 5 4 5 4 4 3 4 4 4 ...
## $ Q13 : num 5 4 5 4 5 3 4 3 5 4 ...
## $ Q14 : num 5 4 5 4 5 5 4 5 3 5 ...
## $ Q15 : num 3 1 4 4 1 3 1 3 5 5 ...
## $ Q16 : num 5 3 5 2 1 4 1 5 4 4 ...
## $ Q17 : num 4 3 5 5 5 3 3 4 5 5 ...
## $ Q18 : num 5 3 4 5 5 4 1 4 5 5 ...
## $ Q19 : num 5 1 5 5 3 5 3 3 4 2 ...
## $ Q20 : num 5 3 5 5 5 4 1 3 3 5 ...
## $ Q21 : num 5 3 5 4 5 4 5 4 5 2 ...
## $ Q22 : num 5 3 5 1 4 2 1 4 2 3 ...
## $ Q23 : num 5 4 5 5 4 5 4 3 5 5 ...
## $ Q24 : num 5 4 5 5 5 5 5 5 4 5 ...
## $ Q25 : num 5 4 3 5 4 3 3 2 2 2 ...
## $ Q26 : num 5 5 5 5 0 4 5 5 3 5 ...
## $ country : chr "US" "GB" "PL" "US" ...
## $ introelapse : num 41 13 2 26 6 410 7 418 37 144 ...
## $ testelapse : num 93 131 49 710 195 117 107 64 114 112 ...
## $ surveyelapse : num 125 161 87 228 424 113 145 97 568 128 ...
## $ TIPI1 : num 1 5 1 7 2 2 3 3 1 5 ...
## $ TIPI2 : num 1 3 6 2 5 3 1 6 1 6 ...
## $ TIPI3 : num 7 5 3 7 4 5 2 5 5 3 ...
## $ TIPI4 : num 1 4 7 4 3 5 1 7 5 6 ...
## $ TIPI5 : num 6 5 6 7 7 5 7 6 5 6 ...
## $ TIPI6 : num 7 5 7 2 6 5 5 5 7 7 ...
## $ TIPI7 : num 7 5 2 6 5 4 5 6 7 4 ...
## $ TIPI8 : num 1 5 6 1 5 3 5 7 7 7 ...
## $ TIPI9 : num 7 6 2 7 6 4 6 1 3 5 ...
## $ TIPI10 : num 1 1 1 1 2 4 1 4 1 2 ...
## $ VCL1 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ VCL2 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ VCL3 : num 1 1 1 1 0 1 1 1 0 1 ...
## $ VCL4 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ VCL5 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ VCL6 : num 0 0 1 0 0 1 0 0 0 0 ...
## $ VCL7 : num 1 0 0 1 0 0 0 0 0 0 ...
## $ VCL8 : num 1 1 0 1 0 0 1 0 0 1 ...
## $ VCL9 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VCL10 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ VCL11 : num 1 1 1 1 0 0 0 0 0 1 ...
## $ VCL12 : num 0 0 1 0 0 1 0 0 0 0 ...
## $ VCL13 : num 1 1 1 1 1 1 0 1 1 1 ...
## $ VCL14 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ VCL15 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ VCL16 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ education : num 3 4 3 4 1 2 2 2 3 1 ...
## $ urban : num 2 2 1 3 1 3 2 2 2 2 ...
## $ gender : num 2 2 2 1 2 1 1 2 2 3 ...
## $ engnat : num 1 1 2 1 1 1 1 1 1 2 ...
## $ age : num 69 50 22 44 17 18 18 21 25 17 ...
## $ screenw : num 2520 854 1366 768 1366 ...
## $ screenh : num 1418 534 768 1024 768 ...
## $ hand : num 1 1 1 1 1 1 2 1 1 1 ...
## $ religion : num 6 1 1 2 7 1 1 12 7 1 ...
## $ orientation : num 1 1 2 1 4 2 1 5 1 2 ...
## $ voted : num 1 1 1 2 2 2 2 1 2 2 ...
## $ married : num 3 1 1 3 1 1 1 1 1 1 ...
## $ familysize : num 4 3 2 4 1 3 2 3 4 3 ...
## $ major : chr "Studio Art" "biophysics" "biology" "Mathematics" ...
## $ race_arab : num 0 0 0 0 0 0 0 0 0 0 ...
## $ race_asian : num 0 0 0 0 0 0 0 0 0 0 ...
## $ race_black : num 0 0 0 0 0 0 1 0 1 0 ...
## $ race_white : num 1 1 1 1 1 1 0 1 0 0 ...
## $ race_hispanic: num 0 0 0 0 0 0 0 0 0 1 ...
## $ race_nativeam: num 0 0 0 0 0 0 0 0 0 0 ...
## $ race_nativeau: num 0 0 0 0 0 0 0 0 0 0 ...
## $ race_other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ nerdy : num 7 6 7 7 6 5 6 5 7 7 ...
## $ ASD : num 2 2 2 2 2 2 2 2 2 2 ...
We can see that, along with the “nerdiness” questions, there are a number of other scale items and demographic variables represented. Let’s change each of the Likert variables (and education) to ordered, using lapply()
.
## [1] "Q1" "Q2" "Q3" "Q4"
## [5] "Q5" "Q6" "Q7" "Q8"
## [9] "Q9" "Q10" "Q11" "Q12"
## [13] "Q13" "Q14" "Q15" "Q16"
## [17] "Q17" "Q18" "Q19" "Q20"
## [21] "Q21" "Q22" "Q23" "Q24"
## [25] "Q25" "Q26" "country" "introelapse"
## [29] "testelapse" "surveyelapse" "TIPI1" "TIPI2"
## [33] "TIPI3" "TIPI4" "TIPI5" "TIPI6"
## [37] "TIPI7" "TIPI8" "TIPI9" "TIPI10"
## [41] "VCL1" "VCL2" "VCL3" "VCL4"
## [45] "VCL5" "VCL6" "VCL7" "VCL8"
## [49] "VCL9" "VCL10" "VCL11" "VCL12"
## [53] "VCL13" "VCL14" "VCL15" "VCL16"
## [57] "education" "urban" "gender" "engnat"
## [61] "age" "screenw" "screenh" "hand"
## [65] "religion" "orientation" "voted" "married"
## [69] "familysize" "major" "race_arab" "race_asian"
## [73] "race_black" "race_white" "race_hispanic" "race_nativeam"
## [77] "race_nativeau" "race_other" "nerdy" "ASD"
First we’ll load up the packages we’ll need: psych
for parallel analysis to determine our number of factors, and lavaan
and semTools
for the EFA and CFA.
## This is lavaan 0.5-23.1097
## lavaan is BETA software! Please report any bugs.
##
## Attaching package: 'lavaan'
## The following object is masked from 'package:psych':
##
## cor2cov
##
## ###############################################################################
## This is semTools 0.4-14
## All users of R (or SEM) are invited to submit functions or ideas for functions.
## ###############################################################################
##
## Attaching package: 'semTools'
## The following object is masked from 'package:psych':
##
## skew
## The following object is masked from 'package:readr':
##
## clipboard
One important point in the E/CFA process is that we can’t build our model and evaluate our model using the same data. If we did, we would risk building a model that is only valid for that particular data set. The best way to counteract this would be to collect two separate data sets and use one for EFA an the other for CFA. Since here we only have one data set to work with, we’ll do the next best thing, which is to split it randomly into two sets that we use for the two steps of E/CFA. Following convention in machine learning, in which this kind of splitting is common for training and validating predictive models, we’ll call them the “train” and “test” sets, respectively.
To split the data, we’ll use the sample()
function to randomly select half of the rows in the data frame (rounded to an even digit). We use those rows are our train set, and the rest of the rows for our test set. Before sampling, we set the seed for the random number generator in R, so that if someone else goes to run our code they will get the same result. I tend to use the current date as my seed, but you could use any number.
set.seed(20190430)
trainSet = sample(nrow(nerd), round(nrow(nerd)/2))
nerdTrain = nerd[trainSet,]
nerdTest = nerd[-trainSet,]
head(nerdTrain)
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19
## 12801 1 5 2 4 4 4 5 4 4 1 1 5 2 4 1 1 1 5 1
## 178 2 5 2 2 4 3 5 5 2 4 5 5 4 4 3 3 3 2 5
## 15165 4 4 5 5 4 4 5 3 4 3 1 4 2 4 1 1 4 5 2
## 7646 5 4 5 5 5 4 4 4 4 5 1 4 3 3 4 3 4 5 1
## 3162 5 1 5 5 4 4 3 1 5 4 1 3 3 5 1 5 5 1 1
## 10896 4 5 4 2 1 5 4 4 4 4 3 4 4 4 4 2 3 5 3
## Q20 Q21 Q22 Q23 Q24 Q25 Q26 country introelapse testelapse
## 12801 4 1 1 1 4 1 5 IE 1103 83
## 178 3 4 2 2 3 3 3 CA 12 84
## 15165 4 1 2 1 4 3 5 US 21 131
## 7646 4 5 3 4 5 2 4 NONE 64 173
## 3162 4 3 5 5 5 1 2 US 5 107
## 10896 3 2 3 5 4 4 4 US 11 100
## surveyelapse TIPI1 TIPI2 TIPI3 TIPI4 TIPI5 TIPI6 TIPI7 TIPI8 TIPI9
## 12801 80 7 1 5 1 6 3 7 5 7
## 178 133 4 5 6 3 5 5 3 5 6
## 15165 213 3 5 2 6 2 4 3 7 5
## 7646 273 6 5 6 6 5 5 4 5 3
## 3162 126 1 7 2 7 6 6 1 7 1
## 10896 131 5 5 6 3 7 5 6 6 5
## TIPI10 VCL1 VCL2 VCL3 VCL4 VCL5 VCL6 VCL7 VCL8 VCL9 VCL10 VCL11
## 12801 2 1 1 1 1 1 0 1 1 1 1 0
## 178 5 1 1 0 1 1 0 0 0 0 1 0
## 15165 2 1 1 1 1 1 0 1 1 0 1 0
## 7646 2 1 1 1 0 1 0 0 0 0 1 0
## 3162 2 1 1 0 1 1 0 0 1 0 1 0
## 10896 2 1 1 1 1 1 0 0 0 0 1 0
## VCL12 VCL13 VCL14 VCL15 VCL16 education urban gender engnat age
## 12801 0 1 1 1 1 4 1 2 1 28
## 178 0 1 1 1 1 2 1 1 1 17
## 15165 0 1 1 1 1 4 3 1 1 52
## 7646 0 1 1 1 1 1 1 2 1 18
## 3162 0 0 1 1 1 1 1 2 1 15
## 10896 0 1 1 1 1 2 2 2 1 15
## screenw screenh hand religion orientation voted married familysize
## 12801 1280 1024 1 2 1 1 1 4
## 178 1920 1080 1 2 1 2 1 3
## 15165 360 640 2 9 1 1 2 2
## 7646 320 534 3 5 1 2 1 6
## 3162 1366 768 1 2 1 2 1 1
## 10896 375 667 1 6 3 2 1 2
## major race_arab race_asian race_black race_white race_hispanic
## 12801 science 0 0 0 1 0
## 178 <NA> 0 0 0 1 0
## 15165 Geology 0 0 0 1 0
## 7646 <NA> 0 0 0 1 0
## 3162 <NA> 0 0 0 1 0
## 10896 <NA> 0 0 0 1 0
## race_nativeam race_nativeau race_other nerdy ASD
## 12801 0 0 0 6 2
## 178 0 0 0 3 2
## 15165 0 0 0 5 2
## 7646 0 0 0 5 2
## 3162 0 0 0 2 2
## 10896 0 0 0 6 2
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20
## 1 3 5 3 3 5 5 5 3 5 5 4 5 5 5 3 5 4 5 5 5
## 2 4 4 4 3 5 2 5 1 4 4 1 5 4 4 1 3 3 3 1 3
## 3 5 5 5 5 5 5 5 5 5 5 5 4 5 5 4 5 5 4 5 5
## 4 5 5 5 5 5 5 5 3 5 5 5 5 4 4 4 2 5 5 5 5
## 6 4 4 4 4 4 4 5 3 3 4 4 4 3 5 3 4 3 4 5 4
## 7 5 3 4 3 5 4 4 5 5 5 3 3 4 4 1 1 3 1 3 1
## Q21 Q22 Q23 Q24 Q25 Q26 country introelapse testelapse surveyelapse
## 1 5 5 5 5 5 5 US 41 93 125
## 2 3 3 4 4 4 5 GB 13 131 161
## 3 5 5 5 5 3 5 PL 2 49 87
## 4 4 1 5 5 5 5 US 26 710 228
## 6 4 2 5 5 3 4 US 410 117 113
## 7 5 1 4 5 3 5 US 7 107 145
## TIPI1 TIPI2 TIPI3 TIPI4 TIPI5 TIPI6 TIPI7 TIPI8 TIPI9 TIPI10 VCL1 VCL2
## 1 1 1 7 1 6 7 7 1 7 1 1 1
## 2 5 3 5 4 5 5 5 5 6 1 1 1
## 3 1 6 3 7 6 7 2 6 2 1 1 1
## 4 7 2 7 4 7 2 6 1 7 1 1 1
## 6 2 3 5 5 5 5 4 3 4 4 1 1
## 7 3 1 2 1 7 5 5 5 6 1 1 1
## VCL3 VCL4 VCL5 VCL6 VCL7 VCL8 VCL9 VCL10 VCL11 VCL12 VCL13 VCL14 VCL15
## 1 1 1 1 0 1 1 0 1 1 0 1 1 1
## 2 1 1 1 0 0 1 0 1 1 0 1 1 1
## 3 1 1 1 1 0 0 0 1 1 1 1 1 1
## 4 1 1 1 0 1 1 0 1 1 0 1 1 1
## 6 1 1 1 1 0 0 0 1 0 1 1 1 1
## 7 1 1 1 0 0 1 0 1 0 0 0 1 1
## VCL16 education urban gender engnat age screenw screenh hand religion
## 1 1 3 2 2 1 69 2520 1418 1 6
## 2 1 4 2 2 1 50 854 534 1 1
## 3 1 3 1 2 2 22 1366 768 1 1
## 4 1 4 3 1 1 44 768 1024 1 2
## 6 1 2 3 1 1 18 1366 768 1 1
## 7 1 2 2 1 1 18 1280 720 2 1
## orientation voted married familysize major race_arab race_asian
## 1 1 1 3 4 Studio Art 0 0
## 2 1 1 1 3 biophysics 0 0
## 3 2 1 1 2 biology 0 0
## 4 1 2 3 4 Mathematics 0 0
## 6 2 2 1 3 Geology 0 0
## 7 1 2 1 2 <NA> 0 0
## race_black race_white race_hispanic race_nativeam race_nativeau
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## 3 0 1 0 0 0
## 4 0 1 0 0 0
## 6 0 1 0 0 0
## 7 1 0 0 0 0
## race_other nerdy ASD
## 1 0 7 2
## 2 0 6 2
## 3 0 7 2
## 4 0 7 2
## 6 0 5 2
## 7 0 6 2
Then, to proceed with our EFA, we build a correlation matrix of the “nerdiness” scale items.
## [1] 26 26
## Q1 Q2 Q3 Q4 Q5 Q6 Q7
## Q1 1.0000000 0.2136586 0.4296777 0.6261649 0.2710970 0.2127082 0.1573807
## Q2 0.2136586 1.0000000 0.1826554 0.2220183 0.2216742 0.2021222 0.3988995
## Q3 0.4296777 0.1826554 1.0000000 0.3755428 0.1768708 0.2290472 0.1014630
## Q4 0.6261649 0.2220183 0.3755428 1.0000000 0.3002433 0.2415370 0.1656149
## Q5 0.2710970 0.2216742 0.1768708 0.3002433 1.0000000 0.3501039 0.3710673
## Q6 0.2127082 0.2021222 0.2290472 0.2415370 0.3501039 1.0000000 0.2492080
## Q7 0.1573807 0.3988995 0.1014630 0.1656149 0.3710673 0.2492080 1.0000000
## Q8 0.3799838 0.3032137 0.4246551 0.3222260 0.1638154 0.2601828 0.0573631
## Q9 0.2235768 0.2555981 0.1163953 0.2475506 0.4768827 0.4460943 0.3729157
## Q10 0.2593911 0.1776734 0.3222602 0.2636585 0.3251740 0.3110858 0.2065492
## Q8 Q9 Q10
## Q1 0.3799838 0.2235768 0.2593911
## Q2 0.3032137 0.2555981 0.1776734
## Q3 0.4246551 0.1163953 0.3222602
## Q4 0.3222260 0.2475506 0.2636585
## Q5 0.1638154 0.4768827 0.3251740
## Q6 0.2601828 0.4460943 0.3110858
## Q7 0.0573631 0.3729157 0.2065492
## Q8 1.0000000 0.1785132 0.2282437
## Q9 0.1785132 1.0000000 0.2980757
## Q10 0.2282437 0.2980757 1.0000000
To stick with a consistent set of tools, we’ll use efaUnrotate()
in the semTools
package to run our EFA this time.
Before we do, we need to determine the appropriate number of factors. Instead of maximum likelihood estimation (ml
), in this case we’ll use weighted least squares methods (wls
) because they tend to be more appropriate for ordinal data.
## Parallel analysis suggests that the number of factors = 7 and the number of components = 5
Looks like 7. Let’s give it a try. efaUnrotate()
takes raw data, but knows what to do with our ordinal variables, so it’s okay to use them.
Here we use WLSMV
as our estimator, which is a statistically robust variant (meaning: good for both normal and non-normal data) of WLS available in lavaan
and semTools
. A list of options are available on the lavaan
“Estimators” webpage: http://lavaan.ugent.be/tutorial/est.html
Then, just like our previous work with EFA, we rotate the result. Here I use oblimin
rotation again, because we expect our different factors may be correlated (since they all measure “nerdiness”).
## Loading required namespace: GPArotation
## Standardized Rotated Factor Loadings
## factor1 factor2 factor3 factor4 factor5 factor6 factor7
## Q17 -0.761*
## Q13 -0.629* 0.104* -0.147* -0.236*
## Q16 -0.597* -0.140*
## Q3 -0.499* -0.111* -0.337*
## Q8 -0.436* -0.171* -0.333*
## Q10 -0.292* 0.276*
## Q22 -0.285* 0.189* 0.113* -0.271* -0.189*
## Q5 0.682* 0.104*
## Q24 0.579* -0.121* 0.116*
## Q26 0.507* 0.160* -0.205*
## Q9 0.495* 0.107* -0.123* -0.222*
## Q1 -0.778*
## Q4 -0.756*
## Q19 -0.866*
## Q11 -0.750* -0.115*
## Q7 0.751* 0.115* -0.122*
## Q12 0.728*
## Q2 -0.179* 0.534* -0.358*
## Q25 0.248* -0.145* 0.502*
## Q15 -0.168* -0.134* 0.206* -0.536*
## Q23 0.246* 0.110* -0.470* -0.276*
## Q21 0.250* -0.152* -0.217* -0.400*
## Q6 0.141* -0.144* -0.513*
## Q18 0.195* -0.111* -0.491*
## Q20 -0.359* 0.178* 0.130* -0.436*
## Q14 -0.201* 0.139* 0.109* -0.320*
##
## Factor Correlation
## factor1 factor2 factor3 factor4 factor5
## factor1 1.0000000 -0.19992781 0.4353482 0.20338370 -0.12370052
## factor2 -0.1999278 1.00000000 -0.2843631 -0.02436977 0.42249015
## factor3 0.4353482 -0.28436309 1.0000000 0.30746979 -0.21769809
## factor4 0.2033837 -0.02436977 0.3074698 1.00000000 -0.29241720
## factor5 -0.1237005 0.42249015 -0.2176981 -0.29241720 1.00000000
## factor6 0.2430809 -0.14202300 0.2115626 0.13406027 -0.07220745
## factor7 0.3011301 -0.51064846 0.1433126 -0.12776643 -0.23331675
## factor6 factor7
## factor1 0.24308089 0.3011301
## factor2 -0.14202300 -0.5106485
## factor3 0.21156265 0.1433126
## factor4 0.13406027 -0.1277664
## factor5 -0.07220745 -0.2333167
## factor6 1.00000000 0.2314242
## factor7 0.23142418 1.0000000
##
## Method of rotation: Oblimin Quartimin
## [1] "The standard errors are close but do not match with other packages. Be mindful when using it."
And we get our 7 factors with the loadings for each item. It conveniently organizes them from highest-to-lowest loading (positive or negative) for the first factor, then for the next factor, and so on. It’s okay to see high negative loadings for some items on some factors–it means that those items are strongly opposed to the positive items in that factor.
However, this immediately alerts us to a problem with this model, in that we don’t have any high positive loadings on the first factor. The highest is Q18 (“I love to read challenging material”), but it’s below 0.2, which means it really doesn’t fit there either.
The problem here is that we have too many factors, even though parallel analysis said this was the best fit. This is called “overdimensionalization” and sometimes happens with ordinal data like Likert surveys. Here is a reference to read more about it:
The solution is just to drop a factor. So we do:
## Standardized Rotated Factor Loadings
## factor1 factor2 factor3 factor4 factor5 factor6
## Q23 0.699* -0.243*
## Q9 0.603* 0.108* 0.183*
## Q6 0.571* 0.156*
## Q26 0.513* 0.245* 0.120* 0.251*
## Q21 0.464* -0.212* 0.187* -0.102* -0.192*
## Q24 0.451* 0.152* 0.230* 0.222*
## Q22 0.442* 0.277*
## Q5 0.391* 0.262* 0.201* 0.235*
## Q18 0.347* 0.154*
## Q17 0.725* 0.140*
## Q13 0.323* 0.666* -0.130*
## Q16 -0.157* 0.592* 0.116* -0.125*
## Q3 -0.119* 0.545* 0.308*
## Q20 0.241* 0.458* 0.130* 0.196*
## Q8 0.444* -0.330* 0.183*
## Q14 0.235* 0.281* 0.157* 0.185*
## Q10 0.230* 0.266* 0.165* 0.163*
## Q15 -0.550* -0.138* 0.146*
## Q1 0.767*
## Q4 0.746*
## Q19 -0.870*
## Q11 -0.765*
## Q7 0.811*
## Q12 -0.128* 0.784*
## Q25 -0.200* 0.555*
## Q2 -0.416* -0.186* 0.502*
##
## Factor Correlation
## factor1 factor2 factor3 factor4 factor5
## factor1 1.00000000 0.3301510 -0.03697804 0.27620315 0.01337266
## factor2 0.33015100 1.0000000 -0.16333641 0.41772927 -0.19033071
## factor3 -0.03697804 -0.1633364 1.00000000 -0.08160535 0.17238051
## factor4 0.27620315 0.4177293 -0.08160535 1.00000000 -0.31085697
## factor5 0.01337266 -0.1903307 0.17238051 -0.31085697 1.00000000
## factor6 0.38927694 0.1571819 0.10638474 0.26261581 -0.27637983
## factor6
## factor1 0.3892769
## factor2 0.1571819
## factor3 0.1063847
## factor4 0.2626158
## factor5 -0.2763798
## factor6 1.0000000
##
## Method of rotation: Oblimin Quartimin
## [1] "The standard errors are close but do not match with other packages. Be mindful when using it."
Looks better, but we have only negative loadings on factor 5 now (and spoiler: 5 factors doesn’t work well either). At this point the better approach would probably be for us to drop some of the more problematic items and re-run analyses from the start. For this lecture, I’m just going to cut it down to 4 factors so we have something to use for our CFA.
## Warning in lav_samplestats_from_data(lavdata = lavdata, missing = lavoptions$missing, : lavaan WARNING: 173 bivariate tables have empty cells; to see them, use:
## lavInspect(fit, "zero.cell.tables")
## Warning in lav_samplestats_from_data(lavdata = lavdata, missing = lavoptions$missing, : lavaan WARNING: 173 bivariate tables have empty cells; to see them, use:
## lavInspect(fit, "zero.cell.tables")
## Standardized Rotated Factor Loadings
## factor1 factor2 factor3 factor4
## Q7 0.760* 0.178*
## Q12 0.649* 0.254*
## Q26 0.625* 0.151* -0.123* -0.186*
## Q5 0.568* -0.244* -0.103*
## Q25 0.560* -0.107* 0.259*
## Q24 0.557* 0.122* -0.181*
## Q9 0.544* 0.304* -0.114*
## Q18 0.356* 0.159*
## Q14 0.308* 0.213* -0.175*
## Q23 0.170* 0.686* 0.115*
## Q13 0.646* -0.177*
## Q22 0.105* 0.548*
## Q8 -0.163* 0.511* -0.256* 0.252*
## Q6 0.306* 0.494* -0.153*
## Q21 0.411*
## Q20 0.244* 0.362* -0.162*
## Q1 0.101* -0.753*
## Q4 0.160* -0.703*
## Q3 -0.139* 0.236* -0.536* 0.105*
## Q17 -0.102* 0.360* -0.482* 0.162*
## Q16 -0.168* 0.253* -0.405* 0.230*
## Q10 0.219* 0.193* -0.304*
## Q19 -0.135* -0.102* 0.752*
## Q11 0.716*
## Q2 0.319* 0.234* 0.107* 0.494*
## Q15 0.362* 0.429*
##
## Factor Correlation
## factor1 factor2 factor3 factor4
## factor1 1.0000000 0.29674975 -0.1988301 0.11859358
## factor2 0.2967497 1.00000000 -0.4175340 0.05306424
## factor3 -0.1988301 -0.41753403 1.0000000 -0.26056564
## factor4 0.1185936 0.05306424 -0.2605656 1.00000000
##
## Method of rotation: Oblimin Quartimin
## [1] "The standard errors are close but do not match with other packages. Be mindful when using it."
If we want more details, like standard error values and confidence intervals, we could run:
We can use fitMeasures()
to look at a number of different criteria used to assess the fit of the EFA. We can also look at the r squared values (“communalities”) for each item to see how well it fits. We’ll come back to these for the CFA.
## chisq df pvalue cfi tli rmsea srmr
## 6155.099 227.000 0.000 0.978 0.968 0.051 0.042
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12
## 0.569 0.444 0.458 0.518 0.462 0.449 0.599 0.498 0.494 0.267 0.569 0.520
## Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24
## 0.535 0.262 0.291 0.388 0.546 0.182 0.618 0.327 0.206 0.340 0.509 0.445
## Q25 Q26
## 0.396 0.523
Here are how each of our factors breaks down. Sometimes it’s helpful to assign labels to your constructs to help make sense of them (though the labels themselves are arbitrary).
Factor 1 (“Geek”):
Q7 I watch science related shows.
Q12 I spend more time at the library than any other public place.
Q26 I can be socially awkward at times.
Q5 I collect books.
Q25 I care about super heroes.
Q9 I like science fiction.
Q24 I am a strange person.
Q18 I love to read challenging material.
Q14 I like to read technology news reports.
Factor 2 (“Reader”):
Q13 I would describe my smarts as bookish.
Q23 I get excited about my ideas and research.
Q22 I enjoy learning more than I need to.
Q8 I spend recreational time researching topics others might find dry or overly rigorous.
Q17 I am more comfortable interacting online than in person.
Q6 I prefer academic success to social success.
Q20 I was a very odd child.
Q21 I sometimes prefer fictional people to real ones.
Q16 I gravitate towards introspection.
Q10 I would rather read a book than go to a party.
Factor 3 (“Know-it-all”):
Q1 I am interested in science.
Q4 My appearance is not as important as my intelligence.
Q3 I like to play RPGs. (Ex. D&D)
Factor 4 (“Gamer”):
Q19 I have played a lot of video games.
Q11 I am more comfortable with my hobbies than I am with other people.
Q2 I was in advanced classes.
Q15 I have started writing a novel.
Then, for our subsequent analysis, we’ll need to define this model in a way that lavaan
can understand. There are details on the syntax in their tutorial here (part 1) and here (part 2).
nerdModel = "nerd =~ geek + reader + knowitall + gamer
geek =~ Q7 + Q12 + Q26 +Q5 + Q25 + Q9 + Q24 + Q18 + Q14
reader =~ Q13 + Q23 + Q22 + Q8 + Q17 + Q6 + Q20 + Q21 + Q16 + Q10
knowitall =~ Q1 + Q4 + Q3
gamer =~ Q19 + Q11 + Q2 + Q15"
Essentially we’re telling it that geek
, reader
, knowitall
and gamer
are our 4 factors, and that they all contribute to a second-order factor called nerd
.
Note: The most recent versions of the lavaan
and semTools
packages refused to fit the model for me. I had to roll back to older versions, using the commands below. If you’re trying to replicate it and can’t get it to work, try doing the same.
remove.packages(c("lavaan","semTools"))
install.packages("devtools")
devtools::install_version("lavaan", version="0.6.3")
devtools::install_version("semTools", version="0.4-14")
Now that we have our model, built using our train subset, we can test the fit of our test subset to that model using CFA. It’s fairly straightforward here (but more details are available on the lavaan
webpage):
## Warning in lav_samplestats_from_data(lavdata = lavdata, missing = lavoptions$missing, : lavaan WARNING: 149 bivariate tables have empty cells; to see them, use:
## lavInspect(fit, "zero.cell.tables")
We get the warning again about empty bivariate cells, which we’re okay to ignore.
## lavaan (0.5-23.1097) converged normally after 34 iterations
##
## Number of observations 9875
##
## Estimator DWLS Robust
## Minimum Function Test Statistic 27890.631 29727.068
## Degrees of freedom 295 295
## P-value (Chi-square) 0.000 0.000
## Scaling correction factor 0.941
## Shift parameter 76.852
## for simple second-order correction (Mplus variant)
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Robust.sem
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## nerd =~
## geek 1.000
## reader 1.356 0.033 40.870 0.000
## knowitall 1.267 0.029 43.733 0.000
## gamer 0.739 0.022 33.384 0.000
## geek =~
## Q7 1.000
## Q12 1.009 0.015 67.554 0.000
## Q26 1.088 0.015 70.779 0.000
## Q5 1.048 0.015 68.615 0.000
## Q25 0.825 0.015 56.517 0.000
## Q9 1.059 0.016 68.226 0.000
## Q24 1.037 0.016 66.033 0.000
## Q18 0.576 0.017 33.949 0.000
## Q14 0.831 0.016 51.277 0.000
## reader =~
## Q13 1.000
## Q23 0.833 0.014 60.354 0.000
## Q22 0.795 0.014 57.554 0.000
## Q8 0.901 0.014 65.948 0.000
## Q17 1.012 0.013 76.021 0.000
## Q6 0.911 0.014 66.350 0.000
## Q20 0.861 0.014 62.427 0.000
## Q21 0.625 0.015 41.026 0.000
## Q16 0.775 0.014 55.468 0.000
## Q10 0.759 0.015 50.779 0.000
## knowitall =~
## Q1 1.000
## Q4 0.993 0.015 65.947 0.000
## Q3 0.901 0.016 56.565 0.000
## gamer =~
## Q19 1.000
## Q11 1.197 0.024 48.902 0.000
## Q2 1.200 0.025 48.867 0.000
## Q15 0.858 0.021 40.772 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .Q7 0.000
## .Q12 0.000
## .Q26 0.000
## .Q5 0.000
## .Q25 0.000
## .Q9 0.000
## .Q24 0.000
## .Q18 0.000
## .Q14 0.000
## .Q13 0.000
## .Q23 0.000
## .Q22 0.000
## .Q8 0.000
## .Q17 0.000
## .Q6 0.000
## .Q20 0.000
## .Q21 0.000
## .Q16 0.000
## .Q10 0.000
## .Q1 0.000
## .Q4 0.000
## .Q3 0.000
## .Q19 0.000
## .Q11 0.000
## .Q2 0.000
## .Q15 0.000
## nerd 0.000
## geek 0.000
## reader 0.000
## knowitall 0.000
## gamer 0.000
##
## Thresholds:
## Estimate Std.Err z-value P(>|z|)
## Q7|t1 -2.615 0.051 -50.912 0.000
## Q7|t2 -1.738 0.023 -76.640 0.000
## Q7|t3 -1.296 0.017 -74.797 0.000
## Q7|t4 -0.837 0.014 -58.293 0.000
## Q7|t5 -0.016 0.013 -1.298 0.194
## Q12|t1 -2.631 0.052 -50.285 0.000
## Q12|t2 -1.427 0.019 -76.748 0.000
## Q12|t3 -0.933 0.015 -62.952 0.000
## Q12|t4 -0.469 0.013 -35.747 0.000
## Q12|t5 0.344 0.013 26.703 0.000
## Q26|t1 -2.722 0.058 -46.680 0.000
## Q26|t2 -1.896 0.026 -74.250 0.000
## Q26|t3 -1.370 0.018 -76.057 0.000
## Q26|t4 -0.797 0.014 -56.171 0.000
## Q26|t5 0.010 0.013 0.755 0.450
## Q5|t1 -2.925 0.075 -38.790 0.000
## Q5|t2 -1.599 0.021 -77.489 0.000
## Q5|t3 -1.030 0.015 -67.003 0.000
## Q5|t4 -0.528 0.013 -39.783 0.000
## Q5|t5 0.313 0.013 24.342 0.000
## Q25|t1 -2.755 0.061 -45.402 0.000
## Q25|t2 -0.969 0.015 -64.497 0.000
## Q25|t3 -0.385 0.013 -29.679 0.000
## Q25|t4 0.096 0.013 7.616 0.000
## Q25|t5 0.799 0.014 56.284 0.000
## Q9|t1 -2.744 0.060 -45.840 0.000
## Q9|t2 -1.711 0.022 -76.900 0.000
## Q9|t3 -1.183 0.016 -72.102 0.000
## Q9|t4 -0.568 0.013 -42.440 0.000
## Q9|t5 0.275 0.013 21.497 0.000
## Q24|t1 -2.766 0.062 -44.951 0.000
## Q24|t2 -2.046 0.029 -70.773 0.000
## Q24|t3 -1.600 0.021 -77.487 0.000
## Q24|t4 -1.023 0.015 -66.736 0.000
## Q24|t5 -0.059 0.013 -4.659 0.000
## Q18|t1 -2.639 0.053 -49.962 0.000
## Q18|t2 -1.268 0.017 -74.211 0.000
## Q18|t3 -0.927 0.015 -62.650 0.000
## Q18|t4 -0.516 0.013 -38.974 0.000
## Q18|t5 0.077 0.013 6.107 0.000
## Q14|t1 -2.778 0.062 -44.485 0.000
## Q14|t2 -1.667 0.022 -77.230 0.000
## Q14|t3 -1.007 0.015 -66.078 0.000
## Q14|t4 -0.298 0.013 -23.241 0.000
## Q14|t5 0.447 0.013 34.199 0.000
## Q13|t1 -2.790 0.063 -44.004 0.000
## Q13|t2 -1.357 0.018 -75.857 0.000
## Q13|t3 -0.840 0.014 -58.441 0.000
## Q13|t4 -0.274 0.013 -21.417 0.000
## Q13|t5 0.240 0.013 18.810 0.000
## Q23|t1 -2.744 0.060 -45.840 0.000
## Q23|t2 -1.306 0.017 -74.991 0.000
## Q23|t3 -0.873 0.015 -60.111 0.000
## Q23|t4 -0.520 0.013 -39.211 0.000
## Q23|t5 0.093 0.013 7.335 0.000
## Q22|t1 -2.830 0.067 -42.462 0.000
## Q22|t2 -0.389 0.013 -29.998 0.000
## Q22|t3 0.202 0.013 15.899 0.000
## Q22|t4 0.693 0.014 50.318 0.000
## Q22|t5 1.190 0.016 72.301 0.000
## Q8|t1 -2.712 0.058 -47.083 0.000
## Q8|t2 -1.340 0.018 -75.586 0.000
## Q8|t3 -0.936 0.015 -63.076 0.000
## Q8|t4 -0.556 0.013 -41.654 0.000
## Q8|t5 0.131 0.013 10.392 0.000
## Q17|t1 -2.907 0.074 -39.469 0.000
## Q17|t2 -1.759 0.023 -76.405 0.000
## Q17|t3 -1.124 0.016 -70.325 0.000
## Q17|t4 -0.539 0.013 -40.552 0.000
## Q17|t5 0.202 0.013 15.859 0.000
## Q6|t1 -2.733 0.059 -46.266 0.000
## Q6|t2 -1.586 0.020 -77.504 0.000
## Q6|t3 -0.982 0.015 -65.069 0.000
## Q6|t4 -0.315 0.013 -24.522 0.000
## Q6|t5 0.520 0.013 39.211 0.000
## Q20|t1 -2.790 0.063 -44.004 0.000
## Q20|t2 -1.612 0.021 -77.462 0.000
## Q20|t3 -0.957 0.015 -63.991 0.000
## Q20|t4 -0.194 0.013 -15.276 0.000
## Q20|t5 0.477 0.013 36.283 0.000
## Q21|t1 -2.665 0.054 -48.947 0.000
## Q21|t2 -0.534 0.013 -40.197 0.000
## Q21|t3 -0.177 0.013 -13.950 0.000
## Q21|t4 0.092 0.013 7.315 0.000
## Q21|t5 0.617 0.014 45.591 0.000
## Q16|t1 -2.925 0.075 -38.790 0.000
## Q16|t2 -1.156 0.016 -71.333 0.000
## Q16|t3 -0.616 0.014 -45.532 0.000
## Q16|t4 -0.089 0.013 -7.013 0.000
## Q16|t5 0.441 0.013 33.742 0.000
## Q10|t1 -2.571 0.049 -52.643 0.000
## Q10|t2 -1.825 0.024 -75.506 0.000
## Q10|t3 -1.420 0.019 -76.674 0.000
## Q10|t4 -0.680 0.014 -49.490 0.000
## Q10|t5 0.071 0.013 5.604 0.000
## Q1|t1 -2.803 0.064 -43.508 0.000
## Q1|t2 -1.754 0.023 -76.460 0.000
## Q1|t3 -1.260 0.017 -74.024 0.000
## Q1|t4 -0.623 0.014 -46.020 0.000
## Q1|t5 0.242 0.013 18.970 0.000
## Q4|t1 -2.683 0.056 -48.229 0.000
## Q4|t2 -1.548 0.020 -77.483 0.000
## Q4|t3 -0.965 0.015 -64.358 0.000
## Q4|t4 -0.425 0.013 -32.588 0.000
## Q4|t5 0.328 0.013 25.503 0.000
## Q3|t1 -2.639 0.053 -49.962 0.000
## Q3|t2 -1.796 0.024 -75.930 0.000
## Q3|t3 -1.368 0.018 -76.020 0.000
## Q3|t4 -0.960 0.015 -64.131 0.000
## Q3|t5 -0.034 0.013 -2.667 0.008
## Q19|t1 -2.778 0.062 -44.485 0.000
## Q19|t2 -0.852 0.014 -59.067 0.000
## Q19|t3 -0.428 0.013 -32.787 0.000
## Q19|t4 -0.103 0.013 -8.180 0.000
## Q19|t5 0.352 0.013 27.283 0.000
## Q11|t1 -2.693 0.056 -47.857 0.000
## Q11|t2 -0.653 0.014 -47.866 0.000
## Q11|t3 -0.320 0.013 -24.903 0.000
## Q11|t4 0.082 0.013 6.470 0.000
## Q11|t5 0.575 0.013 42.911 0.000
## Q2|t1 -2.600 0.050 -51.512 0.000
## Q2|t2 -1.538 0.020 -77.463 0.000
## Q2|t3 -1.121 0.016 -70.218 0.000
## Q2|t4 -0.713 0.014 -51.489 0.000
## Q2|t5 0.032 0.013 2.506 0.012
## Q15|t1 -2.693 0.056 -47.857 0.000
## Q15|t2 -0.669 0.014 -48.872 0.000
## Q15|t3 -0.266 0.013 -20.836 0.000
## Q15|t4 0.147 0.013 11.598 0.000
## Q15|t5 0.732 0.014 52.598 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .Q7 0.585
## .Q12 0.578
## .Q26 0.509
## .Q5 0.544
## .Q25 0.718
## .Q9 0.535
## .Q24 0.554
## .Q18 0.862
## .Q14 0.714
## .Q13 0.552
## .Q23 0.689
## .Q22 0.717
## .Q8 0.636
## .Q17 0.541
## .Q6 0.628
## .Q20 0.668
## .Q21 0.825
## .Q16 0.731
## .Q10 0.742
## .Q1 0.449
## .Q4 0.457
## .Q3 0.553
## .Q19 0.615
## .Q11 0.448
## .Q2 0.445
## .Q15 0.716
## nerd 0.184 0.007 26.676 0.000
## geek 0.231 0.007 34.294 0.000
## reader 0.109 0.007 15.940 0.000
## knowitall 0.255 0.009 29.752 0.000
## gamer 0.285 0.008 33.629 0.000
##
## Scales y*:
## Estimate Std.Err z-value P(>|z|)
## Q7 1.000
## Q12 1.000
## Q26 1.000
## Q5 1.000
## Q25 1.000
## Q9 1.000
## Q24 1.000
## Q18 1.000
## Q14 1.000
## Q13 1.000
## Q23 1.000
## Q22 1.000
## Q8 1.000
## Q17 1.000
## Q6 1.000
## Q20 1.000
## Q21 1.000
## Q16 1.000
## Q10 1.000
## Q1 1.000
## Q4 1.000
## Q3 1.000
## Q19 1.000
## Q11 1.000
## Q2 1.000
## Q15 1.000
We see the results for all 4 factors and the items that they’re constructed from, as well as a bunch of other information. But how well does the model fit the data?
## chisq df pvalue cfi tli rmsea srmr
## 27890.631 295.000 0.000 0.891 0.880 0.097 0.088
Chi-square statistics (including degrees of freedom and p-value) are commonly reported in the literature and are expected by some reviewers, but really mean nothing in this context as they’re almost always significant given a decent sample size. The ones to pay attention to are the last 4: CFI, TLI, RMSEA, and (sometimes) SRMR. Good criteria to use to evaluate them are:
Comparative Fit Index (“CFI”) > 0.95
Tucker-Lewis Index (“TLI”) > 0.96
Root Mean Square Error of Approximation (“RMSEA”) < 0.05
Standardized Root Mean Square Residual (“SRMR”) < 0.07
So, does our model fit? No, not well at all–all of our fit indices are outside the criteria. This means we’d have to go back, probably trim down some of the less well-fitting items, and see if we can build a better model.
Note: These values come from the following sources (of which, Hu and Bentler is the classic source and Yu updated a couple of the scores based on additional analyses):
Hu L, Bentler PM (1999) Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling 6(1):1–55. doi: 10.1080/10705519909540118
Yu C-Y (2002) Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes. Dissertation (University of California, Los Angeles). Available: https://www.statmodel.com/download/Yudissertation.pdf
Before we go, let’s plot our model using the semPlot
package.
You’d need to know a bit more about CFA and SEM to be able to interpret everything properly, but the lines from latent factors to lower factors and items represent loadings, so you can see the relative strength of each.
The figure could use some cleaning up, using additional parameters (see ?semPaths
), to make everything a bit more visible. But here, we’ll leave it as-is and move on.
CFA is just a specialized instance of SEM, so there are many more ways to do SEM using the lavaan
package. More details are available here: http://lavaan.ugent.be/tutorial/sem.html
The general function is sem()
:
Social network analysis
To have a data set to explore as a first look into social networks, we’ll use data on a Northern Alaskan community helping network from the following paper:
Lee, H. W., et al. (2018). Mapping the structure of perceptions in helping networks of Alaska Natives. PLOS ONE, 13(11), e0204343. doi: 10.1371/journal.pone.0204343
In the Supporting Information, download “S1 File” and extract “Edge data.csv” to your
/data
folder: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0204343#sec016A little network data terminology: networks are made up of nodes and edges. Nodes are the individual points in the network, usually people, and are often represented as numerical IDs. Edges are the connections between nodes, and can represent any kind of association or interaction (time spent together, citations, etc.). When we read in the data, we can have separate lists of nodes and edges, and edge lists often contain other variables that relate to the links between nodes.
There are two types of networks: directed and undirected. In directed networks, the edges have a direction, e.g. A does something to B. In undirected networks, A is connected to B, but there is no direction to the connection, they’re just associated.
Here, all we have is an edge list and, from the paper, it’s directed (A helps B). We read it in:
We can see the network has some negative values for the nodes.
This is actually what’s called a “signed” network, in that it has edges with positive and negative signs. Positive values represent positive interactions (from the paper: “e.g. liking, loving, supporting”) and negative interactions (“disliking, hating, opposing”). To simplify things here, we’re just going to look at the positive edges.
Looks like it’s still a very big network, with 64,262 connections between 254 nodes.
There are a number of social network packages in R, the main ones being
network
,sna
, andigraph
. Here we’ll useigraph
because it seems to be the most commonly used package, and is also cross-platform with other programming languages such as Python. But depending on your analyses and your needs, feel free to check the others out too.We’ll use the
graph_from_data_frame()
function to make our data frame into an igraph object.Looks like it loaded in correctly. We can see the other variables from the data frame as attributes, and the beginning of the edge list, with direction indicated. We can get vectors of the nodes (in
igraph
, called “vertices”) withV()
and the edges withE()
.If we want to pull out any of the other attributes from our original edge list, we can use the
edge_attr()
function:If we reference it like a matrix, we can see the connections between nodes:
It looks like every node is connected to every other node. This may not make for a very informative graph, but we can still plot it to take a look:
Yep, just a total mess. This can often be a downfall of social network diagrams, in that they can be very pretty, but contain such a high density of information that they don’t really communicate anything (refer back to our resources on information design).
We can try to remedy it by changing the layout of the data:
Nope. Still a mess. If we had less data, perhaps by looking at meaningful subsets, it could be better. Since all nodes are connected, it would be good to use those attributes to, for instance, display different line weights for our edges. Though we won’t notice a difference here, we could do it like so:
We’ll do more with plotting next time, with a more tractable data set.
(pdf / Rmd)