“The probability of getting the observed result, or a more extreme result, if the null hypothesis is true.”
Source: McDonald, Basic concepts of hypothesis testing
Source: Pomeroy, R. (2016). The biggest myth about p-values. RealClearScience.
Now let’s get some data!
Go to the page for the PanTHERIA database (hosted by the Ecological Society of America) at: http://esapubs.org/archive/ecol/E090/184/
Near the bottom, download the text file with “WR05” in the name (referring to the 2005 edition of a mammal classification book) and put it in your /data
folder.
Metadata and descriptions of all the variables are available here: http://esapubs.org/archive/ecol/E090/184/metadata.htm
pan = read.table("./data/PanTHERIA_1-0_WR05_Aug2008.txt", header=T, sep="\t",
na.strings=c("-999","-999.00"), stringsAsFactors=F)
str(pan)
## 'data.frame': 5416 obs. of 55 variables:
## $ MSW05_Order : chr "Artiodactyla" "Carnivora" "Carnivora" "Carnivora" ...
## $ MSW05_Family : chr "Camelidae" "Canidae" "Canidae" "Canidae" ...
## $ MSW05_Genus : chr "Camelus" "Canis" "Canis" "Canis" ...
## $ MSW05_Species : chr "dromedarius" "adustus" "aureus" "latrans" ...
## $ MSW05_Binomial : chr "Camelus dromedarius" "Canis adustus" "Canis aureus" "Canis latrans" ...
## $ X1.1_ActivityCycle : num 3 1 2 2 2 2 NA 2 3 NA ...
## $ X5.1_AdultBodyMass_g : num 492714 10392 9659 11989 31757 ...
## $ X8.1_AdultForearmLen_mm : num NA NA NA NA NA NA NA NA NA NA ...
## $ X13.1_AdultHeadBodyLen_mm : num NA 745 828 872 1055 ...
## $ X2.1_AgeatEyeOpening_d : num NA NA 7.5 11.9 14 ...
## $ X3.1_AgeatFirstBirth_d : num 1652 NA NA 365 548 ...
## $ X18.1_BasalMetRate_mLO2hr : num 40293 NA NA 3699 11254 ...
## $ X5.2_BasalMetRateMass_g : num 407000 NA NA 10450 33100 ...
## $ X6.1_DietBreadth : num 3 6 6 1 1 3 2 2 NA NA ...
## $ X7.1_DispersalAge_d : num NA 330 NA 255 180 ...
## $ X9.1_GestationLen_d : num 386.5 65 61.2 61.7 63.5 ...
## $ X12.1_HabitatBreadth : num 1 1 1 1 1 NA NA 1 1 NA ...
## $ X22.1_HomeRange_km2 : num 196.32 1.01 2.95 18.88 159.86 ...
## $ X22.2_HomeRange_Indiv_km2 : num NA 1.01 3.13 19.91 43.13 ...
## $ X14.1_InterbirthInterval_d : num 614 NA 365 365 365 ...
## $ X15.1_LitterSize : num 0.98 4.5 3.74 5.72 4.98 1.22 1 1.22 1.01 NA ...
## $ X16.1_LittersPerYear : num 1 NA NA NA 2 1 1 1 NA NA ...
## $ X17.1_MaxLongevity_m : num 480 137 192 262 354 ...
## $ X5.3_NeonateBodyMass_g : num 36751 NA 212 200 412 ...
## $ X13.2_NeonateHeadBodyLen_mm : num NA NA NA NA NA NA NA NA NA NA ...
## $ X21.1_PopulationDensity_n.km2: num 0.98 0.74 0.22 0.25 0.01 0.54 NA 0.75 4.89 NA ...
## $ X10.1_PopulationGrpSize : num 11 NA NA NA NA NA NA 21 NA NA ...
## $ X23.1_SexualMaturityAge_d : num 1948 250 371 373 679 ...
## $ X10.2_SocialGrpSize : num 10 NA NA NA NA 40 110 40 2.05 NA ...
## $ X24.1_TeatNumber : int NA 8 8 8 9 NA NA NA NA NA ...
## $ X12.2_Terrestriality : num 1 1 1 1 1 NA NA 1 2 NA ...
## $ X6.2_TrophicLevel : int 1 2 2 3 3 1 1 1 NA NA ...
## $ X25.1_WeaningAge_d : num 389.4 52.9 61.3 43.7 44.8 ...
## $ X5.4_WeaningBodyMass_g : num NA NA NA NA NA NA NA NA NA NA ...
## $ X13.3_WeaningHeadBodyLen_mm : num NA NA NA NA NA NA NA NA NA NA ...
## $ References : chr "511;543;719;1274;1297;1594;1654;1822;1848;2655;3044" "542;543;730;1113;1297;1573;2655" "543;679;730;1113;1297;1573;2655" "367;542;543;730;1113;1297;1573;1822;2655" ...
## $ X5.5_AdultBodyMass_g_EXT : num NA NA NA NA NA NA NA NA NA NA ...
## $ X16.2_LittersPerYear_EXT : num NA NA 1.1 1.1 NA NA NA NA 1.05 NA ...
## $ X5.6_NeonateBodyMass_g_EXT : num NA NA NA NA NA NA NA NA NA NA ...
## $ X5.7_WeaningBodyMass_g_EXT : num NA NA NA NA NA NA NA NA NA NA ...
## $ X26.1_GR_Area_km2 : num NA 10581413 25739527 17099094 50803440 ...
## $ X26.2_GR_MaxLat_dd : num NA 16.7 47 71.4 83.3 ...
## $ X26.3_GR_MinLat_dd : num NA -28.73 -4.71 8.02 11.48 ...
## $ X26.4_GR_MidRangeLat_dd : num NA -6 21.1 39.7 47.4 ...
## $ X26.5_GR_MaxLong_dd : num NA 43.5 108.5 -67.1 179.7 ...
## $ X26.6_GR_MinLong_dd : num NA -17.5 -17.1 -168.1 -171.8 ...
## $ X26.7_GR_MidRangeLong_dd : num NA 13 45.7 -117.6 3.9 ...
## $ X27.1_HuPopDen_Min_n.km2 : int NA 0 0 0 0 1 0 1 0 NA ...
## $ X27.2_HuPopDen_Mean_n.km2 : num NA 35.2 79.3 27.3 37.9 ...
## $ X27.3_HuPopDen_5p_n.km2 : num NA 1 0 0 0 8 0 4 0 NA ...
## $ X27.4_HuPopDen_Change : num NA 0.14 0.1 0.06 0.04 0.09 0.05 0.11 0.05 NA ...
## $ X28.1_Precip_Mean_mm : num NA 90.8 44.6 53 34.8 ...
## $ X28.2_Temp_Mean_01degC : num NA 236.51 217.23 58.18 4.82 ...
## $ X30.1_AET_Mean_mm : num NA 923 438 503 313 ...
## $ X30.2_PET_Mean_mm : num NA 1534 1359 728 561 ...
Column names are messy. Let’s try to clean them up.
library(stringr)
Loop through each column name, check if it has an underscore, and try to split it into two pieces using the underscore as a separator:
for (i in 1:length(colnames(pan))) {
if (grepl("_", colnames(pan)[i])) {
colnames(pan)[i] = str_split_fixed(colnames(pan)[i], "_", n=2)[2]
}
}
colnames(pan)
## [1] "Order" "Family"
## [3] "Genus" "Species"
## [5] "Binomial" "ActivityCycle"
## [7] "AdultBodyMass_g" "AdultForearmLen_mm"
## [9] "AdultHeadBodyLen_mm" "AgeatEyeOpening_d"
## [11] "AgeatFirstBirth_d" "BasalMetRate_mLO2hr"
## [13] "BasalMetRateMass_g" "DietBreadth"
## [15] "DispersalAge_d" "GestationLen_d"
## [17] "HabitatBreadth" "HomeRange_km2"
## [19] "HomeRange_Indiv_km2" "InterbirthInterval_d"
## [21] "LitterSize" "LittersPerYear"
## [23] "MaxLongevity_m" "NeonateBodyMass_g"
## [25] "NeonateHeadBodyLen_mm" "PopulationDensity_n.km2"
## [27] "PopulationGrpSize" "SexualMaturityAge_d"
## [29] "SocialGrpSize" "TeatNumber"
## [31] "Terrestriality" "TrophicLevel"
## [33] "WeaningAge_d" "WeaningBodyMass_g"
## [35] "WeaningHeadBodyLen_mm" "References"
## [37] "AdultBodyMass_g_EXT" "LittersPerYear_EXT"
## [39] "NeonateBodyMass_g_EXT" "WeaningBodyMass_g_EXT"
## [41] "GR_Area_km2" "GR_MaxLat_dd"
## [43] "GR_MinLat_dd" "GR_MidRangeLat_dd"
## [45] "GR_MaxLong_dd" "GR_MinLong_dd"
## [47] "GR_MidRangeLong_dd" "HuPopDen_Min_n.km2"
## [49] "HuPopDen_Mean_n.km2" "HuPopDen_5p_n.km2"
## [51] "HuPopDen_Change" "Precip_Mean_mm"
## [53] "Temp_Mean_01degC" "AET_Mean_mm"
## [55] "PET_Mean_mm"
?binom.test
Classic example: toss a coin a bunch of times and test whether the probability of getting a heads is different from 0.5.
“Is an artiodactyl species more likely to be an herbivore than an omnivore?”
table(pan$TrophicLevel[pan$Order == "Artiodactyla"])
##
## 1 2
## 139 25
binom.test(table(pan$TrophicLevel[pan$Order == "Artiodactyla"]),
alternative="greater")
##
## Exact binomial test
##
## data: table(pan$TrophicLevel[pan$Order == "Artiodactyla"])
## number of successes = 139, number of trials = 164, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.7936443 1.0000000
## sample estimates:
## probability of success
## 0.847561
?fisher.test
Tests a contingency table to ask whether the values of each category differ across groups.
“Does the time of day that primate species are active differ across families?”
table(pan$ActivityCycle[pan$Order == "Primates"],
pan$Family[pan$Order == "Primates"])
##
## Aotidae Atelidae Cebidae Cercopithecidae Cheirogaleidae Daubentoniidae
## 1 8 0 0 0 6 1
## 2 0 0 0 1 2 0
## 3 0 18 42 95 0 0
##
## Galagidae Hominidae Hylobatidae Indriidae Lemuridae Lepilemuridae
## 1 5 0 0 1 0 7
## 2 0 0 0 1 12 0
## 3 0 5 11 7 0 0
##
## Lorisidae Pitheciidae Tarsiidae
## 1 4 0 3
## 2 0 0 0
## 3 0 22 0
Notice it’s still a 2x2 table, even though each variable has a bunch of categories.
fisher.test(table(pan$ActivityCycle[pan$Order == "Primates"],
pan$Family[pan$Order == "Primates"]))
## Error in fisher.test(table(pan$ActivityCycle[pan$Order == "Primates"], : FEXACT error 7(location). LDSTP=18600 is too small for this problem,
## (pastp=7.78322, ipn_0:=ipoin[itp=40]=86, stp[ipn_0]=6.44572).
## Increase workspace or consider using 'simulate.p.value=TRUE'
It doesn’t want to do it, because it’s a big table, but we can force it to simulate a p-value, which will be good enough for now (see the McDonald page for alternatives).
fisher.test(table(pan$ActivityCycle[pan$Order == "Primates"],
pan$Family[pan$Order == "Primates"]),
simulate.p.value=T)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data:
## p-value = 0.0004998
## alternative hypothesis: two.sided
?t.test
Is the mean of a set of values different from some value I specify?
“Does the forearm length of fruit bats (Family Pteropodidae) tend to be greater than 100 mm?”
na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae"])[1:20]
## [1] 40.96 45.50 56.00 64.21 73.52 71.00 76.94 122.55 105.74 127.50
## [11] 122.37 115.64 114.76 107.25 148.84 80.00 138.69 110.83 115.10 116.50
mean(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae"], na.rm=T)
## [1] 100.1316
t.test(na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae"]),
mu=100, alternative="greater")
##
## One Sample t-test
##
## data: na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae"])
## t = 0.042288, df = 164, p-value = 0.4832
## alternative hypothesis: true mean is greater than 100
## 95 percent confidence interval:
## 94.98233 Inf
## sample estimates:
## mean of x
## 100.1316
?t.test
Do the means of two sets of values differ?
“Does the average forearm length of fruit bats (Family Pteropodidae) differ between species that have one litter of offspring per year and species that have two?”
na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae" & pan$LittersPerYear == 1])[1:10]
## [1] 105.74 138.69 127.72 120.50 50.56 77.00 65.36 165.99 177.97 132.00
na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae" & pan$LittersPerYear == 2])[1:10]
## [1] 64.21 71.00 76.94 107.25 80.00 116.50 69.00 85.27 73.23 80.66
t.test(na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae" & pan$LittersPerYear == 1]),
na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae" & pan$LittersPerYear == 2]))
##
## Welch Two Sample t-test
##
## data: na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae" & and na.omit(pan$AdultForearmLen_mm[pan$Family == "Pteropodidae" & pan$LittersPerYear == 1]) and pan$LittersPerYear == 2])
## t = 5.9168, df = 25.131, p-value = 3.487e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 39.08336 80.80292
## sample estimates:
## mean of x mean of y
## 137.6268 77.6837
Note that if you have paired values (like pre- and post-test values from the same individuals), then a paired t-test is more powerful (better chance of detecting a significant effect with a smaller sample size) and you should use that instead. You can do this in R by setting the t.test()
argument paired=TRUE
.
?cor.test
Are two variables correlated (positively or negatively)?
“Is wild mammal population density associated with human population density?”
pan[complete.cases(pan$PopulationDensity_n.km2, pan$HuPopDen_Mean_n.km2),
c("PopulationDensity_n.km2", "HuPopDen_Mean_n.km2")][1:20,]
## PopulationDensity_n.km2 HuPopDen_Mean_n.km2
## 2 0.74 35.20
## 3 0.22 79.29
## 4 0.25 27.27
## 5 0.01 37.87
## 6 0.54 152.67
## 8 0.75 139.21
## 9 4.89 1.07
## 19 11.86 53.45
## 21 62.32 39.68
## 22 8.85 3.51
## 24 0.06 1.83
## 27 2.39 313.30
## 29 0.74 30.81
## 30 1.20 99.87
## 41 13.49 2.79
## 42 3.39 1.62
## 44 19.06 55.14
## 47 8.00 3.25
## 48 18.43 2.40
## 54 13.03 47.74
cor.test(pan[complete.cases(pan$PopulationDensity_n.km2,
pan$HuPopDen_Mean_n.km2),
"PopulationDensity_n.km2"],
pan[complete.cases(pan$PopulationDensity_n.km2,
pan$HuPopDen_Mean_n.km2),
"HuPopDen_Mean_n.km2"])
##
## Pearson's product-moment correlation
##
## data: pan[complete.cases(pan$PopulationDensity_n.km2, pan$HuPopDen_Mean_n.km2), and pan[complete.cases(pan$PopulationDensity_n.km2, pan$HuPopDen_Mean_n.km2), "PopulationDensity_n.km2"] and "HuPopDen_Mean_n.km2"]
## t = -0.077896, df = 891, p-value = 0.9379
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06820173 0.06300495
## sample estimates:
## cor
## -0.002609622
Correlation and regression can both examine how two variables are associated.
So what’s the difference between correlation and regression?
m
and b
of a y = mx + b
equation) so that you can create a model of your data and plot a line of best fit through it.?lm
Does one variable predict another (positively or negatively)?
“Does litter size predict home range in canids?”
panCanid = pan[pan$Family == "Canidae",]
panCanid = panCanid[complete.cases(panCanid$LitterSize,
panCanid$HomeRange_km2),]
panCanidModel = lm(HomeRange_km2 ~ LitterSize, data=panCanid)
summary(panCanidModel)
##
## Call:
## lm(formula = HomeRange_km2 ~ LitterSize, data = panCanid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -115.07 -39.74 -16.59 26.48 254.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -85.60 50.20 -1.705 0.1054
## LitterSize 28.07 10.83 2.592 0.0184 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.41 on 18 degrees of freedom
## Multiple R-squared: 0.2718, Adjusted R-squared: 0.2314
## F-statistic: 6.72 on 1 and 18 DF, p-value: 0.0184
Response (Y variable) comes first, followed by predictors (X variables). Read the ~
as “as a function of” or “as a result of.”
So, from this, our model equation would look something like:
Y
= b
+ m
* X
HomeRange_km2
= -85.60
+ 28.07
* LitterSize
We can plot our regression line:
library(ggplot2)
ggplot(panCanid, aes(x=LitterSize, y=HomeRange_km2)) +
geom_point() +
geom_smooth(method="lm", se=F, color="red")
This re-runs the lm()
function with the specified variables to plot the line.
But if we want to make sure we’re using our own model estimates, we can fortify()
our model results, which turns them into a data frame:
fortify(panCanidModel)
## HomeRange_km2 LitterSize .hat .sigma .cooksd .fitted
## 2 1.01 4.50 0.05053979 82.14831 6.842384e-03 40.737331
## 3 2.95 3.74 0.05626123 82.63931 1.321991e-03 19.400799
## 4 18.88 5.72 0.08517500 81.50889 2.477626e-02 74.988081
## 5 159.86 4.98 0.05772333 78.41785 5.611221e-02 54.213036
## 29 11.93 3.89 0.05347216 82.69005 6.298555e-04 23.611957
## 193 1.41 3.09 0.07778012 82.74129 4.691101e-07 1.152449
## 882 21.36 2.00 0.14827048 81.65682 4.080140e-02 -29.448630
## 1132 33.64 4.30 0.05001372 82.74049 9.418158e-06 35.122454
## 2224 5.05 4.88 0.05553743 81.92856 1.034593e-02 51.405598
## 2230 395.88 8.10 0.30816788 36.84250 3.214091e+00 141.805116
## 3133 1.15 6.36 0.12493847 79.24366 1.063425e-01 92.955687
## 3249 1.13 4.00 0.05194566 82.49580 2.922118e-03 26.700139
## 5312 1.83 3.71 0.05691700 82.63577 1.384868e-03 18.558567
## 5313 0.42 2.17 0.13443944 82.48226 8.739549e-03 -24.675985
## 5373 1.17 2.00 0.14827048 82.34912 1.481739e-02 -29.448630
## 5377 14.10 7.65 0.25025054 76.20518 4.558565e-01 129.171643
## 5378 8.10 4.50 0.05053979 82.34155 4.618042e-03 40.737331
## 5380 29.93 2.12 0.13839855 81.43678 4.522661e-02 -26.079704
## 5381 4.59 4.25 0.05010895 82.42318 3.643844e-03 33.718735
## 5382 3.50 4.59 0.05124998 82.14676 6.961768e-03 43.264026
## .resid .stdresid
## 2 -39.727331 -0.507037867
## 3 -16.450799 -0.210596178
## 4 -56.108081 -0.729534421
## 5 105.646964 1.353496665
## 29 -11.681957 -0.149326981
## 193 0.257551 0.003335303
## 882 50.808630 0.684661686
## 1132 -1.482454 -0.018915249
## 2224 -46.355598 -0.593197340
## 2230 254.074884 3.798837911
## 3133 -91.805687 -1.220505316
## 3249 -25.570139 -0.326592240
## 5312 -16.728567 -0.214226495
## 5313 25.095985 0.335463211
## 5373 30.618630 0.412595319
## 5377 -115.071643 -1.652721284
## 5378 -32.637331 -0.416548565
## 5380 56.009704 0.750411455
## 5381 -29.128735 -0.371684215
## 5382 -39.764026 -0.507696111
We can then use the .fitted
values (the result of plugging each of those LitterSize
values into the model) as the values for our line:
ggplot(panCanid, aes(x=LitterSize, y=HomeRange_km2)) +
geom_point() +
geom_line(data=fortify(panCanidModel), aes(x=LitterSize, y=.fitted),
color="red", size=1)
Finally, we can use our model to predict new responses.
?predict
predict(panCanidModel, newdata=data.frame(LitterSize=1:10))
## 1 2 3 4 5 6
## -57.523015 -29.448630 -1.374246 26.700139 54.774524 82.848908
## 7 8 9 10
## 110.923293 138.997678 167.072062 195.146447
range(panCanid$LitterSize)
## [1] 2.0 8.1
Note we can predict the results of values outside the range of our data.
If we want to use a different “link function” (i.e. if we think our data fit a different distribution), we can use a generalized linear model with glm()
. We can also do basic linear regression with glm()
, because lm()
is a special case of glm()
.
We also see, with our plot, that we probably should have checked for and removed outliers first. Outliers can have a huge effect on the parameters of your model and its accuracy, because the model fitting procedure has to take those outliers into account when it’s trying to find the line that fits everything as closely as possible.
Do any of these variables predict another (positively or negatively)?
First, let’s make sure we’re working with the complete cases for our new variable, DispersalAge_d
.
panCanid2 = pan[pan$Family == "Canidae",]
panCanid2 = panCanid2[complete.cases(panCanid2$LitterSize,
panCanid2$DispersalAge_d,
panCanid2$HomeRange_km2),]
Same function, we just add predictor variables with +
:
panCanidModel2 = lm(HomeRange_km2 ~ LitterSize + DispersalAge_d, data=panCanid2)
summary(panCanidModel2)
##
## Call:
## lm(formula = HomeRange_km2 ~ LitterSize + DispersalAge_d, data = panCanid2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.25 -47.84 -13.82 36.70 160.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -206.284 78.959 -2.613 0.0227 *
## LitterSize 24.217 13.717 1.765 0.1029
## DispersalAge_d 0.629 0.337 1.866 0.0866 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.55 on 12 degrees of freedom
## Multiple R-squared: 0.4756, Adjusted R-squared: 0.3882
## F-statistic: 5.442 on 2 and 12 DF, p-value: 0.0208
Now that we take the effect of dispersal age into account, litter size is no longer significant. But dispersal age isn’t significant either (“almost significant” or “marginally significant” don’t count–it’s either below your alpha value or it’s not), which is incredibly frustrating. There must be something going on.
Let’s check for interactions. We can do so in the model by changing the +
to a *
to specify it should include an interaction term between our main effects:
panCanidModel3 = lm(HomeRange_km2 ~ LitterSize * DispersalAge_d, data=panCanid2)
summary(panCanidModel3)
##
## Call:
## lm(formula = HomeRange_km2 ~ LitterSize * DispersalAge_d, data = panCanid2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.081 -20.475 -16.314 8.685 133.484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 312.64756 112.58178 2.777 0.01800 *
## LitterSize -64.69160 19.33256 -3.346 0.00652 **
## DispersalAge_d -1.92514 0.54285 -3.546 0.00458 **
## LitterSize:DispersalAge_d 0.42661 0.08471 5.036 0.00038 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.42 on 11 degrees of freedom
## Multiple R-squared: 0.8414, Adjusted R-squared: 0.7981
## F-statistic: 19.45 on 3 and 11 DF, p-value: 0.000105
Now everything is significant, including our interaction term.
What is an interaction?
“The effect of Factor A is different, depending on which level of Factor B we’re talking about.”
Source: Navarro, Section 16.2.1
Significant interactions are generally bad news, because it can be very difficult to explain them in a way that is biologically (or physically, or socially…) meaningful.
Sometimes it is helpful to visualize the effects of the two predictors to see what the interaction is. But to do so with measurement variables, we have to make them discrete so we can see the change according to different levels. There are different ways to do this “binning.” For LitterSize
, let’s split it by Low
and High
based on the median value:
panCanid2$LitterSizeD = ifelse(panCanid2$LitterSize < median(panCanid2$LitterSize),
"Low", "High")
And for DispersalAge_d
, we’ll tell R we want to split it into 5 intervals using the cut()
function:
panCanid2$DispersalAge_dD = cut(panCanid2$DispersalAge_d, 5)
You could also use quantiles or cut2()
from the Hmisc
package to cut by intervals of equal numbers of values rather than equal spacing.
ggplot(panCanid2, aes(x=DispersalAge_dD, y=HomeRange_km2, group=LitterSizeD)) +
geom_line(aes(color=LitterSizeD))
Like I mentioned before, difficult to interpret. Usually, in an interaction plot, the lines will cross each other (showing that the relationships are different for each level of the factor), instead of being parallel (which would show similar effects); here, they do cross in the first level of dispersal age, though it’s difficult to see.