library(tidyverse)
library(magrittr)
data("USArrests")
names(USArrests)
[1] "Murder" "Assault" "UrbanPop" "Rape"
glimpse(USArrests)
Rows: 50
Columns: 4
$ Murder <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.6, 10.4, 7.2, 2.2, 6.0…
$ Assault <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 249, 113, 56, 115, 109, …
$ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 66, 52, 66, 51, 67, 85,…
$ Rape <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 20.2, 14.2, 24.0, 21.0,…
Varialble ~ X1 + X2 + X2
Equation <- "Murder ~ Assault + UrbanPop + Rape"
Model <- lm(Equation,
data = USArrests)
summary(Model)
Call:
lm(formula = Equation, data = USArrests)
Residuals:
Min 1Q Median 3Q Max
-4.3990 -1.9127 -0.3444 1.2557 7.4279
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.276639 1.737997 1.885 0.0657 .
Assault 0.039777 0.005912 6.729 2.33e-08 ***
UrbanPop -0.054694 0.027880 -1.962 0.0559 .
Rape 0.061399 0.055740 1.102 0.2764
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.574 on 46 degrees of freedom
Multiple R-squared: 0.6721, Adjusted R-squared: 0.6507
F-statistic: 31.42 on 3 and 46 DF, p-value: 3.322e-11
# Murder_ESTIMATED = 3.276639 +
# 0.039777*Assault + -0.054694*UrbanPop + Rap + 0.061399residual
We get from the main output:
jtools::summ(Model)
MODEL INFO:
Observations: 50
Dependent Variable: Murder
Type: OLS linear regression
MODEL FIT:
F(3,46) = 31.42, p = 0.00
R² = 0.67
Adj. R² = 0.65
Standard errors: OLS
------------------------------------------------
Est. S.E. t val. p
----------------- ------- ------ -------- ------
(Intercept) 3.28 1.74 1.89 0.07
Assault 0.04 0.01 6.73 0.00
UrbanPop -0.05 0.03 -1.96 0.06
Rape 0.06 0.06 1.10 0.28
------------------------------------------------
TableResults <- broom::tidy(Model)
TableResults
coef(Model)
(Intercept) Assault UrbanPop Rape
3.27663918 0.03977717 -0.05469363 0.06139942
confint(Model)
2.5 % 97.5 %
(Intercept) -0.22176766 6.775046016
Assault 0.02787760 0.051676734
UrbanPop -0.11081365 0.001426387
Rape -0.05079988 0.173598724
plot(Model)
How to interpret these plots?
Y_Estimated <- predict(Model)
Residuals <- residuals(Model)
Diagnostic <- data.frame(Y_Estimated, Residuals)
Diagnostic
ggplot(data = Diagnostic) +
geom_point(aes(x = Y_Estimated, y = Residuals)) +
ggtitle("Spread of the residuals") +
xlab("Estimated variable") + ylab("residuals") +
theme_light()
ggplot(data = Diagnostic) +
geom_histogram(aes(x = Residuals), binwidth = 0.7) +
theme_light()
Diagnostic %<>% arrange(Residuals)
Diagnostic
Observations <- dim(USArrests)[1]
Probabilities <- seq(0.01, 0.99, 0.02)
Probabilities <- (1:Observations - 0.5)/Observations
Diagnostic %<>% mutate(Theoretical_Quantiles = qnorm(Probabilities))
Diagnostic
ggplot(data = Diagnostic) +
geom_point(aes(x = Theoretical_Quantiles, y = Residuals)) +
ggtitle("QQ Plot") +
xlab("Theoretical Quantiles") + ylab("residuals") +
theme_light()
Model$df.residual
[1] 46
estimated_sd <- sqrt(sum(Diagnostic$Residuals^2)/Model$df.residual)
Diagnostic %<>% mutate(Stand_Residuals = Residuals/estimated_sd)
Diagnostic
ggplot(data = Diagnostic) +
geom_point(aes(x = Theoretical_Quantiles, y = Stand_Residuals)) +
geom_line(aes(x = Theoretical_Quantiles, y = Theoretical_Quantiles)) +
ggtitle("QQ Plot") +
xlab("Theoretical Quantiles") + ylab("Standardized residuals") +
theme_light()
Diagnostic %<>%
mutate(sqrt_abs_std_resid = sqrt(abs(Stand_Residuals)) )
Diagnostic
ggplot(Diagnostic, aes(x = Y_Estimated, y = sqrt_abs_std_resid)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(x = "Predicted Values", y = "Scale-Location") +
ggtitle("Scale-Location Plot") +
theme_minimal()
The leverage plot, also known as the leverage-residuals plot or the Cook’s distance plot, is a diagnostic plot used in linear regression analysis. It helps identify influential observations that have a significant impact on the regression model’s results.
Leverage refers to the potential of an observation to have a disproportionate effect on the estimated regression coefficients. Observations with high leverage can have a substantial impact on the regression line, either by exerting a pulling effect or by being outliers.
In the leverage plot, each point represents an observation from the dataset. The vertical axis shows the standardized residuals, which indicate how far each observation’s actual value deviates from its predicted value, taking into account the variability of the residuals. The horizontal axis represents the leverage values of the observations, which measure the potential influence of each observation on the estimated regression coefficients.
Leverage values are calculated using the “Hat” matrix, which is a matrix that maps the observed values to the predicted values. High leverage values suggest that an observation has a greater ability to affect the estimated regression coefficients. Observations with high leverage are located towards the ends of the horizontal axis.
Influential Observations: Observations with high leverage and large residuals can be considered influential observations. They have the potential to significantly affect the regression model’s parameters and can distort the results if they are outliers or if their predictors have extreme values.
Cook’s Distance: In addition to the leverage plot, Cook’s distance is often shown on the same graph or as a separate plot. Cook’s distance measures the influence of each observation on the entire regression model. Large values of Cook’s distance suggest influential observations that significantly affect the regression model’s results.
By examining the leverage plot and Cook’s distance, you can identify influential observations that may require further investigation. These observations may have a substantial impact on the regression analysis and could potentially affect the model’s conclusions.
It’s important to note that the interpretation of influential observations should be done cautiously, considering the context and objectives of the analysis. Influential observations may warrant additional scrutiny, but they should not be automatically removed without careful consideration and justification.
VIF is calculated for each independent variable in the regression model. For each variable, the VIF is obtained by regressing that variable against all the other independent variables in the model.
VIF = 1 / (1 - R²)
cor(USArrests)
Murder Assault UrbanPop Rape
Murder 1.00000000 0.8018733 0.06957262 0.5635788
Assault 0.80187331 1.0000000 0.25887170 0.6652412
UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
Rape 0.56357883 0.6652412 0.41134124 1.0000000
regclass::VIF(Model)
Assault UrbanPop Rape
1.794715 1.204229 2.015462
car::durbinWatsonTest(Model)
lag Autocorrelation D-W Statistic p-value
1 0.1014268 1.77713 0.446
Alternative hypothesis: rho != 0
# H0 (null hypothesis): There is no correlation among the residuals.
# HA (alternative hypothesis): The residuals are autocorrelated.
# p-value is less than 0.05, reject the null hypothesis and conclude that the residuals in this regression model are autocorrelated
tseries::jarque.bera.test(residuals(Model))
Jarque Bera Test
data: residuals(Model)
X-squared = 3.2393, df = 2, p-value = 0.198
# The null hypothesis is that the data are normally distributed
# If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that the data are not normally distributed.
ks.test(residuals(Model), "pnorm")
Exact one-sample Kolmogorov-Smirnov test
data: residuals(Model)
D = 0.24998, p-value = 0.003069
alternative hypothesis: two-sided
# If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that the residuals do not follow a normal distribution.
lmtest::bptest(Model)
studentized Breusch-Pagan test
data: Model
BP = 2.9605, df = 3, p-value = 0.3978
# If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there is evidence of heteroscedasticity in the residuals.
logLik(Model)
'log Lik.' -116.1404 (df=5)
AIC(Model)
[1] 242.2808
BIC(Model)
[1] 251.8409
Model_1 <- lm("Murder ~ Assault + UrbanPop + Rape", data = USArrests)
Model_2 <- lm("Murder ~ Assault + UrbanPop", data = USArrests)
summary(Model_1)
Call:
lm(formula = "Murder ~ Assault + UrbanPop + Rape", data = USArrests)
Residuals:
Min 1Q Median 3Q Max
-4.3990 -1.9127 -0.3444 1.2557 7.4279
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.276639 1.737997 1.885 0.0657 .
Assault 0.039777 0.005912 6.729 2.33e-08 ***
UrbanPop -0.054694 0.027880 -1.962 0.0559 .
Rape 0.061399 0.055740 1.102 0.2764
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.574 on 46 degrees of freedom
Multiple R-squared: 0.6721, Adjusted R-squared: 0.6507
F-statistic: 31.42 on 3 and 46 DF, p-value: 3.322e-11
summary(Model_2)
Call:
lm(formula = "Murder ~ Assault + UrbanPop", data = USArrests)
Residuals:
Min 1Q Median 3Q Max
-4.5530 -1.7093 -0.3677 1.2284 7.5985
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.207153 1.740790 1.842 0.0717 .
Assault 0.043910 0.004579 9.590 1.22e-12 ***
UrbanPop -0.044510 0.026363 -1.688 0.0980 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.58 on 47 degrees of freedom
Multiple R-squared: 0.6634, Adjusted R-squared: 0.6491
F-statistic: 46.32 on 2 and 47 DF, p-value: 7.704e-12
summary(Model_1)$r.squared
[1] 0.6720656
summary(Model_2)$r.squared
[1] 0.6634156
summary(Model_1)$adj.r.squared
[1] 0.6506786
summary(Model_2)$adj.r.squared
[1] 0.6490928
logLik(Model_1)
'log Lik.' -116.1404 (df=5)
logLik(Model_2)
'log Lik.' -116.7913 (df=4)
AIC(Model_1)
[1] 242.2808
AIC(Model_2)
[1] 241.5826
BIC(Model_1)
[1] 251.8409
BIC(Model_2)
[1] 249.2307
anova(Model_1, Model_2)
Analysis of Variance Table
Model 1: Murder ~ Assault + UrbanPop + Rape
Model 2: Murder ~ Assault + UrbanPop
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 304.83
2 47 312.87 -1 -8.0407 1.2134 0.2764
If p value is small, then…?
The LRT is used to assess whether the more complex model (alternative model) provides a significantly better fit compared to the simpler model (null model).
A smaller p-value indicates stronger evidence against the null hypothesis (that the two models are the same)
Small p value: We can conclude that the more complex model provides a significantly better fit compared to the simpler model. The additional variables or complexity in the alternative model are considered important and contribute to the model’s improved fit.
Large p values: We do not have enough evidence to conclude that the more complex model provides a significantly better fit compared to the simpler model.
My note: I hate using the anova() function. It creates confusion. There (at least) another two alternatives
lmtest::lrtest(Model_1, Model_2)
Likelihood ratio test
Model 1: Murder ~ Assault + UrbanPop + Rape
Model 2: Murder ~ Assault + UrbanPop
#Df LogLik Df Chisq Pr(>Chisq)
1 5 -116.14
2 4 -116.79 -1 1.3018 0.2539
The likelihood ratio test statistic is calculated as the ratio of the likelihoods of the two models:
LR = -2 * (log(L₀(β₀)) - log(L₁(β₁))) df = p₁ - p₀
lrt <- -2 * (logLik(Model_2) - logLik(Model_1))
df <- length(coef(Model_1)) - length(coef(Model_2))
# Calculate the p-value using the chi-square distribution
p_value <- 1 - pchisq(lrt, df)
p_value
'log Lik.' 0.2538886 (df=4)
# install.packages("datarium")
# devtools::install_github("kassmbara/datarium")
library(datarium)
Warning: package ‘datarium’ was built under R version 4.2.3
data("marketing", package = "datarium")
marketing
cor(marketing)
youtube facebook newspaper sales
youtube 1.00000000 0.05480866 0.05664787 0.7822244
facebook 0.05480866 1.00000000 0.35410375 0.5762226
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales 0.78222442 0.57622257 0.22829903 1.0000000
marketing %<>%
mutate(Total = youtube + facebook + newspaper) %>%
mutate(Prop_Newspaper = newspaper/Total) %>%
mutate(Prop_Youtube = youtube/Total)
ggplot(data = marketing) +
geom_point(aes( x= Prop_Newspaper , y = sales)) +
theme_light()
ggplot(data = marketing) +
geom_point(aes( x= Prop_Youtube, y = sales)) +
theme_light()
ggplot(data = marketing) +
geom_point(aes( x= newspaper , y = sales)) +
theme_light()
summary(marketing)
youtube facebook newspaper sales Total
Min. : 0.84 Min. : 0.00 Min. : 0.36 Min. : 1.92 Min. : 14.04
1st Qu.: 89.25 1st Qu.:11.97 1st Qu.: 15.30 1st Qu.:12.45 1st Qu.:148.26
Median :179.70 Median :27.48 Median : 30.90 Median :15.48 Median :248.82
Mean :176.45 Mean :27.92 Mean : 36.66 Mean :16.83 Mean :241.03
3rd Qu.:262.59 3rd Qu.:43.83 3rd Qu.: 54.12 3rd Qu.:20.88 3rd Qu.:337.35
Max. :355.68 Max. :59.52 Max. :136.80 Max. :32.40 Max. :520.32
Prop_Newspaper Prop_Youtube
Min. :0.001049 Min. :0.01429
1st Qu.:0.074894 1st Qu.:0.58580
Median :0.144936 Median :0.73248
Mean :0.178757 Mean :0.67809
3rd Qu.:0.234432 3rd Qu.:0.84160
Max. :0.654732 Max. :0.95846
Model_Sales_Marketing <- lm("sales ~ youtube + facebook + newspaper",
data = marketing)
summary(Model_Sales_Marketing)
Call:
lm(formula = "sales ~ youtube + facebook + newspaper", data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5932 -1.0690 0.2902 1.4272 3.3951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.526667 0.374290 9.422 <2e-16 ***
youtube 0.045765 0.001395 32.809 <2e-16 ***
facebook 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
plot(Model_Sales_Marketing)
summary(lm("sales ~ youtube + facebook + newspaper", data = marketing))
Call:
lm(formula = "sales ~ youtube + facebook + newspaper", data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5932 -1.0690 0.2902 1.4272 3.3951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.526667 0.374290 9.422 <2e-16 ***
youtube 0.045765 0.001395 32.809 <2e-16 ***
facebook 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
summary(lm("sales ~ youtube + facebook", data = marketing))
Call:
lm(formula = "sales ~ youtube + facebook", data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5572 -1.0502 0.2906 1.4049 3.3994
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.50532 0.35339 9.919 <2e-16 ***
youtube 0.04575 0.00139 32.909 <2e-16 ***
facebook 0.18799 0.00804 23.382 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.018 on 197 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
summary(lm("sales ~ youtube", data = marketing))
Call:
lm(formula = "sales ~ youtube", data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.0632 -2.3454 -0.2295 2.4805 8.6548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.439112 0.549412 15.36 <2e-16 ***
youtube 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.91 on 198 degrees of freedom
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
summary(lm("sales ~ facebook + newspaper", data = marketing))
Call:
lm(formula = "sales ~ facebook + newspaper", data = marketing)
Residuals:
Min 1Q Median 3Q Max
-18.6347 -2.5739 0.8778 3.3188 9.5701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.026705 0.753206 14.640 <2e-16 ***
facebook 0.199045 0.021870 9.101 <2e-16 ***
newspaper 0.006644 0.014909 0.446 0.656
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.14 on 197 degrees of freedom
Multiple R-squared: 0.3327, Adjusted R-squared: 0.3259
F-statistic: 49.11 on 2 and 197 DF, p-value: < 2.2e-16
summary(lm("sales ~ newspaper", data = marketing))
Call:
lm(formula = "sales ~ newspaper", data = marketing)
Residuals:
Min 1Q Median 3Q Max
-13.473 -4.065 -1.007 4.207 15.330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.82169 0.74570 19.88 < 2e-16 ***
newspaper 0.05469 0.01658 3.30 0.00115 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.111 on 198 degrees of freedom
Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
data("Salaries", package = "carData")
Salaries
Model_1 <- lm("salary ~ sex", data = Salaries)
summary(Model_1)
Call:
lm(formula = "salary ~ sex", data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-57290 -23502 -6828 19710 116455
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101002 4809 21.001 < 2e-16 ***
sexMale 14088 5065 2.782 0.00567 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30030 on 395 degrees of freedom
Multiple R-squared: 0.01921, Adjusted R-squared: 0.01673
F-statistic: 7.738 on 1 and 395 DF, p-value: 0.005667
class(Salaries$sex)
[1] "factor"
levels(Salaries$sex)
[1] "Female" "Male"
Salaries %<>%
mutate(Sex_Character = sex) %>%
mutate(Sex_Male = if_else(Sex_Character == "Male", 1, 0),
Sex_Female = if_else(Sex_Character == "Female", 1, 0))
# Salario = B0 +B_M*Male
Salaries
Model_2 <- lm("salary ~ Sex_Male", data = Salaries)
summary(Model_2)
Call:
lm(formula = "salary ~ Sex_Male", data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-57290 -23502 -6828 19710 116455
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101002 4809 21.001 < 2e-16 ***
Sex_Male 14088 5065 2.782 0.00567 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30030 on 395 degrees of freedom
Multiple R-squared: 0.01921, Adjusted R-squared: 0.01673
F-statistic: 7.738 on 1 and 395 DF, p-value: 0.005667
Model_3 <- lm("salary ~ Sex_Female", data = Salaries)
summary(Model_3)
Call:
lm(formula = "salary ~ Sex_Female", data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-57290 -23502 -6828 19710 116455
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 115090 1587 72.503 < 2e-16 ***
Sex_Female -14088 5065 -2.782 0.00567 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30030 on 395 degrees of freedom
Multiple R-squared: 0.01921, Adjusted R-squared: 0.01673
F-statistic: 7.738 on 1 and 395 DF, p-value: 0.005667
Salaries %>% group_by(sex) %>% summarise(MeanPerSex = mean(salary),
PeopleInSurvey = n())
101002.4 - 115090.4
[1] -14088
names(Salaries)
[1] "rank" "discipline" "yrs.since.phd" "yrs.service" "sex" "salary"
[7] "Sex_Character" "Sex_Male" "Sex_Female"
Model_1 <- Salaries %>% lm("salary ~ yrs.service + discipline", data = .)
summary(Model_1)
Call:
lm(formula = "salary ~ yrs.service + discipline", data = .)
Residuals:
Min 1Q Median 3Q Max
-77537 -19699 -5135 15631 106625
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 91335.8 3005.4 30.391 < 2e-16 ***
yrs.service 862.8 109.2 7.904 2.73e-14 ***
disciplineB 13184.0 2846.8 4.631 4.95e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27870 on 394 degrees of freedom
Multiple R-squared: 0.1579, Adjusted R-squared: 0.1536
F-statistic: 36.94 on 2 and 394 DF, p-value: 1.983e-15
class(Salaries$discipline)
[1] "factor"
levels(Salaries$discipline)
[1] "A" "B"
Salaries$discipline <- as.character(Salaries$discipline)
Salaries$discipline <- factor(Salaries$discipline, levels = c("B", "A"))
Salaries %>% group_by(discipline) %>% summarize(Meam =mean(salary), N = n())
names(Salaries)
[1] "rank" "discipline" "yrs.since.phd" "yrs.service" "sex" "salary"
[7] "Sex_Character" "Sex_Male" "Sex_Female"
names(Salaries)[4] <- "Years_Service"
names(Salaries)
[1] "rank" "discipline" "yrs.since.phd" "Years_Service" "sex" "salary"
[7] "Sex_Character" "Sex_Male" "Sex_Female"
Salaries %<>%
mutate(Sociology = if_else(discipline == "A", 1, 0),
Engineer = if_else(discipline == "B", 1, 0)) %>%
mutate(Interac_YrServices_Engineer = Years_Service*Engineer) %>%
mutate(Interac_YrServices_Sociology = Years_Service*Sociology)
Equation_0 <- "salary ~ Years_Service + discipline + Years_Service*discipline"
Equation_1 <- "salary ~ Years_Service + Sociology + Years_Service*Sociology"
Equation_2 <- "salary ~ Years_Service + Sociology + Interac_YrServices_Sociology"
summary(lm(Equation_0, data = Salaries))
Call:
lm(formula = Equation_0, data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-86326 -19779 -4999 16091 102274
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98895.4 3068.4 32.230 < 2e-16 ***
Years_Service 1222.0 155.2 7.874 3.37e-14 ***
disciplineA -857.4 4750.7 -0.180 0.85687
Years_Service:disciplineA -695.2 215.9 -3.220 0.00139 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27540 on 393 degrees of freedom
Multiple R-squared: 0.1795, Adjusted R-squared: 0.1733
F-statistic: 28.67 on 3 and 393 DF, p-value: < 2.2e-16
summary(lm(Equation_1, data = Salaries))
Call:
lm(formula = Equation_1, data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-86326 -19779 -4999 16091 102274
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98895.4 3068.4 32.230 < 2e-16 ***
Years_Service 1222.0 155.2 7.874 3.37e-14 ***
Sociology -857.4 4750.7 -0.180 0.85687
Years_Service:Sociology -695.2 215.9 -3.220 0.00139 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27540 on 393 degrees of freedom
Multiple R-squared: 0.1795, Adjusted R-squared: 0.1733
F-statistic: 28.67 on 3 and 393 DF, p-value: < 2.2e-16
summary(lm(Equation_2, data = Salaries))
Call:
lm(formula = Equation_2, data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-86326 -19779 -4999 16091 102274
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98895.4 3068.4 32.230 < 2e-16 ***
Years_Service 1222.0 155.2 7.874 3.37e-14 ***
Sociology -857.4 4750.7 -0.180 0.85687
Interac_YrServices_Sociology -695.2 215.9 -3.220 0.00139 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27540 on 393 degrees of freedom
Multiple R-squared: 0.1795, Adjusted R-squared: 0.1733
F-statistic: 28.67 on 3 and 393 DF, p-value: < 2.2e-16
names(Salaries)
[1] "rank" "discipline" "yrs.since.phd"
[4] "Years_Service" "sex" "salary"
[7] "Sex_Character" "Sex_Male" "Sex_Female"
[10] "Sociology" "Engineer" "Interac_YrServices_Engineer"
[13] "Interac_YrServices_Sociology"
Equation_1 <- "salary ~ Years_Service + sex + discipline + sex*discipline"
Equation_2 <- "salary ~ Years_Service + Sex_Male + Sociology + Sex_Male*Sociology"
Equation_3 <- "salary ~ Years_Service + Sex_Male + Engineer + Sex_Male*Engineer"
summary(lm(Equation_1, data = Salaries))
Call:
lm(formula = Equation_1, data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-77749 -19940 -5111 16007 104763
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101606.3 6198.6 16.392 < 2e-16 ***
Years_Service 825.3 110.4 7.476 5.06e-13 ***
sexMale 3877.7 6400.1 0.606 0.5449
disciplineA -21986.2 8925.0 -2.463 0.0142 *
sexMale:disciplineA 9962.5 9415.8 1.058 0.2907
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27790 on 392 degrees of freedom
Multiple R-squared: 0.167, Adjusted R-squared: 0.1585
F-statistic: 19.64 on 4 and 392 DF, p-value: 9.487e-15
summary(lm(Equation_2, data = Salaries))
Call:
lm(formula = Equation_2, data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-77749 -19940 -5111 16007 104763
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101606.3 6198.6 16.392 < 2e-16 ***
Years_Service 825.3 110.4 7.476 5.06e-13 ***
Sex_Male 3877.7 6400.1 0.606 0.5449
Sociology -21986.2 8925.0 -2.463 0.0142 *
Sex_Male:Sociology 9962.5 9415.8 1.058 0.2907
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27790 on 392 degrees of freedom
Multiple R-squared: 0.167, Adjusted R-squared: 0.1585
F-statistic: 19.64 on 4 and 392 DF, p-value: 9.487e-15
summary(lm(Equation_3, data = Salaries))
Call:
lm(formula = Equation_3, data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-77749 -19940 -5111 16007 104763
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79620.1 6669.9 11.937 < 2e-16 ***
Years_Service 825.3 110.4 7.476 5.06e-13 ***
Sex_Male 13840.2 6979.6 1.983 0.0481 *
Engineer 21986.2 8925.0 2.463 0.0142 *
Sex_Male:Engineer -9962.5 9415.8 -1.058 0.2907
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27790 on 392 degrees of freedom
Multiple R-squared: 0.167, Adjusted R-squared: 0.1585
F-statistic: 19.64 on 4 and 392 DF, p-value: 9.487e-15
“Only the dependent/response variable is log-transformed. Exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – 1) * 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.
Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002. For x percent increase, multiply the coefficient by log(1.x). Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02.
Both dependent/response variable and independent/predictor variable(s) are log-transformed. Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 – 1) * 100 = 3.7 percent.”
Marginal effects refer to the change in the predicted value of the dependent variable that results from a one-unit change in one of the independent variables, while holding all other independent variables constant.
Think of it as the specific slope on each point in the curve.
We could:
install.packages(“margins”) library(margins) mfx <- margins(m, variables = “x”) summary(mfx)
library(wooldridge) data(“card”)
library(jtools) export_sums()
mfx::logitmfx() mfx::probitmfx()
margins::margins()