Introduction
Linear Regression is one of the most simple, intuitive, and widely used modeling technique which primarily predicts a quantitative response and thus falls under classification of Supervised Learning. A search on Google Scholar for the term “Linear Regression” returns 4+ million result and there are many good quality courses available online explaining Linear Regression for free. Today there are many sophisticated techniques available for predictive and prescriptive analytics but a solid foundation in Linear Regression is must to understand nuances of modeling before dealing with more advanced techniques.
In it’s simplest form, linear model is expressed as:
\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ….. + \beta_kx_k + \epsilon\]
where:
- \(y\) is quantitative response variable i.e basically what we have to predict
- \(\beta_0\) is the intercept or slope and it’s not multiplied with any feature or predictor variable
- \(x_1, x_2, ..., x_k\) are predictor variables
- \(\beta_1\), \(\beta_2\) … \(\beta_k\) are coefficients of their respective predictor variables
- the \(\epsilon\) is the error term. It’s the difference between actual and predicted outcomes. In Linear Regression we assume that we’ll make many positive and negative errors which are small but only few large errors. We will b analyzing residuals later in the post.
Climate Change
We will use “climate_change” data provided in course The Analytical Edge offered by MIT through edX for this exercise.
Preparing Data
There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.
In this problem, we will attempt to study the relationship between average global temperature and several other factors.
The file climate_change.csv contains climate data from May 1983 to December 2008. The available variables include:
- Temp: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia.
- CO2, N2O, CH4, CFC.11, CFC.12: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division.
- CO2, N2O and CH4 are expressed in ppmv (parts per million by volume – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)
- CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume).
- Aerosols: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun’s energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA.
- TSI: the total solar irradiance (TSI) in W/m2 (the rate at which the sun’s energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website.
- MEI: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.
We start with loading all required packages we need for data analysis.
library(tidyverse)
library(psych)
library(broom)
library(modelr)
library(ggridges)
library(ggrepel)
library(lubridate)
library(reshape)
library(kableExtra)
Reading data into R is super easy. The read_csv
function allows to read a .csv file into R and takes care of headers and data types automatically. If you need to pass some parameters you can always do that, however most the times it’s not required. You can learn more about read_csv
or any other function by typing ?read_csv
in your R console. It’s worthy to note that read_csv
return a “tibble” whereas read.csv
function returns a data-frame.
# reading data into R
climate <- read_csv("climate_change.csv")
# printing header of climate data
knitr::kable(head(climate)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | Temp |
---|---|---|---|---|---|---|---|---|---|---|
1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.102 | 0.0863 | 0.109 |
1983 | 6 | 2.167 | 345.52 | 1633.71 | 303.746 | 192.057 | 351.848 | 1366.121 | 0.0794 | 0.118 |
1983 | 7 | 1.741 | 344.15 | 1633.22 | 303.795 | 192.818 | 353.725 | 1366.285 | 0.0731 | 0.137 |
1983 | 8 | 1.130 | 342.25 | 1631.35 | 303.839 | 193.602 | 355.633 | 1366.420 | 0.0673 | 0.176 |
1983 | 9 | 0.428 | 340.17 | 1648.40 | 303.901 | 194.392 | 357.465 | 1366.234 | 0.0619 | 0.149 |
1983 | 10 | 0.002 | 340.30 | 1663.79 | 303.970 | 195.171 | 359.174 | 1366.059 | 0.0569 | 0.093 |
When we’re dealing with an unknown dataset, it’s always a good first step to look at the structure of the dataset. R provides a function str()
exactly for this purpose.
str(climate)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 308 obs. of 11 variables:
## $ Year : num 1983 1983 1983 1983 1983 ...
## $ Month : num 5 6 7 8 9 10 11 12 1 2 ...
## $ MEI : num 2.556 2.167 1.741 1.13 0.428 ...
## $ CO2 : num 346 346 344 342 340 ...
## $ CH4 : num 1639 1634 1633 1631 1648 ...
## $ N2O : num 304 304 304 304 304 ...
## $ CFC-11 : num 191 192 193 194 194 ...
## $ CFC-12 : num 350 352 354 356 357 ...
## $ TSI : num 1366 1366 1366 1366 1366 ...
## $ Aerosols: num 0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
## $ Temp : num 0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
## - attr(*, "spec")=
## .. cols(
## .. Year = col_double(),
## .. Month = col_double(),
## .. MEI = col_double(),
## .. CO2 = col_double(),
## .. CH4 = col_double(),
## .. N2O = col_double(),
## .. `CFC-11` = col_double(),
## .. `CFC-12` = col_double(),
## .. TSI = col_double(),
## .. Aerosols = col_double(),
## .. Temp = col_double()
## .. )
We can see there are 11 “numerical” variables with 308 observations.
Next, we will use summary()
function to get a quick summary of dataset.
summary(climate)
## Year Month MEI CO2
## Min. :1983 Min. : 1.000 Min. :-1.6350 Min. :340.2
## 1st Qu.:1989 1st Qu.: 4.000 1st Qu.:-0.3987 1st Qu.:353.0
## Median :1996 Median : 7.000 Median : 0.2375 Median :361.7
## Mean :1996 Mean : 6.552 Mean : 0.2756 Mean :363.2
## 3rd Qu.:2002 3rd Qu.:10.000 3rd Qu.: 0.8305 3rd Qu.:373.5
## Max. :2008 Max. :12.000 Max. : 3.0010 Max. :388.5
## CH4 N2O CFC-11 CFC-12 TSI
## Min. :1630 Min. :303.7 Min. :191.3 Min. :350.1 Min. :1365
## 1st Qu.:1722 1st Qu.:308.1 1st Qu.:246.3 1st Qu.:472.4 1st Qu.:1366
## Median :1764 Median :311.5 Median :258.3 Median :528.4 Median :1366
## Mean :1750 Mean :312.4 Mean :252.0 Mean :497.5 Mean :1366
## 3rd Qu.:1787 3rd Qu.:317.0 3rd Qu.:267.0 3rd Qu.:540.5 3rd Qu.:1366
## Max. :1814 Max. :322.2 Max. :271.5 Max. :543.8 Max. :1367
## Aerosols Temp
## Min. :0.00160 Min. :-0.2820
## 1st Qu.:0.00280 1st Qu.: 0.1217
## Median :0.00575 Median : 0.2480
## Mean :0.01666 Mean : 0.2568
## 3rd Qu.:0.01260 3rd Qu.: 0.4073
## Max. :0.14940 Max. : 0.7390
In a single line code we can see statistical numerical summary of all variables in dataset. This step is very helpful in identifying outliers or any anomaly in data.
Exploratory Data Analysis (EDA)
Before we start with model building, it makes sense to get ourselves familiarized with the data. This initial step is known as “Exploratory Data Analysis” or EDA for short. There are no hard and fast rules for this phase. It depends on your intuition and asking meaningful questions as you learn more about the data itself. Lets begin by analyzing each variable one by one.
1) Temperature
Temperature is the response variable in this exercise.
# adding Year-Month variable as date
climate1 <- climate %>%
mutate(year_month = ymd(paste(climate$Year, climate$Month, truncated = 1)))
# plotting Tempearture-Year
ggplot(climate1, aes(year_month, Temp)) +
geom_line() +
geom_smooth(se=FALSE, linetype = "dotted") +
labs(title = "Temperature (1983-2010)",
x = "Year",
y = "Temperature") +
theme_minimal()
We can see from the plot that there has been a steady rise in temperature over the years but since 2005 the curve has plateaued off. If its a permanent shift in the trend or seasonality, we don’t know and discussing it is beyond the scope of this post. However, lets see if there’s any seasonality to temperature across the year.
# adding right and left hand sided labels
climate1 <- climate1 %>%
mutate(label_rt = if_else(Month == 12 & Year%%2 == 0,
as.character(Year), NA_character_)) %>%
mutate(label_lt = if_else(Month == 1 & Year%%2 != 0,
as.character(Year), NA_character_))
# creating label zones for left and right sided
x_lt <- c(NA,1)
x_rt <- c(12,NA)
# Temperature Month-wise plot for each year in data
ggplot(climate1, aes(as.factor(Month), Temp)) +
geom_point(aes(color = as.factor(Year))) +
geom_line(aes(group = as.factor(Year),
color = as.factor(Year)),
alpha = 0.7) +
labs(title = 'Temperature by month') +
xlab("Months") +
ylab("Temperature") +
geom_text_repel(aes(label = label_rt, color = as.factor(Year)),
nudge_x = 1, xlim = x_rt) +
geom_text_repel(aes(label = label_lt, color = as.factor(Year)),
nudge_x = 1, xlim = x_lt) +
scale_x_discrete(expand=c(0.1,0),
breaks=c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12"),
labels=c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12"),
limits=c("-1","1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "13"), drop=FALSE) +
theme(legend.position = "none")
We can see that generally the temperature has been steadily rising across years (same as last plot) but I find this plot is little bit cluttered, so lets try another approach and plot a ‘Temperature-density’ distribution.
ggplot(climate1, aes(x = Temp, y = as.factor(Year))) +
geom_density_ridges_gradient(aes(fill = ..x..),
scale = 3, size = 0.3, alpha = 0.) +
geom_vline(xintercept = 0.5, alpha = 0.5, color = "red", linetype = "dotted") +
scale_fill_gradientn(colours = c("#87CEFA", "#FFFFE0", "#FF0000"),
name = "Temp") +
labs(title = 'Temperature density',
subtitle = "Temperature is difference in degrees Celsius between the \n average global temperature in that period and a reference value") +
theme(legend.position = c(0.9,0.2)) +
xlab("Temperature") +
ylab("Year") +
theme_classic()
The plot reveals that for last few decades we have a permanent shift towards higher temperature. Although there is a dip from 2005 through 2008 as we did see in other plots, we notice that that extreme temperatures (>0.5°C) occurred in all years since 2000.
2) MEI
MEI i.e. “Multivariate El Nino Southern Oscillation index by Year” is a prime predictor for global climate disruptions. MEI is a time series type of data and its a combination of multiple variables which contains both atmospheric and oceanic variables. Real time monitoring of MEI enables goverments to tackle regional issue affected by climate and plan for food and water supply, health and safety etc.
climate1 %>%
arrange(year_month) %>%
ggplot() +
geom_line(aes(year_month, MEI, color = MEI), size = 1) +
scale_color_gradient2(low = 'blue', high = 'red') +
labs(title = "Multivariate El Nino Southern Oscillation index by Year",
caption = "Source: ESRL/NOAA Physical Sciences Division") +
xlab("Year") +
ylab("MEI") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal()
There is definitely a seasonality to MEI but not a continuous positive or negative trend over the years.
3) Carbon dioxide
Carbon dioxide is a Greenhouse house which traps the solar enegery and helps warm up the planet. Since Industrial revolution humans have significantly contributed to natrually occurring \(CO_2\) levels which might have significant impact on increasing Global temperature. The \(CO_2\) levels have been constantly increasing in atmosphere since this data is collected. In model building it’d be interesting to know how does temperature correlate with \(CO_2\) levels.
climate1 %>%
arrange(year_month) %>%
ggplot(aes(year_month, CO2)) +
geom_line() +
geom_smooth(se = FALSE, linetype = "dotted", color = "red") +
labs(title = "Carbon-dioxide by Year",
caption = "Source: ESRL/NOAA Global Monitoring Division") +
xlab("Year") +
ylab("CO2 (ppmv)") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_classic()
4) Methane
Methane is also a Greenhouse gas and traps more energy than Carbon dioxide. We can see that its levels are on constantly rise too, however we notice that the curve has somewhat flattened around 2000.
climate1 %>%
arrange(year_month) %>%
ggplot(aes(year_month, CH4)) +
geom_line() +
geom_smooth(se=FALSE, linetype = "dotted", color = "red") +
labs(title = "Methane by Year",
caption = "Source: ESRL/NOAA Global Monitoring Division") +
xlab("Year") +
ylab("CH4 (ppmv)") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_classic()
5) Nitrous Oxide
Nitrous Oxide, another Green House gas, has lot more Global Warming Potential (GWP) than \(CO_2\) or \(N_2O\). As we can see all 3 biggest man-made contributor to Greenhouse gases are on constant rise.
climate1 %>%
arrange(year_month) %>%
ggplot() +
geom_line(aes(year_month, N2O)) +
labs(title = "Nitrous Oxide by Year",
caption = "Source: ESRL/NOAA Global Monitoring Division") +
xlab("Year") +
ylab("N2O (ppmv)") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
6) CFC-11 and CFC-12
We can see from the plots below that after rising steadily through 1995-2000 there is a decline in levels of CFC-11 and CFC-12. Read more about Kyoto protocol and Montreal protocol to understand the drivers behind this decline.
cfc11 <- climate1 %>%
arrange(year_month) %>%
ggplot() +
geom_line(aes(year_month, `CFC-11`)) +
labs(title = "CFC-11",
caption = "Source: ESRL/NOAA Global Monitoring Division") +
xlab("Year") +
ylab("CFC-11 (ppbv)") +
theme_classic()
cfc12 <- climate1 %>%
arrange(year_month) %>%
ggplot() +
geom_line(aes(year_month, `CFC-12`)) +
labs(title = "CFC-12",
caption = "Source: ESRL/NOAA Global Monitoring Division") +
xlab("Year") +
ylab("CFC-12 (ppbv)") +
theme_classic()
gridExtra::grid.arrange(cfc11, cfc12, nrow=1)
6) TSI
TSI pattern exhibits a 10 year cycle or at least that’s what it looks like from the plot. I was interested to know more about it so I googled “total solar irradiance cyclic?” and it took me to Solar Cycle wiki page which explains that “The solar cycle or solar magnetic activity cycle is the nearly periodic 11-year change in the Sun’s activity (including changes in the levels of solar radiation and ejection of solar material) and appearance (changes in the number and size of sunspots, flares, and other manifestations).”
climate1 %>%
arrange(year_month) %>%
ggplot(aes(year_month, TSI)) +
geom_line() +
labs(title = "Total Solar Irradiance (TSI)",
caption = "Source: SOLARIS-HEPPA project") +
xlab("Year") +
ylab("TSI (W/m^2)") +
theme_classic()
7) Aerosols
Well at first glance it looks like something is off. The peak around 1992 looks like an anomaly or some data issue. However, if you go to Godard Institure for Space Studies at NASA website, you’ll notice that there is no problem with this data. It is accurate and a bit of research will tell you that in 1991 Mount Pinatubo erupted causing Aerosols to form a global layer of sulfuric acid haze which caused global temperatures to drop by 0.5°C.
climate1 %>%
arrange(year_month) %>%
ggplot() +
geom_line(aes(year_month, Aerosols)) +
labs(title = "Mean Stratospheric Aerosol Optical Depth at 550 nm",
caption = "Source: Godard Institute for Space Studies at NASA") +
xlab("Year") +
ylab("AOD")
Setting up Training and Testing data
In Machine Learning its pretty common to divvy up data between training and testing set. Training set is primarily used to understand the relationship between a) response and predictor variables and b) relationship among predictors. We then use the generated hypothesis on testing dataset to check the strength of model.
There are many ways to split up data. One way could be to just randomly split data in whatever ratio makes sense either 60/40 or 80/20. This approach makes sense if there isn’t a particular pattern in the response variable. However, say if you have a qualitative response variable such as “Yes” or “No”, “Democrat” or “Republican” etc., in that case you want to make sure that the testing and training dataset retain the same proportion of “response” as in original dataset. There are multiple approaches to do that and we will discuss some of those in upcoming posts.
For climate dataset, we’ll put data for all years including and before year 2006 in training set and the rest in testing set.
climate_train <- climate %>%
filter(Year <= 2006)
climate_test <- climate %>%
filter(Year > 2006)
Simple Linear Regression
Simple Linear Regression looks like this: \[Y = \beta_0 + \beta_1X + \epsilon\] where ‘Y’ is the response variable and ‘X’ is the predictor variable. In essence there is only one predictor variable. To build a simple linear regression lets use \(CO_2\) as predictor variable.
Model Building
Before we build a Simple Linear model, lets understand what a “Baseline model” is and build one. A Baseline model predicts a unique value for a quantitative response variable or the most common value for qualitative/classification response variables. For example, if you’re predicting rainfall for next month, a baseline model will use average of all monthly rainfall data as a prediction, however, if you’re trying to predict who wins the next soccer world cup, a baseline model will look at team with highest wins and it’ll simply predict that team to win. In this particular example, baseline model will always predict the average temperature, which is:
mean(climate_train$Temp)
## [1] 0.2477993
Plotting Baseline model against \(CO_2\) will look like:
climate_train_baseline <- climate_train %>%
mutate(baseline_temp = mean(Temp))
ggplot(climate_train_baseline, aes(CO2, baseline_temp)) +
geom_point(alpha = 0.2) +
stat_smooth(geom = "line", se = FALSE, color = "red", alpha = 0.8) +
labs(title = "Baseline Temperature-CO2 model",
x = "CO2",
y = "Baseline Temperature") +
theme(plot.title = element_text(hjust = 0.5))
Basically, as stated above, this model will always predict same value for Temperature regardless of \(CO_2\) value. The error of this model, defined as difference between actual and predicted values, can be calculated as following:
climate_train_baseline <- climate_train_baseline %>%
mutate(error = Temp - baseline_temp)
ggplot(climate_train_baseline, aes(CO2, error)) +
geom_point(aes(color = ifelse(error > 0, "red", "blue"))) +
scale_color_identity() +
labs(title = "Error plot for Baseline model",
x = "CO2",
y = "Error")
We can see that we have, both, the positive and negative type of errors and they are distributed uniformly on both sides of zero. Above is a great plot to analyze errors visually but to compare the accuracy and strength of models we use few other statistics such as ‘Sum of Squared Errors’, ‘Root Mean Squared Errors’, ‘R-square’ etc.
The ‘Sum of Squared Errors’ (SSE) for baseline model is calculated as:
\[SSE = \sum_{i=1}^N(X_i - \bar{X})^{2} \]
(SSE_baseline <- sum(climate_train_baseline$error^2))
## [1] 9.285256
Note that SSE of Baseline model is also known as SST i.e Sum of Squared Total Errors.
SSE is a good indicator of model accuracy but it’s dependent on number of observations (N). So if we double the observation the model SSE will double as well which is NOT good for assessing accuracy of model, hence we use ‘Root Mean Squared Error’ (RMSE) which is basically square root of SSE divided by N.
\[RMSE = \sqrt[]{\frac{SSE}{N}}\]
(RMSE_baseline <- sqrt(SSE_baseline/nrow(climate_train_baseline)))
## [1] 0.1808164
Now we understand our baseline model so we’re ready to build our Simple Linear model.
\[Temp = \beta_0 + \beta_1*CO_2 + \epsilon \]
model1 <- lm(Temp ~ CO2, data = climate_train)
We can look at summary of model by using summary()
function
summary(model1)
##
## Call:
## lm(formula = Temp ~ CO2, data = climate_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31912 -0.07819 -0.00129 0.05805 0.43420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2646515 0.2096815 -20.34 <2e-16 ***
## CO2 0.0124855 0.0005799 21.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1116 on 282 degrees of freedom
## Multiple R-squared: 0.6218, Adjusted R-squared: 0.6204
## F-statistic: 463.6 on 1 and 282 DF, p-value: < 2.2e-16
We observe that coefficient of \(CO_2\) is 0.012 which means that for every increase of 10 ppmv of \(CO_2\), the temperature difference will increase by 0.12°C. The p value establishes that coefficient is significant.
Note: \(R^2\) ( = 1 - \(\frac{SSE}{SST}\)) depends on SSE and SST (SST is SSE of baseline). \(R^2\) value of model built using training dataset is anywhere between 0 and 1. A value of 0 means no improvement over Baseline and 1 means perfect Predictive model. In other words, \(R^2\) tells us how better our model is than Baseline model. The \(R^2\) for Simple model is 0.6218 which is significant improvement over Baseline model.
The summary()
function also outputs ‘Adjusted \(R^2\)’, which adjusts the ‘Multiple \(R^2\)’ to account for the number of independent variables used relative to the number of data points. ‘Multiple \(R^2\)’ will always increase if you add independent variable whereas ‘Adjusted \(R^2\)’ will decrease if you add independent variable which decreases the quality of the model.
Lets calculate SSE and RMSE for Simple Linear model:
climate_train <- climate_train %>%
add_predictions(model1) %>%
mutate(error_simple = Temp - pred)
(SSE_simple_model = sum(climate_train$error_simple^2))
## [1] 3.511885
(RMSE_simple_model = sqrt(SSE_simple_model)/nrow(climate_train))
## [1] 0.0065986
We can see that Simple Model has significantly less SSE and RMSE compared to Baseline model.
Interpret model
We’re working with a single variable model and we can already see its much better than baseline model. Now lets plot the estimate versus actual data points for a Temperature-Carbon-dioxide plot.
aug_model1 <- augment(model1, climate_train)
ggplot(aug_model1, aes(CO2, Temp)) +
geom_point(color = "red") +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
geom_segment(aes(x = CO2, xend = CO2,
y = Temp, yend = aug_model1$.fitted), alpha = 0.5) +
ggtitle("Regression Error") +
theme_classic()
The summary()
function is great to see relevant stats of a model but accessing the results from summary()
is difficult as it’s not in a data frame or tibble format. Luckily, we can just do that using tidy()
function from broom
package. It’s a very handy tool for writing reports or futher analysis.
tidy(model1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.26 0.210 -20.3 3.13e-57
## 2 CO2 0.0125 0.000580 21.5 1.74e-61
Assessing accuracy
There are few ways we can assess accuracy of the model built against the raw data.
- RSE
- \(R^2\)
- F-statistics
We can see that the \(R^2\) is 0.6217783 and adjusted \(R^2\) is 0.6204371 but we don’t know if its better or not since we haven’t build any other model. But we know its an improvement of ~62% over baseline model. We can look up all three values using glance()
function from broom()
package.
glance(model1)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.622 0.620 0.112 464. 1.74e-61 2 221. -436. -425.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
As we can see R-squared value is 0.622 (1 is perfect prediction), Residual Standard Error (RSE) is 0.112 (minimize as much as possible), and F-statistics is 463.594 (larger the better).
Making Prediction
We have our model ready, now we can use this model on to predict temperature changes on unseen data and assess its accuracy.
climate_test <- climate_test %>%
add_predictions(model1)
climate_test %>%
select(Temp, pred)
## # A tibble: 24 x 2
## Temp pred
## <dbl> <dbl>
## 1 0.601 0.516
## 2 0.498 0.527
## 3 0.435 0.537
## 4 0.466 0.560
## 5 0.372 0.562
## 6 0.382 0.555
## 7 0.394 0.536
## 8 0.358 0.505
## 9 0.402 0.491
## 10 0.362 0.494
## # … with 14 more rows
Calculate Mean Squared Error (MSE)
# MSE (Mean Square Error)
climate_test %>%
summarise(MSE = mean((Temp - pred)^2))
## # A tibble: 1 x 1
## MSE
## <dbl>
## 1 0.0438
We’re going to compare model1 against other models in next section.
Multiple Linear Regression
A Multiple Linear Regression is in the form of: \[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ….. + \beta_kX_k + \epsilon \]
Model Building
We can build a model where Temperature depends on all other variables except Year and Month.
Lets take a look at the summary()
of this model.
summary(model2)
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` +
## TSI + Aerosols, data = climate_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25888 -0.05913 -0.00082 0.05649 0.32433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.246e+02 1.989e+01 -6.265 1.43e-09 ***
## MEI 6.421e-02 6.470e-03 9.923 < 2e-16 ***
## CO2 6.457e-03 2.285e-03 2.826 0.00505 **
## CH4 1.240e-04 5.158e-04 0.240 0.81015
## N2O -1.653e-02 8.565e-03 -1.930 0.05467 .
## `CFC-11` -6.631e-03 1.626e-03 -4.078 5.96e-05 ***
## `CFC-12` 3.808e-03 1.014e-03 3.757 0.00021 ***
## TSI 9.314e-02 1.475e-02 6.313 1.10e-09 ***
## Aerosols -1.538e+00 2.133e-01 -7.210 5.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7436
## F-statistic: 103.6 on 8 and 275 DF, p-value: < 2.2e-16
This is not a bad model. Our \(R^2\) increased from 0.622 to 0.751 and there are few significant predictors. It appears to be a better model than the simple linear model, however, there is a problem associated with this model, Collinearity.
Collinearity or Multicollinearity exists when there are two or more highly correlated variables in the predictor set. Because of this high collinearity among variables, it’s difficult to explain the variation in the dataset and which variable caused it. In short, we need to deal with this issue and the easiest way is to drop one or few highly collinear variables. To calculate collinearity we can use inbuilt cor()
function.
round(cor(climate_train), digits = 2)
## Year Month MEI CO2 CH4 N2O CFC-11 CFC-12 TSI Aerosols
## Year 1.00 -0.03 -0.04 0.98 0.92 0.99 0.57 0.90 0.17 -0.35
## Month -0.03 1.00 0.00 -0.11 0.02 0.01 -0.01 0.00 -0.03 0.01
## MEI -0.04 0.00 1.00 -0.04 -0.03 -0.05 0.07 0.01 -0.15 0.34
## CO2 0.98 -0.11 -0.04 1.00 0.88 0.98 0.51 0.85 0.18 -0.36
## CH4 0.92 0.02 -0.03 0.88 1.00 0.90 0.78 0.96 0.25 -0.27
## N2O 0.99 0.01 -0.05 0.98 0.90 1.00 0.52 0.87 0.20 -0.34
## CFC-11 0.57 -0.01 0.07 0.51 0.78 0.52 1.00 0.87 0.27 -0.04
## CFC-12 0.90 0.00 0.01 0.85 0.96 0.87 0.87 1.00 0.26 -0.23
## TSI 0.17 -0.03 -0.15 0.18 0.25 0.20 0.27 0.26 1.00 0.05
## Aerosols -0.35 0.01 0.34 -0.36 -0.27 -0.34 -0.04 -0.23 0.05 1.00
## Temp 0.79 -0.10 0.17 0.79 0.70 0.78 0.41 0.69 0.24 -0.38
## pred 0.98 -0.11 -0.04 1.00 0.88 0.98 0.51 0.85 0.18 -0.36
## error_simple 0.02 -0.03 0.33 0.00 0.02 0.01 0.00 0.02 0.17 -0.17
## Temp pred error_simple
## Year 0.79 0.98 0.02
## Month -0.10 -0.11 -0.03
## MEI 0.17 -0.04 0.33
## CO2 0.79 1.00 0.00
## CH4 0.70 0.88 0.02
## N2O 0.78 0.98 0.01
## CFC-11 0.41 0.51 0.00
## CFC-12 0.69 0.85 0.02
## TSI 0.24 0.18 0.17
## Aerosols -0.38 -0.36 -0.17
## Temp 1.00 0.79 0.61
## pred 0.79 1.00 0.00
## error_simple 0.61 0.00 1.00
We can clearly see that there are few highly collinear variables in the dataset. However, there is a better way to visualize collinearity matrix using corrplot
package.
library(corrplot)
x <- cor(climate_train)
corrplot.mixed(x, lower = "number", upper = "square", number.cex = 0.7,
order = "hclust")
We can see that variable \(N_2O\) and \(CH_4\) have high correlation with other variables, so we can drop them from the dataset and build another model.
model3 <- lm(Temp ~ MEI + CO2 + `CFC-11` + `CFC-12` + TSI + Aerosols,
data = climate_train)
summary(model3)
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + `CFC-11` + `CFC-12` + TSI + Aerosols,
## data = climate_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26799 -0.06002 -0.00323 0.05651 0.33074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.223e+02 1.991e+01 -6.139 2.86e-09 ***
## MEI 6.421e-02 6.465e-03 9.932 < 2e-16 ***
## CO2 4.061e-03 1.928e-03 2.106 0.036112 *
## `CFC-11` -4.314e-03 1.109e-03 -3.891 0.000125 ***
## `CFC-12` 2.430e-03 6.431e-04 3.778 0.000194 ***
## TSI 8.852e-02 1.461e-02 6.060 4.44e-09 ***
## Aerosols -1.567e+00 2.132e-01 -7.347 2.28e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09201 on 277 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.742
## F-statistic: 136.6 on 6 and 277 DF, p-value: < 2.2e-16
There is another method in R to work through this. R provides a step function which returns a model with optimized AIC. If we use model2 which contains all variables, step will figure out which variables to drop to maximize the accuracy of the model.
model4 <- step(model2)
## Start: AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` + TSI + Aerosols
##
## Df Sum of Sq RSS AIC
## - CH4 1 0.00049 2.3135 -1350.1
## <none> 2.3130 -1348.2
## - N2O 1 0.03132 2.3443 -1346.3
## - CO2 1 0.06719 2.3802 -1342.0
## - `CFC-12` 1 0.11874 2.4318 -1335.9
## - `CFC-11` 1 0.13986 2.4529 -1333.5
## - TSI 1 0.33516 2.6482 -1311.7
## - Aerosols 1 0.43727 2.7503 -1301.0
## - MEI 1 0.82823 3.1412 -1263.2
##
## Step: AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + `CFC-11` + `CFC-12` + TSI + Aerosols
##
## Df Sum of Sq RSS AIC
## <none> 2.3135 -1350.1
## - N2O 1 0.03133 2.3448 -1348.3
## - CO2 1 0.06672 2.3802 -1344.0
## - `CFC-12` 1 0.13023 2.4437 -1336.5
## - `CFC-11` 1 0.13938 2.4529 -1335.5
## - TSI 1 0.33500 2.6485 -1313.7
## - Aerosols 1 0.43987 2.7534 -1302.7
## - MEI 1 0.83118 3.1447 -1264.9
Assessing Models Numerically
We can numerically compare quality of all four models.
list(model1 = broom::glance(model1),
model2 = broom::glance(model2),
model3 = broom::glance(model3),
model4 = broom::glance(model4))
## $model1
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.622 0.620 0.112 464. 1.74e-61 2 221. -436. -425.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
##
## $model2
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.751 0.744 0.0917 104. 1.94e-78 9 280. -540. -504.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
##
## $model3
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.747 0.742 0.0920 137. 9.15e-80 7 278. -540. -511.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
##
## $model4
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.751 0.745 0.0916 119. 1.77e-79 8 280. -542. -509.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the \(R^2\) significantly. The consequence of this is that the step function will not necessarily produce a very interpretable model - just a model that has balanced quality and simplicity for a particular weighting of quality and simplicity (AIC). Looking at the data above its clear that model2, model3 and model4 are roughly same in terms of quality.
Personally I’d use model3 for its interpretability and also because it addresses collinearity issue head on.
Assessing Models Visually
So far we compared the models numerically which is great but we should also assess model strength visually. We will choose model3 for this analysis. In each model a visual study of residuals can provide valuable insights.
1. “Normality” of residuals
aug_model3 <- augment(model3, climate_train)
ggplot(aug_model3, aes(.resid)) +
geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
stat_function(fun = dnorm, color = "red",
args = list(mean = mean(aug_model3$.resid, na.rm = TRUE),
sd = sd(aug_model3$.resid, na.rm = TRUE))) +
xlab("Residuals from model3")
The plot clearly shows that the “residuals” of the model follow normal distribution which is as per our expectation.
2. Residual vs. Fitted values
ggplot(aug_model3, aes(.fitted, .resid)) +
geom_ref_line(h = 0) +
geom_point(aes(color = as.factor(Year)), alpha = 0.6) +
geom_smooth(se = FALSE) +
ggtitle("Residuals vs Fitted") +
xlab("Fitted values (Predictions)") +
ylab("Residuals (Actual - Predicted values)") +
labs(color = "Year")
In this plot we expect a sort of horizontal line at .resid = 0 and equal and homogeneous distribution of points on either side of line. Above model satisfies both requirement. If there was a non-linear relationship between predictor variables and the response variable then it’d shows up in this plot.
3. Standardized Residuals vs. Fitted values
ggplot(aug_model3, aes(.fitted, .std.resid)) +
geom_ref_line(h = 0) +
geom_point(aes(color = as.factor(Year)), alpha = 0.6) +
geom_smooth(se = FALSE) +
ggtitle("Standardized Residual vs Fitted") +
xlab("Fitted values (Predictions)") +
ylab("Standardized Residuals") +
labs(color = "Year")
Interpretation of above plot is similar to Residual vs. Fitted values
4. Scale-Location Plot
Scale-Location plot explains if the residuals are spread evenly along the range of predictors. This plot is used to check the assumption of equal variance (homoscedasticity). We have scattered points equally on both side of line and the line is roughly horizontal and both of these two are good indicator of a robust and accurate model.
ggplot(aug_model3, aes(.fitted, sqrt(abs(.std.resid)))) +
geom_ref_line(h = 0) +
geom_point(aes(color = as.factor(Year)), alpha = 0.6) +
geom_smooth(se = FALSE) +
ggtitle("Scale-Location") +
xlab("Fitted values (Predictions)") +
ylab("sqrt(Standardized Residuals)") +
labs(color = "Year")
5. Q-Q plot
We plotted residuals to check for their normality in first plot of this section. We can also use Q-Q plot. If data points fall on an imaginary straight line that indicates normal distribution while any skewness is not a good sign. The plot clearly shows a normal distribution, there are some data points towards the head and tail observations which deviate from the normal line but that’s not too bad. In real world data, this is expected.
ggplot(aug_model3, aes(sample = .resid)) +
stat_qq(alpha = 0.3) +
stat_qq_line(color = "blue") +
ggtitle("Normal Q-Q Plot") +
xlab("Theoretical Quantiles") +
ylab("Sample Quantiles")
6. Cook’s distance and Residuals vs. Leverage
In Regression we always look for outliers with our belief being that outliers cause distortion to our model and must be avoided at all costs. However, that’s nor true at all, infact outliers can explain something deep about the environment or structure of dataset and should be studied very carefully. Also outliers don’t necessarily influence regression model as much as we think. Sometimes you’d notice that it doesn’t matter if you include or exclude the outlier from the dataset, their impact is marginal. However there is another set of observations which may impact the quality of model even if you change the observation little bit. Obviously we’re very interested in studying these observations and their impact. That’s where this plot comes in picture. I’m using base R for these two plots.
From Residual vs. Leverage plot we can’t even see the Cook’s distance line as all observations are well within Cook’s distance lines. It indicates absence of any influential observation.
plot(model3, which = 4, id.n = 5)
# id.n identifies "n" top outlier observations
plot(model3, which = 5, id.n = 5)
Making Prediction
Having analyzed the model on training data both, numerically and visually, we’re now ready to make prediction on the testing dataset. We will use model3 for predictions.
(climate_test <- climate_test %>%
add_predictions(model3)) %>%
select(Temp, pred, everything())
## # A tibble: 24 x 12
## Temp pred Year Month MEI CO2 CH4 N2O `CFC-11` `CFC-12` TSI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.601 0.488 2007 1 0.974 383. 1800. 321. 248. 539. 1366.
## 2 0.498 0.462 2007 2 0.51 384. 1803. 321. 248. 539. 1366.
## 3 0.435 0.442 2007 3 0.074 385. 1803. 321. 248. 539. 1366.
## 4 0.466 0.440 2007 4 -0.049 386. 1802. 321. 248. 539. 1366.
## 5 0.372 0.454 2007 5 0.183 387. 1796. 320. 247. 538. 1366.
## 6 0.382 0.423 2007 6 -0.358 386. 1782. 320. 247. 537. 1366.
## 7 0.394 0.421 2007 7 -0.290 384. 1772. 320. 246. 537. 1366.
## 8 0.358 0.403 2007 8 -0.44 382 1779. 320. 246. 537. 1366.
## 9 0.402 0.349 2007 9 -1.16 381. 1794. 321. 246. 537. 1366.
## 10 0.362 0.354 2007 10 -1.14 381. 1802. 321. 246. 537. 1366.
## # … with 14 more rows, and 1 more variable: Aerosols <dbl>
Comparing MSE of model3 versus model1:
climate_test %>%
gather_predictions(model1, model3) %>%
group_by(model) %>%
summarise(MSE = mean((Temp-pred)^2))
## # A tibble: 2 x 2
## model MSE
## <chr> <dbl>
## 1 model1 0.0438
## 2 model3 0.0104
This clearly shows that when we use model1 and model3 for out of sample prediction, model3 is still a much better model with very low MSE.
Conclusion
In this post we learned:
- How to input data into R
- Create Simple and Multiple Linear Regression model
- How to interpret a model both, numerically and visually
- How to make predictions using the models
Please know the purpose of the post was as stated above and not to make any comment on current politics around climate change. I leave it up to the climate science experts, free economy market and political leadership to figure that out.