Statistical Models

Simple Linear Regression

Linear relationships are pervasive.
y = mx + b where
- y is the dependent variable
- x is the independent variable
- m is the slope
- b is the Y-intercept
Items can be positively or negatively correlated.
- Positive: m > 0
  - Y = wisdom, X = age
- Negative: M < 0
  - Y = years in jail, X = years of education
Perfect correlation when M = 1 or M = -1.
Correlation statistic is r, where -1 <= r <= 1
Measure of strength of correlation is r²: 0 <= r² <=1
https://en.m.wikipedia.org/wiki/Anscombe%27s_quartet Anscombe Quartet

Just because you can apply linear regression does not mean that it is appropriate. Example: White House linear extrapolation story.

Spurious correlation: http://tylervigen.com/spurious-correlations

Horace: “Est modus in rebus, sunt certi denique fines, quos ultra citraque nequit consister rectrum.” (“There is a proper measure in things. There are, finally, certain boundaries short of and beyond which what is right cannot exist.”) argument against linear models.
Linear regression in R - see Chapter 11 of R Cookbook. The key command is lm (for linear model) Type lm? in rstudio to see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html
Anscombe is a built-in dataset in R. See https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/anscombe.html
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html (for all datasets)

> require(stats); require(graphics)
> summary(anscombe)
x1 x2 x3 x4 y1 y2 y3
Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260 Min. :3.100 Min. : 5.39
1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25
Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580 Median :8.140 Median : 7.11
Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501 Mean :7.501 Mean : 7.50
3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98
Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840 Max. :9.260 Max. :12.74
y4
Min. : 5.250
1st Qu.: 6.170
Median : 7.040
Mean : 7.501
3rd Qu.: 8.190
Max. :12.500
> ##-- now some "magic" to do the 4 regressions in a loop:
> ff <- y ~ x
> mods <- setNames(as.list(1:4), paste0("lm", 1:4))
> for(i in 1:4) {
+ ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
+ ## or ff[[2]] <- as.name(paste0("y", i))
+ ## ff[[3]] <- as.name(paste0("x", i))
+ mods[[i]] <- lmi <- lm(ff, data = anscombe)
+ print(anova(lmi))
+ }
Analysis of Variance Table

Response: y1
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 27.510 27.5100 17.99 0.00217 **
Residuals 9 13.763 1.5292
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y2
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 27.500 27.5000 17.966 0.002179 **
Residuals 9 13.776 1.5307
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y3
Df Sum Sq Mean Sq F value Pr(>F)
x3 1 27.470 27.4700 17.972 0.002176 **
Residuals 9 13.756 1.5285
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis of Variance Table

Response: y4
Df Sum Sq Mean Sq F value Pr(>F)
x4 1 27.490 27.4900 18.003 0.002165 **
Residuals 9 13.742 1.5269
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> ## See how close they are (numerically!)
> sapply(mods, coef)
lm1 lm2 lm3 lm4
(Intercept) 3.0000909 3.000909 3.0024545 3.0017273
x1 0.5000909 0.500000 0.4997273 0.4999091
> lapply(mods, function(fm) coef(summary(fm)))
$lm1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0000909 1.1247468 2.667348 0.025734051
x1 0.5000909 0.1179055 4.241455 0.002169629

$lm2
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.000909 1.1253024 2.666758 0.025758941
x2 0.500000 0.1179637 4.238590 0.002178816

$lm3
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0024545 1.1244812 2.670080 0.025619109
x3 0.4997273 0.1178777 4.239372 0.002176305

$lm4
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017273 1.1239211 2.670763 0.025590425
x4 0.4999091 0.1178189 4.243028 0.002164602

> ## Now, do what you should have done in the first place: PLOTS
> op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
> for(i in 1:4) {
+ ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
+ plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
+ xlim = c(3, 19), ylim = c(3, 13))
+ abline(mods[[i]], col = "blue")
+ }
> mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
> par(op)

Multivariate (multivariable) Linear Regression

Simple linear regression involves only one independent variable: x
Multivariable linear regression has more than one independent variable, that is, X is a vector. The model has multiple inputs to predict a single output variable.
Multivariate linear regression has more than one dependent variable, that is, Y is a vector. The model predicts multiple output values.

Logistic Regression

Logistic regression is used when the output variable is binary, e.g., true/false, yes/no.
Examples:
- Epidemiology: Does the patient have the disease? Will the patient live or die?
- Politics: Will a subject vote for the Democrat or the Republican? (based on age, income, sex, race, state of residence, votes in previous elections)
- Marketing: Will a subject buy a particular new product?
- Internet ads: will a user click on a particular ad on a web page? (Google and Facebook use this approach.)

Applications

Epidemiology
Marketing
Capital Asset Pricing Model (CAPM)
Economics - everywhere
Machine Learning - most common technique

Detailed Example: Forecasting Broadway Show Gross Revenue

Stanford Business School project for course in Business Intelligence from Big Data
http://zoo.cs.yale.edu/classes/cs458/lectures/Broadway/
- Data
- Final Report
- Python code
- R code
- Report files
Dependent variable: Broadway show gross revenue. Data gathered by scraping Internet sites.
Independent variables:
- Show capacity (how many seats in theatre, shows per week)
- Show genre (play, musical, other)
- Seasonality and holidays
- Weather
- Financial indicators (stock market)
- Social data (Google trends)
Goal: generate a predictive model with minimum root-mean-squared-error (RMSE).
Noticed that length of run did not reduce revenue.

Optional Homework Assignment

The grade on this assignment can replace your lowest homework grade.
Use multivariable linear regression to predict either
- the stock market, e.g., the S&P 500 index or the Dow Jones index
- the price of a specific stock, e.g., Apple.
Sample independent variables could include macroeconomic data, like interest rates, housing starts, unemployment, factory orders, lipstick, etc. See https://en.wikipedia.org/wiki/Economic_indicator and http://sys.vos.cz/pdf_view/AJ/Baumohl_The_secrets_of_Economic_Indicators.pdf
You should also consider other independent variables that might be available from social media, like Google or Twitter. Tobias Preis used Google trends and Wikipedia usage data to predict the stock market. https://en.wikipedia.org/wiki/Tobias_Preis
The model should make sense. Avoid spurious correlations.
Following the model of the broadway show project, use only publically available data, which includes Yahoo Finance.
You are required to write programs in Python or R to scrape the web to gather at least one of the independent variables. You may also simply download relevant, predigested or preformatted datasets from government, academic, or other sites.
The project with the highest predictive value gets an extra 5 points. (You might also want to write a business plan.)

Slade, Automated Decision Systems

Page 6