Linear relationships are pervasive.
y = mx + b where
y is the dependent variable
x is the independent variable
m is the slope
b is the Y-intercept
Items can be positively or negatively correlated.
Positive: m > 0
Y = wisdom, X = age
Negative: M < 0
Y = years in jail, X = years of education
Perfect correlation when M = 1 or M = -1.
Correlation statistic is r, where -1 <= r <= 1
Measure of strength of correlation is r2: 0 <= r2 <=1
https://en.m.wikipedia.org/wiki/Anscombe%27s_quartet Anscombe Quartet
Just because you can apply linear regression does not mean that it is appropriate. Example: White House linear extrapolation story.
Spurious correlation: http://tylervigen.com/spurious-correlations
Horace: “Est modus in rebus, sunt certi denique fines, quos ultra citraque nequit consister rectrum.” (“There is a proper measure in things. There are, finally, certain boundaries short of and beyond which what is right cannot exist.”) argument against linear models.
Linear regression in R - see Chapter 11 of R Cookbook. The key command is lm (for linear model) Type lm? in rstudio to see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html
Anscombe is a built-in dataset in R. See https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/anscombe.html
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html (for all datasets)
>
require(stats); require(graphics) |
|
|
|
|
Simple linear regression involves only one independent variable: x
Multivariable linear regression has more than one independent variable, that is, X is a vector. The model has multiple inputs to predict a single output variable.
Multivariate linear regression has more than one dependent variable, that is, Y is a vector. The model predicts multiple output values.
Logistic regression is used when the output variable is binary, e.g., true/false, yes/no.
Examples:
Epidemiology: Does the patient have the disease? Will the patient live or die?
Politics: Will a subject vote for the Democrat or the Republican? (based on age, income, sex, race, state of residence, votes in previous elections)
Marketing: Will a subject buy a particular new product?
Internet ads: will a user click on a particular ad on a web page? (Google and Facebook use this approach.)
Epidemiology
Marketing
Capital Asset Pricing Model (CAPM)
Economics - everywhere
Machine Learning - most common technique
Stanford Business School project for course in Business Intelligence from Big Data
http://zoo.cs.yale.edu/classes/cs458/lectures/Broadway/
Data
Final Report
Python code
R code
Report files
Dependent variable: Broadway show gross revenue. Data gathered by scraping Internet sites.
Independent variables:
Show capacity (how many seats in theatre, shows per week)
Show genre (play, musical, other)
Seasonality and holidays
Weather
Financial indicators (stock market)
Social data (Google trends)
Goal: generate a predictive model with minimum root-mean-squared-error (RMSE).
Noticed that length of run did not reduce revenue.
The grade on this assignment can replace your lowest homework grade.
Use multivariable linear regression to predict either
the stock market, e.g., the S&P 500 index or the Dow Jones index
the price of a specific stock, e.g., Apple.
Sample independent variables could include macroeconomic data, like interest rates, housing starts, unemployment, factory orders, lipstick, etc. See https://en.wikipedia.org/wiki/Economic_indicator and http://sys.vos.cz/pdf_view/AJ/Baumohl_The_secrets_of_Economic_Indicators.pdf
You should also consider other independent variables that might be available from social media, like Google or Twitter. Tobias Preis used Google trends and Wikipedia usage data to predict the stock market. https://en.wikipedia.org/wiki/Tobias_Preis
The model should make sense. Avoid spurious correlations.
Following the model of the broadway show project, use only publically available data, which includes Yahoo Finance.
You are required to write programs in Python or R to scrape the web to gather at least one of the independent variables. You may also simply download relevant, predigested or preformatted datasets from government, academic, or other sites.
The project with the highest predictive value gets an extra 5 points. (You might also want to write a business plan.)
Slade, Automated Decision Systems |
Page |