Econometrics Exam questions

Econometrics Take Home Exam
Hi, attatched is the exam, powerpoint slides, and practice problems with solutions.Please follow the powerpoint slides for it to make sense for the class and get a high grade.

Ordinary least squares regression II:
The univariate affine regression.
Clement de Chaisemartin and Doug Steigerwald
UCSB
1
Many people need to make
predictions
• Traders: use today’s GDP growth to predict
tomorrow oil’s price.
• Banks: use FICO score to predict the amount
that a April 2018 applicant will fail to
reimburse on her one‐year loan in April 2018.
• Gmail: use whether incoming email has the
word “free” in it to predict whether it’s a
spam.
2
The relationship between FICO score
and default
• Assume that relationship between FICO score and amount
people fail to repay looks like graph below: people with low
FICO fail to repay more.
• If you use a univariate linear regression to predict the amount
people fail to repay based on their FICO score, will you make
good predictions? Discuss this question with your neighbor.
‐1000 0 2 4 6 8 10 3
0
1000
2000
3000
4000
5000
6000
Default
iClicker time
• If you use a univariate linear regression to
predict the amount people fail to repay based
on their FICO score, will you make good
predictions?
4
No!
• In this example, OLS regression function is 250*FICO,
increasing with FICO! We predict that people with better
scores will fail to reimburse more.
• OLS regression makes large prediction errors.
• Why does regression make large prediction errors?
0 2 4 6 8 10 5
‐1000
0
1000
2000
3000
4000
5000
6000
Default
250*FICO
Prediction error of
univariate linear regression
for person with x=2
iClicker time
• Why does the univariate linear regression
make large prediction errors?
a) Because the relationship between FICO and
the amount people fail to repay is
decreasing.
b) Because the amount that people with FICO
score equal to 0 fail to repay is different from
0.
6
Because the amount that people with a FICO
score equal to 0 fail to repay is different from 0.
• The univariate linear regression function is ௞.
Therefore, by construction, our prediction will be 0
for people with FICO score = 0.
• However, as you can see from the graph, people with
a FICO score equal to 0 fail to reimburse a strictly
positive amount on their loan, not a 0 amount.
7
You should use an affine prediction function.
• The graph below shows that the function 5000‐500*FICO does
a much better job at predicting the amount that people fail to
repay than the univariate linear regression function 250*FICO
• 5000‐500*FICO is a an affine function of FICO, with an
intercept equal to 5000, and a slope equal to ‐500.
• In these lectures, we study OLS univariate affine regression.
‐1000 0 2 4 6 8 10 8
0
1000
2000
3000
4000
5000
6000
Default
250*FICO
5000‐500*FICO
Roadmap
1. The OLS univariate affine regression function.
2. Estimating the OLS univariate affine regression function.
3. Interpreting 𝛽 መଵ
4. OLS univariate affine regression in practice.
9
Set up and notation.
• We consider a population of 𝑁 units.
– 𝑁 = number of people who apply for a one year‐loan with
bank A during April 2018.
– 𝑁 = number of emails reaching Gmail accounts in April 2018.
• Each unit 𝑘 has a variable 𝑦௞ attached to it that we do not
observe. We call this variable the dependent variable.
– In loan example, 𝑦௞ is a variable equal to the amount of her
loan applicant 𝑘 will fail to reimburse when her loan expires
in April 2018.
– In email example, 𝑦௞ = 1 if email 𝑘 is a spam and 0 otherwise.
• Each unit 𝑘 also has 1 variable 𝑥
௞ attached to it that we do
observe. We call this variable the independent variable.
– In loan example, 𝑥௞ could be the FICO score of applicant 𝑘.
– In email example, 𝑥௞ =1 if the word “free” appears in the
email.
• 𝑦 ത ൌ ଵ

∑ே ௞ୀଵ 𝑦௞ and 𝑥̅ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞: average of 𝑦௞s and 𝑥௞s.
10
Your prediction should be a function of
• Based on the value of
௞ of each unit, we want to
predict her ௞.
• E.g.: in the loan example, we want to predict the
amount that unit will fail to repay on her loan based
on her FICO score.
• Assume that applicant 1 has a very high (good) credit
score, while applicant 2 has a very low (bad) credit
score.
• Should you predict the same value of ௞ for applicants
1 and 2?
• No! Your prediction should a function of ௞, ௞ .
• In these lectures, we focus on predictions which are a
affine function of 𝒌: 𝒌 𝟎 𝟏 𝒌, for two
real numbers 𝟎 and 𝟏.
11
Our prediction error is .
• Based on the value of 𝑥
௞ of each unit, we want to predict her
𝑦௞.
• Our prediction should a function of 𝑥௞, 𝑓ሺ𝑥௞ሻ. We focus on
predictions which are a affine function of 𝑥௞: 𝑓 𝑥௞ ൌ 𝑏଴ ൅
𝑏
ଵ𝑥௞, for two real numbers 𝑏଴ and 𝑏ଵ.
• 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ , the difference between our prediction and
𝑦௞, is our prediction error.
• In the loan example, if 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ is large and positive,
our prediction is much below the amount applicant 𝑘 will fail
to reimburse.
• If 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ is large and negative, our prediction is
much above the amount person 𝑘 will fail to reimburse.
• Large positive or negative values of 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ mean
bad prediction.
• 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ close to 0 means good prediction.
12
We want to find the value of ሺ ଴ ଵ) that minimizes
௞ ଴ ଵ ௞


௞ୀଵ
• ௞ ଴ ଵ ௞


௞ୀଵ is positive. =>
minimizing it = same thing as making it as close
to 0 as possible.
• If
௞ ଴ ଵ ௞


௞ୀଵ is as close to 0 as
possible, means that the sum of the squared
value of our prediction errors is as small as
possible.
• => we make small errors. That’s good, that’s
what we want!
13
The OLS univariate affine regression function
in the population.
• Let
଴ ଵ ௕బ,௕భ ∈ோమ ௞ ଴ ଵ ௞


௞ୀଵ
• We call
଴ ଵ ௞ the ordinary least squares (OLS)
univariate OLS affine regression function of 𝒌 on 𝒌
in the population.
• Affine: because the regression function is an affine
function of
௞.
• Shortcut: OLS regression of 𝒌 on a constant and 𝒌
in the population.
• Constant: because there is the constant
଴ in our
prediction function.
14
Decomposing between predicted value
and error.

଴ and ଵ: coefficient of the constant and ௞ in the OLS
regression of ௞ on a constant and ௞ in the
population.
• Let
௞ ଴ ଵ ௞. ௞ is the predicted value for ௞
according to the OLS regression of ௞ on a constant
and
௞ in the population.
• Let
௞ ௞ ௞. ௞: error we make when we use OLS
regression in the population to predict ௞.
• We have
௞ ௞ ௞.
௞ predicted value + error.
15

• 𝛽଴, 𝛽ଵ : 𝑏଴, 𝑏ଵ minimizing ∑ே ௞ୀଵ 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ ଶ.
• Derivative wrt to 𝑏
଴ is: ∑ே ௞ୀଵ െ2 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ . Why?
• Derivative wrt to 𝑏
ଵ is: ∑ே ௞ୀଵ െ2𝑥௞ 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ . Why?
• 𝛽଴, 𝛽ଵ : value of 𝑏଴, 𝑏ଵ for which 2 derivatives = 0.
• We use fact 1st derivative = 0 to write 𝛽଴ as function of 𝛽ଵ:
∑ே ௞ୀଵ െ2 𝑦௞ െ 𝛽଴ ൅ 𝛽ଵ𝑥௞ ൌ 0
i𝑖𝑓 െ 2 ∑ே ௞ୀଵ 𝑦௞ െ 𝛽଴ െ 𝛽ଵ𝑥௞ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ 𝑦௞ െ 𝛽଴ െ 𝛽ଵ𝑥௞ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ 𝑦௞ െ ∑ே ௞ୀଵ 𝛽଴ െ ∑ே ௞ୀଵ 𝛽ଵ𝑥௞ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ 𝑦௞ െ ∑ே ௞ୀଵ 𝛽ଵ𝑥௞ ൌ ∑ே ௞ୀଵ 𝛽଴
i𝑖𝑓 ∑ே ௞ୀଵ 𝑦௞ െ 𝛽ଵ ∑ே ௞ୀଵ 𝑥௞ ൌ 𝑁𝛽଴
i𝑖𝑓 ଵ

∑ே ௞ୀଵ 𝑦௞ െ 𝛽ଵ ଵ

∑ே ௞ୀଵ 𝑥௞ ൌ 𝛽଴
i𝑖𝑓𝛽଴ ൌ 𝑦 ത െ 𝛽ଵ𝑥̅.
16
2 useful formulas for the next derivation.
• During the sessions, you have proven that
ଵ ே
∑ே ௞ୀଵ 𝑥௞ଶ െ 𝑥̅ ଶ ൌ ଵ

∑ே ௞ୀଵሺ𝑥௞െ𝑥̅ሻଶ.
• Multiplying both sides by 𝑁, equivalent to saying that
∑ே ௞ୀଵ 𝑥௞ଶ െ 𝑁𝑥̅ ଶ ൌ ∑ே ௞ୀଵሺ𝑥௞െ𝑥̅ሻଶ.
• Bear this 1st equality in mind, we use it in next derivation.
• Moreover,
∑ே ௞ୀଵ 𝑥̅ 𝑦௞ െ 𝑦 ത ൌ 𝑥̅ ∑ே ௞ୀଵ 𝑦௞ െ 𝑦 ത ൌ 𝑥̅ ∑ே ௞ୀଵ 𝑦௞ െ 𝑥̅ ∑ே ௞ୀଵ 𝑦 ത
ൌ 𝑥̅𝑁𝑦 ത‐𝑥̅𝑁𝑦 ത ൌ 0.
• Therefore,
෍ሺ𝑥௞െ𝑥̅ሻሺ𝑦௞ െ 𝑦 തሻ

௞ୀଵ
ൌ ෍ 𝑥௞ 𝑦௞ െ 𝑦 ത െ 𝑥̅ 𝑦௞ െ 𝑦 ത

௞ୀଵ
ൌ ෍ 𝑥௞ 𝑦௞ െ 𝑦 ത

௞ୀଵ
െ ෍ 𝑥̅ 𝑦௞ െ 𝑦 ത

௞ୀଵ
ൌ ෍ 𝑥௞ 𝑦௞ െ 𝑦 ത

௞ୀଵ
• Bear this 2nd equality in mind, we use it in next derivation. 17
… and ೖ ೖ
ಿ
ೖసభ
ಿ ೖసభ ೖ మ
• Now, let’s use fact 2nd derivative = 0 and formula for 𝛽଴ to find 𝛽ଵ.
∑ே ௞ୀଵ െ2𝑥௞ 𝑦௞ െ 𝛽଴ ൅ 𝛽ଵ𝑥௞ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞ 𝑦௞ െ 𝛽଴ ൅ 𝛽ଵ𝑥௞ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞𝑦௞ െ 𝛽଴𝑥௞ െ 𝛽ଵ𝑥௞ଶ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞𝑦௞ െ ∑ே ௞ୀଵ 𝛽଴𝑥௞ െ ∑ே ௞ୀଵ 𝛽ଵ𝑥௞ଶ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞𝑦௞ െ 𝛽଴ ∑ே ௞ୀଵ 𝑥௞ െ 𝛽ଵ ∑ே ௞ୀଵ 𝑥௞ଶ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞𝑦௞ ൌ 𝛽଴ ∑ே ௞ୀଵ 𝑥௞ ൅ 𝛽ଵ ∑ே ௞ୀଵ 𝑥௞ଶ
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞𝑦௞ ൌ 𝑦 ത െ 𝛽ଵ𝑥̅ ∑ே ௞ୀଵ 𝑥௞ ൅ 𝛽ଵ ∑ே ௞ୀଵ 𝑥௞ଶ
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞𝑦௞ ൌ 𝑦 ത ∑ே ௞ୀଵ 𝑥௞ െ 𝛽ଵ𝑥̅ ∑ே ௞ୀଵ 𝑥௞ ൅ 𝛽ଵ ∑ே ௞ୀଵ 𝑥௞ଶ
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞𝑦௞ െ ∑ே ௞ୀଵ 𝑥௞𝑦 ത ൌ 𝛽ଵ ∑ே ௞ୀଵ 𝑥௞ଶ െ 𝑥̅ ∑ே ௞ୀଵ 𝑥௞
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞𝑦௞ െ 𝑥௞𝑦 ത ൌ 𝛽ଵ ∑ே ௞ୀଵ 𝑥௞ଶ െ 𝑥̅𝑁𝑥̅
i𝑖𝑓 ∑ே ௞ୀଵ 𝑥௞ 𝑦௞ െ 𝑦 ത ൌ 𝛽ଵ ∑ே ௞ୀଵ 𝑥௞ଶ െ 𝑁𝑥̅ ଶ
i𝑖𝑓 ∑ே ௞ୀଵሺ𝑥௞െ𝑥̅ሻ 𝑦௞ െ 𝑦 ത ൌ 𝛽ଵ ∑ே ௞ୀଵሺ𝑥௞െ𝑥̅ሻଶ
i𝑖𝑓𝛽ଵ ൌ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ
∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .
18
Applying the formulas for and in an
example.
• Assume for a minute that : there are
only two units in the population.
• Assume that
ଵ , ଵ , ଶ , ଶ ,
ଷ , and ଷ .
• Use the previous formulas to compute ଴ and
ଵ in this example.
19
iClicker time
• If , ଵ , ଵ , ଶ and ଶ , ଷ
and
ଷ , then

ଷ ଶ
and

଻ ଶ

ଷ ଶ
and

଻ ଶ
c) ଴ ଷ

and

ହ ଶ
20
and
• If 𝑁 ൌ 3, 𝑦ଵ ൌ 2, 𝑥ଵ ൌ 0, 𝑦ଶ ൌ 3, 𝑥ଶ ൌ 1, 𝑦ଷ ൌ 7, and 𝑥ଷ ൌ 2,
then 𝑦 ത ൌ 4 and 𝑥̅ ൌ 1.
• Then,
𝛽ଵ ൌ
ሺ𝑥ଵെ𝑥̅ሻሺ𝑦ଵ െ 𝑦 തሻ ൅ ሺ𝑥ଶെ𝑥̅ሻሺ𝑦ଶ െ 𝑦 തሻ ൅ ሺ𝑥ଷെ𝑥̅ሻሺ𝑦ଷ െ 𝑦 തሻ
ሺ𝑥ଵെ𝑥̅ሻଶ ൅ ሺ𝑥ଶെ𝑥̅ሻଶ ൅ ሺ𝑥ଷെ𝑥̅ሻଶ

ሺ0 െ 1ሻሺ2 െ 4ሻ ൅ ሺ1 െ 1ሻሺ3 െ 4ሻ ൅ ሺ2 െ 1ሻሺ7 െ 4ሻ
ሺ0 െ 1ሻଶ൅ሺ1 െ 1ሻଶ൅ሺ2 െ 1ሻଶ

ହ ଶ
.
• And 𝛽଴ ൌ 𝑦 ത െ 𝛽ଵ𝑥̅ ൌ 4 െ ହ


ଷ ଶ
.
21
Two other useful formulas
• We let 𝑒
௞ ൌ 𝑦௞ െ 𝛽଴ ൅ 𝛽ଵ𝑥௞ . 𝑒௞: error we make when we
use a univariate affine regression to predict 𝑦௞.
• In the derivation of the formula of 𝛽଴, we have shown that
∑ே ௞ୀଵ 𝑦௞ െ 𝛽଴ െ 𝛽ଵ𝑥௞ ൌ 0
• This is equivalent to ∑ே ௞ୀଵ 𝑒௞ ൌ 0, which is itself equivalent to
ଵ ே
∑ே ௞ୀଵ 𝑒௞ ൌ 0: the average of our prediction errors is 0.
• In the derivation of the formula of 𝛽଴, we have also shown
that ∑ே ௞ୀଵ 𝑥௞ 𝑦௞ െ 𝛽଴ െ 𝛽ଵ𝑥௞ ൌ 0
• This is equivalent to ∑ே ௞ୀଵ 𝑥௞𝑒௞ ൌ 0, which is itself equivalent
to saying ଵ

∑ே ௞ୀଵ 𝑥௞𝑒௞ ൌ 0: the average of the product of our
prediction errors and 𝑥௞ is 0.
22
What you need to remember
• Population of 𝑁 units. Each unit 𝑘 has 2 variables attached to it:
𝑦௞ is a variable we do not observe, 𝑥௞ is a variable we observe.
• We want to predict the 𝑦௞ of each unit based on her 𝑥௞.
• Our prediction should be function of 𝑥௞, 𝑓ሺ𝑥௞).
• Focus on affine functions: 𝑏
଴ ൅ 𝑏ଵ𝑥௞, for 2 numbers 𝑏଴ and 𝑏ଵ.
• Best ሺ𝑏଴, 𝑏ଵሻ is that minimizing ∑ே ௞ୀଵ 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ ଶ.
• We call that value ሺ𝛽଴, 𝛽ଵሻ, we call 𝛽଴ ൅ 𝛽ଵ𝑥௞: OLS regression
function of 𝑦௞ on a constant and 𝑥௞, and we let 𝑒௞ ൌ 𝑦௞ െ
𝛽଴ ൅ 𝛽ଵ𝑥௞ .
• 𝛽଴ ൌ 𝑦 ത െ 𝛽ଵ𝑥̅, and 𝛽ଵ ൌ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ
∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .
• We have ଵ

∑ே ௞ୀଵ 𝑒௞ ൌ 0: average prediction error is 0, and
ଵ ே
∑ே ௞ୀଵ 𝑥௞𝑒௞ ൌ 0.
23
Roadmap
1. The OLS univariate affine regression function.
2. Estimating the OLS univariate affine regression function.
3. Interpreting 𝛽 መଵ.
4. OLS univariate affine regression in practice.
24
Can we compute , ?
• Our prediction for ௞ based on a univariate
linear regression is ଴ ଵ ௞, the univariate
linear regression function.
• => to be able to make a prediction for a unit’s
௞ based on her ௞, we need to know the
value of
଴, ଵ .
• Under the assumptions we have made so far,
can we compute ଴, ଵ ? Discuss this
question with your neighbor during 1 minute.
25
iClicker time
• Can we compute ଴, ଵ ?
26
No!

଴ ଵ , and ଵ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ
∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .
• Remember, we have assumed that we observe
the
௞s of everybody in the population (e.g.
applicants’ FICO scores) but not the ௞s (e.g.
the amount that a person applying for a one‐
year loan in April 2018 will fail to reimburse in
April 2018 when that loan expires).
• => we cannot compute ଴, and ଵ.
27
A method to estimate and
• We draw 𝑛 units from the population, and we measure the
dependent and the independent variable of those units.
• For every 𝑖 between 1 and 𝑛, 𝑌 ௜ and 𝑋௜ = value of dependent
and of independent variable of 𝑖th unit we randomly select.
• We want to use the 𝑌 ௜s and the 𝑋௜s to estimate 𝛽଴ and 𝛽ଵ.
• 𝛽଴, 𝛽ଵ , 𝑏଴, 𝑏ଵ minimizing ∑ே ௞ୀଵ 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ ଶ.
• => to estimate 𝛽଴, 𝛽ଵ , we use 𝑏଴, 𝑏ଵ minimizing ∑௡ ௜ୀଵ൫𝑌 ௜ െ
𝑏
଴ ൅ 𝑏ଵ𝑋௜ ൯ଶ.
• Instead of finding 𝑏଴, 𝑏ଵ that minimizes sum of squared
prediction errors in population, find 𝑏଴, 𝑏ଵ that minimizes
sum of squared prediction errors in the sample.
• Intuition: if we find a method to predict well the dependent
variable in the sample, method should work well in entire
population, given that sample representative of population.
28
The OLS regression function in the sample.
• Let
𝛽 መ଴, 𝛽 መଵ ൌ 𝑎𝑟𝑔𝑚𝑖𝑛 ௕బ,௕భ ∈ோమ ෍ 𝑌 ௜ െ 𝑏଴ ൅ 𝑏ଵ𝑋௜ ଶ

௜ୀଵ
• We call 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ the OLS regression function of 𝑌 ௜ on a
constant and 𝑋௜ in the sample.
• In the sample: because we only use the 𝑌 ௜s and 𝑋௜s of the 𝑛
units in the sample we randomly draw from the population.
• 𝛽 መ଴, 𝛽 መଵ : coefficients of the constant and 𝑋௜ in the OLS
regression of 𝑌 ௜ on 𝑋௜ in the sample.
• Let 𝑌 ෠௜ ൌ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜. 𝑌 ෠௜ is the predicted value for 𝑌 ௜
according to the OLS regression of 𝑌 ௜ on a constant and 𝑋௜
in the sample.
• Let 𝑒̂௜ ൌ 𝑌 ௜ െ 𝑌 ෠௜. 𝑒̂௜: error we make when we use OLS
regression in the sample to predict 𝑌 ௜.
• We have 𝑌 ௜ ൌ 𝑌 ෠௜ ൅ 𝑒̂௜. 29
Find the value of
଴ ଵ that minimizes
௜ ଴ ଵ ௜


௜ୀଵ .
• Find a formula for the value of
଴ ଵ that
minimizes ௜ ଴ ଵ ௜


௜ୀଵ . Hint: the
formula is “almost” the same as that for
଴ ଵ ,
except that you need to replace:
– , size of population by , size of sample,
– ௞, the dependent variable of unit in the
population, by ௜, the dependent variable of unit
in the sample,
– ௞, the independent variable of unit in the
population, by ௜, the independent variable of
unit in the sample.
30
iClicker time
• Let ଵ
௡ ௜

௜ୀଵ , and ଵ
௡ ௜

௜ୀଵ . Let
଴ ଵ denote the value of 𝑏଴, 𝑏ଵ that minimizes
௜ 𝑏଴ ൅ 𝑏ଵ𝑋௜


௜ୀଵ We have:
଴ ଵ , and ଵ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ
∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .
଴ ଵ , and ଵ ሺ௑೔ି௑ തሻሺ௒೔ି௒ തሻ
ሺ௑೔ି௑ തሻమ

௜ୀଵ .
଴ ଵ , and ଵ ∑೙ ೔సభሺ௑೔ି௑ തሻሺ௒೔ି௒ തሻ
∑೙ ೔సభሺ௑೔ି௑ തሻమ .
31
଴ ଵ , and ଵ ∑೙ ೔సభሺ௑೔ି௑ തሻሺ௒೔ି௒ തሻ
∑೙ ೔సభሺ௑೔ି௑ തሻమ !
• Sketch of the proof.
• Differentiate ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑏଴ ൅ 𝑏ଵ𝑋௜ ଶ wrt to 𝑏଴ and 𝑏ଵ.
• 𝛽 መ଴ and 𝛽 መଵ: values of 𝑏଴ and 𝑏ଵ that cancel these two
derivatives. That gives us a system of 2 equations with 2
unknowns to solve.
• The steps to solve it are exactly the same as those we used to
find 𝛽଴ ൌ 𝑦 ത െ 𝛽ଵ𝑥̅, and 𝛽ଵ ൌ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ
∑ಿ ೖసభሺ௫ೖି௫̅ሻమ , except that we
replace:
– 𝑁, the size of the population by 𝑛, the size of the sample,
– 𝑦௞, the dependent variable of unit 𝑘 in the population, by
𝑌 ௜, the dependent variable of unit 𝑖 in the sample,
– 𝑥௞, the independent variable of unit 𝑘 in the population,
by 𝑋௜, the independent variable of unit 𝑖 in the sample.
32
଴ converges towards ଴, and ଵ converges
towards
ଵ.
• Remember, when we studied the OLS regression of ௜ on
௜ without a constant, we used the law of large numbers
to prove that
௡→ାஶ
.
• When the sample we randomly draw gets large, , the
sample coefficient of the regression, gets close to , the
population coefficient, so is a good proxy for .
• Here, one can also use the law of large numbers to prove
that
௡→ାஶ ଴ ଴ and ௡→ାஶ ଵ ଵ.
• Take‐away: when sample we randomly draw gets large,
଴ and ଵ, sample coefficients of the regression of ௜ on
a constant and ௜, get close to ଴ and ଵ, the population
coefficients.
• Therefore, ଴ and ଵ = good proxys of ଴ and ଵ when
sample is large enough.
33
iClicker time
• We have shown that

∑೙ ೔సభሺ௑೔ି௑ തሻሺ௒೔ି௒ തሻ
∑೙ ೔సభሺ௑೔ି௑ തሻమ
• Is
ଵ a real number, or is it a random variable? Discuss this
question with your neighbour for 1mn, and then answer.
ଵ a real number
ଵ a random variable.
34
ଵ is a random variable!
• We have shown that 𝛽 መଵ ൌ ∑೙ ೔సభሺ௑೔ି௑ തሻሺ௒೔ି௒ തሻ
∑೙ ೔సభሺ௑೔ି௑ തሻమ
• 𝑋
௜s and 𝑌 ௜s are random variables: their value depends on which unit
we randomly draw when we draw ith unit in sample.
• Therefore, 𝛽 መଵ is a random variable, with a variance.
• Let 𝜎ଶ ൌ ଵ

∑ே ௞ୀଵሺ𝑒௞ሻଶ denote the average of the squared of our
prediction errors in the population.
• One can show that 𝑉 𝛽 መଵ ൎ ఙమ
∑೙ ೔సభሺ௑೔ି௑ തሻమ.
• 𝑉 𝛽 መଵ small if average squared prediction error low, meaning that
regression model makes small prediction errors in the population
• 𝑉 𝛽 መଵ small if high variability of 𝑋௜.
• 𝑉 𝛽 መଵ small if sample size is large.
35
Using central limit theorem for ଵ to construct a
test and a confidence interval.
• If 𝑛 ൒ 100, ఉ ෡భିఉభ
௏ ఉ ෡భ
follows normal distribution with mean 0 and variance 1.
• We can use this to test null hypothesis on 𝛽ଵ.
• Often, we want to test 𝛽ଵ ൌ 0. If 𝛽ଵ ൌ 0, OLS regression function is 𝛽଴ ൅
0 ൈ 𝑥
௞ ൌ 𝛽଴. Means that actually 𝑥௞ is useless to predict 𝑦௞. E.g.: best
prediction of amount people will fail to repay on their loan is actually not a
function of their FICO score, it is just a constant.
• If we want to have 5% chances of wrongly rejecting 𝛽ଵ ൌ 0, test is:
Reject 𝛽ଵ ൌ 0 if ఉ ෡భ
௏ ఉ ෡భ
൐ 1.96 or ఉ ෡భ
௏ ఉ ෡భ
൏ െ1.96.
Otherwise, do not reject 𝛽ଵ ൌ 0.
• We can also construct confidence interval for 𝛽ଵ:
𝛽 መଵ െ 1.96 𝑉 𝛽 መଵ , 𝛽 መଵ ൅ 1.96 𝑉 𝛽 መଵ .
For 95% of random samples we can draw, 𝛽ଵ belongs to confidence interval.
36
Assessing quality of our predictions: the MSE
• For every individual in sample, 𝑒̂௜ ൌ 𝑌 ௜‐ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ : error we make
when we use sample OLS regression to predict 𝑌 ௜.
• We have 𝑌 ௜ ൌ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ ൅ 𝑒̂௜.
• 𝑒௞ ൌ 𝑦௞− 𝛽଴ ൅ 𝛽ଵ𝑥௞ : population prediction errors. 𝑒̂௜: sample
prediction errors.
• Slide 25: we have shown that ଵ

∑ே ௞ୀଵ 𝑒௞ ൌ 0.
• Similarly, ଵ

∑௡ ௜ୀଵ 𝑒̂௜ ൌ 0. Average sample prediction error=0.
• We cannot use ଵ

∑௡ ௜ୀଵ 𝑒̂௜ to assess quality of our predictions. Even if
our regression makes bad predictions, ଵ

∑௡ ௜ୀଵ 𝑒̂௜ always equal to 0.
• Instead, we use ଵ

∑௡ ௜ୀଵ 𝑒̂௜ଶ: mean‐squared error (MSE) of regression.
• Good to compare regressions: if regression A has a lower MSE than
B, A better than B: makes smaller errors on average.
• However, ଵ

∑௡ ௜ୀଵ 𝑒̂௜ଶ hard to interpret: if equal to 10, what does that
mean? Does not have a natural scale to which we can compare it. 37
Assessing quality of our predictions: the ଶ.
• Instead, we are going to use ଶ
భ ೙
∑೙ ೔సభ ௘̂೔మ
భ ೙
∑೙ ೔సభ ௒೔ି௒ ത మ
• 1 – MSE / sample variance of the ௜s
38
The ଶ has a natural scale (1/3)
• 𝑒̂௜ ൌ 𝑌 ௜‐ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ = error we make when we use sample OLS
regression to predict 𝑌 ௜. We have 𝑌 ௜ ൌ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ ൅ 𝑒̂௜.
• 𝑒௞ ൌ 𝑦௞− 𝛽଴ ൅ 𝛽ଵ𝑥௞ : population errors. 𝑒̂௜: sample errors.
• Slide 25: we have shown that ଵ

∑ே ௞ୀଵ 𝑒௞ ൌ 0 and ଵ

∑ே ௞ୀଵ 𝑥௞𝑒௞ ൌ 0.
Average population error = 0, and average product of 𝑥௞s and 𝑒௞s = 0.
• Similarly, one can show that ଵ

∑௡ ௜ୀଵ 𝑒̂௜ ൌ 0 and ଵ

∑௡ ௜ୀଵ 𝑋௜𝑒̂௜ ൌ 0. Average
sample error=0, and average product of the 𝑋௜s and 𝑒̂௜s = 0.
• Because of this, one can show that
ଵ ௡
∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑌 ത ଶ ൌ ଵ

∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ ൅ 𝑒̂௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത ൅ 𝑒̂̅ ଶ

ଵ ௡
∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത ൅𝑒̂௜ െ𝑒̂̅ ଶ

ଵ ௡
∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത ଶ ൅ ଵ

∑௡ ௜ୀଵ 𝑒̂௜ଶ.
That’s because ଵ

∑௡ ௜ୀଵ 𝑒̂௜ ൌ 0 and ଵ

∑௡ ௜ୀଵ 𝑋௜𝑒̂௜ ൌ 0 implies
ଵ ௡
∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത 𝑒̂௜ െ 𝑒̂̅ ൌ 0.
39
The ଶ has a natural scale (2/3)
• Let 𝑒̂௜ ൌ 𝑌 ௜‐ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ denote the error we make when we use the
sample OLS regression function to predict 𝑌 ௜. We have 𝑌 ௜ ൌ 𝛽 መ଴ ൅
𝛽 መଵ𝑋௜ ൅ 𝑒̂௜.
• One can show that ଵ

∑௡ ௜ୀଵ 𝑒̂௜ ൌ 0 and ଵ

∑௡ ௜ୀଵ 𝑋௜𝑒̂௜ ൌ 0. The average
sample prediction error is 0, and the average product of the 𝑋௜s
and 𝐸 ෠௜s in the sample is 0.
• Because of this, one can show that
ଵ ௡
∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑌 ത ଶ ൌ ଵ

∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത ଶ ൅ ଵ

∑௡ ௜ୀଵ 𝑒̂௜ െ 𝑒̂̅ ଶ.

ଵ ௡
∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑌 ത ଶ is the sample variance of the 𝑌 ௜s,
ଵ ௡
∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത ଶ is sample variance of 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜s,
and ଵ

∑௡ ௜ୀଵ 𝑒̂௜ଶ is the MSE.
• The sample variance of the 𝒀𝒊s is equal to the sample variance of
𝜷 ෡𝟎 ൅ 𝜷 ෡𝟏𝑿𝒊, our predictions for 𝒀𝒊, plus the MSE of the regression.
40
iClicker time
• One can show that
ଵ ௡
∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑌 ത ଶ ൌ ଵ

∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത ଶ ൅
ଵ ௡
∑௡ ௜ୀଵ 𝑒̂௜ଶ.
• 𝑅ଶ ൌ 1 െ
భ ೙
∑೙ ೔సభ ௘̂೔మ
భ ೙
∑೙ ೔సభ ௒೔ି௒ ത మ.
• Based on the equality above, and based on its definition,
which of the following properties should the number 𝑅ଶ
satisfy?
aሻ 𝑅ଶ must be included between 0.5 and 1.
bሻ 𝑅ଶ must be included between 0.5 and 1.5.
cሻ 𝑅ଶ must be included between 0 and 1. 41
The ଶ has a natural scale: it must be included
between 0 and 1 (3/3)
• One has:
ଵ ௡
∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑌 ത ଶ ൌ ଵ

∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത ଶ ൅ ଵ

∑௡ ௜ୀଵ 𝑒̂௜ଶ.
• 𝑅ଶ ൌ 1 െ
భ ೙
∑೙ ೔సభ ௘̂೔మ
భ ೙
∑೙ ೔సభ ௒೔ି௒ ത మ, so 𝑅ଶ ൑ 1
• Then, using the fact that
ଵ ௡
∑௡ ௜ୀଵ 𝑒̂௜ଶ ൌ ଵ

∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑌 ത ଶ െ ଵ

∑௡ ௜ୀଵ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௜ െ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋 ത ଶ ,
one can show that
𝑅ଶ ൌ
భ ೙
∑೙ ೔సభ ఉ ෡బାఉ ෡భ௑೔ି ఉ ෡బାఉ ෡భ௑ ത మ
భ ೙
∑೙ ೔సభ ௒೔ି௒ ത మ ൒ 0.
• => 𝑅ଶ= easily interpretable measure of quality of our predictions. If
close to 1, our predictions make almost no error (MSE close to 0), so
excellent prediction. If close to 0, poor prediction.
42
What you need to remember
• Prediction for 𝑦௞ based on OLS regression of 𝑦௞ on a constant and
𝑥௞ in the population is 𝛽଴ ൅ 𝛽ଵ𝑥௞, with 𝛽଴ ൌ 𝑦 ത െ 𝛽ଵ𝑥̅, and 𝛽ଵ ൌ
∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ
∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .
• We can estimate 𝛽଴, 𝛽ଵ if we measure 𝑦௞s for random sample.
• For every 𝑖 between 1 and 𝑛, 𝑌 ௜ and 𝑋௜ = value of dependent and
independent variables of 𝑖th unit we randomly select.
• 𝛽଴, 𝛽ଵ is 𝑏଴, 𝑏ଵ that minimizes ∑ே ௞ୀଵ 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ ଶ
• To estimate 𝛽଴, 𝛽ଵ , find 𝑏଴, 𝑏ଵ minimizing ∑௡ ௜ୀଵ൫𝑌 ௜ െ ሺ𝑏଴ ൅
𝑏
ଵ𝑋௜ሻ൯ଶ.
• Yields 𝛽 መ଴ ൌ 𝑌 ത െ 𝛽 መଵ𝑋 ത, and 𝛽 መଵ ൌ ∑೙ ೔సభሺ௑೔ି௑ തሻሺ௒೔ି௒ തሻ
∑೙ ೔సభሺ௑೔ି௑ തሻమ .
• 𝑉 𝛽 መଵ ൎ ఙమ
∑೙ ೔సభሺ௑೔ି௑ തሻమ, and if 𝑛 ൒ 100, ఉ ෡భିఉభ
௏ ఉ ෡భ
follows N(0,1). We can
use this to test 𝛽ଵ ൌ 0 and get 95% confidence interval for 𝛽ଵ.
• 𝑅ଶ ൌ 1 െ
భ ೙
∑೙ ೔సభ ௘̂೔మ
భ ೙
∑೙ ೔సభ ௒೔ି௒ ത మ. Close to 1: good prediction. Close to 0: poor
prediction.
43
Roadmap
1. The OLS univariate affine regression function.
2. Estimating the OLS univariate affine regression function.
3. Interpreting 𝛽 መଵ
4. OLS univariate affine regression in practice.
44
A useful reminder: the sample covariance
• Assume randomly draw a sample of and 𝑛 units from a population, and
for each unit observe variables 𝑋௜ and 𝑌 ௜.
• Sample covariance between 𝑋௜ and 𝑌 ௜ is 𝟏
𝒏
∑𝒏 𝒊ୀ𝟏ሺ𝑿𝒊െ𝑿 ഥሻሺ𝒀𝒊 െ 𝒀 ഥሻ.
• Example: 𝑋௜ ൌ FICO score of ith person, 𝑌 ௜: amount she defaults.
• If 𝑋
௜ ൐ 𝑋 ത (person i’s FICO > average FICO in sample):
– If 𝑌
௜ ൐ 𝑌 ത (amount i defaults > average default in sample) then
ሺ𝑋௜െ𝑋 തሻሺ𝑌 ௜ െ 𝑌 തሻ ൐ 0,
– If 𝑌
௜ ൏ 𝑌 ത then ሺ𝑋௜െ𝑋 തሻሺ𝑌 ௜ െ 𝑌 തሻ ൏ 0.
• If 𝑋
௜ ൏ 𝑋 ത (person i’s FICO < average FICO in sample):
– If 𝑌
௜ ൏ 𝑌 ത then ሺ𝑋௜െ𝑋 തሻሺ𝑌 ௜ െ 𝑌 തሻ ൐ 0.
– If 𝑌
௜ ൐ 𝑌 ത then ሺ𝑋௜െ𝑋 തሻሺ𝑌 ௜ െ 𝑌 തሻ ൏ 0.
• When many people have 𝑋௜ ൐ 𝑋 ത and 𝑌 ௜ ൐ 𝑌 ത, and many people have 𝑋௜ ൏
𝑋 ത and 𝑌 ௜ ൏ 𝑌 ത, then ଵ

∑௡ ௜ୀଵሺ𝑋௜െ𝑋 തሻሺ𝑌 ௜ െ 𝑌 തሻ ൐ 0.
𝑿
𝒊 and 𝒀𝒊 move in the same direction.
• When many people have both 𝑋௜ ൐ 𝑋 ത and 𝑌 ௜ ൏ 𝑌 ത, and many people have
both 𝑋௜ ൏ 𝑋 ത and 𝑌 ௜ ൐ 𝑌 ത, then ଵ

∑௡ ௜ୀଵሺ𝑋௜െ𝑋 തሻሺ𝑌 ௜ െ 𝑌 തሻ ൏ 0.
𝑿
𝒊 and 𝒀𝒊 move in opposite directions.
45
iClicker time
• Let ௜ FICO score of ith person, ௜: amount she
defaults.
• Let ଵ
௡ ௜ ௜

௜ୀଵ be their sample
covariance.
• Which of the two statements sounds the most likely
to you?
ଵ ௡
௜ ௜

௜ୀଵ
ଵ ௡
௜ ௜

௜ୀଵ
46
In example, likely ଵ
௡ ௜ ௜

௜ୀଵ
• Typically, one would expect that people with a FICO score
below average default more than the average on their
loan.
• Similarly, one would expect that people with a FICO score
above average default less than the average on their loan.
• Therefore, we expect that people with ௜ also have
௜ , and people with ௜ also have ௜ .
• Therefore, it is likely that
ଵ ௡
௜ ௜

௜ୀଵ
47
iClicker time
• Go back to the formula we derived for
ଵ, the coefficient
of ௜ in the sample regression of ௜ on a constant and ௜.
• Which of the following statements is correct:
ଵ ௜
௜ ௜
ଵ ௜
௜ ௜
ଵ ௜
௜ ௜
48
sample covariance between and
divided by sample variance of .


∑೙ ೔సభሺ௑೔ି௑ തሻሺ௒೔ି௒ തሻ
∑೙ ೔సభሺ௑೔ି௑ തሻమ
• Multiply numerator and denominator by ଵ

, yields:

௜ ௜

௜ୀଵ
௡ ௜ୀଵ ௜ ଶ

ଵ sample covariance between ௜ and ௜ divided by sample
variance of ௜.
• Therefore, ଵ if ௜ and ௜ move in the same direction, ଵ
if move in opposite directions.
• In the regression of the amount defaulted on a constant and
FICO, do you expect that ଵ or ଵ ?
49
For now, we can interpret the sign of ,
not its specific value.
• For now, we have seen that ଵ means that ௜ and ௜
move in the same direction, ଵ means that move in
opposite directions.
• Interesting, but does not tell us how we should interpret
a specific value of ଵ.
• For instance, what does ଵ mean?
• That’s what we are going to see now.
50
Interpreting ଵ when ௜ is binary.
51
• Assume you run an OLS regression of 𝑌 ௜ on a constant and 𝑋௜,
where 𝑋௜ is a binary variable (variable either equal to 0 or to 1).
• Example: you regress 𝑌 ௜, whether email i is a spam on a constant
and 𝑋௜, a binary variable equal to 1 if the email has the word
“free” in it, and to 0 if the email does not contain that word.
• Then, you have shown / will show during sessions that
𝛽 መଵ ൌ ଵ
௡భ
∑௜:௑೔ୀଵ 𝑌 ௜ െ ௡ ଵ బ ∑௜:௑೔ୀ଴ 𝑌 ௜ ,
where 𝑛
ଵ is the number of units that have 𝑋௜ ൌ 1, 𝑛଴ is the number
of units that have 𝑋௜ ൌ 0, ∑௜:௑೔ୀଵ 𝑌 ௜ is the sum of 𝑌 ௜ of all units with
𝑋
௜ ൌ 1, and ∑௜:௑೔ୀ଴ 𝑌 ௜ is the sum of 𝑌 ௜ of all units with 𝑋௜ ൌ 0.
• In the spam example, explain with words what ଵ
௡భ
∑௜:௑೔ୀଵ 𝑌 ௜,
ଵ ௡బ
∑௜:௑೔ୀ଴ 𝑌 ௜, and 𝛽 መଵ respectively represent. Discuss this question
with your neighbour for one minute.
iClicker time
Assume you regress 𝑌 ௜, whether email i is a spam on a constant and 𝑋௜,
a binary variable equal to 1 if the email has the word “free” in it, and
to 0 if the email does not contain that word. You know that
𝛽 መଵ ൌ ଵ
௡భ
∑௜:௑೔ୀଵ 𝑌 ௜ െ ௡ ଵ

∑௜:௑೔ୀ଴ 𝑌 ௜ .
• Which of the following statements is correct?
aሻ ଵ
௡భ
∑௜:௑೔ୀଵ 𝑌 ௜ is the percentage of emails that have the word free
among the emails that are spams, ଵ


∑௜:௑೔ୀ଴ 𝑌 ௜ is the percentage
of emails that have the word free among the emails that are not
spams, so 𝛽 መଵ is the difference between the percentage of emails
that have the word free across spams and non spams.
bሻ ଵ
௡భ
∑௜:௑೔ୀଵ 𝑌 ௜ is the percentage of emails that are spams among
the emails that have the word free, ଵ


∑௜:௑೔ୀ଴ 𝑌 ௜ is the
percentage of emails that are spams among the emails that do
not have the word free, so 𝛽 መଵ is the difference between the
percentage of emails that are spams across emails that have
and do not have the word free.
52
difference between % of spams
across emails with/without word free.
• ∑௜:௑೔ୀଵ 𝑌 ௜ counts the number of spams among emails that have the
word free.
• 𝑛
ଵ is the number of emails that have the word free.
• Therefore, ଵ
௡భ
∑௜:௑೔ୀଵ 𝑌 ௜: percentage of spams among emails that
have the word free.
• Similarly, ଵ


∑௜:௑೔ୀ଴ 𝑌 ௜: percentage of spams among emails that do
not have the word free.
• 𝛽 መଵ ൌ difference between % of spams across emails with/without
word free.
• Outside of this example, we have following, very important result:
When you regress 𝒀𝒊 on a constant and 𝑿𝒊, where 𝑿𝒊 is a binary
variable, 𝜷 ෡𝟏 is the difference between the average value of 𝒀𝒊 among
units with 𝑿𝒊 ൌ 𝟏 and among units with 𝑿𝒊 ൌ 𝟎.
53
Testing whether the average of a variable is
significantly different between 2 groups.
• When you regress 𝑌 ௜ on a constant and 𝑋௜, where 𝑋௜ is a binary
variable, 𝛽 መଵ is the difference between the average value of 𝑌 ௜ among
units with 𝑋௜ ൌ 1 and among units with 𝑋௜ ൌ 0 in the sample.
• Similarly, 𝛽ଵ is difference between the average 𝑌 ௜ among units with
𝑋
௜ ൌ 1 and among units with 𝑋௜ ൌ 0 in the full population.
• Remember that if ఉ ෡భ
௏ ఉ ෡భ
൐ 1.96 or ఉ ෡భ
௏ ఉ ෡భ
൏ െ1.96, we can reject at
the 5% level the null hypothesis that 𝛽ଵ ൌ 0.
• When we reject 𝛽ଵ ൌ 0 in a regression of 𝑌 ௜ on a constant and 𝑋௜,
where 𝑋௜ is a binary variable, we reject the null hypothesis that the
average of 𝑌 ௜ is the same among units with 𝑋௜ ൌ 1 and among units
with 𝑋௜ ൌ 0 in the full population.
• The difference between the average of 𝑌 ௜ between the two groups in
our sample is unlikely to be due to chance.
• Groups have a significantly different average of 𝒀𝒊 at the 5% level. 54
What about ?
• Assume you run an OLS regression of ௜ on a constant and ௜,
where ௜ is a binary variable (variable either equal to 0 or to 1).
• Then, you have shown / will show during sessions that

ଵ ௡బ
௜:௑೔ୀ଴ ௜.

଴: average of ௜ among units with ௜ .

ଵ is the difference between the average value of ௜ among
units with ௜ and among units with ௜ .
• People sometimes call units with ௜ the reference
category, because ଵ compares the average value of ௜ among
units that do not belong to that reference category to units in
that reference category.
• In the spam example, ଴: percentage of spams among emails
that do not have the word free in them, ଵ= difference
between percentage of spams across emails that have the word
free in them and emails that do not have that word.
55
To predict of a unit, OLS uses average
among units with same as that unit
• Now, let’s consider some units 𝑗 outside of our sample.
• We do not observe their 𝑌
௝ but we observe their 𝑋௝.
• Predicted value of 𝑌
௝ according to OLS regression: 𝑌 ෠ ௝ ൌ 𝛽 መ଴ ൅ 𝛽 መଵ𝑋௝.
• 𝛽 መ଴ ൌ ଵ


∑௜:௑೔ୀ଴ 𝑌 ௜, and 𝛽 መଵ ൌ ௡ ଵ భ ∑௜:௑೔ୀଵ 𝑌 ௜ െ ௡ ଵ బ ∑௜:௑೔ୀ଴ 𝑌 ௜.
• So 𝑌 ෠
௝ ൌ
ଵ ௡బ
∑௜:௑೔ୀ଴ 𝑌 ௜ for units 𝑗 such that 𝑋௝ ൌ 0.
• And 𝑌 ෠
௝ ൌ
ଵ ௡బ
∑௜:௑೔ୀ଴ 𝑌 ௜ ൅ ௡ ଵ భ ∑௜:௑೔ୀଵ 𝑌 ௜ െ ௡ ଵ బ ∑௜:௑೔ୀ଴ 𝑌 ௜ ൌ ௡ ଵ భ ∑௜:௑೔ୀଵ 𝑌 ௜
for units 𝑗 such that 𝑋௝ ൌ 1.
• To make prediction for unit with 𝑋௝ ൌ 0, we use average 𝑌 ௜ among
units with 𝑋௜ ൌ 0 in sample.
• To make prediction for a unit with 𝑋௝ ൌ 1, we use average 𝑌 ௜ among
units with 𝑋௜ ൌ 1 in sample.
• Prediction = average 𝒀𝒊 among units with same 𝑿𝒊 in sample.
• In sessions: in regression of 𝑌 ௜ on a constant, OLS prediction =
average 𝑌 ௜ among units in sample. 56
For now, we know how to interpret the
value of , but only when binary.
• When ௜ binary,



௜:௑೔ୀଵ ଴

௜:௑೔ୀ଴
• In that special case, ଵ has a very simple interpretation:
difference between average ௜ among units with ௜ and
among units with ௜ .
• In other words, ଵ measures by the difference between the
average of ௜ across subgroups whose ௜ differs by one (units
with ௜ versus units with ௜ ).
• Does this result extend to the case where ௜ not binary?
57
ଵ measures difference between the average of
௜ across subgroups whose ௜ differs by one
58
• When 𝑋௜ binary, 𝛽 መଵ measures diff. between average of 𝑌 ௜ across
subgroups whose 𝑋௜ differs by one (units with 𝑋௜ ൌ 1 versus 𝑋௜ ൌ 0).
• Now, assume that 𝑋௜ can be equal to 0, 1, or 2.
• 𝑛
଴: number of units with 𝑋௜ ൌ 0. 𝑛ଵ: number of units with 𝑋௜ ൌ 1. 𝑛ଶ:
number of units with 𝑋௜ ൌ 2.
𝛽 መଵ ൌ 𝑤 1
𝑛

෍ 𝑌 ௜
௜:௑೔ୀଵ

1 𝑛଴
෍ 𝑌 ௜
௜:௑೔ୀ଴
൅ 1 െ 𝑤
1 𝑛ଶ
෍ 𝑌 ௜
௜:௑೔ୀଶ

1 𝑛ଵ
෍ 𝑌 ௜
௜:௑೔ୀଵ
,
where 𝑤 is number included between 0 & 1 that you don’t need to know.
• 𝛽 መଵ: weighted average of diff. between average 𝑌 ௜ of units with 𝑋௜ ൌ 1
and 𝑋௜ ൌ 0, and of diff. between average 𝑌 ௜ of units with 𝑋௜ ൌ 2 and
𝑋
௜ ൌ 1.
• Units with 𝑋௜ ൌ 1 and 𝑋௜ ൌ 0 have a value of 𝑋௜ that differs by one.
• Units with 𝑋௜ ൌ 2 and 𝑋௜ ൌ 1 have a value of 𝑋௜ that differs by one.
• => 𝛽 መଵ measures the difference between the average of 𝑌 ௜ across
subgroups whose 𝑋௜ differs by one!
ଵ measures difference between the average of
௜ across subgroups whose ௜ differs by one
59
• When 𝑋௜ binary, 𝛽 መଵ measures diff. between average of 𝑌 ௜ across subgroups
whose 𝑋௜ differs by one (units with 𝑋௜ ൌ 1 versus 𝑋௜ ൌ 0).
• Now, assume that 𝑋௜ can be equal to 0, 1, 2,…,K.
• 𝑛
଴: number of units with 𝑋௜ ൌ 0, 𝑛ଵ: number of units with 𝑋௜ ൌ 1,…, 𝑛௄:
number of units with 𝑋௜ ൌ 𝐾.
𝛽 መଵ ൌ ෍ 𝑤௞ 𝑛 1

෍ 𝑌 ௜
௜:௑೔ୀ௞

1
𝑛௞ିଵ ෍ 𝑌 ௜
௜:௑೔ୀ௞ିଵ

௞ୀଵ
,
where 𝑤
௞: positive weights summing to 1 that you do not need to know.
• 𝛽 መଵ: weighted average of diff. between average 𝑌 ௜ of units with 𝑋௜ ൌ 1 and
𝑋
௜ ൌ 0, of diff. between average 𝑌 ௜ of units with 𝑋௜ ൌ 2 and 𝑋௜ ൌ 1,…, of
diff. between average 𝑌 ௜ of units with 𝑋௜ ൌ 𝐾 and 𝑋௜ ൌ 𝐾 െ 1.
• Units with 𝑋௜ ൌ 1 and 𝑋௜ ൌ 0 have a value of 𝑋௜ that differs by one.
• Units with 𝑋௜ ൌ 𝐾 and 𝑋௜ ൌ 𝐾 െ 1 have a value of 𝑋௜ that differs by one.
• 𝛽 መଵ ൌ diff. between average of 𝑌 ௜ across subgroups whose 𝑋௜ differs by 1!
Logs versus levels
• Assume you regress 𝑌 ௜ on constant and 𝑋௜, 𝛽 መଵ ൌ 0.5: when you
compare people whose 𝑋௜ differs by 1, average 𝑌 ௜ 0.5 larger among
people whose 𝑋௜ is 1 unit larger.
• Assume you regress lnሺ𝑌 ௜ሻ on constant and 𝑋௜, 𝛽 መଵ ൌ 0.5: when you
compare people whose 𝑋௜ differs by 1, average lnሺ𝑌 ௜ሻ 0.5 larger among
people whose 𝑋௜ 1 unit larger.
• Due to properties ln function, if people whose 𝑋௜ is 1 unit larger have
an average lnሺ𝑌 ௜ሻ 0.5 larger, average of 𝑌 ௜ 50% larger among those
people.
• Assume you regress lnሺ𝑌 ௜ሻ on constant and lnሺ𝑋௜ሻ, 𝛽 መଵ ൌ 0.5: when you
compare people whose 𝑋௜ differs by 1%, average 𝑌 ௜ 0.5% larger among
people whose 𝑋௜ 1% larger.
• Regressing 𝑌 ௜ on constant and 𝑋௜ is useful to study how the mean of 𝑌 ௜
differs in levels across units whose 𝑋௜ differs by one.
• Regressing lnሺ𝑌 ௜ሻ on constant and 𝑋௜ is useful to study how the mean
of 𝑌 ௜ differs in relative terms across units whose 𝑋௜ differs by one.
• Regressing lnሺ𝑌 ௜ሻ on constant and lnሺ𝑋௜ሻ is useful to study how the
mean of 𝑌 ௜ differs in relative terms across units whose 𝑋௜ differs by 1%.
60
iClicker time
• Assume you observe the wages of a sample of wage earners
in the US. You regress 𝑌 ௜, the monthly wage of person 𝑖, on a
constant and 𝑋௜, a binary variable equal to 1 if 𝑖 is a female
and to 0 if 𝑖 is a male. Assume that you find 𝛽 መଵ ൌ െ200 and
𝛽 መ଴ ൌ 2000
• Which of the following statements is correct?
aሻ In this sample, the average wage of females is 200 dollars
higher than the average wage of males, and the average
wage of females is 2000 dollars.
bሻ In this sample, the average wage of females is 200 dollars
lower than the average wage of males, and the average
wage of males is 2000 dollars.
61
Average wage of females is 200 dollars
lower than average wage of males.
• ௜ binary: ௜ for males, ௜ for females.


ଵ ௡భ
௜:௑೔ୀଵ ௜
ଵ ௡బ
௜:௑೔ୀ଴ ௜, Therefore, ଵ
difference between average wage of females and males.

ଵ means that females make 200 dollars less
than males on average.


ଵ ௡బ
௜:௑೔ୀ଴ ௜, Therefore, ଴ average wage of
males.

଴ means that males make 2000 dollars on
average.
62
iClicker time
• Assume you observe the wages of a sample of 5,000 wage
earners in the US. You regress 𝑌 ௜, the monthly wage of person
𝑖, on a constant and 𝑋௜, a binary variable equal to 1 if 𝑖 is a
female and to 0 if 𝑖 is a male. Assume that Eviews or Stata tells
you that 𝛽 መଵ ൌ െ200 and 𝑉 𝛽 መଵ ൌ 20. Which of the
following statements is correct?
aሻ In this sample, the average wage of females is 200 dollars
lower than the average wage of males, and the difference
between the average wage of the two groups is
statistically significant at the 5% level.
bሻ In this sample, the average wage of females is 200 dollars
lower than the average wage of males, and the difference
between the average wage of the two groups is not
statistically significant at the 5% level. 63


, so we reject at 5%

ఉ ෡భ
௏ ఉ ෡భ
, so we reject ଵ at 5% level.
• The difference between the average wage of males and females
is statistically significant at the 5% level.
• It is very unlikely (less than 5% chances) that in the US
population males and females have the same average wages,
but that we drew a random sample fairly different from the US
population where males’ average wage is 200 higher than that
of female.
• Given that our random sample is quite large (5,000 people), the
fact that in our sample the average wage of males is 200 dollars
> than that of females indicates that in the US population,
males also have a higher average wage than females.
64
iClicker time
• Assume you observe the wages of a sample of wage earners
in the US. You regress 𝑌 ௜, the monthly wage of person 𝑖, on a
constant and 𝑋௜, a binary variable equal to 1 if 𝑖 is a female
and to 0 if 𝑖 is a male. Assume that you find 𝛽 መଵ ൌ െ200 and
𝛽 መ଴ ൌ 2000
• Which of the following statements is correct?
aሻ To predict the wage of a female not in the sample, this
regression model will use the average wage of females in
the sample.
bሻ To predict the wage of a female not in the sample, this
regression model will use the average wage of males and
females in the sample.
65
To predict wage of a female not in sample,
regression uses average wage of females in sample.
• Now, let’s consider some units outside of our sample => we
do not observe their
௝.
• Predicted value of
௝ according to OLS regression: ௝ ଴
ଵ ௝.
• Given that female, ௝ , so predicted wage: ௝ ଴ ଵ.


ଵ ௡బ
௜:௑೔ୀ଴ ௜, and ଵ ௡ ଵ భ ௜:௑೔ୀଵ ௜ ௡ ଵ బ ௜:௑೔ୀ଴ ௜, so

ଵ ௡బ
௜:௑೔ୀ଴ ௜
ଵ ௡భ
௜:௑೔ୀଵ ௜
ଵ ௡బ
௜:௑೔ୀ଴ ௜
ଵ ௡భ
௜:௑೔ୀଵ ௜
• Predicted wage: average wage of females in sample.
66
iClicker time
• Assume you observe the wages of a sample of wage
earners in the US. You regress ௜ , the monthly wage
of person , on a constant and ௜, a binary variable equal
to 1 if is a female and to 0 if is a male. Assume that
you find ଵ .
• Which of the following statements is correct?
67
Average wage of females is 10% lower than
average wage of males.
• ௜ binary: ௜ for males, ௜ for females.


ଵ ௡భ
௜:௑೔ୀଵ ௜
ଵ ௡బ
௜:௑೔ୀ଴ ௜ , Therefore,
ଵ difference between average ln(wage) of females
and males.

ଵ means that the average ln(wage) of females
0.1 lower than the average ln(wage) of males.
• As we discussed a few slides ago, using some properties
of the ln function, one can show that this implies that
the average wage of females is 10% lower than the
average wage of males.
68
iClicker time
• Assume you observe the wages of a sample of wage earners
in the US. You regress 𝑌 ௜, the monthly wage of person 𝑖, on a
constant and 𝑋௜, their number of years of professional
experience (from 0 for people who just started working to 50
for people who have worked for 50 years). Assume that you
find 𝛽 መଵ ൌ 100.
• Which of the following statements is correct?
aሻ When we compare people whose years of experience
differ by one, we find that on average, those who have one
more year of experience earn 100 more dollars per
month.
bሻ The covariance between years of experience and wage is
equal to 100.
cሻ The covariance between years of experience and wage
divided by the variance of years of experience is equal to
100.
69
Answers a) and c) both correct
• 𝑋
௜ can be equal to 0 (no experience), 1, 2,…50.
• Let 𝑛
଴ be number of units with 𝑋௜ ൌ 0 (no experience),…, let 𝑛ହ଴ be
number of units with 𝑋௜ ൌ 50 (50 years of experience).
𝛽 መଵ ൌ ෍ 𝑤௞ 𝑛 1

෍ 𝑌 ௜
௜:௑೔ୀ௞

1
𝑛௞ିଵ ෍ 𝑌 ௜
௜:௑೔ୀ௞ିଵ
ହ଴
௞ୀଵ
,
where 𝑤
௞ are positive weights summing to 1 that you do not need to
know.
• 𝛽 መଵ: weighted average of difference between average wage of people
with 1 and 0 years of experience, of difference between average wage
of people with 2 and 1 years of experience,…, of difference between
average wage of units with 50 and 49 years of experience.
• 𝛽 መଵ ൌ 100 means that when we compare people whose years of
experience differ by one, we find that on average, those who have one
more year of experience earn 100 more dollars per month.
• Answer c) also correct. However, ratio of covariance and variance hard
to interpret, while average difference of wages of people with one year
of difference in their experience easy to interpret. 70
iClicker time
• Assume you observe the wages of a sample of wage earners
in the US. You regress ln ሺ𝑌 ௜ሻ, the ln(monthly wage) of person
𝑖, on a constant and lnሺ𝑋௜ሻ, the ln(number of years of
professional experience) of that person. Assume that you find
𝛽 መଵ ൌ 0.5.
• Which of the following statements is correct?
aሻ When we compare people whose years of experience
differ by one, we find that on average, those who have one
more year of experience earn 50% more.
bሻ When we compare people whose years of experience
differ by 1%, we find that on average, those who have 1%
more years of experience earn 0.5% more.
71
Answer b) correct
• We regress ௜ , the ln(monthly wage) of person
, on a constant and ௜ , the ln(number of years
of professional experience) of that person.
• Because ௜ and not ௜ in regression, ଵ does
not compare subgroups whose experience differ by
1 one year, but subgroups whose experience differ
by 1%!
• In this sample, when we compare subgroups of
people whose years of experience differ by 1%, we
find that on average, those who have 1% more
years of experience earn 0.5% more.
72
What you need to remember

ଵ sample covariance between ௜ and ௜ / by sample
variance of ௜.

ଵ (resp. ଵ ): covariance between ௜ and ௜
(resp. ): ௜ and ௜ positively correlated, move in same
(resp. opposite) direction.
• When ௜ binary, ଵ ଵ
௡భ ௜:௑೔ୀଵ ௜
ଵ ௡బ
௜:௑೔ୀ଴ ௜:
difference between the average of ௜ among subgroups
whose ௜ differs by one (units with ௜ versus units
with ௜ ).
• When ௜ not binary, ଵ still measures difference between
average of ௜ among subgroups whose ௜ differs by one.
• You need to know how to interpret ଵ in a regression of
௜ on a constant and ௜, in a regression of ௜ on a
constant and ௜, and in a regression of ௜ on a
constant and ௜ .
73
Roadmap
1. The OLS univariate affine regression function.
2. Estimating the OLS univariate affine regression function.
3. Interpreting 𝛽 መଵ
4. OLS univariate affine regression in practice.
74
How Gmail uses OLS univariate affine regression
• Gmail wants to predict 𝑦௞: 1 if email 𝑘 is spam, 0 otherwise.
• To do so, use 𝑥௞: 1 if “free” appears in email, 0 otherwise.
• 𝑥௞ easy to measure (a computer can do it automatically, by
searching for “free” in the email), but 𝑦௞ is hard to measure:
only a human can know whether an email is a spam or not. =>
cannot observe 𝑦௞ for all emails.
• To make good predictions, would like to compute, 𝛽଴, 𝛽ଵ ,
value of 𝑏
଴, 𝑏ଵ minimizing ∑ே ௞ୀଵ 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ ଶ, and
then use 𝛽଴ ൅ 𝛽ଵ𝑥௞ to predict 𝑦௞. 𝛽଴ ൅ 𝛽ଵ𝑥௞: affine function
of 𝑥
௞ for which sum of squared prediction errors൫𝑦௞ െ
𝑏
଴ ൅ 𝑏ଵ𝑥௞ ൯ଶ minimized.
• Issue: 𝛽଴ ൌ 𝑦 ത െ 𝛽ଵ𝑥̅, and 𝛽ଵ ൌ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ
∑ಿ ೖసభሺ௫ೖି௫̅ሻమ => they
cannot compute these numbers because do not observe 𝑦௞.
75
How Gmail uses OLS univariate affine regression
• Instead Gmail draws random sample of, say, 5000 emails, ask
humans to read them and determine whether spams or not.
• For 𝑖 between 1 and 5000, 𝑌 ௜: whether 𝑖th randomly drawn email is
spam, 𝑋௜: whether 𝑖th randomly drawn email has free in it.
• 𝛽଴, 𝛽ଵ is value of 𝑏଴, 𝑏ଵ minimizing ∑ே ௞ୀଵ 𝑦௞ െ 𝑏଴ ൅ 𝑏ଵ𝑥௞ ଶ
• Estimate 𝛽଴, 𝛽ଵ : use 𝑏଴, 𝑏ଵ minimizing ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑏଴ ൅ 𝑏ଵ𝑋௜ ଶ.
• Yields 𝛽 መ଴ ൌ 𝑌 ത െ 𝛽 መଵ𝑋 ത and 𝛽 መଵ ൌ ∑೙ ೔సభሺ௑೔ି௑ തሻሺ௒೔ି௒ തሻ
∑೙ ೔సభሺ௑೔ି௑ തሻమ .
• For emails not in sample, do not know if spam, but use 𝛽 መ଴ ൅𝛽 መଵ𝑥௞ as
their prediction of whether the email is a spam or not.
• Because their random sample of emails is large, 𝛽 መ଴ and 𝛽 መଵ should
be close to 𝛽଴ and 𝛽ଵ, and therefore 𝛽 መ଴ ൅𝛽 መଵ𝑥௞ should be close to
𝛽଴ ൅ 𝛽ଵ𝑥௞, the best univariate affine prediction of 𝑦௞ given 𝑥௞.
• Use 𝑅ଶ to assess whether regression makes good predictions.
76
Application to a data set of 4601 emails
• 4601 emails which have been read by humans. Variable spam =
1 if email = spam, 0 otherwise.
• We have another variable: number of times the word “free”
appears in the email/number of words in the email *100. Ranges
from 0 to 100: percentage points.
• We go to Eviews and write “ls spam c percent_word_free”.
77
Dependent Variable: SPAM
Method: Least Squares
Date: 04/26/17 Time: 15:56
Sample: 1 4601
Included observations: 4601
Variable Coefficient Std. Error t-Statistic Prob.
PERCENT_WORD_FREE 0.201984 0.023411 8.627873 0.0000
C 0.372927 0.007555 49.35958 0.0000
R-squared 0.015928 Mean dependent var 0.394045
Adjusted R-squared 0.015714 S.D. dependent var 0.488698
S.E. of regression 0.484843 Akaike info criterion 1.390450
Sum squared resid 1081.098 Schwarz criterion 1.393247
Log likelihood -3196.730 Hannan-Quinn criter. 1.391434
F-statistic 74.44020 Durbin-Watson stat 0.032029
Prob(F-statistic) 0.000000
Interpretation of
• 𝛽 መ଴ ൌ 0.37 and 𝛽 መଵ ൌ 0.20. Interpretation of 𝛽 መଵ:
When we compare emails whose percentage of words that are
the word “free” differ by 1, percentage of spams is 20 points
higher among emails whose percentage of the word free is 1
point higher.
• Emails where the word free appears more often are more
likely to be spams!
78
Using and to make predictions

଴ and ଵ . Assume you consider two
emails outside of your sample, and therefore you do not
know whether they are spams or not.
• In one email, the word “free” =0% of the words of the
email, the other one where the word “free”=1% of the
words of the email.
• According to the OLS affine regression function, what is
your prediction for the first email being a spam? What is
your prediction for the second email being a spam?
Discuss this question with your neighbor for 2 minutes.
79
iClicker time
• 𝛽 መ଴ ൌ 0.37 and 𝛽 መଵ ൌ 0.20. Assume you consider two emails,
one where the word “free” =0% of the words of the email, the
other one where the word “free”=1% of the words of the
email.
• According to the OLS affine regression function, what is your
prediction for the first email being a spam? What is your
prediction for the second email being a spam?
aሻ The predicted value for the first email being a spam is
0.37, while the predicted value for the second email being
a spam is 0.372.
bሻ The predicted value for the first email being a spam is
0.37, while the predicted value for the second email
being a spam is 0.57.
80
Predicted value for 1st email being spam is 0.37,
predicted value for 2nd email being spam is 0.57.

଴ and ଵ . Assume you consider two
emails, one where the word “free” =0% of the words of
the email, the other one where the word “free”=1% of
the words of the email.
• According to the OLS affine regression function, what is
your prediction for the first email being a spam? What is
your prediction for the second email being a spam?
• According to this regression, predicted value for whether
email is a spam is ଴+ ଵ , where is number of times
“free” appears in the email/number of words in the
email * 100.
• For first email => predicted value = 0.37.
• For second email, => predicted value = 0.57.
81
Testing .

ଵ , and ଵ .
• Can we reject at the 5% level the null
hypothesis that ଵ ? Discuss this question
with your neighbor for 1 minute.
82
iClicker time

ଵ , and ଵ .
• Can we reject at the 5% level the null hypothesis that
ଵ ?
a) Yes
b) No
83
Yes!
• If we want to have 5% chances of wrongly rejecting ଵ
, test is:
Reject ଵ if ఉ ෡భ
௏ ఉ ෡భ
or
ఉ ෡భ
௏ ఉ ෡భ
.
Otherwise, do not reject ଵ .
• Here, ఉ ෡భ
௏ ఉ ෡భ
=> we can reject ଵ .
• The percentage of the words of the email that are the
word “free” is a statistically significant predictor of
whether the email is a spam or not!
• Find the 95% confidence interval for
ଵ. You have 2mns.
84
iClicker time

ଵ , and ଵ .
• The 95% confidence interval for
ଵis:
a) [0.155,0.245]
b) [0.143,0.228]
85
95% confidence interval for
ଵ is [0.155,0.245]

ଵ , and ଵ .
• The 95% confidence interval for
ଵis ଵ
ଵ ଵ ଵ .
• Plugging in the values of ଵ and ଵ yields
[0.155,0.245].
86
iClicker time
• Does regression has a low or a high R‐squared?
a) It has a low R‐squared.
b) It has a high R‐squared. 87
Dependent Variable: SPAM
Method: Least Squares
Date: 04/26/17 Time: 15:56
Sample: 1 4601
Included observations: 4601
Variable Coefficient Std. Error t-Statistic Prob.
PERCENT_WORD_FREE 0.201984 0.023411 8.627873 0.0000
C 0.372927 0.007555 49.35958 0.0000
R-squared 0.015928 Mean dependent var 0.394045
Adjusted R-squared 0.015714 S.D. dependent var 0.488698
S.E. of regression 0.484843 Akaike info criterion 1.390450
Sum squared resid 1081.098 Schwarz criterion 1.393247
Log likelihood -3196.730 Hannan-Quinn criter. 1.391434
F-statistic 74.44020 Durbin-Watson stat 0.032029
Prob(F-statistic) 0.000000
Our regression has a low
• The ଶ of the regression is equal to .
• ଶ included between 0 and 1. Close to 0: bad
prediction. Close to 1 good prediction.
• Here close to 0 => bad prediction.
88
If we use this regression to construct a spam
filter, filter will be pretty bad.
• We can compute ଴+ ଵ for each email in our sample.
• 39% of those 4601 emails are spams => we could say: we
predict that the 39% of emails with highest value of
଴+ ଵ are spams, while the other emails are not spams.
• We can look how this spam filter performs in our sample.
• Among the non‐spams we correctly predict that 85% are
not spams, but we wrongly predict that 15% are spams.
• Among the spams, we correctly predict that 35% are
spams, but we wrongly predict that 65% are non‐spams.
• => if Gmail used this spam filter, you would receive many
spams, Gmail would send many true emails to your trash,
and you would change your email account to Microsoft.
• In the homework, you will see how to construct a better
spam filter.
89
What you need to remember, and what’s next
• In practice, many instances where we can measure the
௞ , the variable we do not observe for everyone, for a
sample of population.
• We can use that sample to compute ଴ and ଵ, and
then use
଴ ଵ ௞ as our prediction of the ௞ we do
not observe.
• If that sample is a random sample from the population,
଴ ଵ ௞ should be close to ଴ ଵ ௞, the best affine
prediction for ௞.
• But univariate affine regression might still not give
great predictions: spam example.
• There are better prediction methods available. Next
lectures: we see one of them.
90

Don't use plagiarized sources. Get Your Custom Essay on
Econometrics Exam questions
Just from $10/Page
Order Essay

 

OLS multivariate regression.
Clement de Chaisemartin and Doug Steigerwald
UCSB
1
Banks have more than 1 variable to predict
amount applicants will fail to reimburse
• To predict the amount applicants will fail to
reimburse, banks can use their FICO score (score
based on their current debts and on their history of
loan repayments), and all other variables contained
in their application: e.g. their income.
• Will bank be able to make better predictions by using
both variables rather than just FICO score?
2
Yes, provided people with incomes but same
FICO fail to reimburse amounts on their loan.
• Assume FICO score can take only two values: 0 and 100.
• Assume applicants’ income can take two values: 2000 and
4000.
• If average amount people fail to reimburse varies with FICO
and income as in table below, adding applicant’s income to
model improves prediction.
• People with different income levels but with same FICO score
fail to reimburse different amounts on their loan => adding
income to your prediction model will improve quality of your
predictions.
3
Income=2000 Income=4000
FICO=0 2000 1000
FICO=100 500 200
Gmail has more than 1 variable to predict
whether an email is a spam.
• To predict whether email is spam, Gmail can use variable
equal to 1 if “free” appears in email, and variable equal to 1 if
“buy” in email.
• If percentage of spams varies as in table below, adding the
“buy” variable to the model will improve prediction.
• Emails which have “buy” and “free” in it are more likely to be
spams than emails which only have “free” in it.
• Emails which have “buy” but not “free” in it are more likely to
be spams than emails which neither have “buy” or “free” in it.
• => adding “buy” variable will improve predictions. 4
% of spams Email has “buy” in it Email doesn’t have “buy” in it
Email has “free” in it 3% 1.5%
Email doesn’t have “free” in it 1% 0.5%
Multivariate regression
• In these lectures, we are going to discuss OLS
multivariate regressions, which are OLS regressions
with several independent variables to predict a
dependent variable.
5
Roadmap
1. The OLS multivariate regression function.
2. Estimating the OLS multivariate regression function.
3. Advantages and pitfalls of multivariate regressions.
4. Interpreting coefficients in multivariate OLS regressions.
6
Set up and notation.
• We consider a population of units.
– could be number of people who apply for a loan in
bank A during May 2017.
• Each unit has a variable
௞ attached to it that we do
not observe:
– In loan example, ௞ is variable equal to the amount
applicant will fail to reimburse on her loan when her
loan expires in May 2018.
• Each unit also has variables
ଵ௞, ଶ௞, ଷ௞,…, ௃௞
attached to it that we do observe:
– In the loan example, ଵ௞ could be FICO score of
applicant , ଶ௞ could be the income of that applicant,
etc.
7
Prediction = function of
ଵ௞, ଶ௞, ଷ௞,…, ௃௞.
• Based on value of
ଵ௞, ଶ௞, ଷ௞,…, ௃௞ of each
unit, we want to predict her ௞.
• E.g.: in the loan example, we want to predict the
amount that unit will fail to repay based on her
FICO score and her income.
• Prediction should be function of
ଵ௞,…, ௃௞,
ଵ௞,…, ௃௞ .
• In these lectures, we focus on predictions which
are affine function of 𝟏𝒌,…, 𝑱𝒌: 𝟏𝒌,…,
𝑱𝒌 𝟎 𝟏 𝟏𝒌 𝑱 𝑱𝒌, for real
numbers 𝟎, 𝟏,…, 𝑱.
8
Prediction error:
௞ ଴ ଵ ଵ௞ ௃ ௃௞
• Based on value of 𝑥
ଵ௞,…, 𝑥௃௞ of each unit, we predict her
𝑦௞.
• Our prediction should be function of 𝑥ଵ௞,…, 𝑥௃௞, 𝑓ሺ𝑥ଵ௞,…,
𝑥௃௞ሻ.
• We focus on affine functions of 𝑥
ଵ௞,…, 𝑥௃௞: 𝑓 𝑥ଵ௞,…, 𝑥௃௞ ൌ
𝑐
଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞, for 𝐽 ൅ 1 real numbers 𝑐଴, 𝑐ଵ,…, 𝑐௃.
• 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞ , difference between
prediction for 𝑦௞ and actual value of 𝑦௞, is prediction error.
• Large positive or neg. 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞ mean
bad prediction.
• 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞ close to 0 means good
prediction.
9
Goal: find the value of
଴, ଵ,…, ௃ that minimizes
௞ ଴ ଵ ଵ௞ ௃ ௃௞


௞ୀଵ
• ௞ ଴ ଵ ଵ௞ ௃ ௃௞


௞ୀଵ is positive.
=> minimizing it = making it as close to 0 as
possible.
• If
௞ ଴ ଵ ଵ௞ ௃ ௃௞


௞ୀଵ is as
close to 0 as possible, means that the sum of
the squared value of our prediction errors is as
small as possible.
• => we make small errors. That’s good, that’s
what we want!
10
The OLS multivariate regression function
• Let
𝛾଴, 𝛾ଵ, … , 𝛾௃ ൌ 𝑎𝑟𝑔𝑚𝑖𝑛ሺ௖బ, ௖భ,…, ௖಻ሻ ∑ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞


௞ୀଵ
• 𝛾଴, 𝛾ଵ, … , 𝛾௃ : value of ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ minimizing
∑ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞


௞ୀଵ .
• We call 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ the OLS multivariate regression
function of 𝑦௞ on a constant, 𝑥ଵ௞, 𝑥ଶ௞, 𝑥ଷ௞,…, 𝑥௃௞.
• We let 𝑦 ෤௞ ൌ 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ denote prediction from
multivariate OLS regression.
• We let 𝑒
௞ ൌ 𝑦௞ െ 𝑦 ෤௞: prediction error.
• We have 𝑦௞ ൌ 𝑦 ෤௞ ൅ 𝑒௞.
11
How can we find ?
• 𝛾଴, 𝛾ଵ, … , 𝛾௃ : value of ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ minimizing
∑ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞


௞ୀଵ .
• To minimize a function of several variables, we differentiate it
wrt to each of those variables, and we find the value of ሺ𝑐଴,
𝑐
ଵ,…, 𝑐௃ሻ for which all those derivatives are equal to 0. No
need to worry about second derivatives because objective
function convex.
• What is derivative of ∑ே ௞ୀଵ ቀ𝑦௞ െ ൫𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅
𝑐௃
𝑥௃௞൯ቁଶ with respect to 𝑐଴? Discuss this question with your
neighbor for 2mns.
12
iClicker time
• What is the derivative of
௞ ଴ ଵ ଵ௞

௞ୀଵ
௃ ௃௞

with respect to ଴?
a) ே ௞ୀଵ ௞ ଴ ଵ ଵ௞ ௃ ௃௞
b) ே ௞ୀଵ ௞ ଴ ଵ ଵ௞ ௃ ௃௞
c) ே ௞ୀଵ ௞ ଴ ଵ ଵ௞ ௃ ௃௞
13
• Derivative of
௞ ଴ ଵ ଵ௞ ௃ ௃௞


௞ୀଵ
with respect to ଴ is
௞ ଴ ଵ ଵ௞ ௃ ௃௞

௞ୀଵ : P4Sum+chain
rule.
• What is the derivative of
௞ ଴ ଵ ଵ௞

௞ୀଵ
௃ ௃௞

with respect to ଵ? Discuss this question with
your neighbor for 2mns.
14
iClicker time
• What is the derivative of
௞ ଴ ଵ ଵ௞

௞ୀଵ
௃ ௃௞

with respect to ଵ?
a) ே ௞ୀଵ ௞ ଴ ଵ ଵ௞ ௃ ௃௞
b) ே ௞ୀଵ ଵ௞ ௞ ଴ ଵ ଵ௞ ௃ ௃௞
15
ଵ௞ ௞ ଴ ଵ ଵ௞ ௃ ௃௞

௞ୀଵ
• Derivative of ∑ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞


௞ୀଵ wrt 𝑐ଵ:
െ2 ∑ே ௞ୀଵ 𝑥ଵ௞ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞ : P4Sum+chain
rule.
• Derivative of ∑ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞


௞ୀଵ wrt 𝑐ଶ:
െ2 ∑ே ௞ୀଵ 𝑥ଶ௞ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞ .
• …
• Derivative of ∑ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞


௞ୀଵ wrt 𝑐௃:
െ2 ∑ே ௞ୀଵ 𝑥௃௞ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞ .
16
଴ ଵ ௃ is the solution of a system of
equations with unknowns.
• 𝛾଴, 𝛾ଵ, … , 𝛾௃ : value of ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ for which all those derivatives
are equal to 0.
• Thus, we have:
െ2 ∑ே ௞ୀଵ 𝑦௞ െ 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ ൌ 0
െ2 ∑ே ௞ୀଵ 𝑥ଵ௞ 𝑦௞ െ 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ ൌ 0

െ2 ∑ே ௞ୀଵ 𝑥௃௞ 𝑦௞ െ 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ ൌ 0
• 𝛾଴, 𝛾ଵ, … , 𝛾௃ is the solution of a system of 𝐽 ൅ 1 equations with
𝐽 ൅ 1 unknowns.
• If we give the values of the 𝑦௞s, of the 𝑥ଵ௞s, …, of the 𝑥௃௞s to a
computer, can solve this system and give us value of 𝛾଴, 𝛾ଵ, … , 𝛾௃ .
17
What you need to remember
• Population of 𝑁 units. Each unit has J+1 variables attached to it: 𝑦௞
is a variable we do not observe, 𝑥ଵ௞, 𝑥ଶ௞, 𝑥ଷ௞,…, 𝑥௃௞ are variables
we observe. We want to predict 𝑦௞ based on 𝑥ଵ௞, 𝑥ଶ௞, 𝑥ଷ௞,…, 𝑥௃௞.
• E.g.: bank wants to predict amount an applicant will fail to
reimburse on her loan based on her FICO score and her income.
• Our prediction should be function of 𝑥ଵ௞,…, 𝑥௃௞, 𝑓ሺ𝑥ଵ௞,…, 𝑥௃௞).
Affine functions of 𝑥
ଵ௞, 𝑥ଶ௞, 𝑥ଷ௞,…, 𝑥௃௞: 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞,
for some real numbers ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ.
• Good prediction should be such that 𝑒௞ ൌ 𝑦௞ െ ൫𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅
𝑐௃
𝑥௃௞൯, our prediction error, is as small as possible for most units.
• Best ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ: minimizes ∑ே ௞ୀଵ ቀ𝑦௞ െ ൫𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅
𝑐௃
𝑥௃௞൯ቁଶ.
• We call that value 𝛾଴, 𝛾ଵ, … , 𝛾௃ . 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ is OLS
regression function of 𝑦௞ on a constant, 𝑥ଵ௞, 𝑥ଶ௞, 𝑥ଷ௞,…, 𝑥௃௞.
• 𝛾଴, 𝛾ଵ, … , 𝛾௃ : solution of system of J+1 equations with J+1
unknowns: derivatives of ∑ே ௞ୀଵ ቀ𝑦௞ െ ൫𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅
𝑐௃
𝑥௃௞൯ቁଶwrt to 𝑐଴, 𝑐ଵ,…, 𝑐௃ must = 0 at 𝛾଴, 𝛾ଵ, … , 𝛾௃ . 18
Roadmap
1. The OLS multivariate regression function.
2. Estimating the OLS multivariate regression function.
3. Advantages and pitfalls of multivariate regressions.
4. Interpreting coefficients in multivariate OLS regressions.
19
We cannot compute
• Our prediction for 𝑦௞ based on a multivariate regression is 𝛾଴ ൅
𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞, the OLS multivariate regression function.
• => to be able to make a prediction for a unit’s 𝑦௞ based on her 𝑥ଵ௞,
𝑥ଶ௞, 𝑥ଷ௞,…, 𝑥௃௞, we need to know the value of 𝛾଴, 𝛾ଵ, … , 𝛾௃ .
• Under the assumptions we have made so far, we cannot compute
𝛾଴, 𝛾ଵ, … , 𝛾௃ . Solution of
െ2 ∑ே ௞ୀଵ 𝑦௞ െ 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ ൌ 0
െ2 ∑ே ௞ୀଵ 𝑥ଵ௞ 𝑦௞ െ 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ ൌ 0

െ2 ∑ே ௞ୀଵ 𝑥௃௞ 𝑦௞ െ 𝛾଴ ൅ 𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞ ൌ 0
• To solve this system, we need to know the 𝑦௞s, which we don’t!
• E.g.: the bank knows the FICO score and income (𝑥ଵ௞ and 𝑥ଶ௞) for
each applicant, but does not know the amount each applicant will
fail to reimburse in April 2018 when loan expires (𝑦௞).
20
A method to estimate
଴ ଵ ௃
• We draw 𝑛 units from the population, and we measure the
dependent and the independent variable of those 𝑛 units.
• For 𝑖 included between 1 and 𝑛, 𝑌 ௜, 𝑋ଵ௜,…, 𝑋௃௜ = value of
dependent and independent variables of 𝑖th unit we
randomly select.
• 𝛾଴, 𝛾ଵ, … , 𝛾௃ : value of ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ minimizing
∑ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞


௞ୀଵ
• => to estimate 𝛾଴, 𝛾ଵ, … , 𝛾௃ , we use ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ minimizing
∑ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜


௜ୀଵ .
• 𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௃ denotes that value.
• Instead of ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ minimizing sum of squared errors in
population, use ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ minimizing sum of squared
errors in sample.
• If we find a good prediction function in sample, should also
work well in entire population: sample representative of
population.
21
The OLS regression function in the sample.
• Let
𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௃ ൌ 𝑎𝑟𝑔𝑚𝑖𝑛ሺ௖బ, ௖భ,…, ௖಻ሻ∈ோ಻శభ ∑ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜


௜ୀଵ
• We call
଴ ଵ ଵ௜ ௃ ௃௜ the OLS regression function
of ௜ on a constant, ଵ௜,…, and ௃௜ in the sample.

଴ ଵ ௃ : coefficients of the constant, ଵ௜,…, and ௃௜.
• Let ௜ ଴ ଵ ଵ௜ ௃ ௃௜. ௜ is the predicted value
for ௜ according to the OLS regression function of ௜ on a
constant, ଵ௜,…, and ௃௜ in the sample.
• Let ௜ ௜ ௜. ௜: error we make when we use OLS
regression in the sample to predict ௜.
• We have ௜ ௜ ௜.
22
How can we find ?
• 𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௃ : value of ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ minimizing
∑ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜


௜ୀଵ .
• To minimize a function of several variables, we differentiate it
wrt to each of those variables, and we find the value of ሺ𝑐଴,
𝑐
ଵ,…, 𝑐௃ሻ for which all those derivatives are equal to 0. No
need to worry about second derivatives because objective
function convex.
23
The derivatives of the objective function
• Derivative of ∑ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜


௜ୀଵ wrt 𝑐଴:
െ2 ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜ :P4Sum+chain rule+
P2Sum
• Derivative of ∑ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜


௜ୀଵ wrt 𝑐ଵ:
െ2 ∑௡ ௜ୀଵ 𝑋ଵ௜ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜
• …
• Derivative of ∑ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜


௜ୀଵ wrt 𝑐௃:
െ2 ∑௡ ௜ୀଵ 𝑋௃௜ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜
24
଴ ଵ ௃ = solution of system of
equations with unknowns.
• 𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௃ : value of ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ for which all derivatives = 0.
െ2 ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝛾 ො௃𝑋௃௜ ൌ 0
െ2 ∑௡ ௜ୀଵ 𝑋ଵ௜ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝛾 ො௃𝑋௃௜ ൌ 0

െ2 ∑௡ ௜ୀଵ 𝑋௃௜ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝛾 ො௃𝑋௃௜ ൌ 0
• To compute 𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௃ , we:
– draw 𝑛 units from population, measure their 𝑌 ௜s and
ሺ𝑋ଵ௜, … , 𝑋௃௜ሻs
– set up above system plugging in actual values of the 𝑌 ௜s and
ሺ𝑋ଵ௜, … , 𝑋௃௜ሻs
– Yields system of 𝐽 ൅ 1 equations with 𝐽 ൅ 1 unknowns, the
𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௃ s: all the remaining quantities are real numbers.
– Ask a computer to solve that system.
25
The ls command in E‐views solves for you that
system of equations with unknowns.
• We have:
െ2 ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝛾 ො௃𝑋௃௜ ൌ 0
െ2 ∑௡ ௜ୀଵ 𝑋ଵ௜ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝛾 ො௃𝑋௃௜ ൌ 0

െ2 ∑௡ ௜ୀଵ 𝑋௃௜ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝛾 ො௃𝑋௃௜ ൌ 0
• To compute 𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௃ , we:
– draw 𝑛 units from population, measure their 𝑌 ௜s and
ሺ𝑋ଵ௜, … , 𝑋௃௜ሻs
– set up system plugging in values of 𝑌 ௜s and ሺ𝑋ଵ௜, … , 𝑋௃௜ሻs
– Ask a computer to solve that system.
• That is what the “ls” command in eviews does, where the 𝑌 ௜s are
the values of the first variable after “ls” command, and the
ሺ𝑋ଵ௜, … , 𝑋௃௜ሻs are the values of the variables after “c”.
26
Doing E‐views’ job once in our life.
• Gmail example. Assume we sample 4 emails (𝑛 ൌ 4).
• For each, we measure 𝑌 ௜: whether spam, 𝑋ଵ௜: whether has
word “free” in it, and 𝑋ଶ௜: whether has word “buy” in it.
• E.g.: 1st email we sample is a spam, and has words “free” and
“buy” in it. 2nd email is not a spam, and has “free” in it but not
“buy”, etc.
• If you regress 𝑌 ௜ on constant, 𝑋ଵ௜, and 𝑋𝟐௜, what is value of
ሺ𝛾 ො଴, 𝛾 ොଵ, 𝛾 ොଶሻ? Hint: you need to write system of 3 equations
and three unknowns solved by ሺ𝛾 ො଴, 𝛾 ොଵ, 𝛾 ොଶሻ, plug‐in values of
𝑌 ௜, 𝑋ଵ௜, and 𝑋𝟐௜ given in table into system, and then solve
system. You have 4 minutes to find answer. 27
Email 𝑌 ௜ 𝑋ଵ௜ 𝑋𝟐௜
1 1 1 1
2 0 1 0
3 1 1 0
4 0 0 0
iClicker time
• If you regress ௜ on a constant, ଵ௜, and 𝟐௜, what
will be the value of
଴ ଵ ଶ ?
଴ ଵ ଶ
଴ ଵ ଶ
଴ ଵ ଶ
28
Email 𝑌 ௜ 𝑋ଵ௜ 𝑋𝟐௜
1 1 1 1
2 0 1 0
3 1 1 0
4 0 0 0
଴ ଵ ଶ
• 𝑛 ൌ 4 and 𝐽 ൌ 2, so we have (we can forget the ‐2):
∑ସ ௜ୀଵ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ 𝛾 ොଶ𝑋ଶ௜ ൌ 0
∑ସ ௜ୀଵ 𝑋ଵ௜ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ 𝛾 ොଶ𝑋ଶ௜ ൌ 0
∑ସ ௜ୀଵ 𝑋ଶ௜ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ 𝛾 ොଶ𝑋ଶ௜ ൌ 0
• Plugging in values of 𝑌 ௜, 𝑋ଵ௜, and 𝑋ଶ௜ in table, yields:
2 െ 4𝛾 ො଴ െ 3𝛾 ොଵ െ 𝛾 ොଶ ൌ 0
2 െ 3𝛾 ො଴ െ 3𝛾 ොଵ െ 𝛾 ොଶ ൌ 0
1 െ 𝛾 ො଴ െ 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0
• Subtracting equation 1 from equation 2 yields 𝛾 ො଴ ൌ 0.
• Plugging 𝛾 ො଴ ൌ 0 yields system of 2 equations & 2 unknowns:
2 െ 3𝛾 ොଵ െ 𝛾 ොଶ ൌ 0
1 െ 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0
• Subtracting equation 2 from 1 yields
1 െ 2𝛾 ොଵ ൌ 0, which is equivalent to 𝛾 ොଵ ൌ 0.5.
• Plugging 𝛾 ොଵ ൌ 0.5 into 1 െ 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0 yields 𝛾 ොଶ ൌ 0.5. 29
଴ ଵ ௃ converges towards ଴ ଵ ௃
• One can show that as for univariate regressions, the
estimators of multivariate regression coefficients
converge towards the true multivariate regression
coefficients (those for the full population).

௡→ାஶ ଴ ଵ ௃ ଴ ଵ ௃ .
• Intuition: when the sample size becomes large, the
sample becomes similar to the population.
30
Using central limit theorem for the ௝s to
construct tests and confidence intervals.
• Formula of the variance of multivariate regression coefficients
complicated. No need to know it, E‐views computes it for you.
• If 𝑛 ൒ 100, ఊ ෝೕିఊೕ
௏ ఊ ෝೕ
follows normal distrib. mean 0 and variance 1.
• We can use this to test null hypothesis. Often, want to test 𝛾௝ ൌ 0.
• If we want to have 5% chances of wrongly rejecting 𝛾௝ ൌ 0, test is:
Reject 𝛾௝ ൌ 0 if ఊ ෝೕ
௏ ఊ ෝೕ
൐ 1.96 or ఊ ෝೕ
௏ ఊ ෝೕ
൏ െ1.96.
Otherwise, do not reject 𝛾௝ ൌ 0.
• We can also construct a 95% confidence interval for 𝛾௝:
𝛾 ො௝ െ 1.96 𝑉 𝛾 ො௝ , 𝛾 ො௝ ൅ 1.96 𝑉 𝛾 ො௝ .
31
Assessing quality of our predictions: the ଶ.
• To assess the quality of our predictions, we are going to
use the same measure as with the OLS affine regression:

భ ೙
∑೙ ೔సభ ௘̂೔మ
భ ೙
∑೙ ೔సభ ௒೔ି௒ ത మ
• 1 – MSE / sample variance of the ௜s.
• As in the previous lectures, we have that ଶ is included
between 0 and 1.
32
What you need to remember
• Prediction for 𝑦௞ based on multivariate regression is 𝛾଴ ൅
𝛾ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝛾௃𝑥௃௞, with 𝛾଴, 𝛾ଵ, … , 𝛾௃ : value of ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ
minimizing ∑ 𝑦௞ െ 𝑐଴ ൅ 𝑐ଵ𝑥ଵ௞ ൅ ⋯ ൅ 𝑐௃𝑥௃௞


௞ୀଵ .
• We can estimate 𝛾଴, 𝛾ଵ, … , 𝛾௃ , if we measure 𝑦௞s for
random sample of population.
• For every 𝑖 between 1 and 𝑛, 𝑌 ௜, 𝑋ଵ௜,…, 𝑋௃௜ = value of
dependent and independent variables of 𝑖th unit we
randomly select.
• To estimate 𝛾଴, 𝛾ଵ, … , 𝛾௃ , find ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ minimizing
∑ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௃𝑋௃௜


௜ୀଵ .
• Differentiating this function wrt to ሺ𝑐଴, 𝑐ଵ,…, 𝑐௃ሻ yields system
of J+1 equations with J+1 unknowns.
• We solved system in simple example, you should know how to
do that.
• We used the central limit theorem to propose 5% level test of
𝛾௝ ൌ 0, and to derive a 95% confidence interval for 𝛾௝.
33
Roadmap
1. The OLS multivariate regression function.
2. Estimating the OLS multivariate regression function.
3. Advantages and pitfalls of multivariate regressions.
4. Interpreting coefficients in multivariate OLS regressions.
34
Adding variables to a regression always
improves the ଶ
• Assume you regress a variable 𝑌 ௜ on a constant and on a variable 𝑋ଵ௜.
• Then, you regress 𝑌 ௜ on a constant and on two variables 𝑋ଵ௜ and 𝑋ଶ௜.
• The 𝑅ଶ of your second regression will be at least as high as the 𝑅ଶ of
the first regression.
• Adding variables to a regression always increases its 𝑅ଶ.
• => a regression with many variables gives better predictions for the
𝑌 ௜s in the sample than a regression with few variables.
35
Example
• Sample of 4601 emails, for which you observe whether
they are a spam or not.
• You regress spam on constant and variable equal to the
percentage of the words of the email that are the word
“free”.
• Eviews command: ls spam c word_freq_free.
• Is the ଶ of that regression low or high?
36
Dependent Variable: SPAM
Method: Least Squares
Date: 05/16/17 Time: 18:22
Sample: 1 4601
Included observations: 4601
Variable Coefficient Std. Error t-Statistic Prob.
C 0.372927 0.007555 49.35958 0.0000
WORD_FREQ_FREE 0.201984 0.023411 8.627873 0.0000
R-squared 0.015928 Mean dependent var 0.394045
Adjusted R-squared 0.015714 S.D. dependent var 0.488698
S.E. of regression 0.484843 Akaike info criterion 1.390450
Sum squared resid 1081.098 Schwarz criterion 1.393247
Log likelihood -3196.730 Hannan-Quinn criter. 1.391434
F-statistic 74.44020 Durbin-Watson stat 0.032029
Prob(F-statistic) 0.000000
Example
• You regress spam variable on a constant, a variable equal to %
of words of email that are the word “free”, and a variable equal
to % of words of the email that are the word money.
• Eviews command: ls spam c word_freq_free word_freq_money.
• ଶ higher in that regression than in previous one. ଶ= 1‐
average of square prediction errors / variance of the spam
variable. => higher ଶ means lower sum of square prediction
errors => better predictions.
37
Dependent Variable: SPAM
Method: Least Squares
Date: 05/16/17 Time: 18:23
Sample: 1 4601
Included observations: 4601
Variable Coefficient Std. Error t-Statistic Prob.
C 0.358449 0.007483 47.90281 0.0000
WORD_FREQ_FREE 0.141932 0.023370 6.073346 0.0000
WORD_FREQ_MONEY 0.220177 0.016122 13.65706 0.0000
R-squared 0.054291 Mean dependent var 0.394045
Adjusted R-squared 0.053879 S.D. dependent var 0.488698
S.E. of regression 0.475350 Akaike info criterion 1.351121
Sum squared resid 1038.953 Schwarz criterion 1.355317
Log likelihood -3105.255 Hannan-Quinn criter. 1.352598
F-statistic 131.9791 Durbin-Watson stat 0.100016
Prob(F-statistic) 0.000000
Should we include all variables in regression?
• Sometimes we have many potential variables we can include in our
regression.
• E.g.: Gmail example. We could use whether the words “free”, “buy”,
“money” appear in the email, the number of exclamation marks, etc.
to predict whether the email is a spam.
• Previous slides suggest we should include as many variables as
possible in the regression, to get the highest ଶ.
• If we do this, we run into a problem called overfitting: we will make
excellent predictions within the sample we use to run the regression
(high 𝑅ଶ), but bad predictions when we use regression predict
dependent variables of units outside of our sample.
• Issue: we do not care about in‐sample prediction: for the units in the
sample, we already know their 𝑌 ௜, no need to predict them. It’s for
the units not in the sample, for which we do not know the value of
their dependent variable that we want to make good predictions.
38
Introduction to overfitting, through an example
• Assume that in your data, you only have 3 emails.
• Assume also that for each email, you measure 3 variables:
– 𝑌
௜: whether the email is a spam
– 𝑋
ଵ௜: whether the minute when email was sent is an odd number
– 𝑋
ଶ௜: whether the second when email was sent is an odd number.
• 𝑋
ଵ௜ and 𝑋ଶ௜ should be poor predictors of whether email is spam: no
reason why spams more likely to be sent on odd minutes/seconds.
• Assume that the values of 𝑌 ௜, 𝑋ଵ௜, 𝑋ଶ௜ are as in the table below.
• Find ሺ𝑐଴, 𝑐ଵ, 𝑐ଶሻ such that ∑ଷ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ 𝑐ଶ𝑋ଶ௜ ଶ=0.
39
Email 𝑌 ௜ 𝑋ଵ௜ 𝑋𝟐௜
1 1 1 1
2 0 1 0
3 0 0 0
iClicker time
• Assume that the values of ௜ ଵ௜ ଶ௜ are as in the table
below.
• ሺ𝑐଴, 𝑐ଵ, 𝑐ଶሻ such that ∑ଷ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ 𝑐ଶ𝑋ଶ௜ ଶ=0
is:
଴ , ଵ , ଶ .
଴ , ଵ , ଶ .
଴ , ଵ , ଶ .
଴ , ଵ , ଶ . 40
Email 𝑌 ௜ 𝑋ଵ௜ 𝑋𝟐௜
1 1 1 1
2 0 1 0
3 0 0 0
, , .
• ሺ𝑐଴, 𝑐ଵ, 𝑐ଶሻ such that ∑ଷ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ 𝑐ଶ𝑋ଶ௜ ଶ=0 is
solution of this system:
1െ 𝑐
଴ ൅ 𝑐ଵ ൅ 𝑐ଶ ൌ 0
0 െ 𝑐
଴ ൅ 𝑐ଵ ൌ 0
0 െ 𝑐
଴ ൌ 0
• You can check that solution is 𝑐
଴ ൌ 0, 𝑐ଵ ൌ 0, 𝑐ଶ ൌ 1.
• Now assume that in this example, you regress 𝑌 ௜ on a
constant, 𝑋ଵ௜ and 𝑋ଶ௜. Let 𝛾 ො଴, 𝛾 ොଵ, 𝛾 ොଶ respectively denote the
coefficient of the constant, of 𝑋ଵ௜, and of 𝑋ଶ௜ in this
regression. What will be the value of 𝛾 ො଴, 𝛾 ොଵ, 𝛾 ොଶ? Discuss this
question during 2 minutes with your neighbor.
41
Email 𝑌 ௜ 𝑋ଵ௜ 𝑋𝟐௜
1 1 1 1
2 0 1 0
3 0 0 0
iClicker time
• Values of ௜ ଵ௜ ଶ௜ are as in the table below.
• You regress ௜ on a constant, ଵ௜ and ଶ௜. ଴ ଵ ଶ
denote the coefficient of the constant, of ଵ௜, and of ଶ௜
in this regression. What will be the value of ଴ ଵ ଶ?
଴ , ଵ , ଶ .
଴ , ଵ , ଶ .
଴ , ଵ , ଶ .
଴ , ଵ , ଶ .
42
Email 𝑌 ௜ 𝑋ଵ௜ 𝑋𝟐௜
1 1 1 1
2 0 1 0
3 0 0 0
଴ , ଵ , ଶ .
• ሺ𝛾 ො଴,𝛾 ොଵ,𝛾 ොଶሻ: minimizer of ∑ଷ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ 𝑐ଶ𝑋ଶ௜ ଶ.
• For any ሺ𝑐଴, 𝑐ଵ, 𝑐ଶሻ, ∑ଷ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ 𝑐ଶ𝑋ଶ௜ ଶ ൒ 0.
• If for a ሺ𝑐଴, 𝑐ଵ, 𝑐ଶሻ, ∑ଷ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ 𝑐ଶ𝑋ଶ௜ ଶ ൌ 0,
this ሺ𝑐଴, 𝑐ଵ, 𝑐ଶሻ is that minimizing ∑ଷ ௜ୀଵ൫𝑌 ௜ െ ሺ𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅
𝑐
ଶ𝑋ଶ௜ሻ൯ଶ, so ሺ𝛾 ො଴,𝛾 ොଵ,𝛾 ොଶሻ equal to that ሺ𝑐଴, 𝑐ଵ, 𝑐ଶሻ.
• ∑ଷ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ 𝑐ଶ𝑋ଶ௜ ଶ ൌ 0 if 𝑐଴ ൌ 0, 𝑐ଵ ൌ 0, 𝑐ଶ ൌ 1.
• Therefore, 𝛾 ො଴ ൌ 0,𝛾 ොଵ ൌ 0,𝛾 ොଶ ൌ 1.
• Prediction function for whether email spam: 0 ൅ 0 ൈ 𝑋ଵ௜ ൅
1 ൈ 𝑋
𝟐௜
• => you predict that all emails sent on an odd second are
spams while all emails sent on an even second are not spams.
• This regression has an 𝑅ଶ ൌ 1: regression predicts perfectly
whether email are spams, in your sample of 3 observations.
• Do you think that this regression will give good predictions,
when you use it to make predictions for emails outside of the
sample of 3 emails in the regression? 43
iClicker time
• => in this example, prediction function for whether the email
is a spam is 0 ൅ 0 ൈ 𝑋ଵ௜ ൅ 1 ൈ 𝑋𝟐௜
• => you predict that all emails sent on an odd second are
spams while all emails sent on an even second are not spams.
• This regression has an 𝑅ଶ ൌ 1: regression model predicts
perfectly whether email are spams, in your sample of 3
observations.
• Do you think that this regression will give good predictions,
when you use it make predictions for emails outside of the
sample of 3 emails you use in the regression?
a) Yes
b) No
44
No!
• There is no reason why emails sent on odd
seconds would be more likely to be spams than
emails sent on even seconds.
• => when you use the regression to make
predictions for whether emails out of your
sample are spams or not, you will get very bad
predictions.
• This is despite the fact that in your sample, your
regression yields perfect predictions. ଶ .
• So what is going on?
45
ଶ of reg. with as many variables as units =1…
• You have 𝑛 observations, and for each observation you measure
dependent variable 𝑌 ௜ and 𝑛 െ 1 independent variables 𝑋ଵ௜,…, 𝑋௡ିଵ௜.
• You regress 𝑌 ௜ on constant, 𝑋ଵ௜,…, 𝑋௡ିଵ௜.
• 𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௡ିଵ , coefficients of constant, 𝑋ଵ௜,…, 𝑋௡ିଵ௜: value of ሺ𝑐଴,
𝑐
ଵ, … , 𝑐௡ିଵሻ minimizing ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑐଴ ൅ 𝑐ଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝑐௡ିଵ𝑋௡ିଵ௜ ଶ.
• We can make each term in summation =0. Equivalent to solving:
𝑌ଵ
െ 𝑐
଴ ൅ 𝑐ଵ𝑋ଵଵ ൅ ⋯ ൅ 𝑐௡ିଵ𝑋௡ିଵ,ଵ ൌ 0
𝑌ଶ
െ 𝑐
଴ ൅ 𝑐ଵ𝑋ଵଶ ൅ ⋯ ൅ 𝑐௡ିଵ𝑋௡ିଵ,ଶ ൌ 0
… 𝑌௡
െ 𝑐
଴ ൅ 𝑐ଵ𝑋ଵ௡ ൅ ⋯ ൅ 𝑐௡ିଵ𝑋௡ିଵ,௡ ൌ 0
• System of 𝑛 equations with 𝑛 unknowns => has a solution.
 𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௡ିଵ is the solution of this system, and
∑௡ ௜ୀଵ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝛾 ො௡ିଵ𝑋௡ିଵ௜ ଶ ൌ 0
 MSE in this regression =0
 𝑅ଶ= 1: 𝑅ଶ= 1‐ MSE/ variance of spam variable.
46
… Even if independent variables in regression
are actually really bad predictors of ௜
• 𝑅ଶ of reg. with as many independent variables as units=1.
• Mechanical property, just comes from the fact that a system a
𝑛 equations with 𝑛 unknowns has a solution.
• True even if independent variables actually bad predictors of 𝑌 ௜.
• E.g.: previous example, where we regressed whether an email is
spam or not on stupid variables (whether it was sent on an odd
second…) still had 𝑅ଶ of 1.
• 𝑅ଶ=1 means that regression predicts 𝑌 ௜ perfectly well in sample,
but probably will yield bad predictions outside of the sample.
• Overfitting: we give ourselves so many parameters we can play
with (the coefficients of all the variables in the regression) that we
end up fitting perfectly the variable 𝑌 ௜ in our sample, but we will
make very large prediction errors outside of our sample.
47
• Figure below: 11 units, with their values of a variable 𝑋ଵ௜ and of a
variable 𝑌 ௜.
• Black line: regression function you obtain when you regress 𝑌 ௜ on
a constant and 𝑋ଵ௜.
• Blue line: regression function you obtain when you regress 𝑌 ௜ on a
constant, 𝑋ଵ௜, 𝑋ଵ௜ଶ, 𝑋ଵ௜ଷ, 𝑋ଵ௜ସ,…., 𝑋ଵ௜ଵ଴.
• Which of these two regressions will have the highest 𝑅ଶ?
Another example of overfitting
48
• Which of these two regressions will have the highest 𝑅ଶ?
a) The regression of 𝑌 ௜ on a constant and 𝑋ଵ௜.
b) The regression of 𝑌 ௜ on a constant, 𝑋ଵ௜, 𝑋ଵ௜ଶ, 𝑋ଵ௜ଷ, 𝑋ଵ௜ସ,….,
𝑋
ଵ௜
ଵ଴
.
iClicker time
49
• Regression of 𝑌 ௜ on constant, 𝑋ଵ௜, 𝑋ଵ௜ଶ, 𝑋ଵ௜ଷ, 𝑋ଵ௜ସ,…., 𝑋ଵ௜ଵ଴ has 11
observations and 11 independent variables. 𝑅ଶ=1. Blue line fits perfectly
black dots.
• Black line does not perfectly fit black dots => regression of 𝑌 ௜ on a constant
and 𝑋ଵ௜ has 𝑅ଶ ൏1.
• Goal of regression is to make prediction for value of dependent variable of
units not in your sample, for which you observe the 𝑥s but not 𝑦.
• Assume that one of these units has 𝑥 ൌ െ4.5. Do you think you will get a
better prediction for the 𝑦 of that unit using regression of 𝑌 ௜ on a constant
and 𝑋ଵ௜, or regression of 𝑌 ௜ on a constant, 𝑋ଵ௜, 𝑋ଵ௜ଶ, 𝑋ଵ௜ଷ, 𝑋ଵ௜ସ,…., 𝑋ଵ௜ଵ଴?
Regression of ௜ on constant, ଵ௜, ଵ௜ଶ,…., ଵ௜ଵ଴.
50
• The goal of a regression is to make a prediction for the value of
the dependent variable of units not in sample, for which you
observe the 𝑥s but not 𝑦.
• Assume that one of these units has 𝑥 ൌ െ4.5. Do you think you
will get a better prediction for the 𝑦 of that unit using regression
of 𝑌 ௜ on constant and 𝑋ଵ௜, or regression of 𝑌 ௜ on constant, 𝑋ଵ௜,
𝑋
ଵ௜

, 𝑋ଵ௜ଷ, 𝑋ଵ௜ସ,…., 𝑋ଵ௜ଵ଴?
a) We will get a better prediction using the regression of 𝑌 ௜ on a
constant and 𝑋ଵ௜
b) We will get a better prediction using the regression of 𝑌 ௜ on a
constant, 𝑋ଵ௜, 𝑋ଵ௜ଶ, 𝑋ଵ௜ଷ, 𝑋ଵ௜ସ,…., 𝑋ଵ௜ଵ଴.
iClicker time
51
• Prediction of the 𝑦 of unit with 𝑥 ൌ െ4.5:
– according to reg. of 𝑌 ௜ on a constant, 𝑋ଵ௜, 𝑋ଵ௜ଶ,…., 𝑋ଵ௜ଵ଴: 13.
– according to reg. of 𝑌 ௜ on a constant and 𝑋ଵ௜: ‐12.
• In sample, units with 𝑥 close to ‐4.5 have 𝑦 much closer to ‐12 than to
13=> regression of 𝑌 ௜ on a constant and 𝑋ଵ௜ will give better prediction.
• Again, regression with many independent variables might give very
good in‐sample prediction but very bad out of sample prediction
• But making good out of sample prediction is goal of regression.
• => Comparing 𝑅ଶ of 2 regs. is not right way to assess which will give
best out of sample predictions. Reg. with many variables always has
very high 𝑅ଶ but might end up making poor out of sample predictions.
Better prediction using reg. of ௜ on constant & ଵ௜
52
Instead, use a training and validation sample
53
• You start from sample of 𝑛 units for which you measure 𝑌 ௜, dependent
variable, and 𝑋ଵ௜,…, 𝑋௃௜, independent variables.
• Randomly divide sample into two subsamples of 𝑛/2 units. Subsample 1:
training sample. Subsample 2: validation sample.
• In training sample, you estimate the regressions you are interested in.
• For instance, in training sample:
– Regression 1: 𝑌 ௜ on a constant and 𝑋ଵ௜,…, 𝑋௃௜. Coefficients 𝛾 ො଴, 𝛾 ොଵ, … , 𝛾 ො௃ .
– Regression 2: 𝑌 ௜ on a constant and 𝑋ଵ௜. Coefficients 𝛽 መ଴, 𝛽 መଵ .
• Then, compute squared prediction error according to each regression for units
in validation sample.
• For instance, for each unit in validation sample, compute:
– 𝑌
௜ െ ሺ𝛾 ො଴ ൅ 𝛾 ොଵ𝑋ଵ௜ ൅ ⋯ ൅ 𝛾 ො௃𝑋௃௜ሻ ଶ: squared pred. error with Reg. 1.
– 𝑌
௜ െ ሺ𝛽 መ଴ ൅ 𝛽 መଵ𝑋ଵ௜ሻ ଶ: squared pred. error with Reg. 2.
• Finally, choose regression for which sum squared prediction errors for units in
validation sample lowest.
• Intuition: you want to use the reg. that gives best out of sample predictions. By
choosing reg. that gives best predictions in validation sample, you ensure that
your regression will give good out of sample predictions, because you did not
use the validation sample to compute your reg. coefficients.
Machine learning in 2 minutes
54
• Using training and validation sample = key idea underlying machine
learning methods (statistical methods more sophisticated than, but
inspired from multivariate regressions, and that are used by tech
companies to do image recognition, spam detection, etc.)
• Goal: teach a computer to recognize whether an email is a Spam,
whether a picture of a letter is an “a”, a “b”, etc.
• Train the computer in a sample of emails for which the computer
knows whether the email is a spam and many other variables (all the
words in the email, etc.).
• The computer finds the model that predicts the best whether the
email is a spam given all these variables, in the training sample.
• Then, check whether prediction model works well in validation
sample, where you also know which emails are spams or not.
• If the statistical model also works well in the validation sample,
implement method in real life to predict whether new emails
reaching Gmail accounts are spams or not. If email predicted to be
spam, send to junk box. Otherwise, send to regular mail box.
Machine learning often works, but not always
55
What you need to remember
• Great advantage of multivariate regression over univariate
regression: improves the quality of our predictions.
• However, putting too many variables in regression might result in
overfitting: regression fits very well 𝑦s in sample, but gives poor out
of sample predictions.
• For instance, a regression with as many independent variables as
units will automatically have a 𝑅ଶ ൌ 1, even if those independent
variables are actually poor predictors of the independent variable.
• => comparing 𝑅ଶs not good way to choose between several regs
• Instead, you should:
– randomly divide sample into training and validation sample
– estimate your regressions in the training sample only
– compute squared predicted errors according to each regression
in validation sample
– choose regression for which MSE in validation sample is
smallest.
• Training / validation sample idea underlies machine learning models
used for spam detection / image recognition, etc. by tech
companies.
56
Roadmap
1. The OLS multivariate regression function.
2. Estimating the OLS multivariate regression function.
3. Advantages and pitfalls of multivariate regressions.
4. Interpreting coefficients in multivariate OLS regressions.
57
Interpreting coeff. of multivariate regs. An example.
• 6 units (𝑛 ൌ 6). 3 variables: 𝑌 ௜, 𝐷௜, and 𝑋௜. 𝐷௜, and 𝑋௜: binary.
• If you regress 𝑌 ௜ on constant and 𝐷௜, what will be coeff. of 𝐷௜?
If you regress 𝑌 ௜ on a constant, 𝐷௜, and 𝑋௜, what will be coeff.
of 𝐷௜? Hint: to answer first question, you can use a result you
saw during sessions. To answer second question, write system
of 3 equations and three unknowns solved by ሺ𝛾 ො଴, 𝛾 ොଵ, 𝛾 ොଶሻ, the
coefficients of constant, 𝐷௜, and 𝑋௜, plug‐in the values of 𝑌 ௜,
𝐷
௜, and 𝑋௜ in table, and then solve system.
58
Unit 𝑌 ௜ 𝑫௜ 𝑋௜
1 5 1 1
2 3 1 1
3 4 0 1
4 1 1 0
5 0 0 0
6 2 0 0
iClicker time
If you regress 𝑌 ௜ on constant and 𝐷௜, what will be coeff. of 𝐷௜? If you
regress 𝑌 ௜ on a constant, 𝐷௜, and 𝑋௜, what will be coeff. of 𝐷௜?
a) In reg. of 𝑌 ௜ on constant and 𝐷௜, coeff. of 𝐷௜ is 2. In reg. of 𝑌 ௜ on
a constant, 𝐷௜, and 𝑋௜, coeff. of 𝐷௜ is 0.5.
b) In reg. of 𝑌 ௜ on constant and 𝐷௜, coeff. of 𝐷௜ is 1. In reg. of 𝑌 ௜ on
a constant, 𝐷௜, and 𝑋௜, coeff. of 𝐷௜ is 0.5.
c) In reg. of 𝑌 ௜ on constant and 𝐷௜, coeff. of 𝐷௜ is 1. In reg. of 𝑌 ௜ on
a constant, 𝐷௜, and 𝑋௜, coeff. of 𝐷௜ is 0.
d) In reg. of 𝑌 ௜ on constant and 𝐷௜, coeff. of 𝐷௜ is 1. In reg. of 𝑌 ௜ on
a constant, 𝐷௜, and 𝑋௜, coeff. of 𝐷௜ is ‐0.5.
59
Unit 𝑌 ௜ 𝑫௜ 𝑋௜
1 5 1 1
2 3 1 1
3 4 0 1
4 1 1 0
5 0 0 0
6 2 0 0
In regression of ௜ on constant + ௜, coeff of ௜ is 1.
In regression of ௜ on constant, ௜, + ௜, coeff of ௜ is 0.
• Coeff of D୧ in reg. of Y୧ on constant, D୧. Result of sessions:
(Average Y୧ for D୧ ൌ 1)‐ (Average Y୧ for D୧ ൌ 0)ൌ
1/3(5+3+1)−1/3(4+2+0) ൌ 1.
• Coeff D୧ in reg. of Y୧ on constant, D୧, 𝑋௜: 3 eqs. with 3 unknowns.
• 𝑛 ൌ 6 and 𝐽 ൌ 2, so we have (we can forget the ‐2):
∑଺ ௜ୀଵ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝐷௜ ൅ 𝛾 ොଶ𝑋௜ ൌ 0
∑଺ ௜ୀଵ 𝐷௜ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝐷௜ ൅ 𝛾 ොଶ𝑋௜ ൌ 0
∑଺ ௜ୀଵ 𝑋௜ 𝑌 ௜ െ 𝛾 ො଴ ൅ 𝛾 ොଵ𝐷௜ ൅ 𝛾 ොଶ𝑋௜ ൌ 0
• Plugging values of 𝑌 ௜, 𝐷௜, and 𝑋௜, yields:
15 െ 6𝛾 ො଴ െ 3𝛾 ොଵ െ 3𝛾 ොଶ ൌ 0
9 െ 3𝛾 ො଴ െ 3𝛾 ොଵ െ 2𝛾 ොଶ ൌ 0
12 െ 3𝛾 ො଴ െ 2𝛾 ොଵ െ 3𝛾 ොଶ ൌ 0
• Subtracting eq 2 to eq 3: 3 ൅ 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0
• Multiplying eq 3 by 2 and subtracting eq 1: 9 െ 𝛾 ොଵ െ 3𝛾 ොଶ ൌ 0
• Adding the two preceding equations: 12 െ 4𝛾 ොଶ ൌ 0, so 𝛾 ොଶ ൌ 3.
• Plugging 𝛾 ොଶ ൌ 3 in 3 ൅ 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0: 𝛾 ොଵ ൌ 0. 60
A general formula for coefficient of binary variable in a
regression of ௜ on constant and 2 binary variables.
• Let 𝐷௜ and 𝑋௜ be 2 binary variables.
• 𝑛
଴଴: number of units with 𝐷௜ ൌ 0, 𝑋௜ ൌ 0. 𝑛ଵ଴: number of
units with 𝐷௜ ൌ 1, 𝑋௜ ൌ 0. 𝑛଴ଵ: number of units with 𝐷௜ ൌ 0,
𝑋
௜ ൌ 1. 𝑛ଵଵ: number of units with 𝐷௜ ൌ 1, 𝑋௜ ൌ 1.
• Coeff of 𝐷௜ in regression of 𝑌 ௜ on constant, 𝐷௜, 𝑋௜ is:
𝑤
1
𝑛
ଵ଴
෍ 𝑌 ௜
௜:஽೔ୀଵ, ௑೔ୀ଴

1
𝑛
଴଴
෍ 𝑌 ௜
௜:஽೔ୀ଴, ௑೔ୀ଴
൅ 1 െ 𝑤
1
𝑛
ଵଵ
෍ 𝑌 ௜
௜:஽೔ୀଵ, ௑೔ୀଵ

1
𝑛
଴ଵ
෍ 𝑌 ௜
௜:஽೔ୀ଴, ௑೔ୀଵ
𝑤: number included between 0 and 1, no need to know formula.



భబ
∑௜:஽೔ୀଵ, ௑೔ୀ଴ 𝑌 ௜ െ ௡ଵ
బబ
∑௜:஽೔ୀ଴, ௑೔ୀ଴ 𝑌 ௜: difference between
average 𝑌 ௜ of units with 𝐷௜ ൌ 1 and of units with 𝐷௜ ൌ 0,
among units with 𝑋௜ ൌ 0.


௡భభ
∑௜:஽೔ୀଵ, ௑೔ୀଵ 𝑌 ௜ െ ௡ଵ
బభ
∑௜:஽೔ୀ଴, ௑೔ୀଵ 𝑌 ௜: difference between
average 𝑌 ௜ of units with 𝐷௜ ൌ 1 and of units with 𝐷௜ ൌ 0,
among units with 𝑋௜ ൌ 1.
• Coeff of 𝐷௜ measures difference between average of 𝑌 ௜ across
subgroups whose 𝐷௜ differs by one, but that have same 𝑋௜! 61
Applying formula in example.
• Sample with 6 units. 3 variables: 𝑌 ௜, 𝐷௜, 𝑋௜. 𝐷௜ 𝑋௜: binary
variables.
• What is value of ଵ

భబ
∑௜:஽೔ୀଵ, ௑೔ୀ଴ 𝑌 ௜ െ ௡ଵ
బబ
∑௜:஽೔ୀ଴, ௑೔ୀ଴ 𝑌 ௜?
Of ଵ
௡భభ
∑௜:஽೔ୀଵ, ௑೔ୀଵ 𝑌 ௜ െ ௡ଵ
బభ
∑௜:஽೔ୀ଴, ௑೔ୀଵ 𝑌 ௜?
62
Unit 𝑌 ௜ 𝑫௜ 𝑋௜
1 5 1 1
2 3 1 1
3 4 0 1
4 1 1 0
5 0 0 0
6 2 0 0
iClicker time
aሻ ଵ

భబ
∑௜:஽೔ୀଵ, ௑೔ୀ଴ 𝑌 ௜ െ ௡ଵ
బబ
∑௜:஽೔ୀ଴, ௑೔ୀ଴ 𝑌 ௜=1

௡భభ
∑௜:஽೔ୀଵ, ௑೔ୀଵ 𝑌 ௜ െ ௡ଵ
బభ
∑௜:஽೔ୀ଴, ௑೔ୀଵ 𝑌 ௜ ൌ െ1
b) ଵ

భబ
∑௜:஽೔ୀଵ, ௑೔ୀ଴ 𝑌 ௜ െ ௡ଵ
బబ
∑௜:஽೔ୀ଴, ௑೔ୀ଴ 𝑌 ௜=0

௡భభ
∑௜:஽೔ୀଵ, ௑೔ୀଵ 𝑌 ௜ െ ௡ଵ
బభ
∑௜:஽೔ୀ଴, ௑೔ୀଵ 𝑌 ௜ ൌ 0
c) ଵ

భబ
∑௜:஽೔ୀଵ, ௑೔ୀ଴ 𝑌 ௜ െ ௡ଵ
బబ
∑௜:஽೔ୀ଴, ௑೔ୀ଴ 𝑌 ௜=‐1

௡భభ
∑௜:஽೔ୀଵ, ௑೔ୀଵ 𝑌 ௜ െ ௡ଵ
బభ
∑௜:஽೔ୀ଴, ௑೔ୀଵ 𝑌 ௜ ൌ 1
63
Unit 𝑌 ௜ 𝑫௜ 𝑋௜
1 5 1 1
2 3 1 1
3 4 0 1
4 1 1 0
5 0 0 0
6 2 0 0

௡భబ ௜:஽೔ୀଵ, ௑೔ୀ଴ ௜

௡బబ ௜:஽೔ୀ଴, ௑೔ୀ଴ ௜ 0,

௡భభ ௜:஽೔ୀଵ, ௑೔ୀଵ ௜

௡బభ ௜:஽೔ୀ଴, ௑೔ୀଵ ௜



భబ
∑௜:஽೔ୀଵ, ௑೔ୀ଴ 𝑌 ௜ െ ௡ଵ
బబ
∑௜:஽೔ୀ଴, ௑೔ୀ଴ 𝑌 ௜ ൌ 1 െ ଵ ଶ 0 ൅ 2 ൌ 0


௡భభ
∑௜:஽೔ୀଵ, ௑೔ୀଵ 𝑌 ௜ െ ௡ଵ
బభ
∑௜:஽೔ୀ଴, ௑೔ୀଵ 𝑌 ௜ ൌ ଵ ଶ 5 ൅ 3 െ 4 ൌ 0
• The coeff of 𝐷௜ in regression of 𝑌 ௜ on constant, 𝐷௜, 𝑋௜ is a
weighted average of these two numbers, so that’s why it’s
equal to 0, as we have shown earlier. 64
Unit 𝑌 ௜ 𝑫௜ 𝑋௜
1 5 1 1
2 3 1 1
3 4 0 1
4 1 1 0
5 0 0 0
6 2 0 0
Interpreting coefficients in multivariate regressions.
• Previous slides: in reg. of ௜ on constant, ௜, and ௜,
where ௜ and ௜ binary, coeff of ௜ = difference between
average of ௜ across groups whose ௜ differs by one, but
that have same ௜.
• Extends to all multivariate regressions.
• In a multivariate regression of ௜ on constant, ௜, ଵ௜,…,
௃௜, ଵ, the coeff. of ௜, measures difference between
average of ௜ across subgroups whose ௜ differs by one,
but that have same ଵ௜,…, ௃௜.
• If
ଵ , that means that if you compare average ௜
across units whose ௜ differs by one but that have the
same value of ଵ௜,…, ௃௜, the average of ௜ is
smaller among units whose ௜ is 1 unit larger.
• In a multivariate regression of ௜ on constant, ௜,
ଵ௜,…, ௃௜, if ଵ , that means that if you compare
average ௜ across units whose ௜ differs by one but that
have the same value of ଵ௜,…, ௃௜, the average of ௜ is
smaller among units whose ௜ is 1 unit larger. 65
Women earn less than males
• Same representative sample of 14086 US wage earners as in
Homework 3.
• Regression of ln(weekly wage) on constant and binary variable
equal to 1 for females in Stata.
• Women earn 32% less than males, difference very significant.
• From that regression, can we conclude that women are
discriminated against in the labor market? Why? 66
_cons 6.642133 .0099315 668.80 0.000 6.622666 6.6616
female -.3235403 .0142357 -22.73 0.000 -.3514442 -.2956365
ln_weekly_~e Coef. Std. Err. t P>|t| [95% Conf. Interval]
Robust
Root MSE = .84461
R-squared = 0.0354
Prob > F = 0.0000
F(1, 14084) = 516.54
Linear regression Number of obs = 14,086
. reg ln_weekly_wage female, r
iClicker time
• Women earn 32% less than males, difference very significant.
• Can we conclude that women are discriminated against in the
labor market? Why?
a) Yes, we can conclude that women are discriminated against in
the labor market, this 32% difference in wages must reflect
discrimination.
b) No, we cannot conclude that women are discriminated against
in the labor market, because the R2 of the regression is too low.
c) No, we cannot conclude that women are discriminated against
in the labor market. Maybe women earn less than men for
reasons that have nothing to do with their gender.
67
Maybe women earn less for reasons that have
nothing to do with their gender.
• Women earn less than men.
• But that difference could for instance come from the fact they
work less hours per week outside of the home.
• Maybe women not discriminated by their employer, maybe
just work fewer hours for their employer => get paid less.
• (Aside: women indeed tend to work fewer hours a week
outside of the home than men, but that may be because they
also tend to spend more time taking care of children in
households with children, another form of gender imbalance,
though that imbalance is taking place in the family, not in the
labor market).
68
A more complicated regression
• Regression of ln(weekly wage) on constant, variable for
females + years of schooling, age, hours worked per week.
• Interpret coeff. of female variable in that regression.
69
_cons 3.931402 .0398147 98.74 0.000 3.85336 4.009444
years_schooling .1024575 .002269 45.15 0.000 .0980099 .1069052
hours_worked .0231192 .0005812 39.78 0.000 .02198 .0242583
age .0104635 .0004288 24.40 0.000 .0096231 .011304
female -.2731097 .0117557 -23.23 0.000 -.2961524 -.250067
ln_weekly_wage Coef. Std. Err. t P>|t| [95% Conf. Interval]
Robust
Root MSE = .67267
R-squared = 0.3883
Prob > F = 0.0000
F(4, 14081) = 1449.64
Linear regression Number of obs = 14,086
. reg ln_weekly_wage female age hours_worked years_schooling, r
iClicker time
• Interpret coeff. of female variable reg. on previous slide.
a) On average, women earn 0.27 dollars less than men per week.
b) When we compare women and men that have the same
number of years of schooling, the same age, and that work the
same number of hours per week, we find that on average,
women earn 0.27 dollars less than men per week.
c) When we compare women and men that have the same
number of years of schooling, the same age, and that work the
same number of hours per week, we find that on average
women earn 27% less than men per week.
70
Answer c) !
• Remember: in multivariate reg. of ௜ on constant,
௜, ଵ௜,…, ௃௜, if ଵ , means that if you
compare average ௜ across units whose ௜ differs by
one but that have the same value of ଵ௜,…, ௃௜, the
average of ௜ is smaller among units whose ௜ is
1 unit larger.
• Here: ௜ is female variable. Females have ௜ ,
males have ௜ .
• The other variables in the regression are years of
schooling, age, and number of hours worked / week.
• =>
ଵ means that when we compare
women and men that have the same number of
years of schooling, the same age, and that work the
same number of hours per week, we find that on
average women earn 27% less than men per week.
71
Complicated reg. is stronger, though still imperfect
evidence that gender discrimination on labor market.
• Difference between men and women’s earnings cannot be
explained by differences in education, hours worked per week, and
professional experience.
• Even when we compare men and women with same education,
hours worked per week, and professional experience, women earn
substantially less (27%).
• This is still not definitive evidence of discrimination. Maybe women
tend to go into lower paying jobs and industries than men.
• E.g.: less women in finance and engineering.
• But is this because women do not like that type of jobs (if so, no
discrimination) or is it because those industries do not want to hire
women (if so, discrimination), or because women would like to go
into those jobs but do not do so because frowned upon due to
social norms (if so, discrimination)?
• Overall, even though there are limits even with the complicated
regression, the fact that women earn less even when we compare
men and women with same education, hours worked per week, and
professional experience, suggests that women discriminated on
labor market.
72
What is econometrics?
• Econometrics is a set of statistical techniques that we can use to
study economic questions empirically.
• The tools we use in econometrics are statistical techniques, which is
why the beginning of an intro to econometrics class looks more like
a stats class than an econ class: before we can apply the statistical
tools to study economics question, we need to master the tools!
• Why do we want to study economic questions empirically? Isn’t
economic theory enough?
• The issue with economic theory is that on a number of issues,
different theories lead to different conclusions.
• E.g.: neo‐classical economist will tell you that increasing minimum
wage will reduce employment, while a neo‐Keynesian will tell you
that increasing minimum wage will increase employment.
• Conflicting theories => we need to study these questions
empirically (with data) to say which theory is true.
• The wage regressions in homework 3 and in these slides are a first
example of how to use statistical tools to study an economic
question: “are women discriminated on the labor market?”
empirically (with data).
• Other examples coming in the next slides.
73
What you need to remember
• In a multivariate regression of 𝑌 ௜ on constant, 𝐷௜, 𝑋ଵ௜,…, 𝑋௃௜,
𝛾 ොଵ, if 𝛾 ොଵ, the coeff of 𝐷௜, is equal to 𝑥, means that if you
compare average 𝑌 ௜ across units whose 𝐷௜ differs by one but
that have the same value of 𝑋ଵ௜,…, 𝑋௃௜, the average of 𝑌 ௜ is 𝑥
larger (if 𝑥 ൐ 0) / smaller (if 𝑥 ൏ 0) among units whose 𝐷௜ is 1
unit larger.
• In a multivariate regression of lnሺ𝑌 ௜ሻ on constant, 𝐷௜, 𝑋ଵ௜,…,
𝑋
௃௜, if 𝛾 ොଵ, the coeff of 𝐷௜, is equal to 𝑥, means that if you
compare average 𝑌 ௜ across units whose 𝐷௜ differs by one but
that have the same value of 𝑋ଵ௜,…, 𝑋௃௜, the average of 𝑌 ௜ is 𝑥%
larger (if 𝑥 ൐ 0) / smaller (if 𝑥 ൏ 0) among units whose 𝐷௜ is 1
unit larger.
74

 

Take-home examination for Econ 140A, Fall 2020
Your answers should be submitted in a single pdf document on GauchoSpace.
You may either type or handwrite your answers, or some combination if you like.
You can take pictures of your handwritten responses, then include them in the single
pdf document that you submit. Regardless of what you decide, it is important that your
answers are as clear as possible. Your answers should appear in the order in which the
questions are asked. Please review your answers before submitting them to confirm that
they are easily readable.
Be sure that you explicitly answer each question and explain each step, as if you
were writing solutions so that another student in the class would be able to follow your
thoughts. Part of your grade will depend on explaining each step of your answers.
1. Suppose you want to know about wage discrimination by gender in Santa Barbara.
The goal of this problem is to understand regression with a dummy variable.
(a) Suppose you run the following regression:
wagei = β + i
where wagei is monthly wage of individual i, and i is an error term. Show the
objective function to get β and derive β in terms of wagei from the objective
function. Interpret β based on the derivation. (3pt)
(b) In order to investigate wage discrimination in Santa Barbara, one of your
friends suggest to run
wagei = αf · femalei + αm · malei + i, (1)
where femalei =

1 if i is female
0 if i is male
, and malei =

1 if i is male
0 if i is female
.
Show the objective function to get αf and αm and derive them. Finally,
interpret αf and αm in two sentences. (4pt)
(c) Suppose you estimate αf and αm with the sample of 1000 people in this city.
The result of these estimates is the following:
wagei = 3700
(542)
· femalei + 5300
(1024)
· malei + ˆ i (2)
The number in the bracket is the standard error for α b
f and α bm. Derive the
95% confidence interval of monthly wage of male. (2pt)
(d) To figure out the wage discrimination between males and females, one of your
friends suggests to run
wagei = γ1 + γ2 · femalei + i. (3)
If you run regression (3) with the same sample in (c), what will γ b1 and γ b2 be?
In order to test the hypothesis that there is no wage discrimination between
males and females, which coefficient would you use to test this hypothesis
between γ b1 and γ b2? Set up the null hypothesis for this test explicitly. (3pt)
2. You realize that it is insufficient to compare the wage differential between the two
genders to know about wage discrimination because this wage differential could be
from an educational gap between males and females. So, you decide to obtain data
of years of education from all workers in addition to wage. The goal of this problem
is to understand regressions with an interaction term.
(a) Consider
wagei = β0 + β1 · edui + i, (4)
where edui is years of education of worker i. Show the objective function to
obtain β0 and β1. Derive the first order conditions. And, finally, interpret β0
and β1 in (4) in two sentences. (3pt)
Page 2
(You don’t need to derive β0 and β1 explicitly.)
(b) One of your friends suggests that you need to run the following regression to
learn the wage gender gap controlling by education:
wagei = αf + αm · malei + βf · edui + βm · malei · edui + i (5)
where malei = 1 if i is male and 0 otherwise. Since you do not know the
meaning of each coefficient, you decide to derive each coefficient in terms of
wagei, malei and edui. What is the objective function to get αf, αm, βf, and
βm? Derive the first order conditions for αf, αm, βf, and βm respectively. (3pt)
(You don’t need to derive αf, αm, βf and βm explicitly.)
(c) From the first order conditions in (b), show that αf and βf depend only on
wagei and edui of female population. Are these conditions same as the first
order conditions of the regression yi = αf + βf · edui + i where individual i
belongs to the female population? (3pt)
Hint:
nX i
=1
x2
i = X
i:female
x2
i + X
i:male
x2
i .
(d) For now let γ0 = αf +αm and γ1 = βf +βm. From the first order conditions in
(b), show that γ0 and γ1 depend only on wagei and edui of male population.
Are these conditions same as the first order conditions of the regression yi =
γ0 + γ1 · edui + i where individual i belongs to the male population? (3pt)
Hint: Use the first conditions with respect to αm and βm.
(e) Based on (c) and (d), interpret αf, βf, γ0, and γ1 in one sentence for each one.
Given these interpretation, interpret αm and βm in one sentence for each one.
(3pt) Hint: We define γ0 = αf + αm and γ1 = βf + βm in (d). Then,
α
m = γ0 αf, and βf = γ1 βm.
Page 3
(f) From the sample of 1,000 individuals, you estimate (5) and the result is the
following:
wagei = 1600
(542)
+ 400
(172)
· malei + 382
(99)
· edui + 132
(49)
· malei · edui,
where the numbers in the brackets are the standard errors for each coefficient.
You want to test the hypothesis that there is no gender gap in the returns to
education. Write down the null hypothesis for this test and whether you can
reject this null hypothesis at 95% or not. (3pt)
Page 4

 

Ordinary least squares regression I:
The univariate linear regression.
Clement de Chaisemartin and Doug Steigerwald
UCSB
1
Traders make predictions
• Traders, say oil traders, speculate on the price of oil.
• When they think the price of oil will go up, they buy oil.
• When they think the price will go down, they sell oil.
• To inform their buying / selling decisions, they need to
predict whether the price will go up or down.
• To make their predictions, they can use the state of the
economy today. E.g.: if world GDP is growing fast today,
the price of oil should increase tomorrow.
• => traders need to use variables available to them to
make predictions on a variable they do not observe:
the price of oil tomorrow.
2
Banks make predictions
• When someone applies for a loan, the bank needs to decide:
– Whether they should give the loan to that person.
– And if so, which interest rate they should charge that person.
• To answer these questions, the bank needs to predict the amount
of the loan that this person will fail to reimburse. They will charge
high interest rate to people who are predicted to fail to reimburse a
large amount.
• To do so, they can use all the variables contained in application:
gender, age, income, ZIP code…
• Can also use credit score of that person: FICO score, created by
FICO company. All banks in US share information about their
customers with FICO. Therefore, for each person FICO knows: total
amount of debt, history of loans repayment… People with lots of
debt and who often defaulted on their loan in the past get a low
score, while people with little debt and no default get high score.
• Here as well, banks try to predict a variable they do not observe
(amount of the loan the person will fail to reimburse) using
variables that they observe (the variables in her application + FICO).3
Tech companies make predictions
• A reason why people prefer Gmail over other mailboxes is that
Gmail is better than many mailboxes at sending directly spam
emails into your trash box.
• They could ask a human to read the email and say whether it’s a
Spam or not. But that would be very costly and slow!
• Automated process: when a new email reaches your mailbox, Gmail
needs to decide whether it should go into your trash because it’s a
Spam, or whether it should go into your regular mailbox.
• To do so, the computer can extract a number of variables from that
email: number of words, email address of the sender, the specific
words used in the email and how many times they occur…
• Based on these variables, it can try to predict whether the email is a
real email or a spam.
• Here as well, Gmail tries to predict a variable they do not observe
(whether that email is a Spam or not) using variables that they
observe (number of words, email adress of the sender, the specific
words used in the email…).
4
Using variables we observe to make predictions
on variables we do not observe.
• Many real world problems can be cast as using
variables we observe to make predictions on
variables we do not observe:
– either because they will be realized in the future
(e.g.: the amount that someone applying today for a
one year to loan will fail to reimburse will only be
known in one year from now)
– or because observing them would be too costly
(e.g.: assessing whether all the emails reaching all
Gmail accounts everyday are spams or not).
5
We will study a variety of models one can use to
make predictions.
• In all the following lectures, we are going to study
how we can construct statistical models to make
predictions.
• We will start by studying the simplest prediction
model: the ordinary least squares (OLS) univariate
linear regression.
6
Roadmap
1. The OLS univariate linear regression function.
2. Estimating the OLS univariate linear regression function.
3. OLS univariate linear regression in practice.
7
Set up and notation.
• We consider a population of 𝑁 units.
– 𝑁 could be number of people who apply for a one‐year loan with bank
A during April 2018.
– Or 𝑁 could be number of emails reaching all Gmail accounts in April
2018.
• Each unit 𝑘 has a variable 𝑦௞ attached to it that we do not observe.
We call this variable the dependent variable.
– In the loan example, 𝑦௞ is a variable equal to the amount of her loan
applicant 𝑘 will fail to reimburse when her loan expires in April 2019.
– In email example, 𝑦௞ is equal to 1 if email 𝑘 is a spam and 0 otherwise.
• Each unit 𝑘 also has 1 variable 𝑥
௞ attached to it that we do observe.
We call this variable the independent variable.
– In the loan example, 𝑥௞ could be the FICO score of applicant 𝑘.
– In the email example, 𝑥௞ could be a variable equal to 1 if the word
“free” appears in the email.
8
Are units with different values of likely
to have the same value of ?
• Based on the value of
௞ of each unit, we want to
predict her ௞.
• E.g.: in the loan example, we want to predict the
amount that unit will fail to reimburse based on
her FICO score.
• Assume that applicant 1 has a very high (good) credit
score, while applicant 2 has a very low (bad) credit
score.
• Do you think that applicant 1 and 2 will fail to
reimburse the same amount on their loan?
9
No!
• Based on the value of
௞ of each unit, we want to
predict her ௞.
• E.g.: in the loan example, we want to predict the
amount that unit will default on her loan based on
her FICO score.
• Assume that applicant 1 has a very high (good) credit
score, while applicant 2 has a very low (bad) credit
score.
• Do you think that applicant 1 and 2 will fail to
reimburse the same amount on their loan?
• No, applicant 2 is more likely to fail to reimburse a
larger amount than applicant 1.
• Should you predict the same value of ௞ for applicants
1 and 2?
10
No! Your prediction should be a function of ௞
• Based on the value of 𝑥
௞ of each unit, we want to predict her 𝑦௞.
• E.g.: in the loan example, we want to predict the amount that unit
𝑘 will default on her loan based on her FICO score.
• Assume that applicant 1 has a very high (good) credit score, while
applicant 2 has a very low (bad) credit score.
• Should you predict the same value of 𝑦௞ for applicants 1 and 2?
• No! If you want your prediction to be accurate, you should predict a
higher value of 𝑦௞ for applicant 2 than for applicant 1.
• Your prediction should a function of 𝑥௞, 𝑓ሺ𝑥௞ሻ.
• In these lectures, we focus on predictions which are a linear
function of 𝒙𝒌: 𝒇 𝒙𝒌 ൌ 𝒂𝒙𝒌, for some real number 𝒂.
• Which measure can you use to assess whether 𝑎𝑥௞ is a good
prediction of 𝑦௞? Discuss this question with your neighbor for 1
minute.
11
iClicker time
• To assess whether
௞ is a good prediction of
௞, we should use:
௞ ௞
௞ ௞
12
!
• Based on the value of 𝑥
௞ of each unit, we want to predict her 𝑦௞.
• Our prediction should a function of 𝑥௞, 𝑓ሺ𝑥௞ሻ. We focus on
predictions which are a linear function of 𝑥௞: 𝑓 𝑥௞ ൌ 𝑎𝑥௞, for
some real number 𝑎.
• Which measure can you use to assess whether 𝑎𝑥௞ is a good
prediction?
• 𝑦௞ െ 𝑎𝑥௞ , the difference between your prediction and 𝑦௞.
• In the loan example, if 𝑦௞ െ 𝑎𝑥௞ is large and positive, our prediction
is much below the amount applicant 𝑘 will fail to reimburse.
• If 𝑦௞ െ 𝑎𝑥௞ is large and negative, our prediction is much above the
amount person 𝑘 will fail to reimburse.
• Large positive or negative values of 𝑦௞ െ 𝑎𝑥௞ mean bad prediction.
• 𝑦௞ െ 𝑎𝑥௞ close to 0 means good prediction.
13
iClicker time
• Which of the following 3 possible values of
should we choose to ensure that
௞ predicts well
௞ in the population?
a) The value of that maximizes ே ௞ୀଵ ௞ ௞
b) The value of that minimizes ே ௞ୀଵ ௞ ௞ .
c) The value of that minimizes ே ௞ୀଵ ௞ ௞ ଶ.
14
Minimizing ே ௞ୀଵ ௞ ௞ won’t work!
• Minimizing ே ௞ୀଵ ௞ ௞ means we try to avoid
positive prediction errors, but we also try to make
the largest possible negative prediction errors!
• Not a good idea: we will systematically overestimate
௞.
• We want a criterion that deals symmetrically with
positive and negative errors: we want to avoid both
positive and negative errors.
15
Answer: find the value of that minimizes
௞ ௞
ே ଶ
௞ୀଵ
• ே ௞ୀଵ ௞ ௞ ଶ is positive. => minimizing it =
same thing as making it as close to 0 as
possible.
• If
௞ ௞
ே ଶ
௞ୀଵ is as close to 0 as possible,
means that the sum of the squared value of our
prediction errors is as small as possible.
• => we make small errors. That’s good, that’s
what we want!
16
Which prediction function is the best?
• Population has 11 units. 𝑥௞ and 𝑦௞ of those 11 units are
shown on the graph: blue dots.
• Two linear prediction functions for 𝑦௞: 0.5𝑥௞ and 0.8𝑥௞.
• Which one is the best? Discuss this 1mn with your neighbour.
17
iClicker time
• On the previous slide, which function of ௞
gives the best prediction for ௞:
a) ௞
b) ௞
18
௞ is the best prediction function!
• It is the function for which the sum of the squared of the
prediction errors are the smallest.
19
Prediction
error of 0.5x
for the person
with x=6
Prediction
error of 0.8x
for the person
with x=8
The OLS univariate linear regression function
in the population.
• Let
𝛼 ൌ 𝑎𝑟𝑔𝑚𝑖𝑛௔∈ோ ෍ 𝑦௞ െ 𝑎𝑥௞ ଶ

௞ୀଵ
• We call 𝛼𝑥
௞ the ordinary least squares (OLS) univariate linear
regression function of 𝒚𝒌 on 𝒙𝒌 in the population.
• Least squares: because 𝛼𝑥௞ minimizes the sum of the squared
difference between 𝑦௞ and 𝑎𝑥௞.
• Ordinary: because there are fancier way of doing least squares.
• Univariate: because there is only one independent variable in the
regression, 𝑥௞.
• Linear: because the regression function is a linear function of 𝑥௞.
• In the population: because we use the 𝑦௞s and 𝑥௞s of all the 𝑁 units
in the population.
• Shortcut: OLS regression of 𝒚𝒌 on 𝒙𝒌 in the population.
20
Decomposing between predicted value
and residual.
• 𝛼: coefficient of 𝑥
௞ in the OLS regression of 𝑦௞ on 𝑥௞ in the population.
• Let 𝑦 ෤௞ ൌ 𝛼𝑥௞. 𝑦 ෤௞ is the predicted value for 𝑦௞ according to the OLS
regression of 𝑦௞ on 𝑥௞ in the population.
• Let 𝑒
௞ ൌ 𝑦௞ െ 𝑦 ෤௞. 𝑒௞: error we make when we use OLS regression in the
population to predict 𝑦௞.
• We have 𝑦௞ ൌ 𝑦 ෤௞ ൅ 𝑒௞.
𝑦௞ ൌpredicted value + error.
21
0 2 4 6 8 10
‐1
6 5 4 3 2 1 0
y
0.5x
Predicted
value of 𝑦௞
for person
with x=8
Predictior
error for
person with
x=8
Finding a formula for when .
• Assume for a minute that : there are
only two units in the population.
• Then is the value of that minimizes
ଵ ଵ

ଶ ଶ

.
• Find a formula for , as a function of ଵ, ଵ,
ଶ, and ଶ. You have 3 minutes to try to find
the answer. Hint: you need to compute the
derivative of
ଵ ଵ

ଶ ଶ
ଶ with
respect to , and then is the value of for
which that derivative is equal to 0.
22
iClicker time
• If , is equal to:
a)௫భ௬భା௫మ௬మ
௫భା௫మ
b) ௫భ௬భା௫మ௬మ
௫భమ ା௫మమ
c)௫భమ ௬భା௫మమ ௬మ
௫భା௫మ
23
When , భ భ మ మ



మ .
• If 𝑁 ൌ 2, 𝛼 is the value of 𝑎 that minimizes 𝑦ଵ െ 𝑎𝑥ଵ ଶ ൅
𝑦ଶ െ 𝑎𝑥ଶ ଶ.
• The derivative of that function wrt to 𝑎 is:
െ𝑥
ଵ2 𝑦ଵ െ 𝑎𝑥ଵ െ 𝑥ଶ2 𝑦ଶ െ 𝑎𝑥ଶ .
• Let’s find value of 𝑎 for which derivative = 0.
െ2𝑥
ଵ 𝑦ଵ െ 𝑎𝑥ଵ െ 2𝑥ଶ 𝑦ଶ െ 𝑎𝑥ଶ ൌ 0
i𝑖𝑓 െ 2𝑥ଵ𝑦ଵ ൅ 2𝑎𝑥ଵଶ െ 2𝑥ଶ𝑦ଶ ൅ 2𝑎𝑥ଶଶ ൌ 0
i𝑖𝑓 2𝑎 𝑥ଵଶ ൅ 𝑥ଶଶ ൌ 2ሺ𝑥ଵ𝑦ଵ ൅ 𝑥ଶ𝑦ଶሻ
i𝑖𝑓 𝑎 ൌ ௫భ௬భା௫మ௬మ
௫భమ ା௫మమ .
• Second line of derivation shows that derivative increasing in 𝑎. => if
𝑎 ൏ ௫భ௬భା௫మ௬మ
௫భమ ା௫మమ , derivative negative. If 𝑎 ൐ ௫ ௫ భభ ௬ మ భା௫ ା௫మ మ௬ మ మ, derivative
positive.
• Function reaches minimum at 𝒙𝟏𝒚𝟏ା𝒙𝟐𝒚𝟐
𝒙𝟏𝟐 ା𝒙𝟐𝟐 . => 𝜶 ൌ
𝒙𝟏𝒚𝟏ା𝒙𝟐𝒚𝟐
𝒙𝟏𝟐 ା𝒙𝟐𝟐 .
24
Reminder: P4Sum
• P4Sum: let
ଵ , ଶ ,…, ே be
functions of which are all differentiable wrt
to . Let


, ଶ

,…, ே

denote their
derivatives. Then, ே ௞ୀଵ ௞ is differentiable
wrt to , and its derivative is ே ௞ୀଵ ௞ᇱ .
• In words: the derivative of a sum is the sum of
its derivatives.
25
Finding a formula for for any value of .
• Let’s get back to the general case where is
left unspecified.
• Remember, is the value of that minimizes
௞ ௞
ே ଶ
௞ୀଵ .
• Find a formula for , as a function of ଵ,…, ே
and
ଵ,…, ே. You have 3 minutes to find the
answer. Hint: you need to compute the
derivative of
௞ ௞
ே ଶ
௞ୀଵ with respect to
, and then is the value of for which that
derivative is equal to 0. 26
iClicker time
• is equal to:
a)∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖ
b) ௫భ௬భା௫మ௬మ
௫భమ ା௫మమ
c)∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ
27
ೖ ೖ
ಿ
ೖసభ
ಿ ೖసభ ೖమ .
• 𝛼 minimizes ∑ே ௞ୀଵ 𝑦௞ െ 𝑎𝑥௞ ଶ. Derivative wrt to 𝑎
is:∑ே ௞ୀଵ െ𝑥௞2 𝑦௞ െ 𝑎𝑥௞ . Why?
• Let’s find value of 𝑎 for which derivative = 0.
∑ே ௞ୀଵ െ2𝑥௞ 𝑦௞ െ 𝑎𝑥௞ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ െ2𝑥௞𝑦௞ ൅ 2𝑎𝑥௞ଶ ൌ 0
i𝑖𝑓 ∑ே ௞ୀଵ െ2𝑥௞𝑦௞ ൅ ∑ே ௞ୀଵ 2𝑎𝑥௞ଶ ൌ 0
i𝑖𝑓 െ 2 ∑ே ௞ୀଵ 𝑥௞𝑦௞ ൅ 2𝑎 ∑ே ௞ୀଵ 𝑥௞ଶ ൌ 0
i𝑖𝑓2𝑎 ∑ே ௞ୀଵ 𝑥௞ଶ ൌ 2 ∑ே ௞ୀଵ 𝑥௞𝑦௞
i𝑖𝑓 𝑎 ൌ ∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ .
• Function reaches minimum at ∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ . =>
∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ . 28
What you need to remember
• Population of 𝑁 units. Each unit 𝑘 has 2 variables attached to it:
𝑦௞ is a variable we do not observe, 𝑥௞ is a variable we observe.
• We want to predict the 𝑦௞ of each unit based on her 𝑥௞.
• E.g.: a bank wants to predict the amount an applicant will fail to
reimburse on her loan based on her FICO score.
• Our prediction should be function of 𝑥௞, 𝑓ሺ𝑥௞).
• For now, focus on linear functions of 𝑥௞: 𝑎𝑥௞ for some number
𝑎.
• Good prediction should be such that 𝑦௞ െ 𝑎𝑥௞, difference
between prediction and 𝑦௞, is as small as possible for most units.
• The best value of 𝑎 is the one that minimizes ∑ே ௞ୀଵ 𝑦௞ െ 𝑎𝑥௞ ଶ.
• We call that value 𝛼, and we call 𝛼𝑥௞ the OLS univariate linear
regression function of 𝑦௞ on 𝑥௞.
• If 𝑁 ൌ 2 𝛼 ൌ ௫భ௬భା௫మ௬మ
௫భమ ା௫మమ . You should know how to prove that.
• In general, 𝛼 ൌ ∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ . You should know how to prove that.
29
Roadmap
1. The OLS univariate linear regression function.
2. Estimating the OLS univariate linear regression function.
3. OLS univariate linear regression in practice.
30
Can we compute ?
• Our prediction for ௞ based on a univariate
linear regression is ௞, the univariate linear
regression function.
• => to be able to make a prediction for a unit’s
௞ based on her ௞, we need to know the
value of .
• Under the assumptions we have made so far,
can we compute ? Discuss this question with
your neighbor during 1 minute.
31
iClicker time
• Under the assumptions we have made so far,
can we compute ?
a) Yes
b) No
32
We do not observe the s, => we cannot
compute
• Remember, we have assumed that we observe
the
௞s of everybody in the population (e.g.
applicants’ FICO scores) but not the ௞s (e.g.
the amount that a person applying for a one‐
year loan in April 2018 will fail to reimburse in
April 2019 when that loan expires).
• => we cannot compute
∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ .
33
But we can estimate if we observe the
s of a sample of the population.
• But we can estimate if we observe the
௞s of a
sample of the population.
• E.g.: in the Gmail example, we could select a
random sample of emails, and ask a human to
determine whether those emails are spams or
not.
34
Randomly sampling one unit.
• Assume we randomly select one unit in the population, and we
measure the dependent and the independent variable of that unit.
• E.g.: we randomly select one email out of all the emails reaching
Gmail accounts on May, 1st, 2018, and we look whether this is a
spam or not, and whether it contains the word “free” or not.
• Let 𝑌
ଵ and 𝑋ଵ respectively denote the value of the dependent and
of the independent variable for that randomly selected unit.
• 𝑌
ଵ and 𝑋ଵ are random variables, because their values depend on
which unit of the population we randomly select.
• If we select the 34th unit in the population, 𝑌 ଵ ൌ 𝑦ଷସ and 𝑋ଵ ൌ 𝑥ଷସ.
• Each unit in the population has the same probability, ଵ

, of being
selected.
• What is the value of 𝐸ሺ𝑋ଵ𝑌 ଵሻ? Hint: 𝐸ሺ𝑋ଵ𝑌 ଵሻ is a function of all the
𝑦௞s and of all the 𝑥௞s. Discuss this question with your neighbor
during 2mns.
35
iClicker time
• Assume we randomly select one unit in the population, and we
measure the dependent and the independent variable of that unit.
• Let 𝑌
ଵ and 𝑋ଵ respectively denote the value of the dependent and
of the independent variable for that randomly selected unit.
• 𝑌
ଵ and 𝑋ଵ are random variables, because their values depend on
which unit of the population of the population we randomly select.
• Each unit in the population has a probability ଵ

of being selected.
• What is the value of 𝐸ሺ𝑋ଵ𝑌 ଵሻ?
aሻ 𝐸 𝑋ଵ𝑌ଵ ൌ 𝑥௞𝑦௞
bሻ 𝐸 𝑋ଵ𝑌ଵ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞𝑦௞
cሻ 𝐸 𝑋ଵ𝑌ଵ ൌ ∑ே ௞ୀଵ𝑥௞𝑦௞
36

ଵ ଵ is equal to:
– ଵ ଵ if the first individual in the population is
selected, which has a probability ଵ

of happening
– ଶ ଶ if the second individual in the population is
selected, which has a probability ଵ

of happening
– …
– ே ே if the th individual in the population is
selected, which has a probability ଵ

of happening
• Therefore, ଵ ଵ = ଵ
ே ௞ ௞

௞ୀଵ =
ଵ ே
௞ ௞

௞ୀଵ .
• What is the value of


? Discuss this
question with your neighbor during 1mn.
37
iClicker time
• We randomly select one unit, and we measure the dependent
and the independent variable of that unit.
• Let 𝑌
ଵ and 𝑋ଵ respectively denote the value of the dependent
and of the independent variable for that randomly selected
unit.
• 𝑌
ଵ and 𝑋ଵ are random variables, because their values depend
on which unit of the population of the population we
randomly select.
• Each unit in the population has a probability ଵ

of being
selected.
• What is the value of 𝐸 𝑋

ଶ ?

ଶ ଵ
ே ௞
ே ଶ
௞ୀଵ

ଶ ଵ
ே ௞

௞ୀଵ
38



is equal to:
– ଵ
ଶ if the first individual in the population is
selected, which has a probability ଵ

of happening
– ଶ
ଶ if the second individual in the population is
selected, which has a probability ଵ

of happening
– …
– ே
ଶ if the th individual in the population is
selected, which has a probability ଵ

of happening
• Therefore, ଵଶ = ଵ
ே ௞
ே ଶ
௞ୀଵ =
ଵ ே

ே ଶ
௞ୀଵ .
39
Randomly sampling units.
• We randomly draw 𝑛 units with replacement from the
population, and we measure the dependent and the
independent variable of those 𝑛 units.
• For every 𝑖 included between 1 and 𝑛, 𝑌 ௜ and 𝑋௜ = value of the
dependent and of the independent variable of the 𝑖th unit we
randomly select.
• The 𝑌 ௜s and 𝑋௜s are independent and identically distributed.
• For every 𝑖 included between 1 and 𝑛, 𝐸 𝑋௜𝑌 ௜ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞𝑦௞
and 𝐸 𝑋௜ଶ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞ଶ .
40
A method to estimate .
• We want to use the 𝑌 ௜s and the 𝑋௜s to estimate 𝛼.
• Remember: 𝛼 is the value of 𝑎 that minimizes ∑ே ௞ୀଵ 𝑦௞ െ 𝑎𝑥௞ ଶ.
• => to estimate 𝛼, we could use 𝛼 ො, the value of 𝑎 that minimizes
∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑎𝑋௜ ଶ.
• Instead of finding the value of 𝑎 that minimizes the sum of squared
prediction errors in the population, find value of 𝑎 that minimizes
the sum of squared prediction errors in the sample.
• Intuition: if we find a method to predict well the dependent
variable in the sample, that method should also work well in the full
population, as our sample is representative of the population.
41
The OLS regression function in the sample.
• Let
𝛼 ො ൌ 𝑎𝑟𝑔𝑚𝑖𝑛௔∈ோ ෍ 𝑌 ௜ െ 𝑎𝑋௜ ଶ

௜ୀଵ
• We call 𝛼 ො𝑋௜ the OLS regression function of 𝑌 ௜ on 𝑋௜ in the sample.
• In the sample: because we only use the 𝑌 ௜s and 𝑋௜s of the 𝑛 units in
the sample we randomly draw from the population.
• 𝛼 ො: coefficient of 𝑋௜ in the OLS regression of 𝑌 ௜ on 𝑋௜ in the sample.
• Let 𝑌 ෠௜ ൌ 𝛼 ො𝑋௜. 𝑌 ෠௜ is the predicted value for 𝑌 ௜ according to the OLS
regression of 𝑌 ௜ on 𝑋௜ in the sample.
• Let 𝑒̂௜ ൌ 𝑌 ௜ െ 𝑌 ෠௜. 𝑒̂௜: error we make when we use OLS regression in
the sample to predict 𝑌 ෠௜.
• We have 𝑌 ௜ ൌ 𝑌 ෠௜ ൅ 𝑒̂௜.
• Find a formula for 𝛼 ො, the value of 𝑎 that minimizes ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑎𝑋௜ ଶ.
Hint: differentiate this function wrt to 𝑎 and find the value of 𝑎 that
cancels the derivative.
42
iClicker time
• The value of that minimizes ௡ ௜ୀଵ ௜ ௜ ଶ is:
∑೙ ೔సభ ௑೔మ ௒೔
∑೙ ೔సభ ௑೔
∑೙ ೔సభ ௑೔௒೔
∑೙ ೔సభ ௑೔మ
∑೙ ೔సభ ௑೔௒೔
∑೙ ೔సభ ௑೔
43
Value of minimizing ௡ ௜ୀଵ ௜ ௜ ଶ is ∑೙ ೔సభ ௑೔௒೔
∑೙ ೔సభ ௑೔మ
• Derivative wrt to 𝑎 of ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑎𝑋௜ ଶ is: ∑௡ ௜ୀଵሾെ𝑋௜2ሺ𝑌 ௜ െ
𝑎𝑋௜ሻሿ. Why?
• Let’s find value of 𝑎 for which derivative = 0.
∑௡ ௜ୀଵ െ𝑋௜2 𝑌 ௜ െ 𝑎𝑋௜ ൌ 0
i𝑖𝑓 ∑௡ ௜ୀଵ െ2𝑋௜𝑌 ௜ ൅ 2𝑎𝑋௜ଶ ൌ 0
i𝑖𝑓 ∑௡ ௜ୀଵ െ2𝑋௜𝑌 ௜ ൅ ∑௡ ௜ୀଵ 2𝑎𝑋௜ଶ ൌ 0
i𝑖𝑓 െ 2 ∑௡ ௜ୀଵ 𝑋௜𝑌 ௜ ൅ 2𝑎 ∑௡ ௜ୀଵ 𝑋௜ଶ ൌ 0
i𝑖𝑓2𝑎 ∑௡ ௜ୀଵ 𝑋௜ଶ ൌ 2 ∑௡ ௜ୀଵ 𝑋௜𝑌 ௜
i𝑖𝑓 𝑎 ൌ ∑೙ ೔సభ ௑೔௒೔
∑೙ ೔సభ ௑೔మ .
𝜶 ෝ ൌ ∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊
∑𝒏 𝒊స𝟏 𝑿𝒊𝟐
44
Reminder: the law of large numbers.
• LLN: Let
ଵ,…, ௡ be iid random variables,
and let denote their expectation.
௡→ାஶ
ଵ ௡


௜ୀଵ .
• When the sample size grows, the average of
iid random variables converges towards their
expectation.
45
converges towards when the sample size
grows.
• We randomly draw 𝑛 units with replacement from the population,
and we measure the dependent and the independent variable of
those 𝑛 units.
• For every 𝑖 included between 1 and 𝑛, 𝑌 ௜ and 𝑋௜ = value of the
dependent and of the independent variable of the 𝑖th unit we
randomly select.
• Because the 𝑛 units are drawn with replacement, ሺ𝑌 ௜, 𝑋௜ሻ are iid,
and therefore the 𝑋௜𝑌 ௜s are iid, and the 𝑋௜ଶ s are also iid.
• For every 𝑖 included between 1 and 𝑛, 𝐸 𝑋௜𝑌 ௜ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞𝑦௞ and
𝐸 𝑋



ଵ ே
∑ே ௞ୀଵ 𝑥௞ଶ .
• 𝛼 ൌ
∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ and 𝛼 ො ൌ ∑ ∑೙ ೔సభ ೙ ೔సభ ௑ ௑೔ ೔௒ మ೔.
• Use the law of large numbers to show that lim
௡→ାஶ
𝛼 ො ൌ 𝛼. Hint: you
need to use the fact that 𝐸 𝑋௜𝑌 ௜ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞𝑦௞ and 𝐸 𝑋௜ଶ ൌ
ଵ ே
∑ே ௞ୀଵ 𝑥௞ଶ .
46
iClicker time
• Which of following two arguments is correct:
a) The law of large numbers implies that
௡→ାஶ ௜ ௜

௜ୀଵ
௜ ௜ and
௡→ାஶ ௜
௡ ଶ
௜ୀଵ ௜

. We have
௜ ௜
ଵ ே
௞ ௞

௞ୀଵ and ௜ଶ ଵ
ே ௞
ே ଶ
௞ୀଵ .
Therefore,
௡→ାஶ
.
b) The law of large numbers implies
that
௡→ାஶ
ଵ ௡
௜ ௜

௜ୀଵ ௜ ௜ and
௡→ାஶ
ଵ ௡

௡ ଶ
௜ୀଵ ௜

. We have ௜ ௜
ଵ ே
௞ ௞

௞ୀଵ and ௜ଶ ଵ
ே ௞
ே ଶ
௞ୀଵ . Therefore,
௡→ାஶ
.
47
The second argument is correct
• We randomly draw 𝑛 units with replacement from the population, and we
measure the dependent and the independent variable of those 𝑛 units.
• For every 𝑖 included between 1 and 𝑛, 𝑌 ௜ and 𝑋௜ = value of the dependent
and of the independent variable of the 𝑖th unit we randomly select.
• Because the 𝑛 units are drawn with replacement, ሺ𝑌 ௜, 𝑋௜ሻ are iid.
• For every 𝑖 included between 1 and 𝑛, 𝐸 𝑋௜𝑌 ௜ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞𝑦௞ and
𝐸 𝑋



ଵ ே
∑ே ௞ୀଵ 𝑥௞ଶ .
• The law of large numbers implies that lim
௡→ାஶ
ଵ ௡
∑௡ ௜ୀଵ 𝑋௜𝑌 ௜ ൌ 𝐸 𝑋௜𝑌 ௜ and
lim
௡→ାஶ
ଵ ௡
∑௡ ௜ୀଵ 𝑋௜ଶ ൌ 𝐸 𝑋௜ଶ .
• Moreover, We have 𝐸 𝑋௜𝑌 ௜ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞𝑦௞ and 𝐸 𝑋௜ଶ ൌ ଵ

∑ே ௞ୀଵ 𝑥௞ଶ .
• Therefore:
lim
௡→ାஶ
𝛼 ො ൌ lim
௡→ାஶ
∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊
∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 ൌ lim ௡→ାஶ
భ ೙
∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊
భ ೙
∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 ൌ
୪୧୫
೙→శಮ
భ ೙
∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊
୪୧୫
೙→శಮ
భ ೙
∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 ൌ
ா ௑೔௒೔
ா ௑೔మ ൌ
భ ಿ
∑ಿ ೖసభ ௫ೖ௬ೖ
భ ಿ
∑ಿ ೖసభ ௫ೖమ ൌ
∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ ൌ 𝛼.
48
What you need to remember
• Prediction for 𝑦௞ based on OLS regression in population is 𝛼𝑥௞, with 𝛼 ൌ
∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ .
• We would like to compute 𝛼 but we cannot because we do not observe
the 𝑦௞s of everybody in the population.
• => we randomly draw 𝑛 units with replacement from the population, and
measure the dependent and the independent variable of those 𝑛 units.
• For every 𝑖 between 1 and 𝑛, 𝑌 ௜ and 𝑋௜ = value of dependent and
independent variables of the 𝑖th unit we randomly select.
• Given that 𝛼 is value of 𝑎 that minimizes ∑ே ௞ୀଵ 𝑦௞ െ 𝑎𝑥௞ ଶ, we use 𝛼 ො, the
value of 𝑎 that minimizes ∑௡ ௜ୀଵ 𝑌 ௜ െ 𝑎𝑋௜ ଶ to estimate 𝛼.
• We have 𝛼 ො ൌ ∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊
∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 .
• Law of large numbers implies that lim
௡→ାஶ
𝛼 ො ൌ 𝛼.
• When the sample we randomly draw gets large, 𝛼 ො, the sample coefficient
of the regression, gets close to 𝛼, the population coefficient.
• Therefore, 𝛼 ො is a good proxy for 𝛼 when the sample size is large enough.
49
Roadmap
1. The OLS univariate linear regression function.
2. Estimating the OLS univariate linear regression function.
3. OLS univariate linear regression in practice.
50
How Gmail uses univariate linear regression (1/2)
• Gmail would like to predict 𝑦௞, a variable equal to 1 if email 𝑘 is a
spam and 0 otherwise.
• To do so they use a variable 𝑥௞, a variable equal to 1 if the word
“free” appears in the email and 0 otherwise.
• 𝑥௞ is easy to measure (a computer can do it automatically, by doing
a search of “free” in the email), but 𝑦௞ is hard to measure: only a
human can know for sure whether an email is a spam or not. =>
they cannot observe 𝑦௞ for all emails reaching Gmail.
• To make good predictions, they would like to compute, 𝛼, the value
of 𝑎 that minimizes ∑ே ௞ୀଵ 𝑦௞ െ 𝑎𝑥௞ ଶ, and then use 𝛼𝑥௞ to predict
𝑦௞. 𝛼𝑥௞: best univariate linear prediction of 𝑦௞ given 𝑥௞.
• Issue: 𝛼 ൌ ∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ => they cannot compute it unless they observe
all the 𝑦௞s, but that would be very costly to do (a human has to
read all the emails reaching Gmail accounts), plus once it’s done we
no longer need to predict the 𝑦௞s because we know them.
51
How Gmail uses univariate linear regression (2/2)
• Instead Gmail can draw a random sample of, say, 5000
emails, ask humans to read them and determine whether
they are spams or not.
• For every 𝑖 between 1 and 5000, let 𝑌 ௜ denote whether the
𝑖th randomly drawn email is a spam or not, and let 𝑋௜
denote whether the 𝑖th randomly drawn email has the
word free in it.
• Then, people in Gmail can compute 𝛼 ො ൌ ∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊
∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 .
• For all the emails they have not randomly drawn, and for
which they do not observe 𝑦௞, they can use 𝛼 ො𝑥௞ as their
prediction of whether the email is a spam or not.
• Because their random sample of emails is large, 𝛼 ො should
be close to 𝛼, and therefore 𝛼 ො𝑥௞ should be close to 𝛼𝑥௞,
the best univariate linear prediction of 𝑦௞ given 𝑥௞.
52
How banks use univariate linear regression (1/2)
• A bank would like to predict 𝑦௞, a variable equal to the
amount that a person applying in April 2018 for a one‐year
loan will fail to reimburse in April 2019 when her loan expires.
• To do so they use a variable 𝑥௞, equal to the FICO score of that
applicant.
• 𝑥௞ is easy to measure (the bank has access to the FICO score
of all applicants), but 𝑦௞ is impossible to measure today: it’s
only in April 2019 that the bank will know the amount the
applicant fails to reimburse.
• To make good predictions, they would like to compute 𝛼, the
value of 𝑎 that minimizes ∑ே ௞ୀଵ 𝑦௞ െ 𝑎𝑥௞ ଶ, and then use
𝛼𝑥௞ to predict 𝑦௞. 𝛼𝑥௞: best univariate linear prediction of
𝑦௞ given 𝑥௞.
• Issue: 𝛼 ൌ ∑ಿ ೖసభ ௫ೖ௬ೖ
∑ಿ ೖసభ ௫ೖమ => they cannot compute because they do
not observe the 𝑦௞s.
53
How banks uses univariate linear regression (2/2)
• Instead, the bank can use data on people who applied in April 2017
for a one‐year loan. For those people, they know how much they
failed to reimburse on their loan. Let’s assume that the bank has
1000 applicants in April 2018, and 1000 applicants in April 2017.
• For every 𝑖 between 1 and 1000, let 𝑌 ௜ denote the amount that the
𝑖th April 2017 applicant failed to reimburse on her loan, and let 𝑋௜
denote the FICO score of that applicant.
• Then, people in the bank can compute 𝛼 ො ൌ ∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊
∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 .
• For their April 2018 applicants, for which they do not observe 𝑦௞,
they can use 𝛼 ො𝑥௞ as their prediction of the amount each applicant
will fail to reimburse.
• Which condition should be satisfied to ensure 𝛼 ො is close to 𝛼? Hint:
look again at the Gmail example. There is one difference in the way
we select the observations for which we measure 𝑌 ௜ in the bank and
in the Gmail example. Discuss this question with your neighbor for
one minute.
54
iClicker time
• Which condition should be satisfied to ensure is
close to ?
55
April 2017 and 2018 applicants should look similar
• Previous section: 𝛼 ො converges towards 𝛼 if sample of units for which we
observe 𝑌 ௜ randomly drawn from population.
• The bank cannot draw random sample of April 2018 applicants and observe
today the amount this sample will fail to reimburse.
• Instead, can use the April 2017 applicants, for which they can both measure 𝑌 ௜,
amount that each applicant failed to reimburse, and 𝑋௜, FICO score.
• Then, can compute 𝛼 ො, and for each April 2018 applicant they can use 𝛼 ො𝑥௞ as
their prediction of 𝑦௞, the amount each applicant will fail to reimburse.
• If April 2017 applicants are “as good as” a random sample from combined
population of April 2017 and April 2018 applicants, then all our theoretical
results apply: 𝛼 ො should be close to 𝛼, and our predictions should be good.
• To assume that April 2017 applicants are almost a random sample from the
population of April 2017 and April 2018 applicants, April 2017 and April 2018
should look very similar. E.g.: should have similar FICO scores, demographics…
• => if two groups look similar, 𝛼 ො𝑥௞ should be good prediction of 𝑦௞ for 2018
applicants. Otherwise, we have to be careful.
56
What you need to remember, and what’s next
• In practice, there are many instances where we can measure the
𝑦௞𝑠, the variable we do not observe for everyone, for a subsample
of the population.
• We can use that subsample to compute 𝛼 ො , and then use 𝛼 ො𝑥௞ as
our prediction of the 𝑦௞𝑠 we do not observe.
• If that subsample is a random sample from the population (Gmail
example), 𝛼 ො𝑥௞ should be close to 𝛼𝑥௞, best linear prediction for 𝑦௞.
• On the other hand, if that subsample is not a random sample from
the population (bank example),𝛼 ො𝑥௞ will be close to 𝛼𝑥௞ only if the
subsample looks pretty similar to the entire population (almost a
random sample).
• Even when we have a random sample, univariate linear regression
might still not give great predictions.
• There are better prediction methods available. Next lectures: we
see one of them.
57

Get professional assignment help cheaply

Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?

Whichever your reason may is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.

Our essay writers are graduates with diplomas, bachelor, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college diploma. When assigning your order, we match the paper subject with the area of specialization of the writer.

Why choose our academic writing service?

  • Plagiarism free papers
  • Timely delivery
  • Any deadline
  • Skilled, Experienced Native English Writers
  • Subject-relevant academic writer
  • Adherence to paper instructions
  • Ability to tackle bulk assignments
  • Reasonable prices
  • 24/7 Customer Support
  • Get superb grades consistently

 

 

 

 

 

 

Order a unique copy of this paper
(550 words)

Approximate price: $22

Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

We value our customers and so we ensure that what we do is 100% original..
With us you are guaranteed of quality work done by our qualified experts.Your information and everything that you do with us is kept completely confidential.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

The Product ordered is guaranteed to be original. Orders are checked by the most advanced anti-plagiarism software in the market to assure that the Product is 100% original. The Company has a zero tolerance policy for plagiarism.

Read more

Free-revision policy

The Free Revision policy is a courtesy service that the Company provides to help ensure Customer’s total satisfaction with the completed Order. To receive free revision the Company requires that the Customer provide the request within fourteen (14) days from the first completion date and within a period of thirty (30) days for dissertations.

Read more

Privacy policy

The Company is committed to protect the privacy of the Customer and it will never resell or share any of Customer’s personal information, including credit card data, with any third party. All the online transactions are processed through the secure and reliable online payment systems.

Read more

Fair-cooperation guarantee

By placing an order with us, you agree to the service we provide. We will endear to do all that it takes to deliver a comprehensive paper as per your requirements. We also count on your cooperation to ensure that we deliver on this mandate.

Read more

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Open chat
1
You can contact our live agent via WhatsApp! Via +1 817 953 0426

Feel free to ask questions, clarifications, or discounts available when placing an order.

Order your essay today and save 20% with the discount code VICTORY