Econometrics Take Home Exam

Hi, attatched is the exam, powerpoint slides, and practice problems with solutions.Please follow the powerpoint slides for it to make sense for the class and get a high grade.

Ordinary least squares regression II:

The univariate affine regression.

Clement de Chaisemartin and Doug Steigerwald

UCSB

1

Many people need to make

predictions

• Traders: use today’s GDP growth to predict

tomorrow oil’s price.

• Banks: use FICO score to predict the amount

that a April 2018 applicant will fail to

reimburse on her one‐year loan in April 2018.

• Gmail: use whether incoming email has the

word “free” in it to predict whether it’s a

spam.

2

The relationship between FICO score

and default

• Assume that relationship between FICO score and amount

people fail to repay looks like graph below: people with low

FICO fail to repay more.

• If you use a univariate linear regression to predict the amount

people fail to repay based on their FICO score, will you make

good predictions? Discuss this question with your neighbor.

‐1000 0 2 4 6 8 10 3

0

1000

2000

3000

4000

5000

6000

Default

iClicker time

• If you use a univariate linear regression to

predict the amount people fail to repay based

on their FICO score, will you make good

predictions?

4

No!

• In this example, OLS regression function is 250*FICO,

increasing with FICO! We predict that people with better

scores will fail to reimburse more.

• OLS regression makes large prediction errors.

• Why does regression make large prediction errors?

0 2 4 6 8 10 5

‐1000

0

1000

2000

3000

4000

5000

6000

Default

250*FICO

Prediction error of

univariate linear regression

for person with x=2

iClicker time

• Why does the univariate linear regression

make large prediction errors?

a) Because the relationship between FICO and

the amount people fail to repay is

decreasing.

b) Because the amount that people with FICO

score equal to 0 fail to repay is different from

0.

6

Because the amount that people with a FICO

score equal to 0 fail to repay is different from 0.

• The univariate linear regression function is .

Therefore, by construction, our prediction will be 0

for people with FICO score = 0.

• However, as you can see from the graph, people with

a FICO score equal to 0 fail to reimburse a strictly

positive amount on their loan, not a 0 amount.

7

You should use an affine prediction function.

• The graph below shows that the function 5000‐500*FICO does

a much better job at predicting the amount that people fail to

repay than the univariate linear regression function 250*FICO

• 5000‐500*FICO is a an affine function of FICO, with an

intercept equal to 5000, and a slope equal to ‐500.

• In these lectures, we study OLS univariate affine regression.

‐1000 0 2 4 6 8 10 8

0

1000

2000

3000

4000

5000

6000

Default

250*FICO

5000‐500*FICO

Roadmap

1. The OLS univariate affine regression function.

2. Estimating the OLS univariate affine regression function.

3. Interpreting 𝛽 መଵ

4. OLS univariate affine regression in practice.

9

Set up and notation.

• We consider a population of 𝑁 units.

– 𝑁 = number of people who apply for a one year‐loan with

bank A during April 2018.

– 𝑁 = number of emails reaching Gmail accounts in April 2018.

• Each unit 𝑘 has a variable 𝑦 attached to it that we do not

observe. We call this variable the dependent variable.

– In loan example, 𝑦 is a variable equal to the amount of her

loan applicant 𝑘 will fail to reimburse when her loan expires

in April 2018.

– In email example, 𝑦 = 1 if email 𝑘 is a spam and 0 otherwise.

• Each unit 𝑘 also has 1 variable 𝑥

attached to it that we do

observe. We call this variable the independent variable.

– In loan example, 𝑥 could be the FICO score of applicant 𝑘.

– In email example, 𝑥 =1 if the word “free” appears in the

email.

• 𝑦 ത ൌ ଵ

ே

∑ே ୀଵ 𝑦 and 𝑥̅ ൌ ଵ

ே

∑ே ୀଵ 𝑥: average of 𝑦s and 𝑥s.

10

Your prediction should be a function of

• Based on the value of

of each unit, we want to

predict her .

• E.g.: in the loan example, we want to predict the

amount that unit will fail to repay on her loan based

on her FICO score.

• Assume that applicant 1 has a very high (good) credit

score, while applicant 2 has a very low (bad) credit

score.

• Should you predict the same value of for applicants

1 and 2?

• No! Your prediction should a function of , .

• In these lectures, we focus on predictions which are a

affine function of 𝒌: 𝒌 𝟎 𝟏 𝒌, for two

real numbers 𝟎 and 𝟏.

11

Our prediction error is .

• Based on the value of 𝑥

of each unit, we want to predict her

𝑦.

• Our prediction should a function of 𝑥, 𝑓ሺ𝑥ሻ. We focus on

predictions which are a affine function of 𝑥: 𝑓 𝑥 ൌ 𝑏

𝑏

ଵ𝑥, for two real numbers 𝑏 and 𝑏ଵ.

• 𝑦 െ 𝑏 𝑏ଵ𝑥 , the difference between our prediction and

𝑦, is our prediction error.

• In the loan example, if 𝑦 െ 𝑏 𝑏ଵ𝑥 is large and positive,

our prediction is much below the amount applicant 𝑘 will fail

to reimburse.

• If 𝑦 െ 𝑏 𝑏ଵ𝑥 is large and negative, our prediction is

much above the amount person 𝑘 will fail to reimburse.

• Large positive or negative values of 𝑦 െ 𝑏 𝑏ଵ𝑥 mean

bad prediction.

• 𝑦 െ 𝑏 𝑏ଵ𝑥 close to 0 means good prediction.

12

We want to find the value of ሺ ଵ) that minimizes

ଵ

ଶ

ே

ୀଵ

• ଵ

ଶ

ே

ୀଵ is positive. =>

minimizing it = same thing as making it as close

to 0 as possible.

• If

ଵ

ଶ

ே

ୀଵ is as close to 0 as

possible, means that the sum of the squared

value of our prediction errors is as small as

possible.

• => we make small errors. That’s good, that’s

what we want!

13

The OLS univariate affine regression function

in the population.

• Let

ଵ బ,భ ∈ோమ ଵ

ଶ

ே

ୀଵ

• We call

ଵ the ordinary least squares (OLS)

univariate OLS affine regression function of 𝒌 on 𝒌

in the population.

• Affine: because the regression function is an affine

function of

.

• Shortcut: OLS regression of 𝒌 on a constant and 𝒌

in the population.

• Constant: because there is the constant

in our

prediction function.

14

Decomposing between predicted value

and error.

•

and ଵ: coefficient of the constant and in the OLS

regression of on a constant and in the

population.

• Let

ଵ . is the predicted value for

according to the OLS regression of on a constant

and

in the population.

• Let

. : error we make when we use OLS

regression in the population to predict .

• We have

.

predicted value + error.

15

…

• 𝛽, 𝛽ଵ : 𝑏, 𝑏ଵ minimizing ∑ே ୀଵ 𝑦 െ 𝑏 𝑏ଵ𝑥 ଶ.

• Derivative wrt to 𝑏

is: ∑ே ୀଵ െ2 𝑦 െ 𝑏 𝑏ଵ𝑥 . Why?

• Derivative wrt to 𝑏

ଵ is: ∑ே ୀଵ െ2𝑥 𝑦 െ 𝑏 𝑏ଵ𝑥 . Why?

• 𝛽, 𝛽ଵ : value of 𝑏, 𝑏ଵ for which 2 derivatives = 0.

• We use fact 1st derivative = 0 to write 𝛽 as function of 𝛽ଵ:

∑ே ୀଵ െ2 𝑦 െ 𝛽 𝛽ଵ𝑥 ൌ 0

i𝑖𝑓 െ 2 ∑ே ୀଵ 𝑦 െ 𝛽 െ 𝛽ଵ𝑥 ൌ 0

i𝑖𝑓 ∑ே ୀଵ 𝑦 െ 𝛽 െ 𝛽ଵ𝑥 ൌ 0

i𝑖𝑓 ∑ே ୀଵ 𝑦 െ ∑ே ୀଵ 𝛽 െ ∑ே ୀଵ 𝛽ଵ𝑥 ൌ 0

i𝑖𝑓 ∑ே ୀଵ 𝑦 െ ∑ே ୀଵ 𝛽ଵ𝑥 ൌ ∑ே ୀଵ 𝛽

i𝑖𝑓 ∑ே ୀଵ 𝑦 െ 𝛽ଵ ∑ே ୀଵ 𝑥 ൌ 𝑁𝛽

i𝑖𝑓 ଵ

ே

∑ே ୀଵ 𝑦 െ 𝛽ଵ ଵ

ே

∑ே ୀଵ 𝑥 ൌ 𝛽

i𝑖𝑓𝛽 ൌ 𝑦 ത െ 𝛽ଵ𝑥̅.

16

2 useful formulas for the next derivation.

• During the sessions, you have proven that

ଵ ே

∑ே ୀଵ 𝑥ଶ െ 𝑥̅ ଶ ൌ ଵ

ே

∑ே ୀଵሺ𝑥െ𝑥̅ሻଶ.

• Multiplying both sides by 𝑁, equivalent to saying that

∑ே ୀଵ 𝑥ଶ െ 𝑁𝑥̅ ଶ ൌ ∑ே ୀଵሺ𝑥െ𝑥̅ሻଶ.

• Bear this 1st equality in mind, we use it in next derivation.

• Moreover,

∑ே ୀଵ 𝑥̅ 𝑦 െ 𝑦 ത ൌ 𝑥̅ ∑ே ୀଵ 𝑦 െ 𝑦 ത ൌ 𝑥̅ ∑ே ୀଵ 𝑦 െ 𝑥̅ ∑ே ୀଵ 𝑦 ത

ൌ 𝑥̅𝑁𝑦 ത‐𝑥̅𝑁𝑦 ത ൌ 0.

• Therefore,

ሺ𝑥െ𝑥̅ሻሺ𝑦 െ 𝑦 തሻ

ே

ୀଵ

ൌ 𝑥 𝑦 െ 𝑦 ത െ 𝑥̅ 𝑦 െ 𝑦 ത

ே

ୀଵ

ൌ 𝑥 𝑦 െ 𝑦 ത

ே

ୀଵ

െ 𝑥̅ 𝑦 െ 𝑦 ത

ே

ୀଵ

ൌ 𝑥 𝑦 െ 𝑦 ത

ே

ୀଵ

• Bear this 2nd equality in mind, we use it in next derivation. 17

… and ೖ ೖ

ಿ

ೖసభ

ಿ ೖసభ ೖ మ

• Now, let’s use fact 2nd derivative = 0 and formula for 𝛽 to find 𝛽ଵ.

∑ே ୀଵ െ2𝑥 𝑦 െ 𝛽 𝛽ଵ𝑥 ൌ 0

i𝑖𝑓 ∑ே ୀଵ 𝑥 𝑦 െ 𝛽 𝛽ଵ𝑥 ൌ 0

i𝑖𝑓 ∑ே ୀଵ 𝑥𝑦 െ 𝛽𝑥 െ 𝛽ଵ𝑥ଶ ൌ 0

i𝑖𝑓 ∑ே ୀଵ 𝑥𝑦 െ ∑ே ୀଵ 𝛽𝑥 െ ∑ே ୀଵ 𝛽ଵ𝑥ଶ ൌ 0

i𝑖𝑓 ∑ே ୀଵ 𝑥𝑦 െ 𝛽 ∑ே ୀଵ 𝑥 െ 𝛽ଵ ∑ே ୀଵ 𝑥ଶ ൌ 0

i𝑖𝑓 ∑ே ୀଵ 𝑥𝑦 ൌ 𝛽 ∑ே ୀଵ 𝑥 𝛽ଵ ∑ே ୀଵ 𝑥ଶ

i𝑖𝑓 ∑ே ୀଵ 𝑥𝑦 ൌ 𝑦 ത െ 𝛽ଵ𝑥̅ ∑ே ୀଵ 𝑥 𝛽ଵ ∑ே ୀଵ 𝑥ଶ

i𝑖𝑓 ∑ே ୀଵ 𝑥𝑦 ൌ 𝑦 ത ∑ே ୀଵ 𝑥 െ 𝛽ଵ𝑥̅ ∑ே ୀଵ 𝑥 𝛽ଵ ∑ே ୀଵ 𝑥ଶ

i𝑖𝑓 ∑ே ୀଵ 𝑥𝑦 െ ∑ே ୀଵ 𝑥𝑦 ത ൌ 𝛽ଵ ∑ே ୀଵ 𝑥ଶ െ 𝑥̅ ∑ே ୀଵ 𝑥

i𝑖𝑓 ∑ே ୀଵ 𝑥𝑦 െ 𝑥𝑦 ത ൌ 𝛽ଵ ∑ே ୀଵ 𝑥ଶ െ 𝑥̅𝑁𝑥̅

i𝑖𝑓 ∑ே ୀଵ 𝑥 𝑦 െ 𝑦 ത ൌ 𝛽ଵ ∑ே ୀଵ 𝑥ଶ െ 𝑁𝑥̅ ଶ

i𝑖𝑓 ∑ே ୀଵሺ𝑥െ𝑥̅ሻ 𝑦 െ 𝑦 ത ൌ 𝛽ଵ ∑ே ୀଵሺ𝑥െ𝑥̅ሻଶ

i𝑖𝑓𝛽ଵ ൌ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ

∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .

18

Applying the formulas for and in an

example.

• Assume for a minute that : there are

only two units in the population.

• Assume that

ଵ , ଵ , ଶ , ଶ ,

ଷ , and ଷ .

• Use the previous formulas to compute and

ଵ in this example.

19

iClicker time

• If , ଵ , ଵ , ଶ and ଶ , ଷ

and

ଷ , then

ଷ ଶ

and

ଵ

ଶ

ଷ ଶ

and

ଵ

ଶ

c) ଷ

ଶ

and

ଵ

ହ ଶ

20

and

• If 𝑁 ൌ 3, 𝑦ଵ ൌ 2, 𝑥ଵ ൌ 0, 𝑦ଶ ൌ 3, 𝑥ଶ ൌ 1, 𝑦ଷ ൌ 7, and 𝑥ଷ ൌ 2,

then 𝑦 ത ൌ 4 and 𝑥̅ ൌ 1.

• Then,

𝛽ଵ ൌ

ሺ𝑥ଵെ𝑥̅ሻሺ𝑦ଵ െ 𝑦 തሻ ሺ𝑥ଶെ𝑥̅ሻሺ𝑦ଶ െ 𝑦 തሻ ሺ𝑥ଷെ𝑥̅ሻሺ𝑦ଷ െ 𝑦 തሻ

ሺ𝑥ଵെ𝑥̅ሻଶ ሺ𝑥ଶെ𝑥̅ሻଶ ሺ𝑥ଷെ𝑥̅ሻଶ

ൌ

ሺ0 െ 1ሻሺ2 െ 4ሻ ሺ1 െ 1ሻሺ3 െ 4ሻ ሺ2 െ 1ሻሺ7 െ 4ሻ

ሺ0 െ 1ሻଶሺ1 െ 1ሻଶሺ2 െ 1ሻଶ

ൌ

ହ ଶ

.

• And 𝛽 ൌ 𝑦 ത െ 𝛽ଵ𝑥̅ ൌ 4 െ ହ

ଶ

ൌ

ଷ ଶ

.

21

Two other useful formulas

• We let 𝑒

ൌ 𝑦 െ 𝛽 𝛽ଵ𝑥 . 𝑒: error we make when we

use a univariate affine regression to predict 𝑦.

• In the derivation of the formula of 𝛽, we have shown that

∑ே ୀଵ 𝑦 െ 𝛽 െ 𝛽ଵ𝑥 ൌ 0

• This is equivalent to ∑ே ୀଵ 𝑒 ൌ 0, which is itself equivalent to

ଵ ே

∑ே ୀଵ 𝑒 ൌ 0: the average of our prediction errors is 0.

• In the derivation of the formula of 𝛽, we have also shown

that ∑ே ୀଵ 𝑥 𝑦 െ 𝛽 െ 𝛽ଵ𝑥 ൌ 0

• This is equivalent to ∑ே ୀଵ 𝑥𝑒 ൌ 0, which is itself equivalent

to saying ଵ

ே

∑ே ୀଵ 𝑥𝑒 ൌ 0: the average of the product of our

prediction errors and 𝑥 is 0.

22

What you need to remember

• Population of 𝑁 units. Each unit 𝑘 has 2 variables attached to it:

𝑦 is a variable we do not observe, 𝑥 is a variable we observe.

• We want to predict the 𝑦 of each unit based on her 𝑥.

• Our prediction should be function of 𝑥, 𝑓ሺ𝑥).

• Focus on affine functions: 𝑏

𝑏ଵ𝑥, for 2 numbers 𝑏 and 𝑏ଵ.

• Best ሺ𝑏, 𝑏ଵሻ is that minimizing ∑ே ୀଵ 𝑦 െ 𝑏 𝑏ଵ𝑥 ଶ.

• We call that value ሺ𝛽, 𝛽ଵሻ, we call 𝛽 𝛽ଵ𝑥: OLS regression

function of 𝑦 on a constant and 𝑥, and we let 𝑒 ൌ 𝑦 െ

𝛽 𝛽ଵ𝑥 .

• 𝛽 ൌ 𝑦 ത െ 𝛽ଵ𝑥̅, and 𝛽ଵ ൌ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ

∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .

• We have ଵ

ே

∑ே ୀଵ 𝑒 ൌ 0: average prediction error is 0, and

ଵ ே

∑ே ୀଵ 𝑥𝑒 ൌ 0.

23

Roadmap

1. The OLS univariate affine regression function.

2. Estimating the OLS univariate affine regression function.

3. Interpreting 𝛽 መଵ.

4. OLS univariate affine regression in practice.

24

Can we compute , ?

• Our prediction for based on a univariate

linear regression is ଵ , the univariate

linear regression function.

• => to be able to make a prediction for a unit’s

based on her , we need to know the

value of

, ଵ .

• Under the assumptions we have made so far,

can we compute , ଵ ? Discuss this

question with your neighbor during 1 minute.

25

iClicker time

• Can we compute , ଵ ?

26

No!

•

ଵ , and ଵ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ

∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .

• Remember, we have assumed that we observe

the

s of everybody in the population (e.g.

applicants’ FICO scores) but not the s (e.g.

the amount that a person applying for a one‐

year loan in April 2018 will fail to reimburse in

April 2018 when that loan expires).

• => we cannot compute , and ଵ.

27

A method to estimate and

• We draw 𝑛 units from the population, and we measure the

dependent and the independent variable of those units.

• For every 𝑖 between 1 and 𝑛, 𝑌 and 𝑋 = value of dependent

and of independent variable of 𝑖th unit we randomly select.

• We want to use the 𝑌 s and the 𝑋s to estimate 𝛽 and 𝛽ଵ.

• 𝛽, 𝛽ଵ , 𝑏, 𝑏ଵ minimizing ∑ே ୀଵ 𝑦 െ 𝑏 𝑏ଵ𝑥 ଶ.

• => to estimate 𝛽, 𝛽ଵ , we use 𝑏, 𝑏ଵ minimizing ∑ ୀଵ൫𝑌 െ

𝑏

𝑏ଵ𝑋 ൯ଶ.

• Instead of finding 𝑏, 𝑏ଵ that minimizes sum of squared

prediction errors in population, find 𝑏, 𝑏ଵ that minimizes

sum of squared prediction errors in the sample.

• Intuition: if we find a method to predict well the dependent

variable in the sample, method should work well in entire

population, given that sample representative of population.

28

The OLS regression function in the sample.

• Let

𝛽 መ, 𝛽 መଵ ൌ 𝑎𝑟𝑔𝑚𝑖𝑛 బ,భ ∈ோమ 𝑌 െ 𝑏 𝑏ଵ𝑋 ଶ

ୀଵ

• We call 𝛽 መ 𝛽 መଵ𝑋 the OLS regression function of 𝑌 on a

constant and 𝑋 in the sample.

• In the sample: because we only use the 𝑌 s and 𝑋s of the 𝑛

units in the sample we randomly draw from the population.

• 𝛽 መ, 𝛽 መଵ : coefficients of the constant and 𝑋 in the OLS

regression of 𝑌 on 𝑋 in the sample.

• Let 𝑌 ൌ 𝛽 መ 𝛽 መଵ𝑋. 𝑌 is the predicted value for 𝑌

according to the OLS regression of 𝑌 on a constant and 𝑋

in the sample.

• Let 𝑒̂ ൌ 𝑌 െ 𝑌 . 𝑒̂: error we make when we use OLS

regression in the sample to predict 𝑌 .

• We have 𝑌 ൌ 𝑌 𝑒̂. 29

Find the value of

ଵ that minimizes

ଵ

ଶ

ୀଵ .

• Find a formula for the value of

ଵ that

minimizes ଵ

ଶ

ୀଵ . Hint: the

formula is “almost” the same as that for

ଵ ,

except that you need to replace:

– , size of population by , size of sample,

– , the dependent variable of unit in the

population, by , the dependent variable of unit

in the sample,

– , the independent variable of unit in the

population, by , the independent variable of

unit in the sample.

30

iClicker time

• Let ଵ

ୀଵ , and ଵ

ୀଵ . Let

ଵ denote the value of 𝑏, 𝑏ଵ that minimizes

𝑏 𝑏ଵ𝑋

ଶ

ୀଵ We have:

ଵ , and ଵ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ

∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .

ଵ , and ଵ ሺି തሻሺି തሻ

ሺି തሻమ

ୀଵ .

ଵ , and ଵ ∑ సభሺି തሻሺି തሻ

∑ సభሺି തሻమ .

31

ଵ , and ଵ ∑ సభሺି തሻሺି തሻ

∑ సభሺି തሻమ !

• Sketch of the proof.

• Differentiate ∑ ୀଵ 𝑌 െ 𝑏 𝑏ଵ𝑋 ଶ wrt to 𝑏 and 𝑏ଵ.

• 𝛽 መ and 𝛽 መଵ: values of 𝑏 and 𝑏ଵ that cancel these two

derivatives. That gives us a system of 2 equations with 2

unknowns to solve.

• The steps to solve it are exactly the same as those we used to

find 𝛽 ൌ 𝑦 ത െ 𝛽ଵ𝑥̅, and 𝛽ଵ ൌ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ

∑ಿ ೖసభሺ௫ೖି௫̅ሻమ , except that we

replace:

– 𝑁, the size of the population by 𝑛, the size of the sample,

– 𝑦, the dependent variable of unit 𝑘 in the population, by

𝑌 , the dependent variable of unit 𝑖 in the sample,

– 𝑥, the independent variable of unit 𝑘 in the population,

by 𝑋, the independent variable of unit 𝑖 in the sample.

32

converges towards , and ଵ converges

towards

ଵ.

• Remember, when we studied the OLS regression of on

without a constant, we used the law of large numbers

to prove that

→ାஶ

.

• When the sample we randomly draw gets large, , the

sample coefficient of the regression, gets close to , the

population coefficient, so is a good proxy for .

• Here, one can also use the law of large numbers to prove

that

→ାஶ and →ାஶ ଵ ଵ.

• Take‐away: when sample we randomly draw gets large,

and ଵ, sample coefficients of the regression of on

a constant and , get close to and ଵ, the population

coefficients.

• Therefore, and ଵ = good proxys of and ଵ when

sample is large enough.

33

iClicker time

• We have shown that

ଵ

∑ సభሺି തሻሺି തሻ

∑ సభሺି തሻమ

• Is

ଵ a real number, or is it a random variable? Discuss this

question with your neighbour for 1mn, and then answer.

ଵ a real number

ଵ a random variable.

34

ଵ is a random variable!

• We have shown that 𝛽 መଵ ൌ ∑ సభሺି തሻሺି തሻ

∑ సభሺି തሻమ

• 𝑋

s and 𝑌 s are random variables: their value depends on which unit

we randomly draw when we draw ith unit in sample.

• Therefore, 𝛽 መଵ is a random variable, with a variance.

• Let 𝜎ଶ ൌ ଵ

ே

∑ே ୀଵሺ𝑒ሻଶ denote the average of the squared of our

prediction errors in the population.

• One can show that 𝑉 𝛽 መଵ ൎ ఙమ

∑ సభሺି തሻమ.

• 𝑉 𝛽 መଵ small if average squared prediction error low, meaning that

regression model makes small prediction errors in the population

• 𝑉 𝛽 መଵ small if high variability of 𝑋.

• 𝑉 𝛽 መଵ small if sample size is large.

35

Using central limit theorem for ଵ to construct a

test and a confidence interval.

• If 𝑛 100, ఉ భିఉభ

ఉ భ

follows normal distribution with mean 0 and variance 1.

• We can use this to test null hypothesis on 𝛽ଵ.

• Often, we want to test 𝛽ଵ ൌ 0. If 𝛽ଵ ൌ 0, OLS regression function is 𝛽

0 ൈ 𝑥

ൌ 𝛽. Means that actually 𝑥 is useless to predict 𝑦. E.g.: best

prediction of amount people will fail to repay on their loan is actually not a

function of their FICO score, it is just a constant.

• If we want to have 5% chances of wrongly rejecting 𝛽ଵ ൌ 0, test is:

Reject 𝛽ଵ ൌ 0 if ఉ భ

ఉ భ

1.96 or ఉ భ

ఉ భ

൏ െ1.96.

Otherwise, do not reject 𝛽ଵ ൌ 0.

• We can also construct confidence interval for 𝛽ଵ:

𝛽 መଵ െ 1.96 𝑉 𝛽 መଵ , 𝛽 መଵ 1.96 𝑉 𝛽 መଵ .

For 95% of random samples we can draw, 𝛽ଵ belongs to confidence interval.

36

Assessing quality of our predictions: the MSE

• For every individual in sample, 𝑒̂ ൌ 𝑌 ‐ 𝛽 መ 𝛽 መଵ𝑋 : error we make

when we use sample OLS regression to predict 𝑌 .

• We have 𝑌 ൌ 𝛽 መ 𝛽 መଵ𝑋 𝑒̂.

• 𝑒 ൌ 𝑦− 𝛽 𝛽ଵ𝑥 : population prediction errors. 𝑒̂: sample

prediction errors.

• Slide 25: we have shown that ଵ

ே

∑ே ୀଵ 𝑒 ൌ 0.

• Similarly, ଵ

∑ ୀଵ 𝑒̂ ൌ 0. Average sample prediction error=0.

• We cannot use ଵ

∑ ୀଵ 𝑒̂ to assess quality of our predictions. Even if

our regression makes bad predictions, ଵ

∑ ୀଵ 𝑒̂ always equal to 0.

• Instead, we use ଵ

∑ ୀଵ 𝑒̂ଶ: mean‐squared error (MSE) of regression.

• Good to compare regressions: if regression A has a lower MSE than

B, A better than B: makes smaller errors on average.

• However, ଵ

∑ ୀଵ 𝑒̂ଶ hard to interpret: if equal to 10, what does that

mean? Does not have a natural scale to which we can compare it. 37

Assessing quality of our predictions: the ଶ.

• Instead, we are going to use ଶ

భ

∑ సభ ̂మ

భ

∑ సభ ି ത మ

• 1 – MSE / sample variance of the s

38

The ଶ has a natural scale (1/3)

• 𝑒̂ ൌ 𝑌 ‐ 𝛽 መ 𝛽 መଵ𝑋 = error we make when we use sample OLS

regression to predict 𝑌 . We have 𝑌 ൌ 𝛽 መ 𝛽 መଵ𝑋 𝑒̂.

• 𝑒 ൌ 𝑦− 𝛽 𝛽ଵ𝑥 : population errors. 𝑒̂: sample errors.

• Slide 25: we have shown that ଵ

ே

∑ே ୀଵ 𝑒 ൌ 0 and ଵ

ே

∑ே ୀଵ 𝑥𝑒 ൌ 0.

Average population error = 0, and average product of 𝑥s and 𝑒s = 0.

• Similarly, one can show that ଵ

∑ ୀଵ 𝑒̂ ൌ 0 and ଵ

∑ ୀଵ 𝑋𝑒̂ ൌ 0. Average

sample error=0, and average product of the 𝑋s and 𝑒̂s = 0.

• Because of this, one can show that

ଵ

∑ ୀଵ 𝑌 െ 𝑌 ത ଶ ൌ ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 𝑒̂ െ 𝛽 መ 𝛽 መଵ𝑋 ത 𝑒̂̅ ଶ

ൌ

ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 െ 𝛽 መ 𝛽 መଵ𝑋 ത 𝑒̂ െ𝑒̂̅ ଶ

ൌ

ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 െ 𝛽 መ 𝛽 መଵ𝑋 ത ଶ ଵ

∑ ୀଵ 𝑒̂ଶ.

That’s because ଵ

∑ ୀଵ 𝑒̂ ൌ 0 and ଵ

∑ ୀଵ 𝑋𝑒̂ ൌ 0 implies

ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 െ 𝛽 መ 𝛽 መଵ𝑋 ത 𝑒̂ െ 𝑒̂̅ ൌ 0.

39

The ଶ has a natural scale (2/3)

• Let 𝑒̂ ൌ 𝑌 ‐ 𝛽 መ 𝛽 መଵ𝑋 denote the error we make when we use the

sample OLS regression function to predict 𝑌 . We have 𝑌 ൌ 𝛽 መ

𝛽 መଵ𝑋 𝑒̂.

• One can show that ଵ

∑ ୀଵ 𝑒̂ ൌ 0 and ଵ

∑ ୀଵ 𝑋𝑒̂ ൌ 0. The average

sample prediction error is 0, and the average product of the 𝑋s

and 𝐸 s in the sample is 0.

• Because of this, one can show that

ଵ

∑ ୀଵ 𝑌 െ 𝑌 ത ଶ ൌ ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 െ 𝛽 መ 𝛽 መଵ𝑋 ത ଶ ଵ

∑ ୀଵ 𝑒̂ െ 𝑒̂̅ ଶ.

•

ଵ

∑ ୀଵ 𝑌 െ 𝑌 ത ଶ is the sample variance of the 𝑌 s,

ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 െ 𝛽 መ 𝛽 መଵ𝑋 ത ଶ is sample variance of 𝛽 መ 𝛽 መଵ𝑋s,

and ଵ

∑ ୀଵ 𝑒̂ଶ is the MSE.

• The sample variance of the 𝒀𝒊s is equal to the sample variance of

𝜷 𝟎 𝜷 𝟏𝑿𝒊, our predictions for 𝒀𝒊, plus the MSE of the regression.

40

iClicker time

• One can show that

ଵ

∑ ୀଵ 𝑌 െ 𝑌 ത ଶ ൌ ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 െ 𝛽 መ 𝛽 መଵ𝑋 ത ଶ

ଵ

∑ ୀଵ 𝑒̂ଶ.

• 𝑅ଶ ൌ 1 െ

భ

∑ సభ ̂మ

భ

∑ సభ ି ത మ.

• Based on the equality above, and based on its definition,

which of the following properties should the number 𝑅ଶ

satisfy?

aሻ 𝑅ଶ must be included between 0.5 and 1.

bሻ 𝑅ଶ must be included between 0.5 and 1.5.

cሻ 𝑅ଶ must be included between 0 and 1. 41

The ଶ has a natural scale: it must be included

between 0 and 1 (3/3)

• One has:

ଵ

∑ ୀଵ 𝑌 െ 𝑌 ത ଶ ൌ ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 െ 𝛽 መ 𝛽 መଵ𝑋 ത ଶ ଵ

∑ ୀଵ 𝑒̂ଶ.

• 𝑅ଶ ൌ 1 െ

భ

∑ సభ ̂మ

భ

∑ సభ ି ത మ, so 𝑅ଶ 1

• Then, using the fact that

ଵ

∑ ୀଵ 𝑒̂ଶ ൌ ଵ

∑ ୀଵ 𝑌 െ 𝑌 ത ଶ െ ଵ

∑ ୀଵ 𝛽 መ 𝛽 መଵ𝑋 െ 𝛽 መ 𝛽 መଵ𝑋 ത ଶ ,

one can show that

𝑅ଶ ൌ

భ

∑ సభ ఉ బାఉ భି ఉ బାఉ భ ത మ

భ

∑ సభ ି ത మ 0.

• => 𝑅ଶ= easily interpretable measure of quality of our predictions. If

close to 1, our predictions make almost no error (MSE close to 0), so

excellent prediction. If close to 0, poor prediction.

42

What you need to remember

• Prediction for 𝑦 based on OLS regression of 𝑦 on a constant and

𝑥 in the population is 𝛽 𝛽ଵ𝑥, with 𝛽 ൌ 𝑦 ത െ 𝛽ଵ𝑥̅, and 𝛽ଵ ൌ

∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ

∑ಿ ೖసభሺ௫ೖି௫̅ሻమ .

• We can estimate 𝛽, 𝛽ଵ if we measure 𝑦s for random sample.

• For every 𝑖 between 1 and 𝑛, 𝑌 and 𝑋 = value of dependent and

independent variables of 𝑖th unit we randomly select.

• 𝛽, 𝛽ଵ is 𝑏, 𝑏ଵ that minimizes ∑ே ୀଵ 𝑦 െ 𝑏 𝑏ଵ𝑥 ଶ

• To estimate 𝛽, 𝛽ଵ , find 𝑏, 𝑏ଵ minimizing ∑ ୀଵ൫𝑌 െ ሺ𝑏

𝑏

ଵ𝑋ሻ൯ଶ.

• Yields 𝛽 መ ൌ 𝑌 ത െ 𝛽 መଵ𝑋 ത, and 𝛽 መଵ ൌ ∑ సభሺି തሻሺି തሻ

∑ సభሺି തሻమ .

• 𝑉 𝛽 መଵ ൎ ఙమ

∑ సభሺି തሻమ, and if 𝑛 100, ఉ భିఉభ

ఉ భ

follows N(0,1). We can

use this to test 𝛽ଵ ൌ 0 and get 95% confidence interval for 𝛽ଵ.

• 𝑅ଶ ൌ 1 െ

భ

∑ సభ ̂మ

భ

∑ సభ ି ത మ. Close to 1: good prediction. Close to 0: poor

prediction.

43

Roadmap

1. The OLS univariate affine regression function.

2. Estimating the OLS univariate affine regression function.

3. Interpreting 𝛽 መଵ

4. OLS univariate affine regression in practice.

44

A useful reminder: the sample covariance

• Assume randomly draw a sample of and 𝑛 units from a population, and

for each unit observe variables 𝑋 and 𝑌 .

• Sample covariance between 𝑋 and 𝑌 is 𝟏

𝒏

∑𝒏 𝒊ୀ𝟏ሺ𝑿𝒊െ𝑿 ഥሻሺ𝒀𝒊 െ 𝒀 ഥሻ.

• Example: 𝑋 ൌ FICO score of ith person, 𝑌 : amount she defaults.

• If 𝑋

𝑋 ത (person i’s FICO > average FICO in sample):

– If 𝑌

𝑌 ത (amount i defaults > average default in sample) then

ሺ𝑋െ𝑋 തሻሺ𝑌 െ 𝑌 തሻ 0,

– If 𝑌

൏ 𝑌 ത then ሺ𝑋െ𝑋 തሻሺ𝑌 െ 𝑌 തሻ ൏ 0.

• If 𝑋

൏ 𝑋 ത (person i’s FICO < average FICO in sample):

– If 𝑌

൏ 𝑌 ത then ሺ𝑋െ𝑋 തሻሺ𝑌 െ 𝑌 തሻ 0.

– If 𝑌

𝑌 ത then ሺ𝑋െ𝑋 തሻሺ𝑌 െ 𝑌 തሻ ൏ 0.

• When many people have 𝑋 𝑋 ത and 𝑌 𝑌 ത, and many people have 𝑋 ൏

𝑋 ത and 𝑌 ൏ 𝑌 ത, then ଵ

∑ ୀଵሺ𝑋െ𝑋 തሻሺ𝑌 െ 𝑌 തሻ 0.

𝑿

𝒊 and 𝒀𝒊 move in the same direction.

• When many people have both 𝑋 𝑋 ത and 𝑌 ൏ 𝑌 ത, and many people have

both 𝑋 ൏ 𝑋 ത and 𝑌 𝑌 ത, then ଵ

∑ ୀଵሺ𝑋െ𝑋 തሻሺ𝑌 െ 𝑌 തሻ ൏ 0.

𝑿

𝒊 and 𝒀𝒊 move in opposite directions.

45

iClicker time

• Let FICO score of ith person, : amount she

defaults.

• Let ଵ

ୀଵ be their sample

covariance.

• Which of the two statements sounds the most likely

to you?

ଵ

ୀଵ

ଵ

ୀଵ

46

In example, likely ଵ

ୀଵ

• Typically, one would expect that people with a FICO score

below average default more than the average on their

loan.

• Similarly, one would expect that people with a FICO score

above average default less than the average on their loan.

• Therefore, we expect that people with also have

, and people with also have .

• Therefore, it is likely that

ଵ

ୀଵ

47

iClicker time

• Go back to the formula we derived for

ଵ, the coefficient

of in the sample regression of on a constant and .

• Which of the following statements is correct:

ଵ

ଵ

ଵ

48

sample covariance between and

divided by sample variance of .

•

ଵ

∑ సభሺି തሻሺି തሻ

∑ సభሺି തሻమ

• Multiply numerator and denominator by ଵ

, yields:

ଵ

ୀଵ

ୀଵ ଶ

•

ଵ sample covariance between and divided by sample

variance of .

• Therefore, ଵ if and move in the same direction, ଵ

if move in opposite directions.

• In the regression of the amount defaulted on a constant and

FICO, do you expect that ଵ or ଵ ?

49

For now, we can interpret the sign of ,

not its specific value.

• For now, we have seen that ଵ means that and

move in the same direction, ଵ means that move in

opposite directions.

• Interesting, but does not tell us how we should interpret

a specific value of ଵ.

• For instance, what does ଵ mean?

• That’s what we are going to see now.

50

Interpreting ଵ when is binary.

51

• Assume you run an OLS regression of 𝑌 on a constant and 𝑋,

where 𝑋 is a binary variable (variable either equal to 0 or to 1).

• Example: you regress 𝑌 , whether email i is a spam on a constant

and 𝑋, a binary variable equal to 1 if the email has the word

“free” in it, and to 0 if the email does not contain that word.

• Then, you have shown / will show during sessions that

𝛽 መଵ ൌ ଵ

భ

∑:ୀଵ 𝑌 െ ଵ బ ∑:ୀ 𝑌 ,

where 𝑛

ଵ is the number of units that have 𝑋 ൌ 1, 𝑛 is the number

of units that have 𝑋 ൌ 0, ∑:ୀଵ 𝑌 is the sum of 𝑌 of all units with

𝑋

ൌ 1, and ∑:ୀ 𝑌 is the sum of 𝑌 of all units with 𝑋 ൌ 0.

• In the spam example, explain with words what ଵ

భ

∑:ୀଵ 𝑌 ,

ଵ బ

∑:ୀ 𝑌 , and 𝛽 መଵ respectively represent. Discuss this question

with your neighbour for one minute.

iClicker time

Assume you regress 𝑌 , whether email i is a spam on a constant and 𝑋,

a binary variable equal to 1 if the email has the word “free” in it, and

to 0 if the email does not contain that word. You know that

𝛽 መଵ ൌ ଵ

భ

∑:ୀଵ 𝑌 െ ଵ

బ

∑:ୀ 𝑌 .

• Which of the following statements is correct?

aሻ ଵ

భ

∑:ୀଵ 𝑌 is the percentage of emails that have the word free

among the emails that are spams, ଵ

బ

∑:ୀ 𝑌 is the percentage

of emails that have the word free among the emails that are not

spams, so 𝛽 መଵ is the difference between the percentage of emails

that have the word free across spams and non spams.

bሻ ଵ

భ

∑:ୀଵ 𝑌 is the percentage of emails that are spams among

the emails that have the word free, ଵ

బ

∑:ୀ 𝑌 is the

percentage of emails that are spams among the emails that do

not have the word free, so 𝛽 መଵ is the difference between the

percentage of emails that are spams across emails that have

and do not have the word free.

52

difference between % of spams

across emails with/without word free.

• ∑:ୀଵ 𝑌 counts the number of spams among emails that have the

word free.

• 𝑛

ଵ is the number of emails that have the word free.

• Therefore, ଵ

భ

∑:ୀଵ 𝑌 : percentage of spams among emails that

have the word free.

• Similarly, ଵ

బ

∑:ୀ 𝑌 : percentage of spams among emails that do

not have the word free.

• 𝛽 መଵ ൌ difference between % of spams across emails with/without

word free.

• Outside of this example, we have following, very important result:

When you regress 𝒀𝒊 on a constant and 𝑿𝒊, where 𝑿𝒊 is a binary

variable, 𝜷 𝟏 is the difference between the average value of 𝒀𝒊 among

units with 𝑿𝒊 ൌ 𝟏 and among units with 𝑿𝒊 ൌ 𝟎.

53

Testing whether the average of a variable is

significantly different between 2 groups.

• When you regress 𝑌 on a constant and 𝑋, where 𝑋 is a binary

variable, 𝛽 መଵ is the difference between the average value of 𝑌 among

units with 𝑋 ൌ 1 and among units with 𝑋 ൌ 0 in the sample.

• Similarly, 𝛽ଵ is difference between the average 𝑌 among units with

𝑋

ൌ 1 and among units with 𝑋 ൌ 0 in the full population.

• Remember that if ఉ భ

ఉ భ

1.96 or ఉ భ

ఉ భ

൏ െ1.96, we can reject at

the 5% level the null hypothesis that 𝛽ଵ ൌ 0.

• When we reject 𝛽ଵ ൌ 0 in a regression of 𝑌 on a constant and 𝑋,

where 𝑋 is a binary variable, we reject the null hypothesis that the

average of 𝑌 is the same among units with 𝑋 ൌ 1 and among units

with 𝑋 ൌ 0 in the full population.

• The difference between the average of 𝑌 between the two groups in

our sample is unlikely to be due to chance.

• Groups have a significantly different average of 𝒀𝒊 at the 5% level. 54

What about ?

• Assume you run an OLS regression of on a constant and ,

where is a binary variable (variable either equal to 0 or to 1).

• Then, you have shown / will show during sessions that

ଵ బ

:ୀ .

•

: average of among units with .

•

ଵ is the difference between the average value of among

units with and among units with .

• People sometimes call units with the reference

category, because ଵ compares the average value of among

units that do not belong to that reference category to units in

that reference category.

• In the spam example, : percentage of spams among emails

that do not have the word free in them, ଵ= difference

between percentage of spams across emails that have the word

free in them and emails that do not have that word.

55

To predict of a unit, OLS uses average

among units with same as that unit

• Now, let’s consider some units 𝑗 outside of our sample.

• We do not observe their 𝑌

but we observe their 𝑋.

• Predicted value of 𝑌

according to OLS regression: 𝑌 ൌ 𝛽 መ 𝛽 መଵ𝑋.

• 𝛽 መ ൌ ଵ

బ

∑:ୀ 𝑌 , and 𝛽 መଵ ൌ ଵ భ ∑:ୀଵ 𝑌 െ ଵ బ ∑:ୀ 𝑌 .

• So 𝑌

ൌ

ଵ బ

∑:ୀ 𝑌 for units 𝑗 such that 𝑋 ൌ 0.

• And 𝑌

ൌ

ଵ బ

∑:ୀ 𝑌 ଵ భ ∑:ୀଵ 𝑌 െ ଵ బ ∑:ୀ 𝑌 ൌ ଵ భ ∑:ୀଵ 𝑌

for units 𝑗 such that 𝑋 ൌ 1.

• To make prediction for unit with 𝑋 ൌ 0, we use average 𝑌 among

units with 𝑋 ൌ 0 in sample.

• To make prediction for a unit with 𝑋 ൌ 1, we use average 𝑌 among

units with 𝑋 ൌ 1 in sample.

• Prediction = average 𝒀𝒊 among units with same 𝑿𝒊 in sample.

• In sessions: in regression of 𝑌 on a constant, OLS prediction =

average 𝑌 among units in sample. 56

For now, we know how to interpret the

value of , but only when binary.

• When binary,

ଵ

ଵ

:ୀଵ

:ୀ

• In that special case, ଵ has a very simple interpretation:

difference between average among units with and

among units with .

• In other words, ଵ measures by the difference between the

average of across subgroups whose differs by one (units

with versus units with ).

• Does this result extend to the case where not binary?

57

ଵ measures difference between the average of

across subgroups whose differs by one

58

• When 𝑋 binary, 𝛽 መଵ measures diff. between average of 𝑌 across

subgroups whose 𝑋 differs by one (units with 𝑋 ൌ 1 versus 𝑋 ൌ 0).

• Now, assume that 𝑋 can be equal to 0, 1, or 2.

• 𝑛

: number of units with 𝑋 ൌ 0. 𝑛ଵ: number of units with 𝑋 ൌ 1. 𝑛ଶ:

number of units with 𝑋 ൌ 2.

𝛽 መଵ ൌ 𝑤 1

𝑛

ଵ

𝑌

:ୀଵ

െ

1 𝑛

𝑌

:ୀ

1 െ 𝑤

1 𝑛ଶ

𝑌

:ୀଶ

െ

1 𝑛ଵ

𝑌

:ୀଵ

,

where 𝑤 is number included between 0 & 1 that you don’t need to know.

• 𝛽 መଵ: weighted average of diff. between average 𝑌 of units with 𝑋 ൌ 1

and 𝑋 ൌ 0, and of diff. between average 𝑌 of units with 𝑋 ൌ 2 and

𝑋

ൌ 1.

• Units with 𝑋 ൌ 1 and 𝑋 ൌ 0 have a value of 𝑋 that differs by one.

• Units with 𝑋 ൌ 2 and 𝑋 ൌ 1 have a value of 𝑋 that differs by one.

• => 𝛽 መଵ measures the difference between the average of 𝑌 across

subgroups whose 𝑋 differs by one!

ଵ measures difference between the average of

across subgroups whose differs by one

59

• When 𝑋 binary, 𝛽 መଵ measures diff. between average of 𝑌 across subgroups

whose 𝑋 differs by one (units with 𝑋 ൌ 1 versus 𝑋 ൌ 0).

• Now, assume that 𝑋 can be equal to 0, 1, 2,…,K.

• 𝑛

: number of units with 𝑋 ൌ 0, 𝑛ଵ: number of units with 𝑋 ൌ 1,…, 𝑛:

number of units with 𝑋 ൌ 𝐾.

𝛽 መଵ ൌ 𝑤 𝑛 1

𝑌

:ୀ

െ

1

𝑛ିଵ 𝑌

:ୀିଵ

ୀଵ

,

where 𝑤

: positive weights summing to 1 that you do not need to know.

• 𝛽 መଵ: weighted average of diff. between average 𝑌 of units with 𝑋 ൌ 1 and

𝑋

ൌ 0, of diff. between average 𝑌 of units with 𝑋 ൌ 2 and 𝑋 ൌ 1,…, of

diff. between average 𝑌 of units with 𝑋 ൌ 𝐾 and 𝑋 ൌ 𝐾 െ 1.

• Units with 𝑋 ൌ 1 and 𝑋 ൌ 0 have a value of 𝑋 that differs by one.

• Units with 𝑋 ൌ 𝐾 and 𝑋 ൌ 𝐾 െ 1 have a value of 𝑋 that differs by one.

• 𝛽 መଵ ൌ diff. between average of 𝑌 across subgroups whose 𝑋 differs by 1!

Logs versus levels

• Assume you regress 𝑌 on constant and 𝑋, 𝛽 መଵ ൌ 0.5: when you

compare people whose 𝑋 differs by 1, average 𝑌 0.5 larger among

people whose 𝑋 is 1 unit larger.

• Assume you regress lnሺ𝑌 ሻ on constant and 𝑋, 𝛽 መଵ ൌ 0.5: when you

compare people whose 𝑋 differs by 1, average lnሺ𝑌 ሻ 0.5 larger among

people whose 𝑋 1 unit larger.

• Due to properties ln function, if people whose 𝑋 is 1 unit larger have

an average lnሺ𝑌 ሻ 0.5 larger, average of 𝑌 50% larger among those

people.

• Assume you regress lnሺ𝑌 ሻ on constant and lnሺ𝑋ሻ, 𝛽 መଵ ൌ 0.5: when you

compare people whose 𝑋 differs by 1%, average 𝑌 0.5% larger among

people whose 𝑋 1% larger.

• Regressing 𝑌 on constant and 𝑋 is useful to study how the mean of 𝑌

differs in levels across units whose 𝑋 differs by one.

• Regressing lnሺ𝑌 ሻ on constant and 𝑋 is useful to study how the mean

of 𝑌 differs in relative terms across units whose 𝑋 differs by one.

• Regressing lnሺ𝑌 ሻ on constant and lnሺ𝑋ሻ is useful to study how the

mean of 𝑌 differs in relative terms across units whose 𝑋 differs by 1%.

60

iClicker time

• Assume you observe the wages of a sample of wage earners

in the US. You regress 𝑌 , the monthly wage of person 𝑖, on a

constant and 𝑋, a binary variable equal to 1 if 𝑖 is a female

and to 0 if 𝑖 is a male. Assume that you find 𝛽 መଵ ൌ െ200 and

𝛽 መ ൌ 2000

• Which of the following statements is correct?

aሻ In this sample, the average wage of females is 200 dollars

higher than the average wage of males, and the average

wage of females is 2000 dollars.

bሻ In this sample, the average wage of females is 200 dollars

lower than the average wage of males, and the average

wage of males is 2000 dollars.

61

Average wage of females is 200 dollars

lower than average wage of males.

• binary: for males, for females.

•

ଵ

ଵ భ

:ୀଵ

ଵ బ

:ୀ , Therefore, ଵ

difference between average wage of females and males.

•

ଵ means that females make 200 dollars less

than males on average.

•

ଵ బ

:ୀ , Therefore, average wage of

males.

•

means that males make 2000 dollars on

average.

62

iClicker time

• Assume you observe the wages of a sample of 5,000 wage

earners in the US. You regress 𝑌 , the monthly wage of person

𝑖, on a constant and 𝑋, a binary variable equal to 1 if 𝑖 is a

female and to 0 if 𝑖 is a male. Assume that Eviews or Stata tells

you that 𝛽 መଵ ൌ െ200 and 𝑉 𝛽 መଵ ൌ 20. Which of the

following statements is correct?

aሻ In this sample, the average wage of females is 200 dollars

lower than the average wage of males, and the difference

between the average wage of the two groups is

statistically significant at the 5% level.

bሻ In this sample, the average wage of females is 200 dollars

lower than the average wage of males, and the difference

between the average wage of the two groups is not

statistically significant at the 5% level. 63

భ

భ

, so we reject at 5%

•

ఉ భ

ఉ భ

, so we reject ଵ at 5% level.

• The difference between the average wage of males and females

is statistically significant at the 5% level.

• It is very unlikely (less than 5% chances) that in the US

population males and females have the same average wages,

but that we drew a random sample fairly different from the US

population where males’ average wage is 200 higher than that

of female.

• Given that our random sample is quite large (5,000 people), the

fact that in our sample the average wage of males is 200 dollars

> than that of females indicates that in the US population,

males also have a higher average wage than females.

64

iClicker time

• Assume you observe the wages of a sample of wage earners

in the US. You regress 𝑌 , the monthly wage of person 𝑖, on a

constant and 𝑋, a binary variable equal to 1 if 𝑖 is a female

and to 0 if 𝑖 is a male. Assume that you find 𝛽 መଵ ൌ െ200 and

𝛽 መ ൌ 2000

• Which of the following statements is correct?

aሻ To predict the wage of a female not in the sample, this

regression model will use the average wage of females in

the sample.

bሻ To predict the wage of a female not in the sample, this

regression model will use the average wage of males and

females in the sample.

65

To predict wage of a female not in sample,

regression uses average wage of females in sample.

• Now, let’s consider some units outside of our sample => we

do not observe their

.

• Predicted value of

according to OLS regression:

ଵ .

• Given that female, , so predicted wage: ଵ.

•

ଵ బ

:ୀ , and ଵ ଵ భ :ୀଵ ଵ బ :ୀ , so

ଵ బ

:ୀ

ଵ భ

:ୀଵ

ଵ బ

:ୀ

ଵ భ

:ୀଵ

• Predicted wage: average wage of females in sample.

66

iClicker time

• Assume you observe the wages of a sample of wage

earners in the US. You regress , the monthly wage

of person , on a constant and , a binary variable equal

to 1 if is a female and to 0 if is a male. Assume that

you find ଵ .

• Which of the following statements is correct?

67

Average wage of females is 10% lower than

average wage of males.

• binary: for males, for females.

•

ଵ

ଵ భ

:ୀଵ

ଵ బ

:ୀ , Therefore,

ଵ difference between average ln(wage) of females

and males.

•

ଵ means that the average ln(wage) of females

0.1 lower than the average ln(wage) of males.

• As we discussed a few slides ago, using some properties

of the ln function, one can show that this implies that

the average wage of females is 10% lower than the

average wage of males.

68

iClicker time

• Assume you observe the wages of a sample of wage earners

in the US. You regress 𝑌 , the monthly wage of person 𝑖, on a

constant and 𝑋, their number of years of professional

experience (from 0 for people who just started working to 50

for people who have worked for 50 years). Assume that you

find 𝛽 መଵ ൌ 100.

• Which of the following statements is correct?

aሻ When we compare people whose years of experience

differ by one, we find that on average, those who have one

more year of experience earn 100 more dollars per

month.

bሻ The covariance between years of experience and wage is

equal to 100.

cሻ The covariance between years of experience and wage

divided by the variance of years of experience is equal to

100.

69

Answers a) and c) both correct

• 𝑋

can be equal to 0 (no experience), 1, 2,…50.

• Let 𝑛

be number of units with 𝑋 ൌ 0 (no experience),…, let 𝑛ହ be

number of units with 𝑋 ൌ 50 (50 years of experience).

𝛽 መଵ ൌ 𝑤 𝑛 1

𝑌

:ୀ

െ

1

𝑛ିଵ 𝑌

:ୀିଵ

ହ

ୀଵ

,

where 𝑤

are positive weights summing to 1 that you do not need to

know.

• 𝛽 መଵ: weighted average of difference between average wage of people

with 1 and 0 years of experience, of difference between average wage

of people with 2 and 1 years of experience,…, of difference between

average wage of units with 50 and 49 years of experience.

• 𝛽 መଵ ൌ 100 means that when we compare people whose years of

experience differ by one, we find that on average, those who have one

more year of experience earn 100 more dollars per month.

• Answer c) also correct. However, ratio of covariance and variance hard

to interpret, while average difference of wages of people with one year

of difference in their experience easy to interpret. 70

iClicker time

• Assume you observe the wages of a sample of wage earners

in the US. You regress ln ሺ𝑌 ሻ, the ln(monthly wage) of person

𝑖, on a constant and lnሺ𝑋ሻ, the ln(number of years of

professional experience) of that person. Assume that you find

𝛽 መଵ ൌ 0.5.

• Which of the following statements is correct?

aሻ When we compare people whose years of experience

differ by one, we find that on average, those who have one

more year of experience earn 50% more.

bሻ When we compare people whose years of experience

differ by 1%, we find that on average, those who have 1%

more years of experience earn 0.5% more.

71

Answer b) correct

• We regress , the ln(monthly wage) of person

, on a constant and , the ln(number of years

of professional experience) of that person.

• Because and not in regression, ଵ does

not compare subgroups whose experience differ by

1 one year, but subgroups whose experience differ

by 1%!

• In this sample, when we compare subgroups of

people whose years of experience differ by 1%, we

find that on average, those who have 1% more

years of experience earn 0.5% more.

72

What you need to remember

•

ଵ sample covariance between and / by sample

variance of .

•

ଵ (resp. ଵ ): covariance between and

(resp. ): and positively correlated, move in same

(resp. opposite) direction.

• When binary, ଵ ଵ

భ :ୀଵ

ଵ బ

:ୀ :

difference between the average of among subgroups

whose differs by one (units with versus units

with ).

• When not binary, ଵ still measures difference between

average of among subgroups whose differs by one.

• You need to know how to interpret ଵ in a regression of

on a constant and , in a regression of on a

constant and , and in a regression of on a

constant and .

73

Roadmap

1. The OLS univariate affine regression function.

2. Estimating the OLS univariate affine regression function.

3. Interpreting 𝛽 መଵ

4. OLS univariate affine regression in practice.

74

How Gmail uses OLS univariate affine regression

• Gmail wants to predict 𝑦: 1 if email 𝑘 is spam, 0 otherwise.

• To do so, use 𝑥: 1 if “free” appears in email, 0 otherwise.

• 𝑥 easy to measure (a computer can do it automatically, by

searching for “free” in the email), but 𝑦 is hard to measure:

only a human can know whether an email is a spam or not. =>

cannot observe 𝑦 for all emails.

• To make good predictions, would like to compute, 𝛽, 𝛽ଵ ,

value of 𝑏

, 𝑏ଵ minimizing ∑ே ୀଵ 𝑦 െ 𝑏 𝑏ଵ𝑥 ଶ, and

then use 𝛽 𝛽ଵ𝑥 to predict 𝑦. 𝛽 𝛽ଵ𝑥: affine function

of 𝑥

for which sum of squared prediction errors൫𝑦 െ

𝑏

𝑏ଵ𝑥 ൯ଶ minimized.

• Issue: 𝛽 ൌ 𝑦 ത െ 𝛽ଵ𝑥̅, and 𝛽ଵ ൌ ∑ಿ ೖసభሺ௫ೖି௫̅ሻሺ௬ೖି௬ തሻ

∑ಿ ೖసభሺ௫ೖି௫̅ሻమ => they

cannot compute these numbers because do not observe 𝑦.

75

How Gmail uses OLS univariate affine regression

• Instead Gmail draws random sample of, say, 5000 emails, ask

humans to read them and determine whether spams or not.

• For 𝑖 between 1 and 5000, 𝑌 : whether 𝑖th randomly drawn email is

spam, 𝑋: whether 𝑖th randomly drawn email has free in it.

• 𝛽, 𝛽ଵ is value of 𝑏, 𝑏ଵ minimizing ∑ே ୀଵ 𝑦 െ 𝑏 𝑏ଵ𝑥 ଶ

• Estimate 𝛽, 𝛽ଵ : use 𝑏, 𝑏ଵ minimizing ∑ ୀଵ 𝑌 െ 𝑏 𝑏ଵ𝑋 ଶ.

• Yields 𝛽 መ ൌ 𝑌 ത െ 𝛽 መଵ𝑋 ത and 𝛽 መଵ ൌ ∑ సభሺି തሻሺି തሻ

∑ సభሺି തሻమ .

• For emails not in sample, do not know if spam, but use 𝛽 መ 𝛽 መଵ𝑥 as

their prediction of whether the email is a spam or not.

• Because their random sample of emails is large, 𝛽 መ and 𝛽 መଵ should

be close to 𝛽 and 𝛽ଵ, and therefore 𝛽 መ 𝛽 መଵ𝑥 should be close to

𝛽 𝛽ଵ𝑥, the best univariate affine prediction of 𝑦 given 𝑥.

• Use 𝑅ଶ to assess whether regression makes good predictions.

76

Application to a data set of 4601 emails

• 4601 emails which have been read by humans. Variable spam =

1 if email = spam, 0 otherwise.

• We have another variable: number of times the word “free”

appears in the email/number of words in the email *100. Ranges

from 0 to 100: percentage points.

• We go to Eviews and write “ls spam c percent_word_free”.

77

Dependent Variable: SPAM

Method: Least Squares

Date: 04/26/17 Time: 15:56

Sample: 1 4601

Included observations: 4601

Variable Coefficient Std. Error t-Statistic Prob.

PERCENT_WORD_FREE 0.201984 0.023411 8.627873 0.0000

C 0.372927 0.007555 49.35958 0.0000

R-squared 0.015928 Mean dependent var 0.394045

Adjusted R-squared 0.015714 S.D. dependent var 0.488698

S.E. of regression 0.484843 Akaike info criterion 1.390450

Sum squared resid 1081.098 Schwarz criterion 1.393247

Log likelihood -3196.730 Hannan-Quinn criter. 1.391434

F-statistic 74.44020 Durbin-Watson stat 0.032029

Prob(F-statistic) 0.000000

Interpretation of

• 𝛽 መ ൌ 0.37 and 𝛽 መଵ ൌ 0.20. Interpretation of 𝛽 መଵ:

When we compare emails whose percentage of words that are

the word “free” differ by 1, percentage of spams is 20 points

higher among emails whose percentage of the word free is 1

point higher.

• Emails where the word free appears more often are more

likely to be spams!

78

Using and to make predictions

•

and ଵ . Assume you consider two

emails outside of your sample, and therefore you do not

know whether they are spams or not.

• In one email, the word “free” =0% of the words of the

email, the other one where the word “free”=1% of the

words of the email.

• According to the OLS affine regression function, what is

your prediction for the first email being a spam? What is

your prediction for the second email being a spam?

Discuss this question with your neighbor for 2 minutes.

79

iClicker time

• 𝛽 መ ൌ 0.37 and 𝛽 መଵ ൌ 0.20. Assume you consider two emails,

one where the word “free” =0% of the words of the email, the

other one where the word “free”=1% of the words of the

email.

• According to the OLS affine regression function, what is your

prediction for the first email being a spam? What is your

prediction for the second email being a spam?

aሻ The predicted value for the first email being a spam is

0.37, while the predicted value for the second email being

a spam is 0.372.

bሻ The predicted value for the first email being a spam is

0.37, while the predicted value for the second email

being a spam is 0.57.

80

Predicted value for 1st email being spam is 0.37,

predicted value for 2nd email being spam is 0.57.

•

and ଵ . Assume you consider two

emails, one where the word “free” =0% of the words of

the email, the other one where the word “free”=1% of

the words of the email.

• According to the OLS affine regression function, what is

your prediction for the first email being a spam? What is

your prediction for the second email being a spam?

• According to this regression, predicted value for whether

email is a spam is + ଵ , where is number of times

“free” appears in the email/number of words in the

email * 100.

• For first email => predicted value = 0.37.

• For second email, => predicted value = 0.57.

81

Testing .

•

ଵ , and ଵ .

• Can we reject at the 5% level the null

hypothesis that ଵ ? Discuss this question

with your neighbor for 1 minute.

82

iClicker time

•

ଵ , and ଵ .

• Can we reject at the 5% level the null hypothesis that

ଵ ?

a) Yes

b) No

83

Yes!

• If we want to have 5% chances of wrongly rejecting ଵ

, test is:

Reject ଵ if ఉ భ

ఉ భ

or

ఉ భ

ఉ భ

.

Otherwise, do not reject ଵ .

• Here, ఉ భ

ఉ భ

=> we can reject ଵ .

• The percentage of the words of the email that are the

word “free” is a statistically significant predictor of

whether the email is a spam or not!

• Find the 95% confidence interval for

ଵ. You have 2mns.

84

iClicker time

•

ଵ , and ଵ .

• The 95% confidence interval for

ଵis:

a) [0.155,0.245]

b) [0.143,0.228]

85

95% confidence interval for

ଵ is [0.155,0.245]

•

ଵ , and ଵ .

• The 95% confidence interval for

ଵis ଵ

ଵ ଵ ଵ .

• Plugging in the values of ଵ and ଵ yields

[0.155,0.245].

86

iClicker time

• Does regression has a low or a high R‐squared?

a) It has a low R‐squared.

b) It has a high R‐squared. 87

Dependent Variable: SPAM

Method: Least Squares

Date: 04/26/17 Time: 15:56

Sample: 1 4601

Included observations: 4601

Variable Coefficient Std. Error t-Statistic Prob.

PERCENT_WORD_FREE 0.201984 0.023411 8.627873 0.0000

C 0.372927 0.007555 49.35958 0.0000

R-squared 0.015928 Mean dependent var 0.394045

Adjusted R-squared 0.015714 S.D. dependent var 0.488698

S.E. of regression 0.484843 Akaike info criterion 1.390450

Sum squared resid 1081.098 Schwarz criterion 1.393247

Log likelihood -3196.730 Hannan-Quinn criter. 1.391434

F-statistic 74.44020 Durbin-Watson stat 0.032029

Prob(F-statistic) 0.000000

Our regression has a low

• The ଶ of the regression is equal to .

• ଶ included between 0 and 1. Close to 0: bad

prediction. Close to 1 good prediction.

• Here close to 0 => bad prediction.

88

If we use this regression to construct a spam

filter, filter will be pretty bad.

• We can compute + ଵ for each email in our sample.

• 39% of those 4601 emails are spams => we could say: we

predict that the 39% of emails with highest value of

+ ଵ are spams, while the other emails are not spams.

• We can look how this spam filter performs in our sample.

• Among the non‐spams we correctly predict that 85% are

not spams, but we wrongly predict that 15% are spams.

• Among the spams, we correctly predict that 35% are

spams, but we wrongly predict that 65% are non‐spams.

• => if Gmail used this spam filter, you would receive many

spams, Gmail would send many true emails to your trash,

and you would change your email account to Microsoft.

• In the homework, you will see how to construct a better

spam filter.

89

What you need to remember, and what’s next

• In practice, many instances where we can measure the

, the variable we do not observe for everyone, for a

sample of population.

• We can use that sample to compute and ଵ, and

then use

ଵ as our prediction of the we do

not observe.

• If that sample is a random sample from the population,

ଵ should be close to ଵ , the best affine

prediction for .

• But univariate affine regression might still not give

great predictions: spam example.

• There are better prediction methods available. Next

lectures: we see one of them.

90

Don't use plagiarized sources. Get Your Custom Essay on

Econometrics Exam questions

Just from $10/Page

OLS multivariate regression.

Clement de Chaisemartin and Doug Steigerwald

UCSB

1

Banks have more than 1 variable to predict

amount applicants will fail to reimburse

• To predict the amount applicants will fail to

reimburse, banks can use their FICO score (score

based on their current debts and on their history of

loan repayments), and all other variables contained

in their application: e.g. their income.

• Will bank be able to make better predictions by using

both variables rather than just FICO score?

2

Yes, provided people with incomes but same

FICO fail to reimburse amounts on their loan.

• Assume FICO score can take only two values: 0 and 100.

• Assume applicants’ income can take two values: 2000 and

4000.

• If average amount people fail to reimburse varies with FICO

and income as in table below, adding applicant’s income to

model improves prediction.

• People with different income levels but with same FICO score

fail to reimburse different amounts on their loan => adding

income to your prediction model will improve quality of your

predictions.

3

Income=2000 Income=4000

FICO=0 2000 1000

FICO=100 500 200

Gmail has more than 1 variable to predict

whether an email is a spam.

• To predict whether email is spam, Gmail can use variable

equal to 1 if “free” appears in email, and variable equal to 1 if

“buy” in email.

• If percentage of spams varies as in table below, adding the

“buy” variable to the model will improve prediction.

• Emails which have “buy” and “free” in it are more likely to be

spams than emails which only have “free” in it.

• Emails which have “buy” but not “free” in it are more likely to

be spams than emails which neither have “buy” or “free” in it.

• => adding “buy” variable will improve predictions. 4

% of spams Email has “buy” in it Email doesn’t have “buy” in it

Email has “free” in it 3% 1.5%

Email doesn’t have “free” in it 1% 0.5%

Multivariate regression

• In these lectures, we are going to discuss OLS

multivariate regressions, which are OLS regressions

with several independent variables to predict a

dependent variable.

5

Roadmap

1. The OLS multivariate regression function.

2. Estimating the OLS multivariate regression function.

3. Advantages and pitfalls of multivariate regressions.

4. Interpreting coefficients in multivariate OLS regressions.

6

Set up and notation.

• We consider a population of units.

– could be number of people who apply for a loan in

bank A during May 2017.

• Each unit has a variable

attached to it that we do

not observe:

– In loan example, is variable equal to the amount

applicant will fail to reimburse on her loan when her

loan expires in May 2018.

• Each unit also has variables

ଵ, ଶ, ଷ,…,

attached to it that we do observe:

– In the loan example, ଵ could be FICO score of

applicant , ଶ could be the income of that applicant,

etc.

7

Prediction = function of

ଵ, ଶ, ଷ,…, .

• Based on value of

ଵ, ଶ, ଷ,…, of each

unit, we want to predict her .

• E.g.: in the loan example, we want to predict the

amount that unit will fail to repay based on her

FICO score and her income.

• Prediction should be function of

ଵ,…, ,

ଵ,…, .

• In these lectures, we focus on predictions which

are affine function of 𝟏𝒌,…, 𝑱𝒌: 𝟏𝒌,…,

𝑱𝒌 𝟎 𝟏 𝟏𝒌 𝑱 𝑱𝒌, for real

numbers 𝟎, 𝟏,…, 𝑱.

8

Prediction error:

ଵ ଵ

• Based on value of 𝑥

ଵ,…, 𝑥 of each unit, we predict her

𝑦.

• Our prediction should be function of 𝑥ଵ,…, 𝑥, 𝑓ሺ𝑥ଵ,…,

𝑥ሻ.

• We focus on affine functions of 𝑥

ଵ,…, 𝑥: 𝑓 𝑥ଵ,…, 𝑥 ൌ

𝑐

𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥, for 𝐽 1 real numbers 𝑐, 𝑐ଵ,…, 𝑐.

• 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥 , difference between

prediction for 𝑦 and actual value of 𝑦, is prediction error.

• Large positive or neg. 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥 mean

bad prediction.

• 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥 close to 0 means good

prediction.

9

Goal: find the value of

, ଵ,…, that minimizes

ଵ ଵ

ଶ

ே

ୀଵ

• ଵ ଵ

ଶ

ே

ୀଵ is positive.

=> minimizing it = making it as close to 0 as

possible.

• If

ଵ ଵ

ଶ

ே

ୀଵ is as

close to 0 as possible, means that the sum of

the squared value of our prediction errors is as

small as possible.

• => we make small errors. That’s good, that’s

what we want!

10

The OLS multivariate regression function

• Let

𝛾, 𝛾ଵ, … , 𝛾 ൌ 𝑎𝑟𝑔𝑚𝑖𝑛ሺబ, భ,…, ሻ ∑ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥

ଶ

ே

ୀଵ

• 𝛾, 𝛾ଵ, … , 𝛾 : value of ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ minimizing

∑ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥

ଶ

ே

ୀଵ .

• We call 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 the OLS multivariate regression

function of 𝑦 on a constant, 𝑥ଵ, 𝑥ଶ, 𝑥ଷ,…, 𝑥.

• We let 𝑦 ൌ 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 denote prediction from

multivariate OLS regression.

• We let 𝑒

ൌ 𝑦 െ 𝑦 : prediction error.

• We have 𝑦 ൌ 𝑦 𝑒.

11

How can we find ?

• 𝛾, 𝛾ଵ, … , 𝛾 : value of ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ minimizing

∑ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥

ଶ

ே

ୀଵ .

• To minimize a function of several variables, we differentiate it

wrt to each of those variables, and we find the value of ሺ𝑐,

𝑐

ଵ,…, 𝑐ሻ for which all those derivatives are equal to 0. No

need to worry about second derivatives because objective

function convex.

• What is derivative of ∑ே ୀଵ ቀ𝑦 െ ൫𝑐 𝑐ଵ𝑥ଵ ⋯

𝑐

𝑥൯ቁଶ with respect to 𝑐? Discuss this question with your

neighbor for 2mns.

12

iClicker time

• What is the derivative of

ଵ ଵ

ே

ୀଵ

ଶ

with respect to ?

a) ே ୀଵ ଵ ଵ

b) ே ୀଵ ଵ ଵ

c) ே ୀଵ ଵ ଵ

13

• Derivative of

ଵ ଵ

ଶ

ே

ୀଵ

with respect to is

ଵ ଵ

ே

ୀଵ : P4Sum+chain

rule.

• What is the derivative of

ଵ ଵ

ே

ୀଵ

ଶ

with respect to ଵ? Discuss this question with

your neighbor for 2mns.

14

iClicker time

• What is the derivative of

ଵ ଵ

ே

ୀଵ

ଶ

with respect to ଵ?

a) ே ୀଵ ଵ ଵ

b) ே ୀଵ ଵ ଵ ଵ

15

ଵ ଵ ଵ

ே

ୀଵ

• Derivative of ∑ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥

ଶ

ே

ୀଵ wrt 𝑐ଵ:

െ2 ∑ே ୀଵ 𝑥ଵ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥 : P4Sum+chain

rule.

• Derivative of ∑ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥

ଶ

ே

ୀଵ wrt 𝑐ଶ:

െ2 ∑ே ୀଵ 𝑥ଶ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥 .

• …

• Derivative of ∑ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥

ଶ

ே

ୀଵ wrt 𝑐:

െ2 ∑ே ୀଵ 𝑥 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥 .

16

ଵ is the solution of a system of

equations with unknowns.

• 𝛾, 𝛾ଵ, … , 𝛾 : value of ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ for which all those derivatives

are equal to 0.

• Thus, we have:

െ2 ∑ே ୀଵ 𝑦 െ 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 ൌ 0

െ2 ∑ே ୀଵ 𝑥ଵ 𝑦 െ 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 ൌ 0

…

െ2 ∑ே ୀଵ 𝑥 𝑦 െ 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 ൌ 0

• 𝛾, 𝛾ଵ, … , 𝛾 is the solution of a system of 𝐽 1 equations with

𝐽 1 unknowns.

• If we give the values of the 𝑦s, of the 𝑥ଵs, …, of the 𝑥s to a

computer, can solve this system and give us value of 𝛾, 𝛾ଵ, … , 𝛾 .

17

What you need to remember

• Population of 𝑁 units. Each unit has J+1 variables attached to it: 𝑦

is a variable we do not observe, 𝑥ଵ, 𝑥ଶ, 𝑥ଷ,…, 𝑥 are variables

we observe. We want to predict 𝑦 based on 𝑥ଵ, 𝑥ଶ, 𝑥ଷ,…, 𝑥.

• E.g.: bank wants to predict amount an applicant will fail to

reimburse on her loan based on her FICO score and her income.

• Our prediction should be function of 𝑥ଵ,…, 𝑥, 𝑓ሺ𝑥ଵ,…, 𝑥).

Affine functions of 𝑥

ଵ, 𝑥ଶ, 𝑥ଷ,…, 𝑥: 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥,

for some real numbers ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ.

• Good prediction should be such that 𝑒 ൌ 𝑦 െ ൫𝑐 𝑐ଵ𝑥ଵ ⋯

𝑐

𝑥൯, our prediction error, is as small as possible for most units.

• Best ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ: minimizes ∑ே ୀଵ ቀ𝑦 െ ൫𝑐 𝑐ଵ𝑥ଵ ⋯

𝑐

𝑥൯ቁଶ.

• We call that value 𝛾, 𝛾ଵ, … , 𝛾 . 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 is OLS

regression function of 𝑦 on a constant, 𝑥ଵ, 𝑥ଶ, 𝑥ଷ,…, 𝑥.

• 𝛾, 𝛾ଵ, … , 𝛾 : solution of system of J+1 equations with J+1

unknowns: derivatives of ∑ே ୀଵ ቀ𝑦 െ ൫𝑐 𝑐ଵ𝑥ଵ ⋯

𝑐

𝑥൯ቁଶwrt to 𝑐, 𝑐ଵ,…, 𝑐 must = 0 at 𝛾, 𝛾ଵ, … , 𝛾 . 18

Roadmap

1. The OLS multivariate regression function.

2. Estimating the OLS multivariate regression function.

3. Advantages and pitfalls of multivariate regressions.

4. Interpreting coefficients in multivariate OLS regressions.

19

We cannot compute

• Our prediction for 𝑦 based on a multivariate regression is 𝛾

𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥, the OLS multivariate regression function.

• => to be able to make a prediction for a unit’s 𝑦 based on her 𝑥ଵ,

𝑥ଶ, 𝑥ଷ,…, 𝑥, we need to know the value of 𝛾, 𝛾ଵ, … , 𝛾 .

• Under the assumptions we have made so far, we cannot compute

𝛾, 𝛾ଵ, … , 𝛾 . Solution of

െ2 ∑ே ୀଵ 𝑦 െ 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 ൌ 0

െ2 ∑ே ୀଵ 𝑥ଵ 𝑦 െ 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 ൌ 0

…

െ2 ∑ே ୀଵ 𝑥 𝑦 െ 𝛾 𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥 ൌ 0

• To solve this system, we need to know the 𝑦s, which we don’t!

• E.g.: the bank knows the FICO score and income (𝑥ଵ and 𝑥ଶ) for

each applicant, but does not know the amount each applicant will

fail to reimburse in April 2018 when loan expires (𝑦).

20

A method to estimate

ଵ

• We draw 𝑛 units from the population, and we measure the

dependent and the independent variable of those 𝑛 units.

• For 𝑖 included between 1 and 𝑛, 𝑌 , 𝑋ଵ,…, 𝑋 = value of

dependent and independent variables of 𝑖th unit we

randomly select.

• 𝛾, 𝛾ଵ, … , 𝛾 : value of ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ minimizing

∑ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥

ଶ

ே

ୀଵ

• => to estimate 𝛾, 𝛾ଵ, … , 𝛾 , we use ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ minimizing

∑ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

ଶ

ୀଵ .

• 𝛾 ො, 𝛾 ොଵ, … , 𝛾 ො denotes that value.

• Instead of ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ minimizing sum of squared errors in

population, use ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ minimizing sum of squared

errors in sample.

• If we find a good prediction function in sample, should also

work well in entire population: sample representative of

population.

21

The OLS regression function in the sample.

• Let

𝛾 ො, 𝛾 ොଵ, … , 𝛾 ො ൌ 𝑎𝑟𝑔𝑚𝑖𝑛ሺబ, భ,…, ሻ∈ோశభ ∑ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

ଶ

ୀଵ

• We call

ଵ ଵ the OLS regression function

of on a constant, ଵ,…, and in the sample.

•

ଵ : coefficients of the constant, ଵ,…, and .

• Let ଵ ଵ . is the predicted value

for according to the OLS regression function of on a

constant, ଵ,…, and in the sample.

• Let . : error we make when we use OLS

regression in the sample to predict .

• We have .

22

How can we find ?

• 𝛾 ො, 𝛾 ොଵ, … , 𝛾 ො : value of ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ minimizing

∑ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

ଶ

ୀଵ .

• To minimize a function of several variables, we differentiate it

wrt to each of those variables, and we find the value of ሺ𝑐,

𝑐

ଵ,…, 𝑐ሻ for which all those derivatives are equal to 0. No

need to worry about second derivatives because objective

function convex.

23

The derivatives of the objective function

• Derivative of ∑ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

ଶ

ୀଵ wrt 𝑐:

െ2 ∑ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋 :P4Sum+chain rule+

P2Sum

• Derivative of ∑ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

ଶ

ୀଵ wrt 𝑐ଵ:

െ2 ∑ ୀଵ 𝑋ଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

• …

• Derivative of ∑ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

ଶ

ୀଵ wrt 𝑐:

െ2 ∑ ୀଵ 𝑋 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

24

ଵ = solution of system of

equations with unknowns.

• 𝛾 ො, 𝛾 ොଵ, … , 𝛾 ො : value of ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ for which all derivatives = 0.

െ2 ∑ ୀଵ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ ⋯ 𝛾 ො𝑋 ൌ 0

െ2 ∑ ୀଵ 𝑋ଵ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ ⋯ 𝛾 ො𝑋 ൌ 0

…

െ2 ∑ ୀଵ 𝑋 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ ⋯ 𝛾 ො𝑋 ൌ 0

• To compute 𝛾 ො, 𝛾 ොଵ, … , 𝛾 ො , we:

– draw 𝑛 units from population, measure their 𝑌 s and

ሺ𝑋ଵ, … , 𝑋ሻs

– set up above system plugging in actual values of the 𝑌 s and

ሺ𝑋ଵ, … , 𝑋ሻs

– Yields system of 𝐽 1 equations with 𝐽 1 unknowns, the

𝛾 ො, 𝛾 ොଵ, … , 𝛾 ො s: all the remaining quantities are real numbers.

– Ask a computer to solve that system.

25

The ls command in E‐views solves for you that

system of equations with unknowns.

• We have:

െ2 ∑ ୀଵ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ ⋯ 𝛾 ො𝑋 ൌ 0

െ2 ∑ ୀଵ 𝑋ଵ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ ⋯ 𝛾 ො𝑋 ൌ 0

…

െ2 ∑ ୀଵ 𝑋 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ ⋯ 𝛾 ො𝑋 ൌ 0

• To compute 𝛾 ො, 𝛾 ොଵ, … , 𝛾 ො , we:

– draw 𝑛 units from population, measure their 𝑌 s and

ሺ𝑋ଵ, … , 𝑋ሻs

– set up system plugging in values of 𝑌 s and ሺ𝑋ଵ, … , 𝑋ሻs

– Ask a computer to solve that system.

• That is what the “ls” command in eviews does, where the 𝑌 s are

the values of the first variable after “ls” command, and the

ሺ𝑋ଵ, … , 𝑋ሻs are the values of the variables after “c”.

26

Doing E‐views’ job once in our life.

• Gmail example. Assume we sample 4 emails (𝑛 ൌ 4).

• For each, we measure 𝑌 : whether spam, 𝑋ଵ: whether has

word “free” in it, and 𝑋ଶ: whether has word “buy” in it.

• E.g.: 1st email we sample is a spam, and has words “free” and

“buy” in it. 2nd email is not a spam, and has “free” in it but not

“buy”, etc.

• If you regress 𝑌 on constant, 𝑋ଵ, and 𝑋𝟐, what is value of

ሺ𝛾 ො, 𝛾 ොଵ, 𝛾 ොଶሻ? Hint: you need to write system of 3 equations

and three unknowns solved by ሺ𝛾 ො, 𝛾 ොଵ, 𝛾 ොଶሻ, plug‐in values of

𝑌 , 𝑋ଵ, and 𝑋𝟐 given in table into system, and then solve

system. You have 4 minutes to find answer. 27

Email 𝑌 𝑋ଵ 𝑋𝟐

1 1 1 1

2 0 1 0

3 1 1 0

4 0 0 0

iClicker time

• If you regress on a constant, ଵ, and 𝟐, what

will be the value of

ଵ ଶ ?

ଵ ଶ

ଵ ଶ

ଵ ଶ

28

Email 𝑌 𝑋ଵ 𝑋𝟐

1 1 1 1

2 0 1 0

3 1 1 0

4 0 0 0

ଵ ଶ

• 𝑛 ൌ 4 and 𝐽 ൌ 2, so we have (we can forget the ‐2):

∑ସ ୀଵ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ 𝛾 ොଶ𝑋ଶ ൌ 0

∑ସ ୀଵ 𝑋ଵ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ 𝛾 ොଶ𝑋ଶ ൌ 0

∑ସ ୀଵ 𝑋ଶ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ 𝛾 ොଶ𝑋ଶ ൌ 0

• Plugging in values of 𝑌 , 𝑋ଵ, and 𝑋ଶ in table, yields:

2 െ 4𝛾 ො െ 3𝛾 ොଵ െ 𝛾 ොଶ ൌ 0

2 െ 3𝛾 ො െ 3𝛾 ොଵ െ 𝛾 ොଶ ൌ 0

1 െ 𝛾 ො െ 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0

• Subtracting equation 1 from equation 2 yields 𝛾 ො ൌ 0.

• Plugging 𝛾 ො ൌ 0 yields system of 2 equations & 2 unknowns:

2 െ 3𝛾 ොଵ െ 𝛾 ොଶ ൌ 0

1 െ 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0

• Subtracting equation 2 from 1 yields

1 െ 2𝛾 ොଵ ൌ 0, which is equivalent to 𝛾 ොଵ ൌ 0.5.

• Plugging 𝛾 ොଵ ൌ 0.5 into 1 െ 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0 yields 𝛾 ොଶ ൌ 0.5. 29

ଵ converges towards ଵ

• One can show that as for univariate regressions, the

estimators of multivariate regression coefficients

converge towards the true multivariate regression

coefficients (those for the full population).

•

→ାஶ ଵ ଵ .

• Intuition: when the sample size becomes large, the

sample becomes similar to the population.

30

Using central limit theorem for the s to

construct tests and confidence intervals.

• Formula of the variance of multivariate regression coefficients

complicated. No need to know it, E‐views computes it for you.

• If 𝑛 100, ఊ ෝೕିఊೕ

ఊ ෝೕ

follows normal distrib. mean 0 and variance 1.

• We can use this to test null hypothesis. Often, want to test 𝛾 ൌ 0.

• If we want to have 5% chances of wrongly rejecting 𝛾 ൌ 0, test is:

Reject 𝛾 ൌ 0 if ఊ ෝೕ

ఊ ෝೕ

1.96 or ఊ ෝೕ

ఊ ෝೕ

൏ െ1.96.

Otherwise, do not reject 𝛾 ൌ 0.

• We can also construct a 95% confidence interval for 𝛾:

𝛾 ො െ 1.96 𝑉 𝛾 ො , 𝛾 ො 1.96 𝑉 𝛾 ො .

31

Assessing quality of our predictions: the ଶ.

• To assess the quality of our predictions, we are going to

use the same measure as with the OLS affine regression:

ଶ

భ

∑ సభ ̂మ

భ

∑ సభ ି ത మ

• 1 – MSE / sample variance of the s.

• As in the previous lectures, we have that ଶ is included

between 0 and 1.

32

What you need to remember

• Prediction for 𝑦 based on multivariate regression is 𝛾

𝛾ଵ𝑥ଵ ⋯ 𝛾𝑥, with 𝛾, 𝛾ଵ, … , 𝛾 : value of ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ

minimizing ∑ 𝑦 െ 𝑐 𝑐ଵ𝑥ଵ ⋯ 𝑐𝑥

ଶ

ே

ୀଵ .

• We can estimate 𝛾, 𝛾ଵ, … , 𝛾 , if we measure 𝑦s for

random sample of population.

• For every 𝑖 between 1 and 𝑛, 𝑌 , 𝑋ଵ,…, 𝑋 = value of

dependent and independent variables of 𝑖th unit we

randomly select.

• To estimate 𝛾, 𝛾ଵ, … , 𝛾 , find ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ minimizing

∑ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐𝑋

ଶ

ୀଵ .

• Differentiating this function wrt to ሺ𝑐, 𝑐ଵ,…, 𝑐ሻ yields system

of J+1 equations with J+1 unknowns.

• We solved system in simple example, you should know how to

do that.

• We used the central limit theorem to propose 5% level test of

𝛾 ൌ 0, and to derive a 95% confidence interval for 𝛾.

33

Roadmap

1. The OLS multivariate regression function.

2. Estimating the OLS multivariate regression function.

3. Advantages and pitfalls of multivariate regressions.

4. Interpreting coefficients in multivariate OLS regressions.

34

Adding variables to a regression always

improves the ଶ

• Assume you regress a variable 𝑌 on a constant and on a variable 𝑋ଵ.

• Then, you regress 𝑌 on a constant and on two variables 𝑋ଵ and 𝑋ଶ.

• The 𝑅ଶ of your second regression will be at least as high as the 𝑅ଶ of

the first regression.

• Adding variables to a regression always increases its 𝑅ଶ.

• => a regression with many variables gives better predictions for the

𝑌 s in the sample than a regression with few variables.

35

Example

• Sample of 4601 emails, for which you observe whether

they are a spam or not.

• You regress spam on constant and variable equal to the

percentage of the words of the email that are the word

“free”.

• Eviews command: ls spam c word_freq_free.

• Is the ଶ of that regression low or high?

36

Dependent Variable: SPAM

Method: Least Squares

Date: 05/16/17 Time: 18:22

Sample: 1 4601

Included observations: 4601

Variable Coefficient Std. Error t-Statistic Prob.

C 0.372927 0.007555 49.35958 0.0000

WORD_FREQ_FREE 0.201984 0.023411 8.627873 0.0000

R-squared 0.015928 Mean dependent var 0.394045

Adjusted R-squared 0.015714 S.D. dependent var 0.488698

S.E. of regression 0.484843 Akaike info criterion 1.390450

Sum squared resid 1081.098 Schwarz criterion 1.393247

Log likelihood -3196.730 Hannan-Quinn criter. 1.391434

F-statistic 74.44020 Durbin-Watson stat 0.032029

Prob(F-statistic) 0.000000

Example

• You regress spam variable on a constant, a variable equal to %

of words of email that are the word “free”, and a variable equal

to % of words of the email that are the word money.

• Eviews command: ls spam c word_freq_free word_freq_money.

• ଶ higher in that regression than in previous one. ଶ= 1‐

average of square prediction errors / variance of the spam

variable. => higher ଶ means lower sum of square prediction

errors => better predictions.

37

Dependent Variable: SPAM

Method: Least Squares

Date: 05/16/17 Time: 18:23

Sample: 1 4601

Included observations: 4601

Variable Coefficient Std. Error t-Statistic Prob.

C 0.358449 0.007483 47.90281 0.0000

WORD_FREQ_FREE 0.141932 0.023370 6.073346 0.0000

WORD_FREQ_MONEY 0.220177 0.016122 13.65706 0.0000

R-squared 0.054291 Mean dependent var 0.394045

Adjusted R-squared 0.053879 S.D. dependent var 0.488698

S.E. of regression 0.475350 Akaike info criterion 1.351121

Sum squared resid 1038.953 Schwarz criterion 1.355317

Log likelihood -3105.255 Hannan-Quinn criter. 1.352598

F-statistic 131.9791 Durbin-Watson stat 0.100016

Prob(F-statistic) 0.000000

Should we include all variables in regression?

• Sometimes we have many potential variables we can include in our

regression.

• E.g.: Gmail example. We could use whether the words “free”, “buy”,

“money” appear in the email, the number of exclamation marks, etc.

to predict whether the email is a spam.

• Previous slides suggest we should include as many variables as

possible in the regression, to get the highest ଶ.

• If we do this, we run into a problem called overfitting: we will make

excellent predictions within the sample we use to run the regression

(high 𝑅ଶ), but bad predictions when we use regression predict

dependent variables of units outside of our sample.

• Issue: we do not care about in‐sample prediction: for the units in the

sample, we already know their 𝑌 , no need to predict them. It’s for

the units not in the sample, for which we do not know the value of

their dependent variable that we want to make good predictions.

38

Introduction to overfitting, through an example

• Assume that in your data, you only have 3 emails.

• Assume also that for each email, you measure 3 variables:

– 𝑌

: whether the email is a spam

– 𝑋

ଵ: whether the minute when email was sent is an odd number

– 𝑋

ଶ: whether the second when email was sent is an odd number.

• 𝑋

ଵ and 𝑋ଶ should be poor predictors of whether email is spam: no

reason why spams more likely to be sent on odd minutes/seconds.

• Assume that the values of 𝑌 , 𝑋ଵ, 𝑋ଶ are as in the table below.

• Find ሺ𝑐, 𝑐ଵ, 𝑐ଶሻ such that ∑ଷ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ 𝑐ଶ𝑋ଶ ଶ=0.

39

Email 𝑌 𝑋ଵ 𝑋𝟐

1 1 1 1

2 0 1 0

3 0 0 0

iClicker time

• Assume that the values of ଵ ଶ are as in the table

below.

• ሺ𝑐, 𝑐ଵ, 𝑐ଶሻ such that ∑ଷ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ 𝑐ଶ𝑋ଶ ଶ=0

is:

, ଵ , ଶ .

, ଵ , ଶ .

, ଵ , ଶ .

, ଵ , ଶ . 40

Email 𝑌 𝑋ଵ 𝑋𝟐

1 1 1 1

2 0 1 0

3 0 0 0

, , .

• ሺ𝑐, 𝑐ଵ, 𝑐ଶሻ such that ∑ଷ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ 𝑐ଶ𝑋ଶ ଶ=0 is

solution of this system:

1െ 𝑐

𝑐ଵ 𝑐ଶ ൌ 0

0 െ 𝑐

𝑐ଵ ൌ 0

0 െ 𝑐

ൌ 0

• You can check that solution is 𝑐

ൌ 0, 𝑐ଵ ൌ 0, 𝑐ଶ ൌ 1.

• Now assume that in this example, you regress 𝑌 on a

constant, 𝑋ଵ and 𝑋ଶ. Let 𝛾 ො, 𝛾 ොଵ, 𝛾 ොଶ respectively denote the

coefficient of the constant, of 𝑋ଵ, and of 𝑋ଶ in this

regression. What will be the value of 𝛾 ො, 𝛾 ොଵ, 𝛾 ොଶ? Discuss this

question during 2 minutes with your neighbor.

41

Email 𝑌 𝑋ଵ 𝑋𝟐

1 1 1 1

2 0 1 0

3 0 0 0

iClicker time

• Values of ଵ ଶ are as in the table below.

• You regress on a constant, ଵ and ଶ. ଵ ଶ

denote the coefficient of the constant, of ଵ, and of ଶ

in this regression. What will be the value of ଵ ଶ?

, ଵ , ଶ .

, ଵ , ଶ .

, ଵ , ଶ .

, ଵ , ଶ .

42

Email 𝑌 𝑋ଵ 𝑋𝟐

1 1 1 1

2 0 1 0

3 0 0 0

, ଵ , ଶ .

• ሺ𝛾 ො,𝛾 ොଵ,𝛾 ොଶሻ: minimizer of ∑ଷ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ 𝑐ଶ𝑋ଶ ଶ.

• For any ሺ𝑐, 𝑐ଵ, 𝑐ଶሻ, ∑ଷ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ 𝑐ଶ𝑋ଶ ଶ 0.

• If for a ሺ𝑐, 𝑐ଵ, 𝑐ଶሻ, ∑ଷ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ 𝑐ଶ𝑋ଶ ଶ ൌ 0,

this ሺ𝑐, 𝑐ଵ, 𝑐ଶሻ is that minimizing ∑ଷ ୀଵ൫𝑌 െ ሺ𝑐 𝑐ଵ𝑋ଵ

𝑐

ଶ𝑋ଶሻ൯ଶ, so ሺ𝛾 ො,𝛾 ොଵ,𝛾 ොଶሻ equal to that ሺ𝑐, 𝑐ଵ, 𝑐ଶሻ.

• ∑ଷ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ 𝑐ଶ𝑋ଶ ଶ ൌ 0 if 𝑐 ൌ 0, 𝑐ଵ ൌ 0, 𝑐ଶ ൌ 1.

• Therefore, 𝛾 ො ൌ 0,𝛾 ොଵ ൌ 0,𝛾 ොଶ ൌ 1.

• Prediction function for whether email spam: 0 0 ൈ 𝑋ଵ

1 ൈ 𝑋

𝟐

• => you predict that all emails sent on an odd second are

spams while all emails sent on an even second are not spams.

• This regression has an 𝑅ଶ ൌ 1: regression predicts perfectly

whether email are spams, in your sample of 3 observations.

• Do you think that this regression will give good predictions,

when you use it to make predictions for emails outside of the

sample of 3 emails in the regression? 43

iClicker time

• => in this example, prediction function for whether the email

is a spam is 0 0 ൈ 𝑋ଵ 1 ൈ 𝑋𝟐

• => you predict that all emails sent on an odd second are

spams while all emails sent on an even second are not spams.

• This regression has an 𝑅ଶ ൌ 1: regression model predicts

perfectly whether email are spams, in your sample of 3

observations.

• Do you think that this regression will give good predictions,

when you use it make predictions for emails outside of the

sample of 3 emails you use in the regression?

a) Yes

b) No

44

No!

• There is no reason why emails sent on odd

seconds would be more likely to be spams than

emails sent on even seconds.

• => when you use the regression to make

predictions for whether emails out of your

sample are spams or not, you will get very bad

predictions.

• This is despite the fact that in your sample, your

regression yields perfect predictions. ଶ .

• So what is going on?

45

ଶ of reg. with as many variables as units =1…

• You have 𝑛 observations, and for each observation you measure

dependent variable 𝑌 and 𝑛 െ 1 independent variables 𝑋ଵ,…, 𝑋ିଵ.

• You regress 𝑌 on constant, 𝑋ଵ,…, 𝑋ିଵ.

• 𝛾 ො, 𝛾 ොଵ, … , 𝛾 ොିଵ , coefficients of constant, 𝑋ଵ,…, 𝑋ିଵ: value of ሺ𝑐,

𝑐

ଵ, … , 𝑐ିଵሻ minimizing ∑ ୀଵ 𝑌 െ 𝑐 𝑐ଵ𝑋ଵ ⋯ 𝑐ିଵ𝑋ିଵ ଶ.

• We can make each term in summation =0. Equivalent to solving:

𝑌ଵ

െ 𝑐

𝑐ଵ𝑋ଵଵ ⋯ 𝑐ିଵ𝑋ିଵ,ଵ ൌ 0

𝑌ଶ

െ 𝑐

𝑐ଵ𝑋ଵଶ ⋯ 𝑐ିଵ𝑋ିଵ,ଶ ൌ 0

… 𝑌

െ 𝑐

𝑐ଵ𝑋ଵ ⋯ 𝑐ିଵ𝑋ିଵ, ൌ 0

• System of 𝑛 equations with 𝑛 unknowns => has a solution.

𝛾 ො, 𝛾 ොଵ, … , 𝛾 ොିଵ is the solution of this system, and

∑ ୀଵ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝑋ଵ ⋯ 𝛾 ොିଵ𝑋ିଵ ଶ ൌ 0

MSE in this regression =0

𝑅ଶ= 1: 𝑅ଶ= 1‐ MSE/ variance of spam variable.

46

… Even if independent variables in regression

are actually really bad predictors of

• 𝑅ଶ of reg. with as many independent variables as units=1.

• Mechanical property, just comes from the fact that a system a

𝑛 equations with 𝑛 unknowns has a solution.

• True even if independent variables actually bad predictors of 𝑌 .

• E.g.: previous example, where we regressed whether an email is

spam or not on stupid variables (whether it was sent on an odd

second…) still had 𝑅ଶ of 1.

• 𝑅ଶ=1 means that regression predicts 𝑌 perfectly well in sample,

but probably will yield bad predictions outside of the sample.

• Overfitting: we give ourselves so many parameters we can play

with (the coefficients of all the variables in the regression) that we

end up fitting perfectly the variable 𝑌 in our sample, but we will

make very large prediction errors outside of our sample.

47

• Figure below: 11 units, with their values of a variable 𝑋ଵ and of a

variable 𝑌 .

• Black line: regression function you obtain when you regress 𝑌 on

a constant and 𝑋ଵ.

• Blue line: regression function you obtain when you regress 𝑌 on a

constant, 𝑋ଵ, 𝑋ଵଶ, 𝑋ଵଷ, 𝑋ଵସ,…., 𝑋ଵଵ.

• Which of these two regressions will have the highest 𝑅ଶ?

Another example of overfitting

48

• Which of these two regressions will have the highest 𝑅ଶ?

a) The regression of 𝑌 on a constant and 𝑋ଵ.

b) The regression of 𝑌 on a constant, 𝑋ଵ, 𝑋ଵଶ, 𝑋ଵଷ, 𝑋ଵସ,….,

𝑋

ଵ

ଵ

.

iClicker time

49

• Regression of 𝑌 on constant, 𝑋ଵ, 𝑋ଵଶ, 𝑋ଵଷ, 𝑋ଵସ,…., 𝑋ଵଵ has 11

observations and 11 independent variables. 𝑅ଶ=1. Blue line fits perfectly

black dots.

• Black line does not perfectly fit black dots => regression of 𝑌 on a constant

and 𝑋ଵ has 𝑅ଶ ൏1.

• Goal of regression is to make prediction for value of dependent variable of

units not in your sample, for which you observe the 𝑥s but not 𝑦.

• Assume that one of these units has 𝑥 ൌ െ4.5. Do you think you will get a

better prediction for the 𝑦 of that unit using regression of 𝑌 on a constant

and 𝑋ଵ, or regression of 𝑌 on a constant, 𝑋ଵ, 𝑋ଵଶ, 𝑋ଵଷ, 𝑋ଵସ,…., 𝑋ଵଵ?

Regression of on constant, ଵ, ଵଶ,…., ଵଵ.

50

• The goal of a regression is to make a prediction for the value of

the dependent variable of units not in sample, for which you

observe the 𝑥s but not 𝑦.

• Assume that one of these units has 𝑥 ൌ െ4.5. Do you think you

will get a better prediction for the 𝑦 of that unit using regression

of 𝑌 on constant and 𝑋ଵ, or regression of 𝑌 on constant, 𝑋ଵ,

𝑋

ଵ

ଶ

, 𝑋ଵଷ, 𝑋ଵସ,…., 𝑋ଵଵ?

a) We will get a better prediction using the regression of 𝑌 on a

constant and 𝑋ଵ

b) We will get a better prediction using the regression of 𝑌 on a

constant, 𝑋ଵ, 𝑋ଵଶ, 𝑋ଵଷ, 𝑋ଵସ,…., 𝑋ଵଵ.

iClicker time

51

• Prediction of the 𝑦 of unit with 𝑥 ൌ െ4.5:

– according to reg. of 𝑌 on a constant, 𝑋ଵ, 𝑋ଵଶ,…., 𝑋ଵଵ: 13.

– according to reg. of 𝑌 on a constant and 𝑋ଵ: ‐12.

• In sample, units with 𝑥 close to ‐4.5 have 𝑦 much closer to ‐12 than to

13=> regression of 𝑌 on a constant and 𝑋ଵ will give better prediction.

• Again, regression with many independent variables might give very

good in‐sample prediction but very bad out of sample prediction

• But making good out of sample prediction is goal of regression.

• => Comparing 𝑅ଶ of 2 regs. is not right way to assess which will give

best out of sample predictions. Reg. with many variables always has

very high 𝑅ଶ but might end up making poor out of sample predictions.

Better prediction using reg. of on constant & ଵ

52

Instead, use a training and validation sample

53

• You start from sample of 𝑛 units for which you measure 𝑌 , dependent

variable, and 𝑋ଵ,…, 𝑋, independent variables.

• Randomly divide sample into two subsamples of 𝑛/2 units. Subsample 1:

training sample. Subsample 2: validation sample.

• In training sample, you estimate the regressions you are interested in.

• For instance, in training sample:

– Regression 1: 𝑌 on a constant and 𝑋ଵ,…, 𝑋. Coefficients 𝛾 ො, 𝛾 ොଵ, … , 𝛾 ො .

– Regression 2: 𝑌 on a constant and 𝑋ଵ. Coefficients 𝛽 መ, 𝛽 መଵ .

• Then, compute squared prediction error according to each regression for units

in validation sample.

• For instance, for each unit in validation sample, compute:

– 𝑌

െ ሺ𝛾 ො 𝛾 ොଵ𝑋ଵ ⋯ 𝛾 ො𝑋ሻ ଶ: squared pred. error with Reg. 1.

– 𝑌

െ ሺ𝛽 መ 𝛽 መଵ𝑋ଵሻ ଶ: squared pred. error with Reg. 2.

• Finally, choose regression for which sum squared prediction errors for units in

validation sample lowest.

• Intuition: you want to use the reg. that gives best out of sample predictions. By

choosing reg. that gives best predictions in validation sample, you ensure that

your regression will give good out of sample predictions, because you did not

use the validation sample to compute your reg. coefficients.

Machine learning in 2 minutes

54

• Using training and validation sample = key idea underlying machine

learning methods (statistical methods more sophisticated than, but

inspired from multivariate regressions, and that are used by tech

companies to do image recognition, spam detection, etc.)

• Goal: teach a computer to recognize whether an email is a Spam,

whether a picture of a letter is an “a”, a “b”, etc.

• Train the computer in a sample of emails for which the computer

knows whether the email is a spam and many other variables (all the

words in the email, etc.).

• The computer finds the model that predicts the best whether the

email is a spam given all these variables, in the training sample.

• Then, check whether prediction model works well in validation

sample, where you also know which emails are spams or not.

• If the statistical model also works well in the validation sample,

implement method in real life to predict whether new emails

reaching Gmail accounts are spams or not. If email predicted to be

spam, send to junk box. Otherwise, send to regular mail box.

Machine learning often works, but not always

55

What you need to remember

• Great advantage of multivariate regression over univariate

regression: improves the quality of our predictions.

• However, putting too many variables in regression might result in

overfitting: regression fits very well 𝑦s in sample, but gives poor out

of sample predictions.

• For instance, a regression with as many independent variables as

units will automatically have a 𝑅ଶ ൌ 1, even if those independent

variables are actually poor predictors of the independent variable.

• => comparing 𝑅ଶs not good way to choose between several regs

• Instead, you should:

– randomly divide sample into training and validation sample

– estimate your regressions in the training sample only

– compute squared predicted errors according to each regression

in validation sample

– choose regression for which MSE in validation sample is

smallest.

• Training / validation sample idea underlies machine learning models

used for spam detection / image recognition, etc. by tech

companies.

56

Roadmap

1. The OLS multivariate regression function.

2. Estimating the OLS multivariate regression function.

3. Advantages and pitfalls of multivariate regressions.

4. Interpreting coefficients in multivariate OLS regressions.

57

Interpreting coeff. of multivariate regs. An example.

• 6 units (𝑛 ൌ 6). 3 variables: 𝑌 , 𝐷, and 𝑋. 𝐷, and 𝑋: binary.

• If you regress 𝑌 on constant and 𝐷, what will be coeff. of 𝐷?

If you regress 𝑌 on a constant, 𝐷, and 𝑋, what will be coeff.

of 𝐷? Hint: to answer first question, you can use a result you

saw during sessions. To answer second question, write system

of 3 equations and three unknowns solved by ሺ𝛾 ො, 𝛾 ොଵ, 𝛾 ොଶሻ, the

coefficients of constant, 𝐷, and 𝑋, plug‐in the values of 𝑌 ,

𝐷

, and 𝑋 in table, and then solve system.

58

Unit 𝑌 𝑫 𝑋

1 5 1 1

2 3 1 1

3 4 0 1

4 1 1 0

5 0 0 0

6 2 0 0

iClicker time

If you regress 𝑌 on constant and 𝐷, what will be coeff. of 𝐷? If you

regress 𝑌 on a constant, 𝐷, and 𝑋, what will be coeff. of 𝐷?

a) In reg. of 𝑌 on constant and 𝐷, coeff. of 𝐷 is 2. In reg. of 𝑌 on

a constant, 𝐷, and 𝑋, coeff. of 𝐷 is 0.5.

b) In reg. of 𝑌 on constant and 𝐷, coeff. of 𝐷 is 1. In reg. of 𝑌 on

a constant, 𝐷, and 𝑋, coeff. of 𝐷 is 0.5.

c) In reg. of 𝑌 on constant and 𝐷, coeff. of 𝐷 is 1. In reg. of 𝑌 on

a constant, 𝐷, and 𝑋, coeff. of 𝐷 is 0.

d) In reg. of 𝑌 on constant and 𝐷, coeff. of 𝐷 is 1. In reg. of 𝑌 on

a constant, 𝐷, and 𝑋, coeff. of 𝐷 is ‐0.5.

59

Unit 𝑌 𝑫 𝑋

1 5 1 1

2 3 1 1

3 4 0 1

4 1 1 0

5 0 0 0

6 2 0 0

In regression of on constant + , coeff of is 1.

In regression of on constant, , + , coeff of is 0.

• Coeff of D୧ in reg. of Y୧ on constant, D୧. Result of sessions:

(Average Y୧ for D୧ ൌ 1)‐ (Average Y୧ for D୧ ൌ 0)ൌ

1/3(5+3+1)−1/3(4+2+0) ൌ 1.

• Coeff D୧ in reg. of Y୧ on constant, D୧, 𝑋: 3 eqs. with 3 unknowns.

• 𝑛 ൌ 6 and 𝐽 ൌ 2, so we have (we can forget the ‐2):

∑ ୀଵ 𝑌 െ 𝛾 ො 𝛾 ොଵ𝐷 𝛾 ොଶ𝑋 ൌ 0

∑ ୀଵ 𝐷 𝑌 െ 𝛾 ො 𝛾 ොଵ𝐷 𝛾 ොଶ𝑋 ൌ 0

∑ ୀଵ 𝑋 𝑌 െ 𝛾 ො 𝛾 ොଵ𝐷 𝛾 ොଶ𝑋 ൌ 0

• Plugging values of 𝑌 , 𝐷, and 𝑋, yields:

15 െ 6𝛾 ො െ 3𝛾 ොଵ െ 3𝛾 ොଶ ൌ 0

9 െ 3𝛾 ො െ 3𝛾 ොଵ െ 2𝛾 ොଶ ൌ 0

12 െ 3𝛾 ො െ 2𝛾 ොଵ െ 3𝛾 ොଶ ൌ 0

• Subtracting eq 2 to eq 3: 3 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0

• Multiplying eq 3 by 2 and subtracting eq 1: 9 െ 𝛾 ොଵ െ 3𝛾 ොଶ ൌ 0

• Adding the two preceding equations: 12 െ 4𝛾 ොଶ ൌ 0, so 𝛾 ොଶ ൌ 3.

• Plugging 𝛾 ොଶ ൌ 3 in 3 𝛾 ොଵ െ 𝛾 ොଶ ൌ 0: 𝛾 ොଵ ൌ 0. 60

A general formula for coefficient of binary variable in a

regression of on constant and 2 binary variables.

• Let 𝐷 and 𝑋 be 2 binary variables.

• 𝑛

: number of units with 𝐷 ൌ 0, 𝑋 ൌ 0. 𝑛ଵ: number of

units with 𝐷 ൌ 1, 𝑋 ൌ 0. 𝑛ଵ: number of units with 𝐷 ൌ 0,

𝑋

ൌ 1. 𝑛ଵଵ: number of units with 𝐷 ൌ 1, 𝑋 ൌ 1.

• Coeff of 𝐷 in regression of 𝑌 on constant, 𝐷*, *𝑋 is:

𝑤

1

𝑛

ଵ

𝑌

:ୀଵ, ୀ

െ

1

𝑛

𝑌

:ୀ, ୀ

1 െ 𝑤

1

𝑛

ଵଵ

𝑌

:ୀଵ, ୀଵ

െ

1

𝑛

ଵ

𝑌

:ୀ, ୀଵ

𝑤: number included between 0 and 1, no need to know formula.

•

ଵ

భబ

∑:ୀଵ, ୀ 𝑌 െ ଵ

బబ

∑:ୀ, ୀ 𝑌 : difference between

average 𝑌 of units with 𝐷 ൌ 1 and of units with 𝐷 ൌ 0,

among units with 𝑋 ൌ 0.

•

ଵ

భభ

∑:ୀଵ, ୀଵ 𝑌 െ ଵ

బభ

∑:ୀ, ୀଵ 𝑌 : difference between

average 𝑌 of units with 𝐷 ൌ 1 and of units with 𝐷 ൌ 0,

among units with 𝑋 ൌ 1.

• Coeff of 𝐷 measures difference between average of 𝑌 across

subgroups whose 𝐷 differs by one, but that have same 𝑋! 61

Applying formula in example.

• Sample with 6 units. 3 variables: 𝑌 , 𝐷, 𝑋. 𝐷 𝑋: binary

variables.

• What is value of ଵ

భబ

∑:ୀଵ, ୀ 𝑌 െ ଵ

బబ

∑:ୀ, ୀ 𝑌 ?

Of ଵ

భభ

∑:ୀଵ, ୀଵ 𝑌 െ ଵ

బభ

∑:ୀ, ୀଵ 𝑌 ?

62

Unit 𝑌 𝑫 𝑋

1 5 1 1

2 3 1 1

3 4 0 1

4 1 1 0

5 0 0 0

6 2 0 0

iClicker time

aሻ ଵ

భబ

∑:ୀଵ, ୀ 𝑌 െ ଵ

బబ

∑:ୀ, ୀ 𝑌 =1

ଵ

భభ

∑:ୀଵ, ୀଵ 𝑌 െ ଵ

బభ

∑:ୀ, ୀଵ 𝑌 ൌ െ1

b) ଵ

భబ

∑:ୀଵ, ୀ 𝑌 െ ଵ

బబ

∑:ୀ, ୀ 𝑌 =0

ଵ

భభ

∑:ୀଵ, ୀଵ 𝑌 െ ଵ

బభ

∑:ୀ, ୀଵ 𝑌 ൌ 0

c) ଵ

భబ

∑:ୀଵ, ୀ 𝑌 െ ଵ

బబ

∑:ୀ, ୀ 𝑌 =‐1

ଵ

భభ

∑:ୀଵ, ୀଵ 𝑌 െ ଵ

బభ

∑:ୀ, ୀଵ 𝑌 ൌ 1

63

Unit 𝑌 𝑫 𝑋

1 5 1 1

2 3 1 1

3 4 0 1

4 1 1 0

5 0 0 0

6 2 0 0

ଵ

భబ :ୀଵ, ୀ

ଵ

బబ :ୀ, ୀ 0,

ଵ

భభ :ୀଵ, ୀଵ

ଵ

బభ :ୀ, ୀଵ

•

ଵ

భబ

∑:ୀଵ, ୀ 𝑌 െ ଵ

బబ

∑:ୀ, ୀ 𝑌 ൌ 1 െ ଵ ଶ 0 2 ൌ 0

•

ଵ

భభ

∑:ୀଵ, ୀଵ 𝑌 െ ଵ

బభ

∑:ୀ, ୀଵ 𝑌 ൌ ଵ ଶ 5 3 െ 4 ൌ 0

• The coeff of 𝐷 in regression of 𝑌 on constant, 𝐷*, *𝑋 is a

weighted average of these two numbers, so that’s why it’s

equal to 0, as we have shown earlier. 64

Unit 𝑌 𝑫 𝑋

1 5 1 1

2 3 1 1

3 4 0 1

4 1 1 0

5 0 0 0

6 2 0 0

Interpreting coefficients in multivariate regressions.

• Previous slides: in reg. of on constant, , and ,

where and binary, coeff of = difference between

average of across groups whose differs by one, but

that have same .

• Extends to all multivariate regressions.

• In a multivariate regression of on constant, , ଵ,…,

, ଵ, the coeff. of , measures difference between

average of across subgroups whose differs by one,

but that have same ଵ,…, .

• If

ଵ , that means that if you compare average

across units whose differs by one but that have the

same value of ଵ,…, , the average of is

smaller among units whose is 1 unit larger.

• In a multivariate regression of on constant, ,

ଵ,…, , if ଵ , that means that if you compare

average across units whose differs by one but that

have the same value of ଵ,…, , the average of is

smaller among units whose is 1 unit larger. 65

Women earn less than males

• Same representative sample of 14086 US wage earners as in

Homework 3.

• Regression of ln(weekly wage) on constant and binary variable

equal to 1 for females in Stata.

• Women earn 32% less than males, difference very significant.

• From that regression, can we conclude that women are

discriminated against in the labor market? Why? 66

_cons 6.642133 .0099315 668.80 0.000 6.622666 6.6616

female -.3235403 .0142357 -22.73 0.000 -.3514442 -.2956365

ln_weekly_~e Coef. Std. Err. t P>|t| [95% Conf. Interval]

Robust

Root MSE = .84461

R-squared = 0.0354

Prob > F = 0.0000

F(1, 14084) = 516.54

Linear regression Number of obs = 14,086

. reg ln_weekly_wage female, r

iClicker time

• Women earn 32% less than males, difference very significant.

• Can we conclude that women are discriminated against in the

labor market? Why?

a) Yes, we can conclude that women are discriminated against in

the labor market, this 32% difference in wages must reflect

discrimination.

b) No, we cannot conclude that women are discriminated against

in the labor market, because the R2 of the regression is too low.

c) No, we cannot conclude that women are discriminated against

in the labor market. Maybe women earn less than men for

reasons that have nothing to do with their gender.

67

Maybe women earn less for reasons that have

nothing to do with their gender.

• Women earn less than men.

• But that difference could for instance come from the fact they

work less hours per week outside of the home.

• Maybe women not discriminated by their employer, maybe

just work fewer hours for their employer => get paid less.

• (Aside: women indeed tend to work fewer hours a week

outside of the home than men, but that may be because they

also tend to spend more time taking care of children in

households with children, another form of gender imbalance,

though that imbalance is taking place in the family, not in the

labor market).

68

A more complicated regression

• Regression of ln(weekly wage) on constant, variable for

females + years of schooling, age, hours worked per week.

• Interpret coeff. of female variable in that regression.

69

_cons 3.931402 .0398147 98.74 0.000 3.85336 4.009444

years_schooling .1024575 .002269 45.15 0.000 .0980099 .1069052

hours_worked .0231192 .0005812 39.78 0.000 .02198 .0242583

age .0104635 .0004288 24.40 0.000 .0096231 .011304

female -.2731097 .0117557 -23.23 0.000 -.2961524 -.250067

ln_weekly_wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

Robust

Root MSE = .67267

R-squared = 0.3883

Prob > F = 0.0000

F(4, 14081) = 1449.64

Linear regression Number of obs = 14,086

. reg ln_weekly_wage female age hours_worked years_schooling, r

iClicker time

• Interpret coeff. of female variable reg. on previous slide.

a) On average, women earn 0.27 dollars less than men per week.

b) When we compare women and men that have the same

number of years of schooling, the same age, and that work the

same number of hours per week, we find that on average,

women earn 0.27 dollars less than men per week.

c) When we compare women and men that have the same

number of years of schooling, the same age, and that work the

same number of hours per week, we find that on average

women earn 27% less than men per week.

70

Answer c) !

• Remember: in multivariate reg. of on constant,

, ଵ,…, , if ଵ , means that if you

compare average across units whose differs by

one but that have the same value of ଵ,…, , the

average of is smaller among units whose is

1 unit larger.

• Here: is female variable. Females have ,

males have .

• The other variables in the regression are years of

schooling, age, and number of hours worked / week.

• =>

ଵ means that when we compare

women and men that have the same number of

years of schooling, the same age, and that work the

same number of hours per week, we find that on

average women earn 27% less than men per week.

71

Complicated reg. is stronger, though still imperfect

evidence that gender discrimination on labor market.

• Difference between men and women’s earnings cannot be

explained by differences in education, hours worked per week, and

professional experience.

• Even when we compare men and women with same education,

hours worked per week, and professional experience, women earn

substantially less (27%).

• This is still not definitive evidence of discrimination. Maybe women

tend to go into lower paying jobs and industries than men.

• E.g.: less women in finance and engineering.

• But is this because women do not like that type of jobs (if so, no

discrimination) or is it because those industries do not want to hire

women (if so, discrimination), or because women would like to go

into those jobs but do not do so because frowned upon due to

social norms (if so, discrimination)?

• Overall, even though there are limits even with the complicated

regression, the fact that women earn less even when we compare

men and women with same education, hours worked per week, and

professional experience, suggests that women discriminated on

labor market.

72

What is econometrics?

• Econometrics is a set of statistical techniques that we can use to

study economic questions empirically.

• The tools we use in econometrics are statistical techniques, which is

why the beginning of an intro to econometrics class looks more like

a stats class than an econ class: before we can apply the statistical

tools to study economics question, we need to master the tools!

• Why do we want to study economic questions empirically? Isn’t

economic theory enough?

• The issue with economic theory is that on a number of issues,

different theories lead to different conclusions.

• E.g.: neo‐classical economist will tell you that increasing minimum

wage will reduce employment, while a neo‐Keynesian will tell you

that increasing minimum wage will increase employment.

• Conflicting theories => we need to study these questions

empirically (with data) to say which theory is true.

• The wage regressions in homework 3 and in these slides are a first

example of how to use statistical tools to study an economic

question: “are women discriminated on the labor market?”

empirically (with data).

• Other examples coming in the next slides.

73

What you need to remember

• In a multivariate regression of 𝑌 on constant, 𝐷, 𝑋ଵ,…, 𝑋,

𝛾 ොଵ, if 𝛾 ොଵ, the coeff of 𝐷, is equal to 𝑥, means that if you

compare average 𝑌 across units whose 𝐷 differs by one but

that have the same value of 𝑋ଵ,…, 𝑋, the average of 𝑌 is 𝑥

larger (if 𝑥 0) / smaller (if 𝑥 ൏ 0) among units whose 𝐷 is 1

unit larger.

• In a multivariate regression of lnሺ𝑌 ሻ on constant, 𝐷, 𝑋ଵ,…,

𝑋

, if 𝛾 ොଵ, the coeff of 𝐷, is equal to 𝑥, means that if you

compare average 𝑌 across units whose 𝐷 differs by one but

that have the same value of 𝑋ଵ,…, 𝑋, the average of 𝑌 is 𝑥%

larger (if 𝑥 0) / smaller (if 𝑥 ൏ 0) among units whose 𝐷 is 1

unit larger.

74

Take-home examination for Econ 140A, Fall 2020

Your answers should be submitted in a single pdf document on GauchoSpace.

You may either type or handwrite your answers, or some combination if you like.

You can take pictures of your handwritten responses, then include them in the single

pdf document that you submit. Regardless of what you decide, it is important that your

answers are as clear as possible. Your answers should appear in the order in which the

questions are asked. Please review your answers before submitting them to confirm that

they are easily readable.

Be sure that you explicitly answer each question and explain each step, as if you

were writing solutions so that another student in the class would be able to follow your

thoughts. Part of your grade will depend on explaining each step of your answers.

1. Suppose you want to know about wage discrimination by gender in Santa Barbara.

The goal of this problem is to understand regression with a dummy variable.

(a) Suppose you run the following regression:

*wage**i *= *β *+ *i*

where *wage**i *is monthly wage of individual *i*, and *i *is an error term. Show the

objective function to get *β *and derive *β *in terms of *wage**i *from the objective

function. Interpret *β *based on the derivation. (3pt)

(b) In order to investigate wage discrimination in Santa Barbara, one of your

friends suggest to run

*wage**i *= *α**f **· **female**i *+ *α**m **· **male**i *+ *i**, *(1)

where *female**i *=

1 if *i *is female

0 if *i *is male

, and *male**i *=

1 if *i *is male

0 if *i *is female

*.*

Show the objective function to get *α**f *and *α**m *and derive them. Finally,

interpret *α**f *and *α**m *in two sentences. (4pt)

(c) Suppose you estimate *α**f *and *α**m *with the sample of 1000 people in this city.

The result of these estimates is the following:

*wage**i *= 3700

(542)

*· **female**i *+ 5300

(1024)

*· **male**i *+ ˆ *i *(2)

The number in the bracket is the standard error for *α *b

*f *and *α *b*m*. Derive the

95% confidence interval of monthly wage of male. (2pt)

(d) To figure out the wage discrimination between males and females, one of your

friends suggests to run

*wage**i *= *γ*1 + *γ*2 *· **female**i *+ *i**. *(3)

If you run regression (3) with the same sample in (c), what will *γ *b1 and *γ *b2 be?

In order to test the hypothesis that there is no wage discrimination between

males and females, which coefficient would you use to test this hypothesis

between *γ *b1 and *γ *b2? Set up the null hypothesis for this test explicitly. (3pt)

2. You realize that it is insufficient to compare the wage differential between the two

genders to know about wage discrimination because this wage differential could be

from an educational gap between males and females. So, you decide to obtain data

of years of education from all workers in addition to wage. The goal of this problem

is to understand regressions with an interaction term.

(a) Consider

*wage**i *= *β*0 + *β*1 *· **edu**i *+ *i**, *(4)

where *edu**i *is years of education of worker *i*. Show the objective function to

obtain *β*0 and *β*1. Derive the first order conditions. And, finally, interpret *β*0

and *β*1 in (4) in two sentences. (3pt)

Page 2

(You don’t need to derive *β*0 and *β*1 explicitly.)

(b) One of your friends suggests that you need to run the following regression to

learn the wage gender gap controlling by education:

*wage**i *= *α**f *+ *α**m **· **male**i *+ *β**f **· **edu**i *+ *β**m **· **male**i **· **edu**i *+ *i *(5)

where *male**i *= 1 if *i *is male and 0 otherwise. Since you do not know the

meaning of each coefficient, you decide to derive each coefficient in terms of

*wage**i*, *male**i *and *edu**i*. What is the objective function to get *α**f**, α**m**, β**f**, *and

*β**m*? Derive the first order conditions for *α**f**, α**m**, β**f**, *and *β**m *respectively. (3pt)

(You don’t need to derive *α**f**, α**m**, β**f *and *β**m *explicitly.)

(c) From the first order conditions in (b), show that *α**f *and *β**f *depend only on

*wage**i *and *edu**i *of female population. Are these conditions same as the first

order conditions of the regression *y**i *= *α**f *+ *β**f **· **edu**i *+ *i *where individual *i*

belongs to the female population? (3pt)

Hint:

*n*X *i*

=1

*x*2

*i *= X

*i*:*female*

*x*2

*i *+ X

*i*:*male*

*x*2

*i **.*

(d) For now let *γ*0 = *α**f *+*α**m *and *γ*1 = *β**f *+*β**m*. From the first order conditions in

(b), show that *γ*0 and *γ*1 depend only on *wage**i *and *edu**i *of male population.

Are these conditions same as the first order conditions of the regression *y**i *=

*γ*0 + *γ*1 *· **edu**i *+ *i *where individual *i *belongs to the male population? (3pt)

Hint: Use the first conditions with respect to *α**m *and *β**m*.

(e) Based on (c) and (d), interpret *α**f**, β**f**, γ*0, and *γ*1 in one sentence for each one.

Given these interpretation, interpret *α**m *and *β**m *in one sentence for each one.

(3pt) Hint: We define *γ*0 = *α**f *+ *α**m *and *γ*1 = *β**f *+ *β**m *in (d). Then,

*α*

*m *= *γ*0 *− **α**f**, *and *β**f *= *γ*1 *− **β**m**.*

Page 3

(f) From the sample of 1,000 individuals, you estimate (5) and the result is the

following:

*wage**i *= 1600

(542)

+ 400

(172)

*· **male**i *+ 382

(99)

*· **edu**i *+ 132

(49)

*· **male**i **· **edu**i**,*

where the numbers in the brackets are the standard errors for each coefficient.

You want to test the hypothesis that there is no gender gap in the returns to

education. Write down the null hypothesis for this test and whether you can

reject this null hypothesis at 95% or not. (3pt)

Page 4

Ordinary least squares regression I:

The univariate linear regression.

Clement de Chaisemartin and Doug Steigerwald

UCSB

1

Traders make predictions

• Traders, say oil traders, speculate on the price of oil.

• When they think the price of oil will go up, they buy oil.

• When they think the price will go down, they sell oil.

• To inform their buying / selling decisions, they need to

predict whether the price will go up or down.

• To make their predictions, they can use the state of the

economy today. E.g.: if world GDP is growing fast today,

the price of oil should increase tomorrow.

• => traders need to use variables available to them to

make predictions on a variable they do not observe:

the price of oil tomorrow.

2

Banks make predictions

• When someone applies for a loan, the bank needs to decide:

– Whether they should give the loan to that person.

– And if so, which interest rate they should charge that person.

• To answer these questions, the bank needs to predict the amount

of the loan that this person will fail to reimburse. They will charge

high interest rate to people who are predicted to fail to reimburse a

large amount.

• To do so, they can use all the variables contained in application:

gender, age, income, ZIP code…

• Can also use credit score of that person: FICO score, created by

FICO company. All banks in US share information about their

customers with FICO. Therefore, for each person FICO knows: total

amount of debt, history of loans repayment… People with lots of

debt and who often defaulted on their loan in the past get a low

score, while people with little debt and no default get high score.

• Here as well, banks try to predict a variable they do not observe

(amount of the loan the person will fail to reimburse) using

variables that they observe (the variables in her application + FICO).3

Tech companies make predictions

• A reason why people prefer Gmail over other mailboxes is that

Gmail is better than many mailboxes at sending directly spam

emails into your trash box.

• They could ask a human to read the email and say whether it’s a

Spam or not. But that would be very costly and slow!

• Automated process: when a new email reaches your mailbox, Gmail

needs to decide whether it should go into your trash because it’s a

Spam, or whether it should go into your regular mailbox.

• To do so, the computer can extract a number of variables from that

email: number of words, email address of the sender, the specific

words used in the email and how many times they occur…

• Based on these variables, it can try to predict whether the email is a

real email or a spam.

• Here as well, Gmail tries to predict a variable they do not observe

(whether that email is a Spam or not) using variables that they

observe (number of words, email adress of the sender, the specific

words used in the email…).

4

Using variables we observe to make predictions

on variables we do not observe.

• Many real world problems can be cast as using

variables we observe to make predictions on

variables we do not observe:

– either because they will be realized in the future

(e.g.: the amount that someone applying today for a

one year to loan will fail to reimburse will only be

known in one year from now)

– or because observing them would be too costly

(e.g.: assessing whether all the emails reaching all

Gmail accounts everyday are spams or not).

5

We will study a variety of models one can use to

make predictions.

• In all the following lectures, we are going to study

how we can construct statistical models to make

predictions.

• We will start by studying the simplest prediction

model: the ordinary least squares (OLS) univariate

linear regression.

6

Roadmap

1. The OLS univariate linear regression function.

2. Estimating the OLS univariate linear regression function.

3. OLS univariate linear regression in practice.

7

Set up and notation.

• We consider a population of 𝑁 units.

– 𝑁 could be number of people who apply for a one‐year loan with bank

A during April 2018.

– Or 𝑁 could be number of emails reaching all Gmail accounts in April

2018.

• Each unit 𝑘 has a variable 𝑦 attached to it that we do not observe.

We call this variable the dependent variable.

– In the loan example, 𝑦 is a variable equal to the amount of her loan

applicant 𝑘 will fail to reimburse when her loan expires in April 2019.

– In email example, 𝑦 is equal to 1 if email 𝑘 is a spam and 0 otherwise.

• Each unit 𝑘 also has 1 variable 𝑥

attached to it that we do observe.

We call this variable the independent variable.

– In the loan example, 𝑥 could be the FICO score of applicant 𝑘.

– In the email example, 𝑥 could be a variable equal to 1 if the word

“free” appears in the email.

8

Are units with different values of likely

to have the same value of ?

• Based on the value of

of each unit, we want to

predict her .

• E.g.: in the loan example, we want to predict the

amount that unit will fail to reimburse based on

her FICO score.

• Assume that applicant 1 has a very high (good) credit

score, while applicant 2 has a very low (bad) credit

score.

• Do you think that applicant 1 and 2 will fail to

reimburse the same amount on their loan?

9

No!

• Based on the value of

of each unit, we want to

predict her .

• E.g.: in the loan example, we want to predict the

amount that unit will default on her loan based on

her FICO score.

• Assume that applicant 1 has a very high (good) credit

score, while applicant 2 has a very low (bad) credit

score.

• Do you think that applicant 1 and 2 will fail to

reimburse the same amount on their loan?

• No, applicant 2 is more likely to fail to reimburse a

larger amount than applicant 1.

• Should you predict the same value of for applicants

1 and 2?

10

No! Your prediction should be a function of

• Based on the value of 𝑥

of each unit, we want to predict her 𝑦.

• E.g.: in the loan example, we want to predict the amount that unit

𝑘 will default on her loan based on her FICO score.

• Assume that applicant 1 has a very high (good) credit score, while

applicant 2 has a very low (bad) credit score.

• Should you predict the same value of 𝑦 for applicants 1 and 2?

• No! If you want your prediction to be accurate, you should predict a

higher value of 𝑦 for applicant 2 than for applicant 1.

• Your prediction should a function of 𝑥, 𝑓ሺ𝑥ሻ.

• In these lectures, we focus on predictions which are a linear

function of 𝒙𝒌: 𝒇 𝒙𝒌 ൌ 𝒂𝒙𝒌, for some real number 𝒂.

• Which measure can you use to assess whether 𝑎𝑥 is a good

prediction of 𝑦? Discuss this question with your neighbor for 1

minute.

11

iClicker time

• To assess whether

is a good prediction of

, we should use:

12

!

• Based on the value of 𝑥

of each unit, we want to predict her 𝑦.

• Our prediction should a function of 𝑥, 𝑓ሺ𝑥ሻ. We focus on

predictions which are a linear function of 𝑥: 𝑓 𝑥 ൌ 𝑎𝑥, for

some real number 𝑎.

• Which measure can you use to assess whether 𝑎𝑥 is a good

prediction?

• 𝑦 െ 𝑎𝑥 , the difference between your prediction and 𝑦.

• In the loan example, if 𝑦 െ 𝑎𝑥 is large and positive, our prediction

is much below the amount applicant 𝑘 will fail to reimburse.

• If 𝑦 െ 𝑎𝑥 is large and negative, our prediction is much above the

amount person 𝑘 will fail to reimburse.

• Large positive or negative values of 𝑦 െ 𝑎𝑥 mean bad prediction.

• 𝑦 െ 𝑎𝑥 close to 0 means good prediction.

13

iClicker time

• Which of the following 3 possible values of

should we choose to ensure that

predicts well

in the population?

a) The value of that maximizes ே ୀଵ

b) The value of that minimizes ே ୀଵ .

c) The value of that minimizes ே ୀଵ ଶ.

14

Minimizing ே ୀଵ won’t work!

• Minimizing ே ୀଵ means we try to avoid

positive prediction errors, but we also try to make

the largest possible negative prediction errors!

• Not a good idea: we will systematically overestimate

.

• We want a criterion that deals symmetrically with

positive and negative errors: we want to avoid both

positive and negative errors.

15

Answer: find the value of that minimizes

ே ଶ

ୀଵ

• ே ୀଵ ଶ is positive. => minimizing it =

same thing as making it as close to 0 as

possible.

• If

ே ଶ

ୀଵ is as close to 0 as possible,

means that the sum of the squared value of our

prediction errors is as small as possible.

• => we make small errors. That’s good, that’s

what we want!

16

Which prediction function is the best?

• Population has 11 units. 𝑥 and 𝑦 of those 11 units are

shown on the graph: blue dots.

• Two linear prediction functions for 𝑦: 0.5𝑥 and 0.8𝑥.

• Which one is the best? Discuss this 1mn with your neighbour.

17

iClicker time

• On the previous slide, which function of

gives the best prediction for :

a)

b)

18

is the best prediction function!

• It is the function for which the sum of the squared of the

prediction errors are the smallest.

19

Prediction

error of 0.5x

for the person

with x=6

Prediction

error of 0.8x

for the person

with x=8

The OLS univariate linear regression function

in the population.

• Let

𝛼 ൌ 𝑎𝑟𝑔𝑚𝑖𝑛∈ோ 𝑦 െ 𝑎𝑥 ଶ

ே

ୀଵ

• We call 𝛼𝑥

the ordinary least squares (OLS) univariate linear

regression function of 𝒚𝒌 on 𝒙𝒌 in the population.

• Least squares: because 𝛼𝑥 minimizes the sum of the squared

difference between 𝑦 and 𝑎𝑥.

• Ordinary: because there are fancier way of doing least squares.

• Univariate: because there is only one independent variable in the

regression, 𝑥.

• Linear: because the regression function is a linear function of 𝑥.

• In the population: because we use the 𝑦s and 𝑥s of all the 𝑁 units

in the population.

• Shortcut: OLS regression of 𝒚𝒌 on 𝒙𝒌 in the population.

20

Decomposing between predicted value

and residual.

• 𝛼: coefficient of 𝑥

in the OLS regression of 𝑦 on 𝑥 in the population.

• Let 𝑦 ൌ 𝛼𝑥. 𝑦 is the predicted value for 𝑦 according to the OLS

regression of 𝑦 on 𝑥 in the population.

• Let 𝑒

ൌ 𝑦 െ 𝑦 . 𝑒: error we make when we use OLS regression in the

population to predict 𝑦.

• We have 𝑦 ൌ 𝑦 𝑒.

𝑦 ൌpredicted value + error.

21

0 2 4 6 8 10

‐1

6 5 4 3 2 1 0

y

0.5x

Predicted

value of 𝑦

for person

with x=8

Predictior

error for

person with

x=8

Finding a formula for when .

• Assume for a minute that : there are

only two units in the population.

• Then is the value of that minimizes

ଵ ଵ

ଶ

ଶ ଶ

ଶ

.

• Find a formula for , as a function of ଵ, ଵ,

ଶ, and ଶ. You have 3 minutes to try to find

the answer. Hint: you need to compute the

derivative of

ଵ ଵ

ଶ

ଶ ଶ

ଶ with

respect to , and then is the value of for

which that derivative is equal to 0.

22

iClicker time

• If , is equal to:

a)௫భ௬భା௫మ௬మ

௫భା௫మ

b) ௫భ௬భା௫మ௬మ

௫భమ ା௫మమ

c)௫భమ ௬భା௫మమ ௬మ

௫భା௫మ

23

When , భ భ మ మ

భ

మ

మ

మ .

• If 𝑁 ൌ 2, 𝛼 is the value of 𝑎 that minimizes 𝑦ଵ െ 𝑎𝑥ଵ ଶ

𝑦ଶ െ 𝑎𝑥ଶ ଶ.

• The derivative of that function wrt to 𝑎 is:

െ𝑥

ଵ2 𝑦ଵ െ 𝑎𝑥ଵ െ 𝑥ଶ2 𝑦ଶ െ 𝑎𝑥ଶ .

• Let’s find value of 𝑎 for which derivative = 0.

െ2𝑥

ଵ 𝑦ଵ െ 𝑎𝑥ଵ െ 2𝑥ଶ 𝑦ଶ െ 𝑎𝑥ଶ ൌ 0

i𝑖𝑓 െ 2𝑥ଵ𝑦ଵ 2𝑎𝑥ଵଶ െ 2𝑥ଶ𝑦ଶ 2𝑎𝑥ଶଶ ൌ 0

i𝑖𝑓 2𝑎 𝑥ଵଶ 𝑥ଶଶ ൌ 2ሺ𝑥ଵ𝑦ଵ 𝑥ଶ𝑦ଶሻ

i𝑖𝑓 𝑎 ൌ ௫భ௬భା௫మ௬మ

௫భమ ା௫మమ .

• Second line of derivation shows that derivative increasing in 𝑎. => if

𝑎 ൏ ௫భ௬భା௫మ௬మ

௫భమ ା௫మమ , derivative negative. If 𝑎 ௫ ௫ భభ ௬ మ భା௫ ା௫మ మ௬ మ మ, derivative

positive.

• Function reaches minimum at 𝒙𝟏𝒚𝟏ା𝒙𝟐𝒚𝟐

𝒙𝟏𝟐 ା𝒙𝟐𝟐 . => 𝜶 ൌ

𝒙𝟏𝒚𝟏ା𝒙𝟐𝒚𝟐

𝒙𝟏𝟐 ା𝒙𝟐𝟐 .

24

Reminder: P4Sum

• P4Sum: let

ଵ , ଶ ,…, ே be

functions of which are all differentiable wrt

to . Let

ଵ

ᇱ

, ଶ

ᇱ

,…, ே

ᇱ

denote their

derivatives. Then, ே ୀଵ is differentiable

wrt to , and its derivative is ே ୀଵ ᇱ .

• In words: the derivative of a sum is the sum of

its derivatives.

25

Finding a formula for for any value of .

• Let’s get back to the general case where is

left unspecified.

• Remember, is the value of that minimizes

ே ଶ

ୀଵ .

• Find a formula for , as a function of ଵ,…, ே

and

ଵ,…, ே. You have 3 minutes to find the

answer. Hint: you need to compute the

derivative of

ே ଶ

ୀଵ with respect to

, and then is the value of for which that

derivative is equal to 0. 26

iClicker time

• is equal to:

a)∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖ

b) ௫భ௬భା௫మ௬మ

௫భమ ା௫మమ

c)∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ

27

ೖ ೖ

ಿ

ೖసభ

ಿ ೖసభ ೖమ .

• 𝛼 minimizes ∑ே ୀଵ 𝑦 െ 𝑎𝑥 ଶ. Derivative wrt to 𝑎

is:∑ே ୀଵ െ𝑥2 𝑦 െ 𝑎𝑥 . Why?

• Let’s find value of 𝑎 for which derivative = 0.

∑ே ୀଵ െ2𝑥 𝑦 െ 𝑎𝑥 ൌ 0

i𝑖𝑓 ∑ே ୀଵ െ2𝑥𝑦 2𝑎𝑥ଶ ൌ 0

i𝑖𝑓 ∑ே ୀଵ െ2𝑥𝑦 ∑ே ୀଵ 2𝑎𝑥ଶ ൌ 0

i𝑖𝑓 െ 2 ∑ே ୀଵ 𝑥𝑦 2𝑎 ∑ே ୀଵ 𝑥ଶ ൌ 0

i𝑖𝑓2𝑎 ∑ே ୀଵ 𝑥ଶ ൌ 2 ∑ே ୀଵ 𝑥𝑦

i𝑖𝑓 𝑎 ൌ ∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ .

• Function reaches minimum at ∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ . =>

∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ . 28

What you need to remember

• Population of 𝑁 units. Each unit 𝑘 has 2 variables attached to it:

𝑦 is a variable we do not observe, 𝑥 is a variable we observe.

• We want to predict the 𝑦 of each unit based on her 𝑥.

• E.g.: a bank wants to predict the amount an applicant will fail to

reimburse on her loan based on her FICO score.

• Our prediction should be function of 𝑥, 𝑓ሺ𝑥).

• For now, focus on linear functions of 𝑥: 𝑎𝑥 for some number

𝑎.

• Good prediction should be such that 𝑦 െ 𝑎𝑥, difference

between prediction and 𝑦, is as small as possible for most units.

• The best value of 𝑎 is the one that minimizes ∑ே ୀଵ 𝑦 െ 𝑎𝑥 ଶ.

• We call that value 𝛼, and we call 𝛼𝑥 the OLS univariate linear

regression function of 𝑦 on 𝑥.

• If 𝑁 ൌ 2 𝛼 ൌ ௫భ௬భା௫మ௬మ

௫భమ ା௫మమ . You should know how to prove that.

• In general, 𝛼 ൌ ∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ . You should know how to prove that.

29

Roadmap

1. The OLS univariate linear regression function.

2. Estimating the OLS univariate linear regression function.

3. OLS univariate linear regression in practice.

30

Can we compute ?

• Our prediction for based on a univariate

linear regression is , the univariate linear

regression function.

• => to be able to make a prediction for a unit’s

based on her , we need to know the

value of .

• Under the assumptions we have made so far,

can we compute ? Discuss this question with

your neighbor during 1 minute.

31

iClicker time

• Under the assumptions we have made so far,

can we compute ?

a) Yes

b) No

32

We do not observe the s, => we cannot

compute

• Remember, we have assumed that we observe

the

s of everybody in the population (e.g.

applicants’ FICO scores) but not the s (e.g.

the amount that a person applying for a one‐

year loan in April 2018 will fail to reimburse in

April 2019 when that loan expires).

• => we cannot compute

∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ .

33

But we can estimate if we observe the

s of a sample of the population.

• But we can estimate if we observe the

s of a

sample of the population.

• E.g.: in the Gmail example, we could select a

random sample of emails, and ask a human to

determine whether those emails are spams or

not.

34

Randomly sampling one unit.

• Assume we randomly select one unit in the population, and we

measure the dependent and the independent variable of that unit.

• E.g.: we randomly select one email out of all the emails reaching

Gmail accounts on May, 1st, 2018, and we look whether this is a

spam or not, and whether it contains the word “free” or not.

• Let 𝑌

ଵ and 𝑋ଵ respectively denote the value of the dependent and

of the independent variable for that randomly selected unit.

• 𝑌

ଵ and 𝑋ଵ are random variables, because their values depend on

which unit of the population we randomly select.

• If we select the 34th unit in the population, 𝑌 ଵ ൌ 𝑦ଷସ and 𝑋ଵ ൌ 𝑥ଷସ.

• Each unit in the population has the same probability, ଵ

ே

, of being

selected.

• What is the value of 𝐸ሺ𝑋ଵ𝑌 ଵሻ? Hint: 𝐸ሺ𝑋ଵ𝑌 ଵሻ is a function of all the

𝑦s and of all the 𝑥s. Discuss this question with your neighbor

during 2mns.

35

iClicker time

• Assume we randomly select one unit in the population, and we

measure the dependent and the independent variable of that unit.

• Let 𝑌

ଵ and 𝑋ଵ respectively denote the value of the dependent and

of the independent variable for that randomly selected unit.

• 𝑌

ଵ and 𝑋ଵ are random variables, because their values depend on

which unit of the population of the population we randomly select.

• Each unit in the population has a probability ଵ

ே

of being selected.

• What is the value of 𝐸ሺ𝑋ଵ𝑌 ଵሻ?

aሻ 𝐸 𝑋ଵ𝑌ଵ ൌ 𝑥𝑦

bሻ 𝐸 𝑋ଵ𝑌ଵ ൌ ଵ

ே

∑ே ୀଵ 𝑥𝑦

cሻ 𝐸 𝑋ଵ𝑌ଵ ൌ ∑ே ୀଵ𝑥𝑦

36

•

ଵ ଵ is equal to:

– ଵ ଵ if the first individual in the population is

selected, which has a probability ଵ

ே

of happening

– ଶ ଶ if the second individual in the population is

selected, which has a probability ଵ

ே

of happening

– …

– ே ே if the th individual in the population is

selected, which has a probability ଵ

ே

of happening

• Therefore, ଵ ଵ = ଵ

ே

ே

ୀଵ =

ଵ ே

ே

ୀଵ .

• What is the value of

ଵ

ଶ

? Discuss this

question with your neighbor during 1mn.

37

iClicker time

• We randomly select one unit, and we measure the dependent

and the independent variable of that unit.

• Let 𝑌

ଵ and 𝑋ଵ respectively denote the value of the dependent

and of the independent variable for that randomly selected

unit.

• 𝑌

ଵ and 𝑋ଵ are random variables, because their values depend

on which unit of the population of the population we

randomly select.

• Each unit in the population has a probability ଵ

ே

of being

selected.

• What is the value of 𝐸 𝑋

ଵ

ଶ ?

ଵ

ଶ ଵ

ே

ே ଶ

ୀଵ

ଵ

ଶ ଵ

ே

ே

ୀଵ

38

•

ଵ

ଶ

is equal to:

– ଵ

ଶ if the first individual in the population is

selected, which has a probability ଵ

ே

of happening

– ଶ

ଶ if the second individual in the population is

selected, which has a probability ଵ

ே

of happening

– …

– ே

ଶ if the th individual in the population is

selected, which has a probability ଵ

ே

of happening

• Therefore, ଵଶ = ଵ

ே

ே ଶ

ୀଵ =

ଵ ே

ே ଶ

ୀଵ .

39

Randomly sampling units.

• We randomly draw 𝑛 units with replacement from the

population, and we measure the dependent and the

independent variable of those 𝑛 units.

• For every 𝑖 included between 1 and 𝑛, 𝑌 and 𝑋 = value of the

dependent and of the independent variable of the 𝑖th unit we

randomly select.

• The 𝑌 s and 𝑋s are independent and identically distributed.

• For every 𝑖 included between 1 and 𝑛, 𝐸 𝑋𝑌 ൌ ଵ

ே

∑ே ୀଵ 𝑥𝑦

and 𝐸 𝑋ଶ ൌ ଵ

ே

∑ே ୀଵ 𝑥ଶ .

40

A method to estimate .

• We want to use the 𝑌 s and the 𝑋s to estimate 𝛼.

• Remember: 𝛼 is the value of 𝑎 that minimizes ∑ே ୀଵ 𝑦 െ 𝑎𝑥 ଶ.

• => to estimate 𝛼, we could use 𝛼 ො, the value of 𝑎 that minimizes

∑ ୀଵ 𝑌 െ 𝑎𝑋 ଶ.

• Instead of finding the value of 𝑎 that minimizes the sum of squared

prediction errors in the population, find value of 𝑎 that minimizes

the sum of squared prediction errors in the sample.

• Intuition: if we find a method to predict well the dependent

variable in the sample, that method should also work well in the full

population, as our sample is representative of the population.

41

The OLS regression function in the sample.

• Let

𝛼 ො ൌ 𝑎𝑟𝑔𝑚𝑖𝑛∈ோ 𝑌 െ 𝑎𝑋 ଶ

ୀଵ

• We call 𝛼 ො𝑋 the OLS regression function of 𝑌 on 𝑋 in the sample.

• In the sample: because we only use the 𝑌 s and 𝑋s of the 𝑛 units in

the sample we randomly draw from the population.

• 𝛼 ො: coefficient of 𝑋 in the OLS regression of 𝑌 on 𝑋 in the sample.

• Let 𝑌 ൌ 𝛼 ො𝑋. 𝑌 is the predicted value for 𝑌 according to the OLS

regression of 𝑌 on 𝑋 in the sample.

• Let 𝑒̂ ൌ 𝑌 െ 𝑌 . 𝑒̂: error we make when we use OLS regression in

the sample to predict 𝑌 .

• We have 𝑌 ൌ 𝑌 𝑒̂.

• Find a formula for 𝛼 ො, the value of 𝑎 that minimizes ∑ ୀଵ 𝑌 െ 𝑎𝑋 ଶ.

Hint: differentiate this function wrt to 𝑎 and find the value of 𝑎 that

cancels the derivative.

42

iClicker time

• The value of that minimizes ୀଵ ଶ is:

∑ సభ మ

∑ సభ

∑ సభ

∑ సభ మ

∑ సభ

∑ సభ

43

Value of minimizing ୀଵ ଶ is ∑ సభ

∑ సభ మ

• Derivative wrt to 𝑎 of ∑ ୀଵ 𝑌 െ 𝑎𝑋 ଶ is: ∑ ୀଵሾെ𝑋2ሺ𝑌 െ

𝑎𝑋ሻሿ. Why?

• Let’s find value of 𝑎 for which derivative = 0.

∑ ୀଵ െ𝑋2 𝑌 െ 𝑎𝑋 ൌ 0

i𝑖𝑓 ∑ ୀଵ െ2𝑋𝑌 2𝑎𝑋ଶ ൌ 0

i𝑖𝑓 ∑ ୀଵ െ2𝑋𝑌 ∑ ୀଵ 2𝑎𝑋ଶ ൌ 0

i𝑖𝑓 െ 2 ∑ ୀଵ 𝑋𝑌 2𝑎 ∑ ୀଵ 𝑋ଶ ൌ 0

i𝑖𝑓2𝑎 ∑ ୀଵ 𝑋ଶ ൌ 2 ∑ ୀଵ 𝑋𝑌

i𝑖𝑓 𝑎 ൌ ∑ సభ

∑ సభ మ .

𝜶 ෝ ൌ ∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊

∑𝒏 𝒊స𝟏 𝑿𝒊𝟐

44

Reminder: the law of large numbers.

• LLN: Let

ଵ,…, be iid random variables,

and let denote their expectation.

→ାஶ

ଵ

ୀଵ .

• When the sample size grows, the average of

iid random variables converges towards their

expectation.

45

converges towards when the sample size

grows.

• We randomly draw 𝑛 units with replacement from the population,

and we measure the dependent and the independent variable of

those 𝑛 units.

• For every 𝑖 included between 1 and 𝑛, 𝑌 and 𝑋 = value of the

dependent and of the independent variable of the 𝑖th unit we

randomly select.

• Because the 𝑛 units are drawn with replacement, ሺ𝑌 , 𝑋ሻ are iid,

and therefore the 𝑋𝑌 s are iid, and the 𝑋ଶ s are also iid.

• For every 𝑖 included between 1 and 𝑛, 𝐸 𝑋𝑌 ൌ ଵ

ே

∑ே ୀଵ 𝑥𝑦 and

𝐸 𝑋

ଶ

ൌ

ଵ ே

∑ே ୀଵ 𝑥ଶ .

• 𝛼 ൌ

∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ and 𝛼 ො ൌ ∑ ∑ సభ సభ మ.

• Use the law of large numbers to show that lim

→ାஶ

𝛼 ො ൌ 𝛼. Hint: you

need to use the fact that 𝐸 𝑋𝑌 ൌ ଵ

ே

∑ே ୀଵ 𝑥𝑦 and 𝐸 𝑋ଶ ൌ

ଵ ே

∑ே ୀଵ 𝑥ଶ .

46

iClicker time

• Which of following two arguments is correct:

a) The law of large numbers implies that

→ାஶ

ୀଵ

and

→ାஶ

ଶ

ୀଵ

ଶ

. We have

ଵ ே

ே

ୀଵ and ଶ ଵ

ே

ே ଶ

ୀଵ .

Therefore,

→ାஶ

.

b) The law of large numbers implies

that

→ାஶ

ଵ

ୀଵ and

→ାஶ

ଵ

ଶ

ୀଵ

ଶ

. We have

ଵ ே

ே

ୀଵ and ଶ ଵ

ே

ே ଶ

ୀଵ . Therefore,

→ାஶ

.

47

The second argument is correct

• We randomly draw 𝑛 units with replacement from the population, and we

measure the dependent and the independent variable of those 𝑛 units.

• For every 𝑖 included between 1 and 𝑛, 𝑌 and 𝑋 = value of the dependent

and of the independent variable of the 𝑖th unit we randomly select.

• Because the 𝑛 units are drawn with replacement, ሺ𝑌 , 𝑋ሻ are iid.

• For every 𝑖 included between 1 and 𝑛, 𝐸 𝑋𝑌 ൌ ଵ

ே

∑ே ୀଵ 𝑥𝑦 and

𝐸 𝑋

ଶ

ൌ

ଵ ே

∑ே ୀଵ 𝑥ଶ .

• The law of large numbers implies that lim

→ାஶ

ଵ

∑ ୀଵ 𝑋𝑌 ൌ 𝐸 𝑋𝑌 and

lim

→ାஶ

ଵ

∑ ୀଵ 𝑋ଶ ൌ 𝐸 𝑋ଶ .

• Moreover, We have 𝐸 𝑋𝑌 ൌ ଵ

ே

∑ே ୀଵ 𝑥𝑦 and 𝐸 𝑋ଶ ൌ ଵ

ே

∑ே ୀଵ 𝑥ଶ .

• Therefore:

lim

→ାஶ

𝛼 ො ൌ lim

→ାஶ

∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊

∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 ൌ lim →ାஶ

భ

∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊

భ

∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 ൌ

୪୧୫

→శಮ

భ

∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊

୪୧୫

→శಮ

భ

∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 ൌ

ா

ா మ ൌ

భ ಿ

∑ಿ ೖసభ ௫ೖ௬ೖ

భ ಿ

∑ಿ ೖసభ ௫ೖమ ൌ

∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ ൌ 𝛼.

48

What you need to remember

• Prediction for 𝑦 based on OLS regression in population is 𝛼𝑥, with 𝛼 ൌ

∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ .

• We would like to compute 𝛼 but we cannot because we do not observe

the 𝑦s of everybody in the population.

• => we randomly draw 𝑛 units with replacement from the population, and

measure the dependent and the independent variable of those 𝑛 units.

• For every 𝑖 between 1 and 𝑛, 𝑌 and 𝑋 = value of dependent and

independent variables of the 𝑖th unit we randomly select.

• Given that 𝛼 is value of 𝑎 that minimizes ∑ே ୀଵ 𝑦 െ 𝑎𝑥 ଶ, we use 𝛼 ො, the

value of 𝑎 that minimizes ∑ ୀଵ 𝑌 െ 𝑎𝑋 ଶ to estimate 𝛼.

• We have 𝛼 ො ൌ ∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊

∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 .

• Law of large numbers implies that lim

→ାஶ

𝛼 ො ൌ 𝛼.

• When the sample we randomly draw gets large, 𝛼 ො, the sample coefficient

of the regression, gets close to 𝛼, the population coefficient.

• Therefore, 𝛼 ො is a good proxy for 𝛼 when the sample size is large enough.

49

Roadmap

1. The OLS univariate linear regression function.

2. Estimating the OLS univariate linear regression function.

3. OLS univariate linear regression in practice.

50

How Gmail uses univariate linear regression (1/2)

• Gmail would like to predict 𝑦, a variable equal to 1 if email 𝑘 is a

spam and 0 otherwise.

• To do so they use a variable 𝑥, a variable equal to 1 if the word

“free” appears in the email and 0 otherwise.

• 𝑥 is easy to measure (a computer can do it automatically, by doing

a search of “free” in the email), but 𝑦 is hard to measure: only a

human can know for sure whether an email is a spam or not. =>

they cannot observe 𝑦 for all emails reaching Gmail.

• To make good predictions, they would like to compute, 𝛼, the value

of 𝑎 that minimizes ∑ே ୀଵ 𝑦 െ 𝑎𝑥 ଶ, and then use 𝛼𝑥 to predict

𝑦. 𝛼𝑥: best univariate linear prediction of 𝑦 given 𝑥.

• Issue: 𝛼 ൌ ∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ => they cannot compute it unless they observe

all the 𝑦s, but that would be very costly to do (a human has to

read all the emails reaching Gmail accounts), plus once it’s done we

no longer need to predict the 𝑦s because we know them.

51

How Gmail uses univariate linear regression (2/2)

• Instead Gmail can draw a random sample of, say, 5000

emails, ask humans to read them and determine whether

they are spams or not.

• For every 𝑖 between 1 and 5000, let 𝑌 denote whether the

𝑖th randomly drawn email is a spam or not, and let 𝑋

denote whether the 𝑖th randomly drawn email has the

word free in it.

• Then, people in Gmail can compute 𝛼 ො ൌ ∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊

∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 .

• For all the emails they have not randomly drawn, and for

which they do not observe 𝑦, they can use 𝛼 ො𝑥 as their

prediction of whether the email is a spam or not.

• Because their random sample of emails is large, 𝛼 ො should

be close to 𝛼, and therefore 𝛼 ො𝑥 should be close to 𝛼𝑥,

the best univariate linear prediction of 𝑦 given 𝑥.

52

How banks use univariate linear regression (1/2)

• A bank would like to predict 𝑦, a variable equal to the

amount that a person applying in April 2018 for a one‐year

loan will fail to reimburse in April 2019 when her loan expires.

• To do so they use a variable 𝑥, equal to the FICO score of that

applicant.

• 𝑥 is easy to measure (the bank has access to the FICO score

of all applicants), but 𝑦 is impossible to measure today: it’s

only in April 2019 that the bank will know the amount the

applicant fails to reimburse.

• To make good predictions, they would like to compute 𝛼, the

value of 𝑎 that minimizes ∑ே ୀଵ 𝑦 െ 𝑎𝑥 ଶ, and then use

𝛼𝑥 to predict 𝑦. 𝛼𝑥: best univariate linear prediction of

𝑦 given 𝑥.

• Issue: 𝛼 ൌ ∑ಿ ೖసభ ௫ೖ௬ೖ

∑ಿ ೖసభ ௫ೖమ => they cannot compute because they do

not observe the 𝑦s.

53

How banks uses univariate linear regression (2/2)

• Instead, the bank can use data on people who applied in April 2017

for a one‐year loan. For those people, they know how much they

failed to reimburse on their loan. Let’s assume that the bank has

1000 applicants in April 2018, and 1000 applicants in April 2017.

• For every 𝑖 between 1 and 1000, let 𝑌 denote the amount that the

𝑖th April 2017 applicant failed to reimburse on her loan, and let 𝑋

denote the FICO score of that applicant.

• Then, people in the bank can compute 𝛼 ො ൌ ∑𝒏 𝒊స𝟏 𝑿𝒊𝒀𝒊

∑𝒏 𝒊స𝟏 𝑿𝒊𝟐 .

• For their April 2018 applicants, for which they do not observe 𝑦,

they can use 𝛼 ො𝑥 as their prediction of the amount each applicant

will fail to reimburse.

• Which condition should be satisfied to ensure 𝛼 ො is close to 𝛼? Hint:

look again at the Gmail example. There is one difference in the way

we select the observations for which we measure 𝑌 in the bank and

in the Gmail example. Discuss this question with your neighbor for

one minute.

54

iClicker time

• Which condition should be satisfied to ensure is

close to ?

55

April 2017 and 2018 applicants should look similar

• Previous section: 𝛼 ො converges towards 𝛼 if sample of units for which we

observe 𝑌 randomly drawn from population.

• The bank cannot draw random sample of April 2018 applicants and observe

today the amount this sample will fail to reimburse.

• Instead, can use the April 2017 applicants, for which they can both measure 𝑌 ,

amount that each applicant failed to reimburse, and 𝑋, FICO score.

• Then, can compute 𝛼 ො, and for each April 2018 applicant they can use 𝛼 ො𝑥 as

their prediction of 𝑦, the amount each applicant will fail to reimburse.

• If April 2017 applicants are “as good as” a random sample from combined

population of April 2017 and April 2018 applicants, then all our theoretical

results apply: 𝛼 ො should be close to 𝛼, and our predictions should be good.

• To assume that April 2017 applicants are almost a random sample from the

population of April 2017 and April 2018 applicants, April 2017 and April 2018

should look very similar. E.g.: should have similar FICO scores, demographics…

• => if two groups look similar, 𝛼 ො𝑥 should be good prediction of 𝑦 for 2018

applicants. Otherwise, we have to be careful.

56

What you need to remember, and what’s next

• In practice, there are many instances where we can measure the

𝑦𝑠, the variable we do not observe for everyone, for a subsample

of the population.

• We can use that subsample to compute 𝛼 ො , and then use 𝛼 ො𝑥 as

our prediction of the 𝑦𝑠 we do not observe.

• If that subsample is a random sample from the population (Gmail

example), 𝛼 ො𝑥 should be close to 𝛼𝑥, best linear prediction for 𝑦.

• On the other hand, if that subsample is not a random sample from

the population (bank example),𝛼 ො𝑥 will be close to 𝛼𝑥 only if the

subsample looks pretty similar to the entire population (almost a

random sample).

• Even when we have a random sample, univariate linear regression

might still not give great predictions.

• There are better prediction methods available. Next lectures: we

see one of them.

57

Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?

Whichever your reason may is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.

Our essay writers are graduates with diplomas, bachelor, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college diploma. When assigning your order, we match the paper subject with the area of specialization of the writer.

- Plagiarism free papers
- Timely delivery
- Any deadline
- Skilled, Experienced Native English Writers
- Subject-relevant academic writer
- Adherence to paper instructions
- Ability to tackle bulk assignments
- Reasonable prices
- 24/7 Customer Support
- Get superb grades consistently

Basic features

- Free title page and bibliography
- Unlimited revisions
- Plagiarism-free guarantee
- Money-back guarantee
- 24/7 support

On-demand options

- Writer’s samples
- Part-by-part delivery
- Overnight delivery
- Copies of used sources
- Expert Proofreading

Paper format

- 275 words per page
- 12 pt Arial/Times New Roman
- Double line spacing
- Any citation style (APA, MLA, Chicago/Turabian, Harvard)

We value our customers and so we ensure that what we do is 100% original..

With us you are guaranteed of quality work done by our qualified experts.Your information and everything that you do with us is kept completely confidential.

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read moreThe Product ordered is guaranteed to be original. Orders are checked by the most advanced anti-plagiarism software in the market to assure that the Product is 100% original. The Company has a zero tolerance policy for plagiarism.

Read moreThe Free Revision policy is a courtesy service that the Company provides to help ensure Customer’s total satisfaction with the completed Order. To receive free revision the Company requires that the Customer provide the request within fourteen (14) days from the first completion date and within a period of thirty (30) days for dissertations.

Read moreThe Company is committed to protect the privacy of the Customer and it will never resell or share any of Customer’s personal information, including credit card data, with any third party. All the online transactions are processed through the secure and reliable online payment systems.

Read moreBy placing an order with us, you agree to the service we provide. We will endear to do all that it takes to deliver a comprehensive paper as per your requirements. We also count on your cooperation to ensure that we deliver on this mandate.

Read more
The price is based on these factors:

Academic level

Number of pages

Urgency