# Introduction to Artificial Intelligence

CS188 Intro to AI Problem Set
Please see the problem set in attachment. Coding part is optional.Course material: https://inst.eecs.berkeley.edu/~cs188/fa20/

CS 188
Fall 2020
Introduction to
Artificial Intelligence Written HW 4
Due: Wednesday 12/02/2020 at 11:59pm (submit via Gradescope).
Policy: Can be solved in groups (acknowledge collaborators) but must be written up individually
Submission: Your submission should be a PDF that matches this template. Each page of the PDF should
align with the corresponding page of the template (page 1 has name/collaborators, question 1 begins on page
2, etc.). Do not reorder, split, combine, or add extra pages. The intention is that you print out the
template, write on the page in pen/pencil, and then scan or take pictures of the pages to make your submission.
You may also fill out this template digitally (e.g. using a tablet.)
First name
Last name
SID
Collaborators
For staff use only:
Q1. Probabilistic Language Modeling /40
Q2. Machine Learning /40
Q3. Optional: Programming Naive Bayes /0
Total /80
1
Q1. [40 pts] Probabilistic Language Modeling
In lecture, you saw an example of supervised learning where we used Naive Bayes for a binary classification problem:
To predict whether an email was ham or spam. To do so, we needed a labeled (i.e., ham or spam) dataset of emails.
To avoid this requirement for labeled datasets, let’s instead explore the area of unsupervised learning, where we don’t
need a labeled dataset. In this problem, let’s consider the setting of language modeling.
Language modeling is a field of Natural Language Processing (NLP) that tries to model the probability of the next
word, given the previous words. Here, instead of predicting a binary label of “yes” or “no,” we instead need to
predict a multiclass label, where the label is the word (from all possible words of the vocabulary) that is the correct
word for the blank that we want to fill in.
One possible way to model this problem is with Naive Bayes. Recall that in Naive Bayes, the features X1, …, Xm
are assumed to be pairwise independent when given the label Y . For this problem, let Y be the word we are trying
to predict, and our features be Xi for i = n, …, 1,1, …, n, where Xi = ith word i places from Y . (For example,
X
2 would be the word 2 places in front of Y . Again, recall that we assume each feature Xi to be independent of
each other, given the word Y . For example, in the sequence Neural networks ____ a lot, X2 = Neural, X1 =
networks, Y = the blank word (our label), X1 = a, and X2 = lot.
(a) First, let’s examine the problem of language modeling with Naive Bayes.
(i) [1 pt] Draw the Bayes Net structure for the Naive Bayes formulation of modeling the middle word of a
sequence given two preceding words and two succeeding words. You may think of the example sequence
listed above:
Neural networks ____ a lot.
(ii) [1 pt] Write the joint probability P(X2, X1, Y, X1, X2) in terms of the relevant Conditional Probability
Tables (CPTs) that describe the Bayes Net.
(iii) [1 pt] What is the size of the largest CPT involved in calculating the joint probability? Assume a
vocabulary size of V , so each variable can take on one of possible V values.
(iv) [1 pt] Write an expression of what label y that Naive Bayes would predict for Y (Hint: Your answer
should involve some kind of arg max and CPTs.)
(v) [3 pts] Describe 2 problems with the Naive Bayes Approach for the general problem of language modeling.
Hint: do you see any problems with the assumptions that this approach makes?
2
Now, let’s change our setting a bit. Instead of trying to fill in a blank given surrounding words, we are now only
given the preceding words. Say that we have a sequence of words: X1, …, Xm1, Xm. We know {Xi}m i=0 1 but we
don’t know X
m.
(b) For this part, assume that every word is conditioned on all previous words. We will call this the Sequence
Model.
(i) [1 pt] Draw the Bayes Net (of only X1, X2, X3, X4, X5) for a 5-word sequence, where we want to predict
the fifth word in a sequence X5 given the previous 4 words X1, X2, X3, X4. Again, we are assuming here
that each word depends on all previous words.
(ii) [1 pt] Write an expression for the joint distribution of a general sequence of length m: P(X1, …, Xm).
(iii) [1 pt] What is the size of the largest CPT involved in calculating the joint probability? Assume a
vocabulary size of V , so each variable can take on one of possible V values.
(c) You should have gotten a very large number for the previous part, which shows how infeasible the sequence
model is. Instead of the model above, let’s now examing another modeling option: N-grams. In N-gram
language modeling, we add back some conditional assumptions to bound the size of the CPTs that we consider.
We limit the tokens of consideration from “all previous words” to instead using only “the previous N 1 words.”
This creates the conditional assumption that, given the previous N 1 words, the current word is independent
of any word before the previous N 1 words. For example, for N = 3, if we are trying to predict the 100th
word, then given the previous N 1 = 2 words (98th and 99th words), then the 100th word is independent of
words 1, . . . , 97 of the sequence.
(i) [1 pt] Making these additional conditional independence assumption changes our Bayes Net. Redraw the
Bayes Net from part (ci) to represent this new N-gram modeling of our 5-word sequence: X1, X2, X3, X4, X5.
Use N = 3.
3
(ii) [2 pts] Write an expression for the N-gram representation of the joint distribution of a general sequence
of length m: P (X1, …, Xm). Please use set notation (for example: For tokens Xi, …, Xj, please write
something of the form {Xk}j k=i). Your answer should express the joint distribution P ({Xi}m i=1), in terms
of m and N.
Hint: If you find it helpful, try it for the 5 word graph above first before going to a general m length
sequence.
(iii) [1 pt] What is the size of the largest CPT involved in calculating the joint probability above? Again,
assume a vocabulary size of V , and m > N.
(iv) [2 pts] Describe one disadvantage of using N-gram over Naive Bayes.
(v) [4 pts] Describe an advantage and disadvantage of using N-gram over the Sequence Model above.
4
(d) In this question, we see a real-world application of smoothing in the context of language modeling.
Say we have the following training corpus from Ted Geisel:
i am sam . sam i am . i do not like green eggs and ham .
Consider the counts given in the tables below, as calculated from the sentence above.
1-gram
Token Count
i 3
am 2
sam 2
. 3
do 1
not 1
like 1
green 1
eggs 1
and 1
ham 1
TOTAL 17
2-gram phrases starting with i
Token1 Token2 Count
i am 2
i do 1
TOTAL 3
2-gram phrases starting with am
Token1 Token2 Count
am sam 1
am . 1
TOTAL 2
(i) [1 pt] Based on the above dataset and counts, what is the N-gram estimate for N = 1, for the sequence
of 3 tokens i am ham? In other words, what is P(i, am, ham) for N = 1?
(ii) [1 pt] Based on the above dataset and counts, what is the N-gram estimate for N = 2, for the sequence
of 3 tokens i am ham? In other words, what is P(i, am, ham) for N = 2?
(iii) [3 pts] What is the importance of smoothing in the context of language modeling?
5
(iv) [5 pts] Perform Laplace k-smoothing on the above problem and re-compute P (i, am, ham) with the
smoothed distribution, for N = 2. In order to calculate this, complete the pseudocount column for
each entry in the probability tables. Note we add a new <unk> entry, which represents any token not in
the table.
Hint: the count for the new <unk> row in each table would be 0.
1-gram
Token Count Pseudocount
i 3
am 2
sam 2
. 3
do 1
not 1
like 1
green 1
eggs 1
and 1
ham 1
<unk> 0
TOTAL 17
2-gram phrases starting with “i”
Token1 Token2 Count Pseudocount
i am 2
i do 1
i <unk> 0
TOTAL 3
2-gram phrases starting with “am”
Token1 Token2 Count Pseudocount
am sam 1
am . 1
am <unk> 0
TOTAL 2
6
(v) [4 pts] What is a potential problem with Laplace smoothing? Propose a solution. (Assume that you have
chosen the best k, so finding the best k is not a problem.)
Hint: Consider the effect of smoothing on a small CPT.
(vi) [2 pts] Let the likelihood L(k) = P (i, am, sam), give an expression for the log likelihood ln L(k) of this
sequence after k-smoothing. Continue to assume N = 2.
(vii) [4 pts] Describe a procedure we could do to find a reasonable value of k. No mathematical computations
needed.
Hint: you might want to maximize the log likelihood ln L(k) on something.
7
Q2. [40 pts] Machine Learning
In this question, we will attempt to develop more intuition about how Neural Networks work. In parts A and B, we
will discuss gradient descent, and in part C we look at backprop.
(a) Gradient descent is a procedure which allows you to minimize any loss function. As as example, let’s consider
a simple function Loss(w) = w2 and let’s assume that we want to minimize this function. Perform gradient
descent on this loss function by using the update rule w w α dLoss dw , where α is the learning rate.
(i) [1 pt] What is dLoss dw ? Write your answer in terms of w.
(ii) [1 pt] What is the optimal w that minimizes this loss function? We denote this value of w as w.
(iii) [2 pts] Carry out one iteration of gradient descent (i.e., weight update). What is the resulting weight
(and corresponding post-update loss) for the scenarios below? Plot the loss function (w2) by hand and,
for each of the two scenarios below, draw the direction in which w is updated (an arrow on the w axis
from wold to wnew).
1. α = 0.1, w = 2
2. α = 1, w = 2
8
(iv) [4 pts] Assume w is initialized to some nonzero value. Assume we are still working with Loss(w) = w2
1. Which value of α allows gradient descent to make w converge to win the least amount of steps?
2. For what range of α does w never converge?
Hint: You may consider for which γ where wt = γwt1 will never converge.
(v) [2 pts] Why must α always be positive when performing gradient descent?
(b) It is unlikely that we have a loss function as nice as w2. Say we instead want to minimize some more complex
loss Loss(w) = w 24 + w 33 3w2 + 7, a polynomial with local minima at w = 2, 1.5, a global minimum at
w = 2, a local maximum at w = 0, and limits that go to infinity for both w → ∞ and w → −∞.
3 2 1 0 1 2 3
0
10
20
30
w
Loss(w)
9
(i) [3 pts] Why do Neural Networks use gradient descent for optimization instead of just taking the derivative
and setting it equal to 0? Explain in 1-2 sentences. You may use the example error function from above
(ii) [1 pt] What is the optimal w, given the loss above?
(iii) [3 pts] Let α and w take on the values below. For each case, perform some update steps and report
whether or not gradient descent will converge to the optimum wafter an infinite number of steps. If not,
report whether it converges to some other value, or does not converge to any value.
1. α = 1, w = 0
2. α = 1, w = 2
3. α = 1, w = 1
4. α = 0.1, w = 3
10
5. α = 0.1, w = 2
6. α = 0.1, w = 10
(iv) [3 pts] From the subquestion above, explain in 1-2 sentences the effect of learning rate being (a) too high
and (b) too low.
(c) Let’s now look at some basic neural networks drawn as computation graphs.
(i) [1 pt] Consider the following computation graph, which represents a 1 layer Neural Network.
x m
× b
+ f(x)
1. Write the equation for the network’s output (y = f(x)) in terms of m, x, b.
2. Describe the types of functions that can be encoded by such a function (given that the variables that
it can control are m and b).
11
(ii) [2 pts] Let’s stack two of the above graphs together to (potentially) represent a 2 layer “Neural Network.”
x m
1
× b1
+
×
m2 b2
+ f(x)
1. Write the equation for y = f(x) in terms of m1, m2, b1, b2, x.
2. Describe the types of functions that can be encoded by such a function (given that the variables that
it can control are the 4 learnable parameters m1, m2, b1, b2). Compare this with the previous neural
network’s expressive power.
3. Is this actually a 2-layer network? If it is, explain in 1-2 sentences. If not, rewrite it (algebraically)
as a 1-layer network with only 2 learnable weights. Why do neural networks need nonlinearities?
(iii) [4 pts] Now, let’s go back to the first NN and add a nonlinearity node. Recall ReLU(x) = max (0, x).
Also consider a loss function Loss(y, y) which represents the error between our network’s output (y) and
the true label (y) from the dataset. We will perform an abbreviated version of backpropagation on this
network.
x m
×
a
b
+ ReLU
y
Loss
y
1. Compute ∂Loss ∂a using Chain Rule. Use Mean Squared Error as the loss function, which is defined as
MSE(y, y) = (y y)2 where y= true label and y is the predicted output from the neural network.
2. Find ∂Loss
∂b . Note that since we are doing backprop, we can reuse calculations from part 1.
12
3. Find ∂Loss
∂m . Note that since we are doing backprop, we can reuse calculations from part 1.
4. What is the gradient descent update rule for updating m? What is the update rule for b?
For the next few parts, we analyze the Perceptron algorithm. In the perceptron algorithm, we predict +1 if ~wT f ~(x)
0, and predict 1 else, where f ~(x) is a feature vector.
(d) [3 pts] When implementing the perceptron algorithm with a neural network, the following function might be
of use: sign(x) = (1 if 1 if x x < 0 0. If we added this sign(x) node to our neural network drawings, what would
happen during backpropagation through this node?
Hint: what does the gradient look like for various x values?
13
(e) [2 pts] Draw the binary perceptron prediction function as a “neural-network”-styled computation graph. Assume 3 dimensional weight and feature vectors: that is, [w0, w1, w2] is the weight vector and [f0(x), f1(x), f2(x)]
is the feature vector. Recall that in the perceptron algorithm, we take the dot product of the weight vector
and the feature vector. In addition to the addition and multiplication nodes, add a loss node at the end, to
represent the prediction error which we would like to minimize. Label the edge which represents the perceptron
model’s output as y.
Hint: y = sign(w0 f0(x) + w1 f1(x) + w2 f2(x))
(f) [2 pts] Using Mean Squared Error (y y)2 as the loss function, compute ∂Loss ∂wi . Because of the problem you
noticed in the previous part with including the sign node, as we are doing chain rule below, use the custom
gradient ∂sign ∂x(x) = h ∂sign ∂x(x) icustom = 1.
14
(g) In this part, we will derive the gradient update rule for the perceptron using our graph above.
(i) [1 pt] The loss gradient is defined as wLoss =

∂Loss
∂w0
∂Loss
∂w1
∂Loss
∂w2

(ii) [4 pts] What is the gradient update rule ( ~w ~w αwLoss) for the cases below?
Hint: your answers will be in terms of f ~(x) and α.
1. y = 1, y= 1
2. y = 1, y= 1
3. y = y
(iii) [1 pt] For α = 1 4, compare the update rules you derived for the 3 cases above with the Perceptron update
formula in the notes and lecture. Briefly describe your observations.
15
Q3. [0 pts] Optional: Programming Naive Bayes
Now we will implement these ideas in code (in a Google Colab Notebook, link posted on piazza) to perform language
modeling, and then use the language model to generate some novel text!
(a) You will implement some of the math you computed earlier to complete the following functions in the provided
N-gram class. Note that if you follow our hints, this should not require more than 15 total lines of code for all
the functions below.
You can find the pdf under
Piazza/resource/CS 188 Fall 2020 Written Homework 4 Colab Instructions (Optional).pdf
You will need to implement the following functions:
• count_words
1. This function returns a dictionary with the count of each word in self.text_list.
2. HINT: You can do this in one line by using collections.Counter.
• calc_word_probs
1. This function converts a dictionary of counts from count_words and normalizes the counts into
probabilities.
2. HINT: You can do this in 1-2 lines by using self.normalize_dict(…)
• probs_to_neg_log_probs
1. This function converts an inputted dictionary of probabilities probs_dict into a dictionary of negative
log probabilities.
2. HINT: Use np.log.
1. This function is a little more complicated. Given a length N 1 tuple of tokens and their associated counts (frequencies), this function searches through all the length N phrases it has stored in
self.adj_counter (or is passed in via the argument adj_counter) and returns a dictionary with
only the length N phrases with the same first N 1 words as word_tuple, plus their associated counts
(frequencies). See the docstring for a concrete example.
2. HINT1: Use phrase[:len(word_tuple)] to get a tuple of the first N 1 words of each N-length
phrase in the adj_counter to compare with word_tuple.
3. HINT2: We are returning the filtered dictionary which is stored in the variable subset_word_adj_counter,
so you need to modify this dictionary in some way.
• p_naive
1. This function calculates the non-smoothed empirical probability of a length n phrase occurring given
length n1 tuple of tokens prev_phrase. In other words, it calculates P(current token|previous N – 1 tokens).
The probability is based on counts, exactly like how we calculated probabilities in the green eggs and
ham example earlier in this problem without smoothing.
2. HINT1: You need to define prob because it is being returned.
3. HINT2: You need to normalize filtered_adj_counter which is already defined for you.
• calc_neg_log_prob_of_sentence
1. This function calculates and returns the negative log probability of the entire sequence of tokens
sentence_list given a probability function p_func (which is either the smoothed or the nonsmoothed probabilities).
2. HINT1: curr_word_prob is defined for you, and is P(currToken|previous N – 1 tokens).
3. HINT2: cum_neg_log_prob is what the function returns. For each iteration of the for-loop, what
must we do to update cum_neg_log_prob?
4. HINT3: Think about how we combine log probabilities for each word.
• calc_prob_of_sentence
1. This function calculates and returns the probability of a sequence of tokens.
16
2. HINT1: Use the function you just wrote, calc_neg_log_prob_of_sentence.
3. HINT2: Use np.exp.
(b) After writing the above functions and mounting the corpus on your google drive, you should be able to run the
text generation algorithm. This algorithm works by first using the N-gram model to construct CPTs (as we
saw earlier in this homework). Then, it uses the CPTs to generate a sequence of words that our model thinks
can occur with relatively “high probability.” Our hope is that the “high probability” sequences are sequences
of words that make some kind of sense.
Run the text generation algorithm and record (in the spaces below) some of your N-gram model-generated
sentences with the following parameters. Please do these in order or else you will be very disappointed by the
mediocrity of the text generated. What are the effects of increasing N and k on the quality of the generated
text? Modify the N, k variables in the “play with params in code here” section.
1. N = 1, k = 1
2. N = 2, k = 1
3. N = 2, k = 5
4. N = 3, k = 1
5. N = 3, k = 5
(c) Now, perform language modeling on a dataset/corpus of your choosing. Select a corpus, put it in a text
file, and upload it to the google drive folder with all the other .txt files you uploaded earlier. Then redefine
training_corpora_path_list under the comment “REDEFINE training_corpora_path_list here if you
wish to use your own corpus” and use the model to perform text generation (as you did above). For good
results, select a corpus at least 50,000 words long. If you are not feeling creative, feel free to use the other files
in the cs188whw4public folder.
Below, write a sentence that your N-gram model generated on your custom corpus.
(d) So far, we have used N-gram to do language modeling and text generation. We can also use N-gram, with some
modifications (that staff has already coded for you) to capitalize a sequence of words correctly. This is done
using probability maximization; in other words, which options of capitalization look most like things we have
seen in the training dataset. The current implementation is slow, but if you were to make a spin-off project,
you can look into the Viterbi Algorithm (not in scope for 188 this semester) to speed it up.
Run the capitalization script on the strings provided (you do not need to make any changes for this part). If
you wrote your code correctly, you should get capitalizations of the inputted sentences that make sense. In
other words, your model is smart enough to know when to capitalize tricky words like “united!”
17

Don't use plagiarized sources. Get Your Custom Essay on
Introduction to Artificial Intelligence
Just from \$10/Page

 Laralex Case Study Data Hospital Acquired Infections Cesarean Section Procedures Discrepant X-Rays Unscheduled Readmissions Patients Who Leave the ED Prior to Treatment Month Patient-Days No. Births No. Patients No. Patients No. Patients No. 1 5225 22 119 32 488 8 310 6 604 6 2 5515 20 111 27 573 3 294 5 575 10 3 5872 15 111 32 489 6 337 14 593 7 4 5398 22 125 28 420 4 253 10 641 6 5 5017 26 99 27 503 6 293 10 601 11 6 5273 17 127 27 580 7 300 4 649 9 7 4824 20 121 25 419 8 319 10 658 11 8 5340 21 117 32 442 4 199 7 552 11 9 5307 14 133 30 407 3 263 11 536 9 10 5507 20 106 23 553 9 259 5 554 11 11 4189 22 120 27 466 3 285 14 708 11 12 4378 17 123 33 551 4 275 11 547 12 13 4620 20 114 29 485 10 320 13 589 16 14 5869 27 128 19 427 7 329 12 596 12 15 4975 21 117 19 540 9 243 11 685 18 16 4969 19 115 21 568 3 278 8 640 15 17 5792 17 104 22 531 9 365 6 659 17 18 4939 22 128 20 558 5 348 11 609 16 19 5616 16 120 24 474 4 290 8 438 14 20 5061 11 121 25 594 9 321 7 522 13 21 5262 20 102 21 540 2 253 9 574 16 22 4808 26 107 18 553 9 266 10 539 18 23 5280 20 118 24 556 11 301 11 634 21 24 5491 24 116 22 541 7 348 9 610 22

## Get professional assignment help cheaply

Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?

Whichever your reason may is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.

Our essay writers are graduates with diplomas, bachelor, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college diploma. When assigning your order, we match the paper subject with the area of specialization of the writer.

## Why choose our academic writing service?

• Plagiarism free papers
• Timely delivery
• Skilled, Experienced Native English Writers
• Ability to tackle bulk assignments
• Reasonable prices

Basic features
• Free title page and bibliography
• Unlimited revisions
• Plagiarism-free guarantee
• Money-back guarantee
On-demand options
• Writer’s samples
• Part-by-part delivery
• Overnight delivery
• Copies of used sources
Paper format
• 275 words per page
• 12 pt Arial/Times New Roman
• Double line spacing
• Any citation style (APA, MLA, Chicago/Turabian, Harvard)

# Our guarantees

We value our customers and so we ensure that what we do is 100% original..
With us you are guaranteed of quality work done by our qualified experts.Your information and everything that you do with us is kept completely confidential.

### Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

### Zero-plagiarism guarantee

The Product ordered is guaranteed to be original. Orders are checked by the most advanced anti-plagiarism software in the market to assure that the Product is 100% original. The Company has a zero tolerance policy for plagiarism.

### Free-revision policy

The Free Revision policy is a courtesy service that the Company provides to help ensure Customer’s total satisfaction with the completed Order. To receive free revision the Company requires that the Customer provide the request within fourteen (14) days from the first completion date and within a period of thirty (30) days for dissertations.

The Company is committed to protect the privacy of the Customer and it will never resell or share any of Customer’s personal information, including credit card data, with any third party. All the online transactions are processed through the secure and reliable online payment systems.

### Fair-cooperation guarantee

By placing an order with us, you agree to the service we provide. We will endear to do all that it takes to deliver a comprehensive paper as per your requirements. We also count on your cooperation to ensure that we deliver on this mandate.

## Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
\$26
The price is based on these factors: