Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions learning_curve.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ def display_digits():

def train_model():
"""Train a model on pictures of digits.

Read in 8x8 pictures of numbers and evaluate the accuracy of the model
when different percentages of the data are used as training data. This function
plots the average accuracy of the model as a function of the percent of data
used to train it.
"""
data = load_digits()
num_trials = 10
num_trials = 100
train_percentages = range(5, 95, 5)
test_accuracies = numpy.zeros(len(train_percentages))

Expand All @@ -39,8 +39,16 @@ def train_model():
# For consistency with the previous example use
# model = LogisticRegression(C=10**-10) for your learner

# TODO: your code here
for percentage in range(len(train_percentages)):
accuracy = []
for trial in range(num_trials):
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, train_size=train_percentages[percentage]/100)
model = LogisticRegression(C=10**-5)
model.fit(x_train, y_train)
accuracy.append(model.score(x_test, y_test))
test_accuracies[percentage] = numpy.mean(accuracy)

# plot figure
fig = plt.figure()
plt.plot(train_percentages, test_accuracies)
plt.xlabel('Percentage of Data Used for Training')
Expand All @@ -50,5 +58,5 @@ def train_model():

if __name__ == "__main__":
# Feel free to comment/uncomment as needed
display_digits()
# train_model()
# display_digits()
train_model()
15 changes: 15 additions & 0 deletions questions.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
1. What is the general trend in the curve?

The general trend in the curve is linear. There is a positive correlation between the percentage of data used as training data and the accuracy of the model.

2. Are there parts of the curve that appear to be noisier than others? Why?

The bottom part of the curve appears noisier than others. When a smaller percentage of data is used for training, there is a lot more guesswork involved with classifying the larger test sets. Sometimes the program will guess right and sometimes the program will guess wrong, thus contributing to the noisiness of the curve. With more trials, the program can average out the variability of these guesses to smooth out the curve.

3. How many trials do you need to get a smooth curve?

The curve is smooth at around 100 trials. Overall, the more trials, the smoother the curve.

4. Try different values for C (by changing LogisticRegression(C=10**-10)). What happens? If you want to know why this happens, see this Wikipedia page as well as the documentation for LogisticRegression in scikit-learn.

Increasing C decreases the percentage of data used for training necessary to achieve a higher accuracy, but increases the processing time. This is because C is the inverse of regularization strength. With a higher C and lower regularization, the model overfits, resulting in higher accuracy. The opposite holds true. With a lower C and higher regularization, the model underfits, resulting in lower accuracy.