olin-toolboxes · vivienyuwenchen · Nov 3, 2017
diff --git a/learning_curve.py b/learning_curve.py
@@ -21,14 +21,14 @@ def display_digits():
 
 def train_model():
     """Train a model on pictures of digits.
-    
+
     Read in 8x8 pictures of numbers and evaluate the accuracy of the model
     when different percentages of the data are used as training data. This function
     plots the average accuracy of the model as a function of the percent of data
     used to train it.
     """
     data = load_digits()
-    num_trials = 10
+    num_trials = 100
     train_percentages = range(5, 95, 5)
     test_accuracies = numpy.zeros(len(train_percentages))
 
@@ -39,8 +39,16 @@ def train_model():
     # For consistency with the previous example use
     # model = LogisticRegression(C=10**-10) for your learner
 
-    # TODO: your code here
+    for percentage in range(len(train_percentages)):
+        accuracy = []
+        for trial in range(num_trials):
+            x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, train_size=train_percentages[percentage]/100)
+            model = LogisticRegression(C=10**-5)
+            model.fit(x_train, y_train)
+            accuracy.append(model.score(x_test, y_test))
+        test_accuracies[percentage] = numpy.mean(accuracy)
 
+    # plot figure
     fig = plt.figure()
     plt.plot(train_percentages, test_accuracies)
     plt.xlabel('Percentage of Data Used for Training')
@@ -50,5 +58,5 @@ def train_model():
 
 if __name__ == "__main__":
     # Feel free to comment/uncomment as needed
-    display_digits()
-    # train_model()
+    # display_digits()
+    train_model()
diff --git a/questions.txt b/questions.txt
@@ -0,0 +1,15 @@
+1. What is the general trend in the curve?
+
+The general trend in the curve is linear. There is a positive correlation between the percentage of data used as training data and the accuracy of the model.
+
+2. Are there parts of the curve that appear to be noisier than others? Why?
+
+The bottom part of the curve appears noisier than others. When a smaller percentage of data is used for training, there is a lot more guesswork involved with classifying the larger test sets. Sometimes the program will guess right and sometimes the program will guess wrong, thus contributing to the noisiness of the curve. With more trials, the program can average out the variability of these guesses to smooth out the curve.
+
+3. How many trials do you need to get a smooth curve?
+
+The curve is smooth at around 100 trials. Overall, the more trials, the smoother the curve.
+
+4. Try different values for C (by changing LogisticRegression(C=10**-10)). What happens? If you want to know why this happens, see this Wikipedia page as well as the documentation for LogisticRegression in scikit-learn.
+
+Increasing C decreases the percentage of data used for training necessary to achieve a higher accuracy, but increases the processing time. This is because C is the inverse of regularization strength. With a higher C and lower regularization, the model overfits, resulting in higher accuracy. The opposite holds true. With a lower C and higher regularization, the model underfits, resulting in lower accuracy.