This post are the fresh notes of the current offering of Machine Learning course on, which covers the courses offered in Week 4 (Neural Networks: Representation) through Week 6 (Machine Learning System Design).

Week 4: Neural Networks Representation

  • Neural networks are multi-layer models in which each layer can be envisioned as, I think, a multivariate logistic regression model. The raw activation is the input of the previous layer (including a bias unit) times the network parameters (or “weights”) inbetween; and the final output, the activation is the result of taking element-wise sigmoid nonlinearity of the raw activations.
  • Neural networks borrow its inspiration from the neuron cells, where input weights mimic the dendrites, while the output weights mimic the Axons. Neuron cell bodies are usually thought as computing a nonlinear function from the inputs to the outputs.
  • Neural networks can fit extremely complex functions, and by extracting more complex features of the input at each layer, the classfication result is often improved. Neural network classfication uses a notation of the “one-vs-all” scheme. That is, when we have multiple (>= 3) labels, each label is denoted as an “all-zero-by-a-one” vector, e.g. class 1 = [1 0 0], class 2 = [0 1 0], class 3 = [0 0 1].

Week 5: Neural Networks Learning

  • The cost function of neural networks have similar forms to that of linear regression and logistic regression: an “output error” term plus the regularization term. Similar to logistic regression, the “1” unit or bias unit is not taken into consideration when doing parameter regularization.
  • The back-propagation algorithm propagate the error from the output layer all the way back to the first hidden layer (the layer next to the input layer), and use the error as well as the sigmoid gradient of the lower layer to compute the gradient of the weights inbetween. Similar to forward propagation that computes the activation of each layer and finally the output of the network, backprop is done layer-by-layer —- only in the reverse order.
  • To make use of Matlab/Octave built-in or third-party vector-based optimization functions to obtain network parameters, the parameters, as well as their corresponding gradients, should be unrolled to form a single vector in pratical implementation.
  • To ensure that the gradient of the network parameters are correctly implemented, gradient checking should be performed to check if the gradients computed from analytical formulas are in accordance with those computed from approximation. Approximation can be done by choosing a small inteval length epsilon, and compute the two-sided approximate gradient (function value difference divided by the total interval length). As this gradient approximation is often computationally inefficient, gradient checking codes should be turned off before applying to pratical training.
  • Identical initialization will lead to identical gradients of the weights, thus the parameters will always remain identical between a given input unit and all the identically-initialized output units, resulting in uninformative (redundant) units. Thus, random initialization is essential to neural networks.

Week 6: Advice for Applying Machine Learning and Machine Learning Systems Design

  • When evaluating a hypothesis, the best benchmark should be the error (or value of unregularized cost function) on the test set. But when multiple models are to be seleted from, a third cross-validation set should be used, thus the result from the test set is fair. Thus a usual pipeline of training, selecting a proper learning algorithm / model, and evaluating the model should be: first train the models on the training set, then fit the regularization value / choose a model according to the performance on the cross-validation set (or validation set), and finally evaluate the model on the test set. All three sets must be disjoint.
  • Learning curves are the errors on the training set and validation set, respectively, plotted against the number of training data used. When the training error as well as validation error converge to a close large value, it’s likely that the model is of high bias (underfit), and adding features, decreasing regularization value should be considerable solutions. When the training error is small while the validation error remains large, it’s likely that the model is of high variance (overfit), and proper actions to take include carefully selecting features, reducing model parameters, and increasing regularization punishment. A proper learning curve should have a decreasing validation error and an increasing training error, and the former converging at a value usually slightly higher than that of the latter.
  • Error analysis is an approach in which we analyse the characteristics of the falsely classfied data, and determine whether devising new features to deal with the misjudgements. It is also recommended to implement a “quick and dirty” learning algorithm for error analysis, to determine whether an algorithm is a good starting point.
  • When the data classes are skewed, e.g. have 99% positive examples with only 1% negative examples, Precision = #true positive / #predicted positive = #true positive / (#true positive + #false positive), Recall = #true positive / #all positive = #true positive / (#true positive + #false negative), and F1 Score = 2 * Precision * Recall / (Precision + Recall) are often superior error metrics than Accuracy = (#true postive + #true negative) / #all data.
  • Large amount of data is helpful when (i) our model is of high variance (overfitting), as more data can make it difficult for the model to overfit the data, *AND* (ii) the features are enough for a human expert to give good predictions, i.e. giving enough information about the desired output, or more data will not help.

See Also…

[Coursera] Machine Learning Notes - Week 1-3
[Coursera] Machine Learning Notes - Week 7-10