I recently enrolled in Stanford University’s Machine Learning open course on coursera.org, which is taught by esteemed Prof Andrew Ng. I’ll take some notes that are important to me (and probably many machine learning rookies), and hope this would help in later studies. (Disclosure of homework, homework solutions, and other materials are somehow a bad thing to do, for copyright problems or unfairness of prospective students of that course, but I guess notes would be fine.)

By the way, the UI of coursera.org is quite simple and informative, but I guess somehow they should try to unify the appearance of homepage and inside-course pages, and use more reliable, preferably official, subtitles.

So here are the notes from Week 1-3.

Week 1: Introduction and Linear Regression

  • Machine learning is some method or algorithm, that improves given experience E with regard to some performance measure P on a task T. (Paraphrased from Tom Mitchell, 1998. I cannot agree more!)
  • Supervised learning is learning problems where we are given the “right answers”, and asked to give the “map” from input values to prediction. Supervised learning mainly consist of regression problems (where the output is continuous) and classification problems (where the output takes only a few discrete values).
  • Unsupervised learning, on the other hand, has no predefined output value for each input datum. In such tasks, we usually need to tell the structure of the input data (which is called “clustering problem”), or separate meaningful information from somewhat mixed input (e.g. the cocktail party problem, where sound recorded by two speakers are used to recover the sole music and human voice from the scene where they were recorded).
  • Usually, in supervised learning, our knowledge of the map from input to output is given as a parametric hypothesis, with a cost function evaluating how good the hypothesis is (given certain choice of parameters).
  • Gradient descent updates the parameters with update rule -\alpha {\partial J(\theta)\over \partial \theta}, with J being the cost function, \theta the parameter(s), and \alpha the learning rate. Gradient descent can find local minima (maxima) for all functions, but global optima is guaranteed for convex functions.

I skipped the review of linear algebra there. For that part, the best help you can find would be a textbook on linear algebra.

Week 2: Multivariate Linear Regression

  • Most multivariate gradient descent problems involve multivariate calculus, so to put this into practice, a textbook is still the best suggestion I can offer.
  • Linear regression model can do much more than linear regression. Polynomial regression, for example, can be done with linear regression model by adding additional features that are powers of certain original features. Nonlinear regression like square roots works well, too.
  • Normal equation offers the possibility of solving linear regression analytically. This solution is essentially obtained by setting the derivative linear regression’s cost function to zero (since the cost function is strictly convex, such solution is guaranteed global optimal). However, since matrix inverse (pseudo inverse for non-invertible / singular matrices) operation is costly, this method cannot scale well to high dimensional problems.
  • When your gradient descent doesn’t work well, try shrinking your learning rate. When it converges too slowly, try amplifying your learning rate. A good way to do that is using a factor of about 3x instead of 10x, that is, use “0.001 0.003 0.01 0.03 0.1 0.3 1” instead of “0.001 0.01 0.1 1”. I recently benefited from this softer approach in my research myself.
  • Gradient descent methods can benefit from feature scaling (stretch all features to approximately [-1, 1]), and mean normalization (subtracting mean from the feature values). These techniques make gradient descent converge faster, without affecting the solution obtained.

I also skipped the Octave part as I use Matlab, which is almost the same language.

Week 3: Logistic Regression and Regularization

  • Logistic regression uses the logistic function (also called sigmoid function) \sigma(\theta^Tx) = {1\over 1+e^{-\theta^T x}} as the hypothesis function. To avoid non-convexity of cost function, instead of the squared difference function linear regression used, logistic regression used a cross-entropy style cost function f(x) = -y\log(\sigma(\theta^Tx)) - (1-y)\log(1-\sigma(\theta^Tx)). This cost function is convex, and thus friendly to gradient descent, for gradient descent methods are guaranteed to obtain the global optima.
  • To generalize binary logistic regression to multiple class, the common option is the “one-vs-all” algorithm. For each class, treat it as the positive class and all others as the negative class, we can train a binary logistic classifier. All these classifiers together consists of a multi-class logistic regression classifier.
  • Regularization can be used to avoid over-fitting, where the parameters fits the hypothesis too well to the training set that it cannot generalize well to new inputs. Regularization does this by minimizing the impact of unnecessary features on the cost function with regard to the regularization constant. However, when this constant is set too large, necessary features could get suppressed, too, causing the model to under-fit the data, in other words, introducing high bias towards how the data should be interpreted.

See Also…

[Coursera] Machine Learning Notes - Week 4-6
[Coursera] Machine Learning Notes - Week 7-10