This post are the fresh notes of the current offering of Machine Learning course on, which covers the courses offered in Week 7 (Support Vector Machines) through Week 10 (Large-scale Machine Learning).

So here are the notes from Week 7-10.

Week 7: Support Vector Machines

  • The basic linear SVM is obtained by replacing the log difference in the cost function with a rectified linear function (but with an offset of 1 on both sides).
  • The offset in the cost function can count for the property of large margine, which helps the classifier to generalize better by choosing a decision boundary of large margin from positive and negative examples. Such behavior can be tuned by the C factor in the cost function, leveraging between high bias and high variance.
  • Kernels can extend the application of support vector machines. Kernels are usually similarity measures between the input data vectors and chosen landmark vectors. If the similarity function is, for example, Gaussian centered at the landmarks, the resulting SVM is called a Gaussian SVM.

Week 8: Clustering and Dimensionality Reduction

  • Clustering and obtaining a good inner representation are two important forms of unsupervised learning.
  • K-means is a clustering algorithm that takes an iterative approach between “center assignment” steps and “center re-location” steps. During a center assignmnet step, each data point is assigned to the nearest of the K centroids; and during a center re-location step, each of the K centroids is re-located at the arithmetic mean of the data points that’s assigned to the centroid.
  • K-means centroids should usually be randomly initialized as random chosen data points. And one should expect to run K-means for several times to obtain stable results, with independent (different) random initializations.
  • The choice of the number of clusters should preferrably be determined by downstream purpose of the clustering, e.g. the profit of T-shirts given different number of sizes. Otherwise, one could plot the total clustering cost (sum of squared distances from data points to clustering centroids), and choose the value where the derivative of the error diminishes significantly (this is called the “Elbow method”).
  • Two apparent motivations for dimensionality reduction are data compression (for storage or computation speed-up purposes) and data visualization (for higher-dimensional data).
  • Principle Component Analysis (PCA) is one of the most commonly used algorithm for dimensionality reduction. It computes the eigen-decomposition of the covariance matrix of the data, and then takes only the eigenvectors corresponding to the largest K eigenvalues to represent the data (by projecting the data points on to the subspace spanned by such vectors).
  • The number of principle components to keep should usually retain 90 or higher percent (preferably 99%) of the variance of the data, meaning the sum of squared difference between the reconstructed data divided by the sum of squared data (dimension-wise).

Week 9: Anomaly Detection and Recommender Systems

  • Anomaly detection is a kind of unsupervised learning, in which normal data is usually modeled by a Gaussian distribution, and abnormal data points are distinguished by a lower probability under this distribution below some threshold epsilon. This algorithms is especially suitable for problems where the classes are very skewed, i.e. only a few positive examples are available, while a lot of negative ones are present. This algorithm is also useful where the properties of anomalies are not fully captured by training data.
  • Multivariate Gaussian is very useful in anomaly dectection when there’s correlation between input dimensions. However it’s more computationally costly, and takes more memory space too.
  • Collaborative filtering is an important algorithm for recommender systems. Collaborative filtering can be implemented via low rank matrix factorization, where the rankings matrix Y is decomposed into a parameter matrix theta (where each row denotes the ranking of a specific feature by a user), and a feature matrix X (where each column denotes the portion of features of some given content). Gradient descent on both matrices is the recommended algorithm to learn such factorization
  • It is worth notice that, when implementing collaborative filtering, it is necessary to do mean normalization first.

Week 10: Large Scale Machine Learning

  • Larger dataset is often the key to success for many modern learning systems.
  • Stochastic gradient descent runs through the dataset one example by one example, and computes the gradient for each data point to run descent. Random shuffling the dataset is necessary.
  • Mini-batch gradient descent lies somewhere between stochastic gradient descent and batch gradient descent (where all the data are taken into consideration for each gradient descent step). It takes a small portion of data (but more than 1), and compute the average gradient for descent. Random shuffling is also necessary, and mini-batch gradient descent is usually much more stable than stochastic gradient descent, thus the recommended algorithm to use.
  • Online learning is quite similar to stochastic gradient descent, in that it takes one sample at a time and computes the gradient. The major difference is that online learning does not keep the dataset; on the opposite, online learning discards every datum that it’s seen and moves on to new data. This allows online learning to enjoy the advantage of self-adaptation to time-changing data.
  • Map-reduce is an important method for making use of both computational and storage resources, by first splitting up task to different processing units or even different computers (map), then collect together the results from each processing unit / computer to calculate the final result (reduce).

See Also…

[Coursera] Machine Learning Notes - Week 1-3
[Coursera] Machine Learning Notes - Week 4-6