I gave a talk at Skyscanner a while ago about the various forms of regressions in Machine Learning (ML) and how they might be applicable. While I have yet to get confirmation that I can post the video of the talk here, I'd like to share a selection of the techniques in a more abstract form.
Linear regression is the basic form of regression that we are most familiar with through our high school Algebra classes. Remember when you were given an equation y = a x + b and asked to find a and b such for a given set of (x,y)? That is regression at it's core. Through a set of training points, a machine learning engine would attempt to identify the best constants which fits those points. By definition, a linear regression problem is any problem that can be simplified into the form f(x) = w^T phi(x), where phi is a linear function. Since the constants a, b, c, ..., etc. in any functions such as a x + b, a x^2 + b x + c, or (a x + b)^4 can be captured in a vector of [a, b, c, ..., etc.]^T, they are all linear regression tasks. This vector is weight vector w in the generalized equation.
To optimize on a regression solution, the error function is estimated and minimized. An error function is the average difference between all the predicted and training points against the different w vectors. The optimized weights can be identified as a point where the derivative, or slope, of the error function is 0. Linear regression is great in that their error function have only one global minimum. However, they also suffer limitation as the types of function they can describe are rather limited. The most commonly seen implementation of linear regression is in spreadsheet applications, when you ask the program to find and plot the best fit line for your data points.
Multilayer Perceptron (MLP), aka Neural Network
To rid of the limitation of linear regression solution, ML scientists applies non-linear functions such as the Sigmoid and inverse tangent functions onto linear functions. Thus a perceptron is created and took the form h(x) = g(f(x)) where g(z) is a non-linear transformation of a linear function f(x) = w^T x. However, researchers soon realize an import draw back to a single perceptron solution. On the direction that is perpendicular (or orthogonal in multidimensional space) to the weight vector w, the predictions do not change. This flaw would greatly increase the error rate as a whole direction of predictions is error-prone.
Ultimately, the solution is simple: use more perceptrons! While each perceptrons has a blind spot, that spot would be covered by other perceptrons. This divide and conquer technique can be seen many other parts of Computer Science, such as quicksort in the sorting problem. To aggregate the different predictions g(f(x)) from multiple perceptrons, more perceptrons are used. This introduces a new layer of perceptrons that treats the output from the previous layer as the inputs. Thus, a MLP system is born.
Because the multilayer nature of the solution, calculating the derivative of a MLP error function is a difficult task. Fortunately, scientists realized that the weight vector in the lower, aggregating layers has a larger impact than the upper layers. Therefore, backward propagation techniques can be used to estimate where the error function has a slope of 0 and greatly speedup the optimization of the MLP system.
MLP is not without its weakness, however. The error function of a MLP solution tend to have multiple valleys, making optimization a form of gambling. If the starting weight vector used is within a local, instead of the global, valley, the "optimum" weight vector from the system is very likely to miss the global best.
That being said, MLP is a widely adapted as a commercial product due to its flexibility and efficiency in finding the "second-best" solutions.
Guassian process is one of the latest regression techniques. Instead of creating a single function (using an optimized weight vector), it produce a distribution of all possible functions given the training points. To do this, it leverages on the definition of a Guassian Process, such that "any collection of random variables where an arbitrary subset of variable have a joint Gaussian distribution". This, however, does mean that the technique assumes all the features, including all dimensions of x and y, have Gaussian distributions. While it appears limiting, this assumptions empowers the system to training using the covariance matrix of the joint distribution.
This covariance matrix encompasses all the interrelationships between any feature to any other features. When predicting a value, the regression system generates a Gaussian distribution with a covariance matrix produced by a Kernel function. The Kernel function describes how a feature y of a function changes depending on how all other features x change. In a Gaussian process, the system trains on the hyperparameters, or parameters of the Kernel function, instead of a weight vector as traditional regressions. The Squared Exponential Covariance and its various forms are the most popular choices for the Kernel function.
When visualizing all the possible predictions for y at all different points of x, what we get in return is a tubular plot where the variance of the prediction widens as the x goes away from a training point and narrows as the x is closer to a training point. This holistic prediction allow engineers to account for how the training data relates to the prediction depending their distances are.
On the flip side of the coin, the technique suffers greatly on speed. The operation of calculating multiple covariance matrices to optimize the hyperparameters is complex and expensive. This complexity may account for why this technique has yet to be widely adopted in commercial applications. Non-the-less, I have high hopes and excitement for Gaussian Process for the comprehensiveness of its predictions.
And well, it's just freaking sexy.