Machine Learning: a definition

Hi everyone, welcome back. Tonight I was having dinner with some engineer friends (sadly not a plat en sauce) and the subject of Machine Learning popped up.

The discussion quickly almost turned religious, between believers and non believers, those that thought it’s over hyped, those who just use it as a tool and those who research it like myself. All in all, it made me feel that even engineers have a very vague sense of what exactly is Machine Learning. I’ve noticed people in general often mix it up with other concepts that either include it such as AI or that it includes such as the Supervised Learning paradigm. Since it’s been a long time that I’ve been wanting to address this issue, here I go.

The aim of this article is to explain ML assuming that you already have some general mathematical background, and have (very) basic knowledge of optimization i.e. you know what argmin is. I also assume that you can read equations. 

My definition of Machine Learning is the following:

Choosing a task and learning a statistical model from data that will perform well on this task.

This definition seems extremely vague, so I’ll explain each part in more detail. I’m intentionally leaving out complications due to taking into account randomness in the data. Of course we need this both in theory and in practice.

Choosing a task

Examples of tasks are recognizing handwritten digits, translating sentences from a language to another, identifying people in pictures, forecasting the weather or stock prices, grasping an object with a robot arm… Basically any clear-cut problem is a task. This part doesn’t suppose an actual mathematical expression of the task. On the other hand, it does require to be very precise as we’ll see in the following example.

A model that will perform well

To define what perform well means, you often want to define a loss or cost function for your task, that given your data and model, i.e. the specific function of your data you’re considering, outputs a measure of its performance. A simple example: suppose my (very basic) task is to always output the number 0. A good metric could be the distance of my function output to 0. Then it’s obvious that if f and g are the constant functions respectively equal to 42 and 0.2, g performs better on this task than f, for every distance function other than the 0-1 distance. All these different distance functions yield a different metric for the same task.
The problem is then to minimize your loss function over the class of functions that you will choose in the next step.

A statistical model

Choosing a class of functions that take as inputs the data you want to use for the task, i.e. often a high-dimensional vector, and outputs a real number or another vector. A class of functions is a group of functions that share a certain structure, with different parameters. An example of this is the class of linear functions that can we written : f(X) = AX + b , where matrix A and vector b are the parameters.

Learning the model

This here is the actual way that you will choose your function that minimizes your loss over your data in the class of functions that you chose. This is the learning and what makes ML so compelling to me, and makes it treated as almost magical in the media.

This is also what made the success of deep learning, where in the parameter space, the overall loss is differentiable with regard to the parameters, allowing us to simply glide towards a (near) optimal solution using Stochastic Gradient Descent.

Who does what?

ML to me is described by these 4 steps. Both researchers and engineers usually focus on a subset of these steps. For example, for a well-known task such as classification (i.e. for each data point, outputting the category it should belong to), or regression (for each data point, predicting the value of something), both the task and the loss function are well known. Most work can then focus on choosing the model family and how to compute the best model of the family. Engineers on the other hand will often have a hard time modelling their task in a nice loss function, and then apply known model families and optimization schemes.

Now for an example

Let’s consider the task (step 1) of recognizing hand written digits, supposing that we know the ground truth for a subset of our data — we’ll call this subset our training set. Our input is going to be a 784 (ie. 28×28)-vector representing an image. Each of its dimensions corresponds to a pixel in the image. Here is a sample image for a 1:

Sample image and matrix representation – From Tensorflow tutorials. Dimensions are wrong.

Now that we identified our task, we want to model it using a loss function (step 2). We want our output to tell us if the sample is a 0, 1, 2… or 9. An intuitive way to represent this is to have a 0-1 error function : if we predict the correct label, our loss is 0 and 1 if we make a mistake. Machine Learners rarely use this in practice because the discrete nature of this loss make the 4th step, the learning, harder.

A common and smart way to represent a very similar task is to have our function output a vector of probabilities over the 10 classes, i.e. a positive vector of size 10, with coordinates that sum to 1. Our new proxy task is predicting how likely a sample is to belong to a certain class, given labelled training data. We’ll call this vector \hat{y} as opposed to y which is the correct probability distribution, where y_i = 1 if and only if i is the correct class of our data point. We need a metric to tell if \hat{y} is far from y. Cross entropy is the go-to method for this. Its expression for each data sample is:

L(\hat{y}, y) = -\sum_i y_i log(\hat{y_i})

Note that if we did not know the y_i, it would be a different task, and would thus require a different loss function.

In practice, we’ll minimize the empirical average of this loss over our training data.

At this point, we want to find our model class, i.e. the structure in our functions f. An easy example is using a one-hidden-layer neural network, with a sigmoid activation function \sigma. f can be written as:

f(x) = W_1 \sigma (W_0 X + b_0) + b_1

So our model is the set of functions that can be written this way, for any weights W_0, W_1, b_0, b_1.

Finally comes the learning. We want to learn what function of our class minimizes our loss. This is equivalent to finding the weights that minimize our loss, given the preceding structure. Mathematically, our minimization problem is:

\min_{W_0, W_1, b_0, b_1} - \sum_{i,j} y_i log(W_1 \sigma (W_0 X_j+ b_0) + b_1)

where the sum over i refers to the different classes (0, 1…9) in our problem and the sum over j refers to the different samples in our training set.

NB for you who are familiar with optimization: This problem is somewhat nice, because the loss function is differentiable with respect to the weights, although it is not convex in the weights. This explains why people use Stochastic Gradient Descent, an algorithm that guarantees us to find a local minimum of our loss and the corresponding weights.

Once we find the optimal weights, we have our candidate, and have finished our learning! For this particular digit recognition dataset, common classification error runs around 5% on a validation set. The classification error used here is the intuitive 0-1 loss that we defined earlier.

A validation set is a set similar to the training set, meaning that we also know the correct labels y_i. The difference with the training set is that we never fed it to the learning algorithm. This means our model’s parameters were not optimized for outputting correct results for the validation set. The hope is that the parameters we learned generalize enough to work well on the validation set. We want this so that we are confident our model will perform well on any new data we feed to it, assuming this data was generated in the same manner as the training data. More on this in a future post.

Conclusion

I went over what makes a problem a Machine Learning problem: the 4 phases of Choosing a Task, Modelling the Task, Choosing a Model Class, and Learning i.e. Optimizing over the Model Class. I hope the example helped to clear this up. We also saw that by trying to solve a specific task, we may have to define a new task that is a good proxy for our original task. In the end though, we always validate our approach on the first task on held-out data, using the loss function that we designed for it, here the 0-1 error loss.

I hope this helps making Machine Learning less of a buzz word and more understandable! I’m looking forward to your comments, don’t hesitate to leave one below.

Going further

The Bible of ML: 
Hastie, T.; Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning , Springer New York Inc. , New York, NY, USA
.

The Bible of Optimization, basic and advanced:
Boyd, S.; and Vandenberghe, L. (2004), Convex Optimization. Cambridge University Press, New York, NY, USA.

The Bible of Deep Learning:
Goodfellow, I.; Bengio Y.; and Courville, A. (2016), Deep Learning. MIT Press

The Plat en Sauce Abstraction

Hi all! Here is my first post on food! I hope you enjoy it and am looking forward to your comments! Today, I’ll focus on the French Plat en sauce concept.

I found an old cooking book in one of my parents’ shelves a few months ago and finally got around to checking it out and trying some things in it. The book is called Le Répertoire de la Cuisine (1914 for the first edition) by Théodore Gringoire and Louis Saulnier. The Répertoire is a major guide/recipe book for people who already know some technical French cuisine vocabulary – but nowadays, you can just Google them and find out all about what monder, singer or chemiser mean. I won’t reproduce the recipe from the book here – you can probably find a PDF on internet or buy it on Amazon. I will add my own recipes in the following days, tune back in soon!

I moved to the US a few months ago,  and started missing the homey family French cuisine tastes. In my mind, nothing is more characteristic of this than a Blanquette and other plats en sauce : sauce entrées where meat and veggies are slowly cooked with herbs and wine.  I read and tried a few different recipes from the Répertoire inspired from Estouffade (traditionnelle and à la provençale) and Blanquette. 

My maths/CS background made me realize that these can all be abstracted to a generic Plat en sauce, which most of you experienced cooks probably already know. Couscous, Coq au vin, Bœuf bourguignon, Fricassée or even Curry implement this. They’re all pretty cheap and healthy dishes, and the abstraction allows for creation and variation depending on your current tastes and desires. Here is what makes up a good plat en sauce in my mind, with the usual French interpretations of each category.

  • Meat: beef stew or equivalent for other animals such as veal or lamb.
  • Fonds:  stock – traditionally either white if made from poultry, veal or vegetables or brown if made from red meat
  • Wine: same color as the fonds, i.e. red for dark fonds and white for light fonds.
  • Vegetables: usually onions, carrots and mushrooms at least – any other root can do, e.g. parsnip for some tartness
  • Carbs: usually rice or steamed potatoes, the former being my favorite to have a donburi-style Plat en sauce where the rice gets soaked in the sauce.
  • Spices, herbs and various seasonings: bouquet garni (thyme, rosemary and sage) is the most frequent for French dishes, but Ras El-Hanout would be its couscous equivalent. I also consider the egg/heavy cream in Blanquette in this category.
  • Optional: a roux to make the sauce thicker.  Always add this at the very end.

Onions have two uses in these recipes: sliced onions, usually in quarters, are for eating, and whole onions are spiked with cloves to flavor the broth and to facilitate removing the cloves in one go before serving the dish. This allows not to end up with a violent clove surprise while eating.

Roux: mix melted butter and flour (around half and half) and cook till the mixture is the color you want. Darker gives a more nutty taste, but doesn’t thicken the sauce as much. Always mix with a few ladles of sauce first before pouring in the main pot to avoid lumps.


I’ll progressively add my recipes for different implementations of Plat en sauce as soon as I’ve tweaked them well enough, come back soon for more! Check my Recipes page for the up to date version.

Let me know if this helped you make one of these dishes or if you tried deriving a new plat en sauce from the abstraction! Here are some ideas that I’d like to try:

  • a vegetarian one, using tofu and vegetable stock
  • using ichiban-dashi (kombu seaweed and fish stock) for an umami-tasting base