Tasting wine with ML.NET

If you really knew me, you'd know I like a good glass of wine once in a while. Yes, really once in a while, no structural issues! Although I do like a good glass of wine, I am not a connoisseur. But I am a software developer, and as you all know, software developers do have a solution for everything... At least, my clients somehow expect that. So I was wondering, how would a software developer pick a good wine?

Machine Learning

New wines are sold every day and the amount of wines available is (almost) endless. Therefore it is impossible to have an up-to-date list of all wines and there quality. If we want to know the quality of every wine available, we need a self-learning algorithm. This is what we call Machine Learning!

ML.NET

There are a lot of ways to implement machine learning, but one interesting project is ML.NET. This Microsoft project got out of preview status in May 2019. If you want to use ML.NET in a mission critical application, please consider labeling the feature as beta in your product, since ML.NET is quite new.

Learn more about the status and features of ML.NET on the official website.

Algorithm

Before starting to ~~taste wine~~ code, it is important to understand there are many algorithms for machine learning. These can be used for different scenarios. ML.NET supports several of these algorithms, so it is important to choose the right one for your project. Since I am not familiar with all algorithms and Microsoft will probably add new algorithms regularly, I will shortly introduce the most common ones:

Classification: determine a category for items based on one or more input variables. In case of binary classification, there are only two categories: true (1) or false (0). This is used for decision making models: is this a good or bad wine? Then there is also multi-class classification, which is used to have multiple classes per item, like: from what regions is this wine?
Clustering: divide items into groups based on their properties. The most common clustering algorithm is by K-Means. Example: what is the price range of this wine?
Recommendation: to recommend items based on a users history. This could be interesting when selling wine. If you know which wines a customer bought, you will be able to recommend other wines based on the purchase history of other customers.
Transfer learning: use a model created by someone else. Recognizing objects on images requires lots of training data and hours of GPU time to process these. It would be a waste of time to create such a library, while others already did. Microsoft recommends to use TensorFlow in these situations.
Regression: predict values based on one or more properties. To predict these values, a model is trained based on historical data. A typical scenario to use regression, is to predict prices. However, it is also a good candidate to... predict the quality of wine!

Sample code

To demonstrate the code shown in this article, I've created a Github repository: https://github.com/vincentbitter/wine-ml

Getting started with ML.NET

It may come as a surprise, but setting up ML.NET only takes a few minutes! It is not necessary to install all kind of services and SDK's. You just need your regular .NET development environment. In my case Visual Studio 2017 Enterprise (15.9.12 to be precise) with .NET Core 2.2.

Because I do not have an existing project, I have to create a new one. Therefore the first step of getting started with ML.NET is, of course, creating a new .NET project. In this case I'm using .NET Core 2.2 Console Application, but any other .NET Standard 2.0 based project should be fine.

Second thing we need to do, is adding the ML.NET NuGet package, which is called Microsoft.ML. Make sure you install 1.3.1.
This can be done via Manage NuGet Packages... in Visual Studio.
Or with the command:

dotnet add package Microsoft.ML --version 1.3.1

Gathering data

The application should be able to tell what the quality is of any wine. To do so, a big set of data is required to train the application in tasting wine. A nice data set to start with, is the one from Paulo Cortez containing almost 5000 white variants of the Portuguese "Vinho Verde" wine. The red wines will be ignored, because it does not make sense to create one formula to predict the quality of both white and red wine.

In this project the .csv-file is added in the project and copied to the output directory on build. This allows us to use LoadFromTextFile from ML.NET, but the data can be loaded from any source, including a SQL database and URL. In these cases, LoadFromEnumerable can be used.

To be able to map the data, a simple class is needed with the right properties. Please note the use of floats. It is possible to use other datatypes, but it will make our life a lot harder.

using Microsoft.ML.Data;

namespace WineML
{
    class WineData
    {
        [LoadColumn(0)]
        public float FixedAcidity;
        [LoadColumn(1)]
        public float VolatileAcidity;
        [LoadColumn(2)]
        public float CitricAcid;
        [LoadColumn(3)]
        public float ResidualSugar;
        [LoadColumn(4)]
        public float Chlorides;
        [LoadColumn(5)]
        public float FreeSulfurDioxide;
        [LoadColumn(6)]
        public float TotalSulfurDioxide;
        [LoadColumn(7)]
        public float Density;
        [LoadColumn(8)]
        public float Ph;
        [LoadColumn(9)]
        public float Sulphates;
        [LoadColumn(10)]
        public float Alcohol;
        [LoadColumn(11)]
        public float Quality;
    }
}

To prove our predictions are good enough, we also need to validate the trained model. Usually 70% of the data set is used to train the model and 30% to validate. When importing the data, we will therefore split the data in two parts: winequality-white-train.csv and winequality-white-validate.csv.

var mlContext = new MLContext();
var dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-white-train.csv");
var trainingData = mlContext.Data.LoadFromTextFile<WineData>(dataPath, separatorChar: ';', hasHeader: true);

Train model

Now we have the data in our project, we need to load the data in ML.NET to train the model. The first step is to map the rows to a model. We already parsed the CSV as IDataView, but we still need to tell our machine learning model how to interpret the data and choose a good trainer. In this case we use all available fields as features, except for the Quality column, which is the field we want to predict. Also we use the FastTree regression trainer to train the model.

var model = mlContext.Transforms
    .CopyColumns(
        outputColumnName: "Label", 
        inputColumnName: nameof(WineData.Quality))
    .Append(
        mlContext.Transforms.Concatenate(
            "Features",
            nameof(WineData.FixedAcidity),
            nameof(WineData.VolatileAcidity),
            nameof(WineData.CitricAcid),
            nameof(WineData.ResidualSugar),
            nameof(WineData.Chlorides),
            nameof(WineData.FreeSulfurDioxide),
            nameof(WineData.TotalSulfurDioxide),
            nameof(WineData.Density),
            nameof(WineData.Ph),
            nameof(WineData.Sulphates),
            nameof(WineData.Alcohol)))
    .Append(mlContext.Regression.Trainers.FastTree())
    .Fit(trainingData);

Validate model

Cool, we have a model! But is it any good? We have no clue! That's why we have to validate our model. Without validating a model, we can't tell how good it is, so it is really important. Microsoft covered this by providing the Evaluate method to calculate some metrics:

var predictions = model.Transform(validationData);
var metrics = mlContext.Regression.Evaluate(predictions, "Label", "Score");

Console.WriteLine($"RSquared Score: {metrics.RSquared:0.##}");
Console.WriteLine($"Root Mean Squared Error: {metrics.RootMeanSquaredError:#.##}");

Make sure to use different data for the validation than the training data. This is the reason we split the original data into two CSV files before.

The metrics include the R Squared and Root Mean Squared Error. The R Squared is between 0 and 1. The higher the better! The Root Mean Squared Error should be as close as possible to 0.

# The results of our dataset:
RSquared Score: 0.22
Root Mean Squared Error: .72

Predict the quality of wine

Now, here we are, ready to roll! Our model is trained and we can start building a prediction engine for the quality of wines. After creating a PredictionEngine with CreatePredictionEngine we can finally start predicting. The results of the prediction will be captured in our WinePrediction class.

using Microsoft.ML.Data;

namespace WineML
{
    public class WinePrediction
    {
        [ColumnName("Score")]
        public float Quality;
    }
}

var predictionFunction = mlContext.Model.CreatePredictionEngine<WineData, WinePrediction>(model);
var prediction = predictionFunction.Predict(new WineData {
    FixedAcidity = 7.6F,
    VolatileAcidity = 0.17F,
    CitricAcid = 0.27F,
    ResidualSugar = 4.6F,
    Chlorides = 0.05F,
    FreeSulfurDioxide = 23,
    TotalSulfurDioxide = 98,
    Density = 0.99422F,
    Ph = 3.08F,
    Sulphates = 0.47F,
    Alcohol = 9.5F,
    Quality = 0 // We are gonna predict this. The expected value is 6
});
Console.WriteLine($"{prediction.Quality:0.##}");

As you can see in the sample code, the official quality of the wine is 6. The result of our prediction... is... 5.69! That is really close! And I didn't cheat on this one, I picked a random record from the validation data set.

Please pay attention to the [ColumnName] attribute in the model. It is set to "Score", which should not be changed! The score is a fixed name used for the result of the prediction, so if you change it, ML.NET doesn't know where to write the result anymore.

Simplify the model

It is really impressive how well our model can predict the quality of wines, but let's get back to our mission: picking a good wine. How realistic is it to ask ourselves to collect all the required information of different wines? Not really... Therefore we need to reduce the amount of parameters. Easiest way to do so, is by calculating the R Squared for each individual column. Remember it is not perfect, because combining particular columns can result in way more accurate results, but we want to make our live easy.

Calculate RSquared for FixedAcidity... 0
Calculate RSquared for VolatileAcidity... 0
Calculate RSquared for CitricAcid... 0.04
Calculate RSquared for ResidualSugar... -0.03
Calculate RSquared for Chlorides... 0.06
Calculate RSquared for FreeSulfurDioxide... 0.02
Calculate RSquared for TotalSulfurDioxide... 0
Calculate RSquared for Density... -0.02
Calculate RSquared for Ph... -0.08
Calculate RSquared for Sulphates... -0.03
Calculate RSquared for Alcohol... 0.11

The conclusion of our research is: the amount of alcohol is most important to predict the quality of wine! Easy to remember and easy to check since every bottle of wine has alcohol percentage on its label! Just grab the one with the most alcohol in it to be safe!

Cheers!