As we have seen in the previous post, in order to use ML.NET we have to configure the pipeline with all the operations that we need to do to transform our dataset and the algorithm that we want to use.
In my opinion this syntax is rather verbose and of course has to be repeated for every training. Like all things once you’ve got used to the syntax, you write the pipeline very fast but the problem is that you should repeat the pipeline for every algorithm you need to use.
For these reason, after I got confidence with the framework, I wrote a wrapper service for ML.NET that it helped me to reduce the code that I had to write and to parametrize the pipeline istructions.
The pipeline parameters class
Firstly I implement a model to store and validate the pipeline parameters. I will use this class to initialize the parameters and pass these to a method for learning operations:
public class PipelineParameters<T> where T : class { private readonly string[] _alphanumericColumns; private readonly string[] _dictionarizedLabels; private readonly string[] _concatenatedColumns; private string _predictedColumn; public PipelineParameters(string dataPath, char separator, string predictedColumn = null, string[] alphanumericColumns = null, string[] dictionarizedLabels = null, string[] concatenatedColumns = null, ILearningPipelineItem algorithm = null) { TextLoader = new TextLoader(dataPath).CreateFrom<T>(separator: separator); _predictedColumn = predictedColumn; _alphanumericColumns = alphanumericColumns; _dictionarizedLabels = dictionarizedLabels; _concatenatedColumns = concatenatedColumns; Algorithm = algorithm; } public TextLoader TextLoader { get; } public PredictedLabelColumnOriginalValueConverter PredictedLabelColumnOriginalValueConverter => !string.IsNullOrEmpty(_predictedColumn) ? new PredictedLabelColumnOriginalValueConverter { PredictedLabelColumn = _predictedColumn } : null; public Dictionarizer Dictionarizer => _dictionarizedLabels != null ? new Dictionarizer(_dictionarizedLabels) : null; public ColumnConcatenator ColumnConcatenator => _concatenatedColumns != null ? new ColumnConcatenator("Features", _concatenatedColumns) : null; public CategoricalOneHotVectorizer CategoricalOneHotVectorizer => _alphanumericColumns != null ? new CategoricalOneHotVectorizer(_alphanumericColumns) : null; public ILearningPipelineItem Algorithm { get; } }
This is a typed class where the type is the model of the dataset, with features columns and labels. There are the main parameters/operations used in the pipeline, dataPath and separator are mandatory, instead the others depends on the type of algorithm that I’ve choosen.
So in the public properties I check what are the declared parameters and I instantiate the related object (PredictedLabelColumnOriginalValueConverter, Dictionarizer and so on). We have seen all these classes yet, they will be used by the service to populate the pipeline.
The service
Then we have a service that deal with train the algorithms, evaluate and predict results. The first method is Train:
public async Task<string> TrainAsync<T, TPrediction>(PipelineParameters<T> pipelineParameters) where T : class where TPrediction : class, new() { if (pipelineParameters.Algorithm == null) throw new ArgumentNullException(nameof(pipelineParameters.Algorithm)); var pipeline = new LearningPipeline(); if (pipelineParameters.TextLoader != null) pipeline.Add(pipelineParameters.TextLoader); if (pipelineParameters.Dictionarizer != null) pipeline.Add(pipelineParameters.Dictionarizer); if (pipelineParameters.CategoricalOneHotVectorizer != null) pipeline.Add(pipelineParameters.CategoricalOneHotVectorizer); if (pipelineParameters.ColumnConcatenator != null) pipeline.Add(pipelineParameters.ColumnConcatenator); pipeline.Add(pipelineParameters.Algorithm); if (pipelineParameters.PredictedLabelColumnOriginalValueConverter != null) pipeline.Add(pipelineParameters.PredictedLabelColumnOriginalValueConverter); var modelPath = $@"{_modelsRootPath}\{Guid.NewGuid()}.zip"; var model = pipeline.Train<T, TPrediction>(); await model.WriteAsync(modelPath); return modelPath; }
As you can see is a typed method where the types are the model of the dataset and the prediction model and the only parameter is a PipelineParameters object.
First of all I check if the parameter is null, then I start to build the pipeline, in a similar manner that we have seen here. After that I save the model in a zip file.
Where we’ll need to use the pipeline, we’ll use this method by passing the specific parameters.
In the service we have other methods about the evaluations:
public async Task<RegressionMetrics> EvaluateRegressionAsync<T, TPrediction>(PipelineParameters<T> pipelineParameters, string modelPath) where T : class where TPrediction : class, new() { var model = await PredictionModel.ReadAsync<T, TPrediction>(modelPath); var regressionEvaluator = new RegressionEvaluator(); return regressionEvaluator.Evaluate(model, pipelineParameters.TextLoader); } public async Task<BinaryClassificationMetrics> EvaluateBinaryClassificationAsync<T, TPrediction>(PipelineParameters<T> pipelineParameters, string modelPath) where T : class where TPrediction : class, new() { var model = await PredictionModel.ReadAsync<T, TPrediction>(modelPath); var binaryClassificationEvaluator = new BinaryClassificationEvaluator(); return binaryClassificationEvaluator.Evaluate(model, pipelineParameters.TextLoader); } public async Task<ClassificationMetrics> EvaluateClassificationAsync<T, TPrediction>(PipelineParameters<T> pipelineParameters, string modelPath) where T : class where TPrediction : class, new() { var model = await PredictionModel.ReadAsync<T, TPrediction>(modelPath); var classificationEvaluator = new ClassificationEvaluator(); return classificationEvaluator.Evaluate(model, pipelineParameters.TextLoader); }
We have the typed methods as well, with the classes about the dataset model and the regression model. We have two parameters, the first one is the object with the pipeline parameters, and the second one is the path of the saved model.
Based on the type of the machine learning algorithm (Regression, binary classification, multi classification) we have three different evaluator classes: RegressionEvaluator, BinaryClassificationEvaluator, ClassificationEvaluator, with different metricts (RegressionMetrics, BinaryClassificationMetrics, ClassificationMetrics).
The last methods are about the score predictions. As we have seen in the previous post, we can predict the most probably value or take the entire list of values with their score:
public async Task<TPrediction> PredictScoreAsync<T, TPrediction>(T data, string modelPath) where T : class where TPrediction : RegressionPrediction, new() { var model = await PredictionModel.ReadAsync<T, TPrediction>(modelPath); return model.Predict(data); } public async Task<ScoreLabel[]> PredictScoresAsync<T, TPrediction>(T data, string modelPath) where T : class where TPrediction : MultiClassificationPrediction, new() { var model = await PredictionModel.ReadAsync<T, TPrediction>(modelPath); var prediction = model.Predict(data); model.TryGetScoreLabelNames(out string[] scoresLabels); return scoresLabels.Select(ls => new ScoreLabel() { Label = ls, Score = prediction.Scores[Array.IndexOf(scoresLabels, ls)] }).ToArray(); }
As parameters, the methods accepts an object with the feature datas on which we have to predict the value and the path of the model zip file. In the second method we have a more complex prediction model:
public class MultiClassificationPrediction { [ColumnName("PredictedLabel")] public string PredictedLabel; [ColumnName("Score")] public float[] Scores; }
We have already seen this model, in the Scores property we have the list of scores of all the predicted values. With the TryGetScoreLabelNames method we retrieves in the same order the list of all labels associated to the scores, then we can populate a list with labels and associated scores.
Now we can use the service in our code:
var pipelineParameters = return new PipelineParameters<CarData>(_dataPath, _separator, alphanumericColumns: _alphanumericColumns, concatenatedColumns: _concatenatedColumns, algorithm: new StochasticDualCoordinateAscentRegressor()); var modelPath = await predictionService.TrainAsync<CarData, CarPricePrediction>(pipelineParameters); var result = await predictionService.EvaluateRegressionAsync<CarData, CarPricePrediction>(pipelineTestParameters, modelPath);
Now we can build a ML.NET pipeline in very compact and easy, we do not need to remember anymore what are the operations to be added to the pipeline and the order. And our code is better in terms of readability.
Summary
We’ve seen how define a class that deal with the ML.NET pipeline paramters. The class accepts a list of arguments like the array of columns to be concatenated, the alphanumeric columns to be converted and so on and exposes the instantiated pipeline operations.
Then we have define the service with the methods Train, Evaluate, Predict and we have seen how we could use them in our code. It’s an example of how we can simplify the ML.NET syntax avoiding code duplications.
Anyway, this post refers to the actual pipeline syntax that will be deprecated in the future versions of ML.NET. The developers team is working to upgrade the actual pipeline syntax and improve the usability and flexibility.
The source code of this post is a GitHub project.
Leave a Reply