Improving ML.NET model accurancy

From the version 0.8 it’s possible with ML.NET to evaluate features importance and so understand what are the columns that are more important to predict the final value.

Permutation Feature Importance has this phurpose, highlight the most important features in order to understand what features has to be included or not; excluding some features from the dataset means reduce the noise and the result will be better.

So with PFI we can understand what are the most important columns in our learning pipeline and use them to predict values.

Pipeline

The first steps are the same that we do in the values prediction, so we have to build the pipeline.

For example a standard pipeline can look like this:


var mlContext = new MLContext();
var dataView = MlContext.Data.LoadFromTextFile<T>(dataPath, separator, hasHeader: false);
var pipeline = MlContext.Transforms.CopyColumns("Label", _predictedColumn.ColumnName).Append(MlContext.Transforms.Concatenate(_featureColumn, _concatenatedColumns));

This is a very simple pipeline, that loading data from a file, copy the label column and add the feature column.

Now that the pipeline is configured we can build the model.

Model

Build the model means get the pipeline, append the choosen algorithm, fit and transform it.


var tranformedDataView = pipeline.Append(MlContext.Regression.Trainers.LbfgsPoissonRegression()).Fit(DataView).Transform(DataView);

The result is a transformed data view with all the pipeline transformations applied, that we will use in the Permutation Feature Importance method.

Metrics

In order to get the PFI metrics, besides the transformed data view we need a transformer as well:


var transformer = pipeline.MlContext.Regression.Trainers.LbfgsPoissonRegression().Fit(tranformedDataView);

Now we are able to get the metrics:


var permutationMetrics = pipeline.MlContext.Regression.PermutationFeatureImportance(transformer, transformedDataView, permutationCount: 3);

With the permutation count parameter we can specify the number of observations that we want to do for the regression metrics.

The result is an array of regression metric statistics, and is useful order it on a specific metric like the mean:


var regressionMetrics = permutationMetrics.Select((metric, index) => new { index, metric.RSquared }).OrderByDescending(features => Math.Abs(features.RSquared.Mean));

With a loop we can now print the metrics:


foreach (var metric in regressionMetrics)
{
if (metric.index >= transformedData.Schema.Count || (transformedData.Schema[metric.index].IsHidden || transformedData.Schema[metric.index].Name == "Label" || transformedData.Schema[metric.index].Name == "Features"))
continue;

Console.WriteLine($"{transformedData.Schema[metric.index].Name,-20}|\t{metric.RSquared.Mean:F6}");
}

In my case the output is:

With this statistic we can understand what are the most important features and apply changes to the pipeline building.

The source code of this post is available on a GitHub project.

 

 

 

 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: