From the version 0.8 it’s possible with ML.NET to evaluate features importance and so understand what are the columns that are more important to predict the final value.
Permutation Feature Importance has this phurpose, highlight the most important features in order to understand what features has to be included or not; excluding some features from the dataset means reduce the noise and the result will be better.
So with PFI we can understand what are the most important columns in our learning pipeline and use them to predict values.
Pipeline
The first steps are the same that we do in the values prediction, so we have to build the pipeline.
For example a standard pipeline can look like this:
var mlContext = new MLContext(); var dataView = MlContext.Data.LoadFromTextFile<T>(dataPath, separator, hasHeader: false); var pipeline = MlContext.Transforms.CopyColumns("Label", _predictedColumn.ColumnName).Append(MlContext.Transforms.Concatenate(_featureColumn, _concatenatedColumns));
This is a very simple pipeline, that loading data from a file, copy the label column and add the feature column.
Now that the pipeline is configured we can build the model.
Model
Build the model means get the pipeline, append the choosen algorithm, fit and transform it.
var tranformedDataView = pipeline.Append(MlContext.Regression.Trainers.LbfgsPoissonRegression()).Fit(DataView).Transform(DataView);
The result is a transformed data view with all the pipeline transformations applied, that we will use in the Permutation Feature Importance method.
Metrics
In order to get the PFI metrics, besides the transformed data view we need a transformer as well:
var transformer = pipeline.MlContext.Regression.Trainers.LbfgsPoissonRegression().Fit(tranformedDataView);
Now we are able to get the metrics:
var permutationMetrics = pipeline.MlContext.Regression.PermutationFeatureImportance(transformer, transformedDataView, permutationCount: 3);
With the permutation count parameter we can specify the number of observations that we want to do for the regression metrics.
The result is an array of regression metric statistics, and is useful order it on a specific metric like the mean:
var regressionMetrics = permutationMetrics.Select((metric, index) => new { index, metric.RSquared }).OrderByDescending(features => Math.Abs(features.RSquared.Mean));
With a loop we can now print the metrics:
foreach (var metric in regressionMetrics) { if (metric.index >= transformedData.Schema.Count || (transformedData.Schema[metric.index].IsHidden || transformedData.Schema[metric.index].Name == "Label" || transformedData.Schema[metric.index].Name == "Features")) continue; Console.WriteLine($"{transformedData.Schema[metric.index].Name,-20}|\t{metric.RSquared.Mean:F6}"); }
In my case the output is:
With this statistic we can understand what are the most important features and apply changes to the pipeline building.
The source code of this post is available on a GitHub project.
Leave a Reply