MLJ Pipeline

A minimal machine learning pipeline using Julia’s MLJ framework

Published

July 7, 2023

Below is a minimal (yet complete) example of a machine learning pipeline that use’s Julia’s MLJ framework and the Palmer Penguins dataset.

Note that the goal here isn’t necessarily to fit the best model; rather it’s just to demonstrate an MLJ pipeline.

using DataFrames
using CSV
using Random
using MLJ

Random.seed!(0408)

#get penguins data
penguins = CSV.read(download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"), DataFrame, missingstring="NA")

#filter to those without missing body mass
dropmissing!(penguins, :body_mass_g)

#extract body mass as y
y, X = unpack(penguins, ==(:body_mass_g))

# coercing textual columns to multiclass for modeling
coerce_nms = [:species, :sex, :island]

c_dict = Dict(zip(coerce_nms, repeat([Multiclass], 3)))

coerce!(
    X,
    c_dict
)

#get training and validation indices
trn, val = partition(eachindex(y), 0.8; shuffle=true)

#define pipeline components
imp = FillImputer();
stand = Standardizer();
oh = OneHotEncoder(drop_last=true);
LinearRegression = @load LinearRegressor pkg = GLM add = true
mod = LinearRegression()

#define pipeline
m = Pipeline(imp, stand, oh, mod)

#define machine
mach = machine(m, X, y);

#fit machine on training rows
fit!(mach, rows=trn)

#predicting training y's
= MLJ.predict_mean(mach, X[trn, :])

#evaluate model
cv = CV(nfolds=3)

MLJ.evaluate!(mach, rows=val, resampling=cv, measure=rmse)

#note -- call measures() to see all available measures
┌ Info: Trying to coerce from `Union{Missing, String7}` to `Multiclass`.
└ Coerced to `Union{Missing,Multiclass}` instead.
[ Info: For silent loading, specify `verbosity=0`. 
[ Info: Training machine(ProbabilisticPipeline(fill_imputer = FillImputer(features = Symbol[], …), …), …).
[ Info: Training machine(:fill_imputer, …).
[ Info: Training machine(:standardizer, …).
[ Info: Training machine(:one_hot_encoder, …).
[ Info: Spawning 2 sub-features to one-hot encode feature :species.
[ Info: Spawning 2 sub-features to one-hot encode feature :island.
[ Info: Spawning 1 sub-features to one-hot encode feature :sex.
[ Info: Training machine(:linear_regressor, …).
[ Info: Creating subsamples from a subset of all rows. 
Evaluating over 3 folds:  67%[================>        ]  ETA: 0:00:01Evaluating over 3 folds: 100%[=========================] Time: 0:00:02
import MLJGLMInterface ✔
PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌────────────────────────┬──────────────┬─────────────┬─────────┬───────────────
│ measure                │ operation    │ measurement │ 1.96*SE │ per_fold     ⋯
├────────────────────────┼──────────────┼─────────────┼─────────┼───────────────
│ RootMeanSquaredError() │ predict_mean │ 342.0       │ 94.8    │ [277.0, 322. ⋯
└────────────────────────┴──────────────┴─────────────┴─────────┴───────────────
                                                                1 column omitted