Below is a minimal (yet complete) example of a machine learning pipeline that use’s Julia’s MLJ framework and the Palmer Penguins dataset.
Note that the goal here isn’t necessarily to fit the best model; rather it’s just to demonstrate an MLJ pipeline.
usingDataFramesusingCSVusingRandomusingMLJRandom.seed!(0408)#get penguins datapenguins = CSV.read(download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"), DataFrame, missingstring="NA")#filter to those without missing body massdropmissing!(penguins, :body_mass_g)#extract body mass as yy, X =unpack(penguins, ==(:body_mass_g))# coercing textual columns to multiclass for modelingcoerce_nms = [:species, :sex, :island]c_dict =Dict(zip(coerce_nms, repeat([Multiclass], 3)))coerce!( X, c_dict)#get training and validation indicestrn, val =partition(eachindex(y), 0.8; shuffle=true)#define pipeline componentsimp =FillImputer();stand =Standardizer();oh =OneHotEncoder(drop_last=true);LinearRegression =@load LinearRegressor pkg = GLM add =truemod =LinearRegression()#define pipelinem =Pipeline(imp, stand, oh, mod)#define machinemach =machine(m, X, y);#fit machine on training rowsfit!(mach, rows=trn)#predicting training y'sŷ = MLJ.predict_mean(mach, X[trn, :])#evaluate modelcv =CV(nfolds=3)MLJ.evaluate!(mach, rows=val, resampling=cv, measure=rmse)#note -- call measures() to see all available measures
┌ Info: Trying to coerce from `Union{Missing, String7}` to `Multiclass`.
└ Coerced to `Union{Missing,Multiclass}` instead.
[ Info: For silent loading, specify `verbosity=0`.
[ Info: Training machine(ProbabilisticPipeline(fill_imputer = FillImputer(features = Symbol[], …), …), …).
[ Info: Training machine(:fill_imputer, …).
[ Info: Training machine(:standardizer, …).
[ Info: Training machine(:one_hot_encoder, …).
[ Info: Spawning 2 sub-features to one-hot encode feature :species.
[ Info: Spawning 2 sub-features to one-hot encode feature :island.
[ Info: Spawning 1 sub-features to one-hot encode feature :sex.
[ Info: Training machine(:linear_regressor, …).
[ Info: Creating subsamples from a subset of all rows.
Evaluating over 3 folds: 67%[================> ] ETA: 0:00:01Evaluating over 3 folds: 100%[=========================] Time: 0:00:02