DataFrames.jl Cheat Sheet

A WIP cheat sheet for working with DataFrames.jl

Published

February 14, 2024

A cheat sheet for using DataFrames.jl. This is a work in progress and will (hopefully) be continuously updated/modified.

Load Pkgs and Read in data

using CSV
using DataFrames
using Statistics
using Chain

#note that the missingstring arg will read in any "NA" strings as missings
penguins = CSV.read(download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"), DataFrame, missingstring="NA"
);

Check Properties of data

size(penguins)
(344, 8)

see the first 3 rows

first(penguins, 3)
3×8 DataFrame
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64? Float64? Int64? Int64? String7? Int64
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007

get the column names of a dataframe

names(penguins)
8-element Vector{String}:
 "species"
 "island"
 "bill_length_mm"
 "bill_depth_mm"
 "flipper_length_mm"
 "body_mass_g"
 "sex"
 "year"

nb we could also convert these to Symbol via:

Symbol.(names(penguins))
8-element Vector{Symbol}:
 :species
 :island
 :bill_length_mm
 :bill_depth_mm
 :flipper_length_mm
 :body_mass_g
 :sex
 :year

If we want to get the column types, we can do the following:

eltype.(eachcol(penguins))
8-element Vector{Type}:
 String15
 String15
 Union{Missing, Float64}
 Union{Missing, Float64}
 Union{Missing, Int64}
 Union{Missing, Int64}
 Union{Missing, String7}
 Int64

Subsetting Rows

In general, we’ll want to use the subset() or subset!() functions to subset rows in a dataframe (subset!() is the in-place version)

The syntax is a little wonky, and it tends to involve passing a symbol into an anonymous function that is then broadcast. For instance, the following will keep only the Chinstrap penguins:

subset(penguins, :species => s -> s .== "Chinstrap")
68×8 DataFrame
43 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64? Float64? Int64? Int64? String7? Int64
1 Chinstrap Dream 46.5 17.9 192 3500 female 2007
2 Chinstrap Dream 50.0 19.5 196 3900 male 2007
3 Chinstrap Dream 51.3 19.2 193 3650 male 2007
4 Chinstrap Dream 45.4 18.7 188 3525 female 2007
5 Chinstrap Dream 52.7 19.8 197 3725 male 2007
6 Chinstrap Dream 45.2 17.8 198 3950 female 2007
7 Chinstrap Dream 46.1 18.2 178 3250 female 2007
8 Chinstrap Dream 51.3 18.2 197 3750 male 2007
9 Chinstrap Dream 46.0 18.9 195 4150 female 2007
10 Chinstrap Dream 51.3 19.9 198 3700 male 2007
11 Chinstrap Dream 46.6 17.8 193 3800 female 2007
12 Chinstrap Dream 51.7 20.3 194 3775 male 2007
13 Chinstrap Dream 47.0 17.3 185 3700 female 2007
57 Chinstrap Dream 45.2 16.6 191 3250 female 2009
58 Chinstrap Dream 49.3 19.9 203 4050 male 2009
59 Chinstrap Dream 50.2 18.8 202 3800 male 2009
60 Chinstrap Dream 45.6 19.4 194 3525 female 2009
61 Chinstrap Dream 51.9 19.5 206 3950 male 2009
62 Chinstrap Dream 46.8 16.5 189 3650 female 2009
63 Chinstrap Dream 45.7 17.0 195 3650 female 2009
64 Chinstrap Dream 55.8 19.8 207 4000 male 2009
65 Chinstrap Dream 43.5 18.1 202 3400 female 2009
66 Chinstrap Dream 49.6 18.2 193 3775 male 2009
67 Chinstrap Dream 50.8 19.0 210 4100 male 2009
68 Chinstrap Dream 50.2 18.7 198 3775 female 2009

And it’s straightforward to pass multiple conditions to the function:

subset(penguins, :species => s -> s .== "Chinstrap", :bill_length_mm => b -> b .> 50, skipmissing=true)
30×8 DataFrame
5 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64? Float64? Int64? Int64? String7? Int64
1 Chinstrap Dream 51.3 19.2 193 3650 male 2007
2 Chinstrap Dream 52.7 19.8 197 3725 male 2007
3 Chinstrap Dream 51.3 18.2 197 3750 male 2007
4 Chinstrap Dream 51.3 19.9 198 3700 male 2007
5 Chinstrap Dream 51.7 20.3 194 3775 male 2007
6 Chinstrap Dream 52.0 18.1 201 4050 male 2007
7 Chinstrap Dream 50.5 19.6 201 4050 male 2007
8 Chinstrap Dream 50.3 20.0 197 3300 male 2007
9 Chinstrap Dream 58.0 17.8 181 3700 female 2007
10 Chinstrap Dream 50.6 19.4 193 3800 male 2007
11 Chinstrap Dream 52.0 19.0 197 4150 male 2007
12 Chinstrap Dream 50.5 18.4 200 3400 female 2008
13 Chinstrap Dream 52.8 20.0 205 4550 male 2008
19 Chinstrap Dream 50.9 17.9 196 3675 female 2009
20 Chinstrap Dream 50.8 18.5 201 4450 male 2009
21 Chinstrap Dream 50.1 17.9 190 3400 female 2009
22 Chinstrap Dream 51.5 18.7 187 3250 male 2009
23 Chinstrap Dream 51.4 19.0 201 3950 male 2009
24 Chinstrap Dream 50.7 19.7 203 4050 male 2009
25 Chinstrap Dream 52.2 18.8 197 3450 male 2009
26 Chinstrap Dream 50.2 18.8 202 3800 male 2009
27 Chinstrap Dream 51.9 19.5 206 3950 male 2009
28 Chinstrap Dream 55.8 19.8 207 4000 male 2009
29 Chinstrap Dream 50.8 19.0 210 4100 male 2009
30 Chinstrap Dream 50.2 18.7 198 3775 female 2009

Note the use of the skipmissing argument. If there are missing values in a column that’s being filtered on, this will throw an error

We can also drop rows with any missing data via dropmissing(df)

dropmissing(penguins)
333×8 DataFrame
308 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64 Float64 Int64 Int64 String7 Int64
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
4 Adelie Torgersen 36.7 19.3 193 3450 female 2007
5 Adelie Torgersen 39.3 20.6 190 3650 male 2007
6 Adelie Torgersen 38.9 17.8 181 3625 female 2007
7 Adelie Torgersen 39.2 19.6 195 4675 male 2007
8 Adelie Torgersen 41.1 17.6 182 3200 female 2007
9 Adelie Torgersen 38.6 21.2 191 3800 male 2007
10 Adelie Torgersen 34.6 21.1 198 4400 male 2007
11 Adelie Torgersen 36.6 17.8 185 3700 female 2007
12 Adelie Torgersen 38.7 19.0 195 3450 female 2007
13 Adelie Torgersen 42.5 20.7 197 4500 male 2007
322 Chinstrap Dream 45.2 16.6 191 3250 female 2009
323 Chinstrap Dream 49.3 19.9 203 4050 male 2009
324 Chinstrap Dream 50.2 18.8 202 3800 male 2009
325 Chinstrap Dream 45.6 19.4 194 3525 female 2009
326 Chinstrap Dream 51.9 19.5 206 3950 male 2009
327 Chinstrap Dream 46.8 16.5 189 3650 female 2009
328 Chinstrap Dream 45.7 17.0 195 3650 female 2009
329 Chinstrap Dream 55.8 19.8 207 4000 male 2009
330 Chinstrap Dream 43.5 18.1 202 3400 female 2009
331 Chinstrap Dream 49.6 18.2 193 3775 male 2009
332 Chinstrap Dream 50.8 19.0 210 4100 male 2009
333 Chinstrap Dream 50.2 18.7 198 3775 female 2009

Or we can choose to drop rows with missing values on a specific column:

dropmissing(penguins, :bill_length_mm)
342×8 DataFrame
317 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64 Float64? Int64? Int64? String7? Int64
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
4 Adelie Torgersen 36.7 19.3 193 3450 female 2007
5 Adelie Torgersen 39.3 20.6 190 3650 male 2007
6 Adelie Torgersen 38.9 17.8 181 3625 female 2007
7 Adelie Torgersen 39.2 19.6 195 4675 male 2007
8 Adelie Torgersen 34.1 18.1 193 3475 missing 2007
9 Adelie Torgersen 42.0 20.2 190 4250 missing 2007
10 Adelie Torgersen 37.8 17.1 186 3300 missing 2007
11 Adelie Torgersen 37.8 17.3 180 3700 missing 2007
12 Adelie Torgersen 41.1 17.6 182 3200 female 2007
13 Adelie Torgersen 38.6 21.2 191 3800 male 2007
331 Chinstrap Dream 45.2 16.6 191 3250 female 2009
332 Chinstrap Dream 49.3 19.9 203 4050 male 2009
333 Chinstrap Dream 50.2 18.8 202 3800 male 2009
334 Chinstrap Dream 45.6 19.4 194 3525 female 2009
335 Chinstrap Dream 51.9 19.5 206 3950 male 2009
336 Chinstrap Dream 46.8 16.5 189 3650 female 2009
337 Chinstrap Dream 45.7 17.0 195 3650 female 2009
338 Chinstrap Dream 55.8 19.8 207 4000 male 2009
339 Chinstrap Dream 43.5 18.1 202 3400 female 2009
340 Chinstrap Dream 49.6 18.2 193 3775 male 2009
341 Chinstrap Dream 50.8 19.0 210 4100 male 2009
342 Chinstrap Dream 50.2 18.7 198 3775 female 2009

We can also subset rows using regex:

subset(penguins, :species => ByRow(x -> occursin(r"^Chin", x)))
68×8 DataFrame
43 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64? Float64? Int64? Int64? String7? Int64
1 Chinstrap Dream 46.5 17.9 192 3500 female 2007
2 Chinstrap Dream 50.0 19.5 196 3900 male 2007
3 Chinstrap Dream 51.3 19.2 193 3650 male 2007
4 Chinstrap Dream 45.4 18.7 188 3525 female 2007
5 Chinstrap Dream 52.7 19.8 197 3725 male 2007
6 Chinstrap Dream 45.2 17.8 198 3950 female 2007
7 Chinstrap Dream 46.1 18.2 178 3250 female 2007
8 Chinstrap Dream 51.3 18.2 197 3750 male 2007
9 Chinstrap Dream 46.0 18.9 195 4150 female 2007
10 Chinstrap Dream 51.3 19.9 198 3700 male 2007
11 Chinstrap Dream 46.6 17.8 193 3800 female 2007
12 Chinstrap Dream 51.7 20.3 194 3775 male 2007
13 Chinstrap Dream 47.0 17.3 185 3700 female 2007
57 Chinstrap Dream 45.2 16.6 191 3250 female 2009
58 Chinstrap Dream 49.3 19.9 203 4050 male 2009
59 Chinstrap Dream 50.2 18.8 202 3800 male 2009
60 Chinstrap Dream 45.6 19.4 194 3525 female 2009
61 Chinstrap Dream 51.9 19.5 206 3950 male 2009
62 Chinstrap Dream 46.8 16.5 189 3650 female 2009
63 Chinstrap Dream 45.7 17.0 195 3650 female 2009
64 Chinstrap Dream 55.8 19.8 207 4000 male 2009
65 Chinstrap Dream 43.5 18.1 202 3400 female 2009
66 Chinstrap Dream 49.6 18.2 193 3775 male 2009
67 Chinstrap Dream 50.8 19.0 210 4100 male 2009
68 Chinstrap Dream 50.2 18.7 198 3775 female 2009

This uses the ByRow() function to apply a function to each row as opposed to broadcasting, but the broadcasting notation also works

subset(penguins, :species => x -> occursin.(r"^Chin", x))
68×8 DataFrame
43 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64? Float64? Int64? Int64? String7? Int64
1 Chinstrap Dream 46.5 17.9 192 3500 female 2007
2 Chinstrap Dream 50.0 19.5 196 3900 male 2007
3 Chinstrap Dream 51.3 19.2 193 3650 male 2007
4 Chinstrap Dream 45.4 18.7 188 3525 female 2007
5 Chinstrap Dream 52.7 19.8 197 3725 male 2007
6 Chinstrap Dream 45.2 17.8 198 3950 female 2007
7 Chinstrap Dream 46.1 18.2 178 3250 female 2007
8 Chinstrap Dream 51.3 18.2 197 3750 male 2007
9 Chinstrap Dream 46.0 18.9 195 4150 female 2007
10 Chinstrap Dream 51.3 19.9 198 3700 male 2007
11 Chinstrap Dream 46.6 17.8 193 3800 female 2007
12 Chinstrap Dream 51.7 20.3 194 3775 male 2007
13 Chinstrap Dream 47.0 17.3 185 3700 female 2007
57 Chinstrap Dream 45.2 16.6 191 3250 female 2009
58 Chinstrap Dream 49.3 19.9 203 4050 male 2009
59 Chinstrap Dream 50.2 18.8 202 3800 male 2009
60 Chinstrap Dream 45.6 19.4 194 3525 female 2009
61 Chinstrap Dream 51.9 19.5 206 3950 male 2009
62 Chinstrap Dream 46.8 16.5 189 3650 female 2009
63 Chinstrap Dream 45.7 17.0 195 3650 female 2009
64 Chinstrap Dream 55.8 19.8 207 4000 male 2009
65 Chinstrap Dream 43.5 18.1 202 3400 female 2009
66 Chinstrap Dream 49.6 18.2 193 3775 male 2009
67 Chinstrap Dream 50.8 19.0 210 4100 male 2009
68 Chinstrap Dream 50.2 18.7 198 3775 female 2009

Finally, note that we can also use the bracket syntax with row indices to filter, but this feels like it’s not the most useful for rows:

#select rows 1, 10, 15
penguins[[1, 10, 15], :]
3×8 DataFrame
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
String15 String15 Float64? Float64? Int64? Int64? String7? Int64
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 42.0 20.2 190 4250 missing 2007
3 Adelie Torgersen 34.6 21.1 198 4400 male 2007

Subsetting Columns

Using bracket syntax to subset columns makes more sense:

penguins[:, 1:3]
344×3 DataFrame
319 rows omitted
Row species island bill_length_mm
String15 String15 Float64?
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen missing
5 Adelie Torgersen 36.7
6 Adelie Torgersen 39.3
7 Adelie Torgersen 38.9
8 Adelie Torgersen 39.2
9 Adelie Torgersen 34.1
10 Adelie Torgersen 42.0
11 Adelie Torgersen 37.8
12 Adelie Torgersen 37.8
13 Adelie Torgersen 41.1
333 Chinstrap Dream 45.2
334 Chinstrap Dream 49.3
335 Chinstrap Dream 50.2
336 Chinstrap Dream 45.6
337 Chinstrap Dream 51.9
338 Chinstrap Dream 46.8
339 Chinstrap Dream 45.7
340 Chinstrap Dream 55.8
341 Chinstrap Dream 43.5
342 Chinstrap Dream 49.6
343 Chinstrap Dream 50.8
344 Chinstrap Dream 50.2

We can also pass strings or symbols when using the bracket subsetting syntax:

penguins[:, [:species, :island]]
344×2 DataFrame
319 rows omitted
Row species island
String15 String15
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
5 Adelie Torgersen
6 Adelie Torgersen
7 Adelie Torgersen
8 Adelie Torgersen
9 Adelie Torgersen
10 Adelie Torgersen
11 Adelie Torgersen
12 Adelie Torgersen
13 Adelie Torgersen
333 Chinstrap Dream
334 Chinstrap Dream
335 Chinstrap Dream
336 Chinstrap Dream
337 Chinstrap Dream
338 Chinstrap Dream
339 Chinstrap Dream
340 Chinstrap Dream
341 Chinstrap Dream
342 Chinstrap Dream
343 Chinstrap Dream
344 Chinstrap Dream
penguins[:, ["species", "island"]]
344×2 DataFrame
319 rows omitted
Row species island
String15 String15
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
5 Adelie Torgersen
6 Adelie Torgersen
7 Adelie Torgersen
8 Adelie Torgersen
9 Adelie Torgersen
10 Adelie Torgersen
11 Adelie Torgersen
12 Adelie Torgersen
13 Adelie Torgersen
333 Chinstrap Dream
334 Chinstrap Dream
335 Chinstrap Dream
336 Chinstrap Dream
337 Chinstrap Dream
338 Chinstrap Dream
339 Chinstrap Dream
340 Chinstrap Dream
341 Chinstrap Dream
342 Chinstrap Dream
343 Chinstrap Dream
344 Chinstrap Dream

An alternative way to subset columns is to use the select() function from DataFrames.jl

select(penguins, [:species, :island])
344×2 DataFrame
319 rows omitted
Row species island
String15 String15
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
5 Adelie Torgersen
6 Adelie Torgersen
7 Adelie Torgersen
8 Adelie Torgersen
9 Adelie Torgersen
10 Adelie Torgersen
11 Adelie Torgersen
12 Adelie Torgersen
13 Adelie Torgersen
333 Chinstrap Dream
334 Chinstrap Dream
335 Chinstrap Dream
336 Chinstrap Dream
337 Chinstrap Dream
338 Chinstrap Dream
339 Chinstrap Dream
340 Chinstrap Dream
341 Chinstrap Dream
342 Chinstrap Dream
343 Chinstrap Dream
344 Chinstrap Dream

Just as with subsetting rows, we can also subset columns using regex. And the column subsetting is way easier than row subsetting:

select(penguins, r"bill")
344×2 DataFrame
319 rows omitted
Row bill_length_mm bill_depth_mm
Float64? Float64?
1 39.1 18.7
2 39.5 17.4
3 40.3 18.0
4 missing missing
5 36.7 19.3
6 39.3 20.6
7 38.9 17.8
8 39.2 19.6
9 34.1 18.1
10 42.0 20.2
11 37.8 17.1
12 37.8 17.3
13 41.1 17.6
333 45.2 16.6
334 49.3 19.9
335 50.2 18.8
336 45.6 19.4
337 51.9 19.5
338 46.8 16.5
339 45.7 17.0
340 55.8 19.8
341 43.5 18.1
342 49.6 18.2
343 50.8 19.0
344 50.2 18.7

Creating New Columns

We can create new columns using the transform() function (or transform!() for in-place).

The general pattern here is that creating a column takes 3 components:

  1. an input column,
  2. a function to apply to that column (usually wrapped in ByRow()), and
  3. the name of the output column
transform(penguins, :species => ByRow(x -> x == "Chinstrap") => :is_chinstrap)
344×9 DataFrame
319 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year is_chinstrap
String15 String15 Float64? Float64? Int64? Int64? String7? Int64 Bool
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 false
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 false
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007 false
4 Adelie Torgersen missing missing missing missing missing 2007 false
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 false
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 false
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007 false
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 false
9 Adelie Torgersen 34.1 18.1 193 3475 missing 2007 false
10 Adelie Torgersen 42.0 20.2 190 4250 missing 2007 false
11 Adelie Torgersen 37.8 17.1 186 3300 missing 2007 false
12 Adelie Torgersen 37.8 17.3 180 3700 missing 2007 false
13 Adelie Torgersen 41.1 17.6 182 3200 female 2007 false
333 Chinstrap Dream 45.2 16.6 191 3250 female 2009 true
334 Chinstrap Dream 49.3 19.9 203 4050 male 2009 true
335 Chinstrap Dream 50.2 18.8 202 3800 male 2009 true
336 Chinstrap Dream 45.6 19.4 194 3525 female 2009 true
337 Chinstrap Dream 51.9 19.5 206 3950 male 2009 true
338 Chinstrap Dream 46.8 16.5 189 3650 female 2009 true
339 Chinstrap Dream 45.7 17.0 195 3650 female 2009 true
340 Chinstrap Dream 55.8 19.8 207 4000 male 2009 true
341 Chinstrap Dream 43.5 18.1 202 3400 female 2009 true
342 Chinstrap Dream 49.6 18.2 193 3775 male 2009 true
343 Chinstrap Dream 50.8 19.0 210 4100 male 2009 true
344 Chinstrap Dream 50.2 18.7 198 3775 female 2009 true

We can also create multiple columns at the same time:

transform(
    penguins,
    :species => ByRow(x -> x == "Chinstrap") => :is_chinstrap,
    :body_mass_g => ByRow(x -> x > 4000) => :chonk
)
344×10 DataFrame
319 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year is_chinstrap chonk
String15 String15 Float64? Float64? Int64? Int64? String7? Int64 Bool Bool?
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 false false
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 false false
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007 false false
4 Adelie Torgersen missing missing missing missing missing 2007 false missing
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 false false
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 false false
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007 false false
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 false true
9 Adelie Torgersen 34.1 18.1 193 3475 missing 2007 false false
10 Adelie Torgersen 42.0 20.2 190 4250 missing 2007 false true
11 Adelie Torgersen 37.8 17.1 186 3300 missing 2007 false false
12 Adelie Torgersen 37.8 17.3 180 3700 missing 2007 false false
13 Adelie Torgersen 41.1 17.6 182 3200 female 2007 false false
333 Chinstrap Dream 45.2 16.6 191 3250 female 2009 true false
334 Chinstrap Dream 49.3 19.9 203 4050 male 2009 true true
335 Chinstrap Dream 50.2 18.8 202 3800 male 2009 true false
336 Chinstrap Dream 45.6 19.4 194 3525 female 2009 true false
337 Chinstrap Dream 51.9 19.5 206 3950 male 2009 true false
338 Chinstrap Dream 46.8 16.5 189 3650 female 2009 true false
339 Chinstrap Dream 45.7 17.0 195 3650 female 2009 true false
340 Chinstrap Dream 55.8 19.8 207 4000 male 2009 true false
341 Chinstrap Dream 43.5 18.1 202 3400 female 2009 true false
342 Chinstrap Dream 49.6 18.2 193 3775 male 2009 true false
343 Chinstrap Dream 50.8 19.0 210 4100 male 2009 true true
344 Chinstrap Dream 50.2 18.7 198 3775 female 2009 true false

We can pass multiple columns into an operation as well:

transform(
    penguins,
    [:bill_length_mm, :bill_depth_mm] => (+) => :bill_stuff
)
344×9 DataFrame
319 rows omitted
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year bill_stuff
String15 String15 Float64? Float64? Int64? Int64? String7? Int64 Float64?
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 57.8
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 56.9
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007 58.3
4 Adelie Torgersen missing missing missing missing missing 2007 missing
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 56.0
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 59.9
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007 56.7
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 58.8
9 Adelie Torgersen 34.1 18.1 193 3475 missing 2007 52.2
10 Adelie Torgersen 42.0 20.2 190 4250 missing 2007 62.2
11 Adelie Torgersen 37.8 17.1 186 3300 missing 2007 54.9
12 Adelie Torgersen 37.8 17.3 180 3700 missing 2007 55.1
13 Adelie Torgersen 41.1 17.6 182 3200 female 2007 58.7
333 Chinstrap Dream 45.2 16.6 191 3250 female 2009 61.8
334 Chinstrap Dream 49.3 19.9 203 4050 male 2009 69.2
335 Chinstrap Dream 50.2 18.8 202 3800 male 2009 69.0
336 Chinstrap Dream 45.6 19.4 194 3525 female 2009 65.0
337 Chinstrap Dream 51.9 19.5 206 3950 male 2009 71.4
338 Chinstrap Dream 46.8 16.5 189 3650 female 2009 63.3
339 Chinstrap Dream 45.7 17.0 195 3650 female 2009 62.7
340 Chinstrap Dream 55.8 19.8 207 4000 male 2009 75.6
341 Chinstrap Dream 43.5 18.1 202 3400 female 2009 61.6
342 Chinstrap Dream 49.6 18.2 193 3775 male 2009 67.8
343 Chinstrap Dream 50.8 19.0 210 4100 male 2009 69.8
344 Chinstrap Dream 50.2 18.7 198 3775 female 2009 68.9

This syntax isn’t my favorite. Another way to do the same, though, if we wanted:

penguins.:bill_stuff = penguins.:bill_length_mm .+ penguins.:bill_depth_mm
344-element Vector{Union{Missing, Float64}}:
 57.8
 56.9
 58.3
   missing
 56.0
 59.9
 56.7
 58.800000000000004
 52.2
 62.2
 54.9
 55.099999999999994
 58.7
  ⋮
 61.800000000000004
 69.19999999999999
 69.0
 65.0
 71.4
 63.3
 62.7
 75.6
 61.6
 67.8
 69.8
 68.9

Summarizing DataFrames

Often, summarizing will involve combining the groupby() method with the combine() method:

# we need to drop missings here on the body mass column first

grouped_penguins = groupby(dropmissing(penguins, :body_mass_g), :species)

combine(grouped_penguins, :body_mass_g => mean)
3×2 DataFrame
Row species body_mass_g_mean
String15 Float64
1 Adelie 3700.66
2 Gentoo 5076.02
3 Chinstrap 3733.09

Counting is a common operation:

combine(grouped_penguins, nrow)
3×2 DataFrame
Row species nrow
String15 Int64
1 Adelie 151
2 Gentoo 123
3 Chinstrap 68

The problem here is that we’ve already dropped missing values on body mass, so we’re not counting all of the observations here. I’m not sure if there’s a great way around this inside of combine(), but something to explore more…

Anyway, say we want to take the mean of all of our numeric columns. We can do that as follows. Note that I don’t really like this syntax, especially compared to some of the way you’d do this in tidyverse or even in Python’s polars. It also has the issue that I’m having to keep complete observations rather than column-wise complete data.

numeric_cols = names(penguins, Union{Missing,Number})

nomiss_penguins = dropmissing(penguins)

grouped_penguins = groupby(nomiss_penguins, :species)

species_means = combine(grouped_penguins, numeric_cols .=> mean)
3×7 DataFrame
Row species bill_length_mm_mean bill_depth_mm_mean flipper_length_mm_mean body_mass_g_mean year_mean bill_stuff_mean
String15 Float64 Float64 Float64 Float64 Float64 Float64
1 Adelie 38.824 18.3473 190.103 3706.16 2008.05 57.1712
2 Gentoo 47.5681 14.9966 217.235 5092.44 2008.07 62.5647
3 Chinstrap 48.8338 18.4206 195.824 3733.09 2007.97 67.2544

We can also perform multiple aggregations on a grouped dataframe, just like we did when we applied multiple transformations via transform():

combine(
    grouped_penguins,
    :bill_length_mm => mean,
    nrow => :n
)
3×3 DataFrame
Row species bill_length_mm_mean n
String15 Float64 Int64
1 Adelie 38.824 146
2 Gentoo 47.5681 119
3 Chinstrap 48.8338 68

Reshaping data

Let’s go back to our species_means data from before. Say we want to pivot this longer, so that we have a 3-column dataframe where our columns are species, metric, and value. We can do this with the stack() function:

species_means_long = stack(species_means, Not(:species), variable_name=:metric, value_name=:value)
18×3 DataFrame
Row species metric value
String15 String Float64
1 Adelie bill_length_mm_mean 38.824
2 Gentoo bill_length_mm_mean 47.5681
3 Chinstrap bill_length_mm_mean 48.8338
4 Adelie bill_depth_mm_mean 18.3473
5 Gentoo bill_depth_mm_mean 14.9966
6 Chinstrap bill_depth_mm_mean 18.4206
7 Adelie flipper_length_mm_mean 190.103
8 Gentoo flipper_length_mm_mean 217.235
9 Chinstrap flipper_length_mm_mean 195.824
10 Adelie body_mass_g_mean 3706.16
11 Gentoo body_mass_g_mean 5092.44
12 Chinstrap body_mass_g_mean 3733.09
13 Adelie year_mean 2008.05
14 Gentoo year_mean 2008.07
15 Chinstrap year_mean 2007.97
16 Adelie bill_stuff_mean 57.1712
17 Gentoo bill_stuff_mean 62.5647
18 Chinstrap bill_stuff_mean 67.2544

To reshape back to wide, we can use the (appropriately named!) unstack() function. This function has the following signature:

unstack(df, id_col, variable_names, variable_values)

So in the example below, species is the id variable, metric is the column with variable names, and value is the column with values.

unstack(species_means_long, :species, :metric, :value)
3×7 DataFrame
Row species bill_length_mm_mean bill_depth_mm_mean flipper_length_mm_mean body_mass_g_mean year_mean bill_stuff_mean
String15 Float64? Float64? Float64? Float64? Float64? Float64?
1 Adelie 38.824 18.3473 190.103 3706.16 2008.05 57.1712
2 Gentoo 47.5681 14.9966 217.235 5092.44 2008.07 62.5647
3 Chinstrap 48.8338 18.4206 195.824 3733.09 2007.97 67.2544

Chaining Operations

You can chain dataframe operations together in Julia using the Chain.jl package. This is basically the same as using the pipe in tidyverse R workflows to set up data processing pipelines.

For instance, we can do the following to summarize our data:

@chain penguins begin
    dropmissing
    groupby(:species)
    combine(:body_mass_g => mean)
end
3×2 DataFrame
Row species body_mass_g_mean
String15 Float64
1 Adelie 3706.16
2 Gentoo 5092.44
3 Chinstrap 3733.09

The “default” specification of a chain (like above) is to assume the output of the previous function call is the first argument to the subsequent function call. So the output of dropmissing – a dataframe – is the first argument to groupby(), etc.

Another way to specify a chain is to use an underscore _ as a placeholder for the output of the previous expression. This isn’t super useful in cases where a dataframe will be the first argument, like below:

@chain penguins begin
    dropmissing(_)
    groupby(_, :species)
    combine(_, :body_mass_g => mean)
end
3×2 DataFrame
Row species body_mass_g_mean
String15 Float64
1 Adelie 3706.16
2 Gentoo 5092.44
3 Chinstrap 3733.09

A not-very-useful but nevertheless illustrative version of using an underscore to pipe a value into the not-first argument of a function is below:

@chain :body_mass_g begin
    combine(dropmissing(penguins), _ => mean)
end
1×1 DataFrame
Row body_mass_g_mean
Float64
1 4207.06