Load Pkgs and Read in data

using CSV
using DataFrames
using Statistics
using Chain

#note that the missingstring arg will read in any "NA" strings as missings
penguins = CSV.read(download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"), DataFrame, missingstring="NA"
);

Check Properties of data

size(penguins)

(344, 8)

see the first 3 rows

first(penguins, 3)

3×8 DataFrame

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64
1	Adelie	Torgersen	39.1	18.7	181	3750	male	2007
2	Adelie	Torgersen	39.5	17.4	186	3800	female	2007
3	Adelie	Torgersen	40.3	18.0	195	3250	female	2007

get the column names of a dataframe

names(penguins)

8-element Vector{String}:
 "species"
 "island"
 "bill_length_mm"
 "bill_depth_mm"
 "flipper_length_mm"
 "body_mass_g"
 "sex"
 "year"

nb we could also convert these to Symbol via:

Symbol.(names(penguins))

8-element Vector{Symbol}:
 :species
 :island
 :bill_length_mm
 :bill_depth_mm
 :flipper_length_mm
 :body_mass_g
 :sex
 :year

If we want to get the column types, we can do the following:

eltype.(eachcol(penguins))

8-element Vector{Type}:
 String15
 String15
 Union{Missing, Float64}
 Union{Missing, Float64}
 Union{Missing, Int64}
 Union{Missing, Int64}
 Union{Missing, String7}
 Int64

Subsetting Rows

In general, we’ll want to use the subset() or subset!() functions to subset rows in a dataframe (subset!() is the in-place version)

The syntax is a little wonky, and it tends to involve passing a symbol into an anonymous function that is then broadcast. For instance, the following will keep only the Chinstrap penguins:

subset(penguins, :species => s -> s .== "Chinstrap")

68×8 DataFrame

43 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64
1	Chinstrap	Dream	46.5	17.9	192	3500	female	2007
2	Chinstrap	Dream	50.0	19.5	196	3900	male	2007
3	Chinstrap	Dream	51.3	19.2	193	3650	male	2007
4	Chinstrap	Dream	45.4	18.7	188	3525	female	2007
5	Chinstrap	Dream	52.7	19.8	197	3725	male	2007
6	Chinstrap	Dream	45.2	17.8	198	3950	female	2007
7	Chinstrap	Dream	46.1	18.2	178	3250	female	2007
8	Chinstrap	Dream	51.3	18.2	197	3750	male	2007
9	Chinstrap	Dream	46.0	18.9	195	4150	female	2007
10	Chinstrap	Dream	51.3	19.9	198	3700	male	2007
11	Chinstrap	Dream	46.6	17.8	193	3800	female	2007
12	Chinstrap	Dream	51.7	20.3	194	3775	male	2007
13	Chinstrap	Dream	47.0	17.3	185	3700	female	2007
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
57	Chinstrap	Dream	45.2	16.6	191	3250	female	2009
58	Chinstrap	Dream	49.3	19.9	203	4050	male	2009
59	Chinstrap	Dream	50.2	18.8	202	3800	male	2009
60	Chinstrap	Dream	45.6	19.4	194	3525	female	2009
61	Chinstrap	Dream	51.9	19.5	206	3950	male	2009
62	Chinstrap	Dream	46.8	16.5	189	3650	female	2009
63	Chinstrap	Dream	45.7	17.0	195	3650	female	2009
64	Chinstrap	Dream	55.8	19.8	207	4000	male	2009
65	Chinstrap	Dream	43.5	18.1	202	3400	female	2009
66	Chinstrap	Dream	49.6	18.2	193	3775	male	2009
67	Chinstrap	Dream	50.8	19.0	210	4100	male	2009
68	Chinstrap	Dream	50.2	18.7	198	3775	female	2009

And it’s straightforward to pass multiple conditions to the function:

subset(penguins, :species => s -> s .== "Chinstrap", :bill_length_mm => b -> b .> 50, skipmissing=true)

30×8 DataFrame

5 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64
1	Chinstrap	Dream	51.3	19.2	193	3650	male	2007
2	Chinstrap	Dream	52.7	19.8	197	3725	male	2007
3	Chinstrap	Dream	51.3	18.2	197	3750	male	2007
4	Chinstrap	Dream	51.3	19.9	198	3700	male	2007
5	Chinstrap	Dream	51.7	20.3	194	3775	male	2007
6	Chinstrap	Dream	52.0	18.1	201	4050	male	2007
7	Chinstrap	Dream	50.5	19.6	201	4050	male	2007
8	Chinstrap	Dream	50.3	20.0	197	3300	male	2007
9	Chinstrap	Dream	58.0	17.8	181	3700	female	2007
10	Chinstrap	Dream	50.6	19.4	193	3800	male	2007
11	Chinstrap	Dream	52.0	19.0	197	4150	male	2007
12	Chinstrap	Dream	50.5	18.4	200	3400	female	2008
13	Chinstrap	Dream	52.8	20.0	205	4550	male	2008
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
19	Chinstrap	Dream	50.9	17.9	196	3675	female	2009
20	Chinstrap	Dream	50.8	18.5	201	4450	male	2009
21	Chinstrap	Dream	50.1	17.9	190	3400	female	2009
22	Chinstrap	Dream	51.5	18.7	187	3250	male	2009
23	Chinstrap	Dream	51.4	19.0	201	3950	male	2009
24	Chinstrap	Dream	50.7	19.7	203	4050	male	2009
25	Chinstrap	Dream	52.2	18.8	197	3450	male	2009
26	Chinstrap	Dream	50.2	18.8	202	3800	male	2009
27	Chinstrap	Dream	51.9	19.5	206	3950	male	2009
28	Chinstrap	Dream	55.8	19.8	207	4000	male	2009
29	Chinstrap	Dream	50.8	19.0	210	4100	male	2009
30	Chinstrap	Dream	50.2	18.7	198	3775	female	2009

Note the use of the skipmissing argument. If there are missing values in a column that’s being filtered on, this will throw an error

We can also drop rows with any missing data via dropmissing(df)

dropmissing(penguins)

333×8 DataFrame

308 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64	Float64	Int64	Int64	String7	Int64
1	Adelie	Torgersen	39.1	18.7	181	3750	male	2007
2	Adelie	Torgersen	39.5	17.4	186	3800	female	2007
3	Adelie	Torgersen	40.3	18.0	195	3250	female	2007
4	Adelie	Torgersen	36.7	19.3	193	3450	female	2007
5	Adelie	Torgersen	39.3	20.6	190	3650	male	2007
6	Adelie	Torgersen	38.9	17.8	181	3625	female	2007
7	Adelie	Torgersen	39.2	19.6	195	4675	male	2007
8	Adelie	Torgersen	41.1	17.6	182	3200	female	2007
9	Adelie	Torgersen	38.6	21.2	191	3800	male	2007
10	Adelie	Torgersen	34.6	21.1	198	4400	male	2007
11	Adelie	Torgersen	36.6	17.8	185	3700	female	2007
12	Adelie	Torgersen	38.7	19.0	195	3450	female	2007
13	Adelie	Torgersen	42.5	20.7	197	4500	male	2007
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
322	Chinstrap	Dream	45.2	16.6	191	3250	female	2009
323	Chinstrap	Dream	49.3	19.9	203	4050	male	2009
324	Chinstrap	Dream	50.2	18.8	202	3800	male	2009
325	Chinstrap	Dream	45.6	19.4	194	3525	female	2009
326	Chinstrap	Dream	51.9	19.5	206	3950	male	2009
327	Chinstrap	Dream	46.8	16.5	189	3650	female	2009
328	Chinstrap	Dream	45.7	17.0	195	3650	female	2009
329	Chinstrap	Dream	55.8	19.8	207	4000	male	2009
330	Chinstrap	Dream	43.5	18.1	202	3400	female	2009
331	Chinstrap	Dream	49.6	18.2	193	3775	male	2009
332	Chinstrap	Dream	50.8	19.0	210	4100	male	2009
333	Chinstrap	Dream	50.2	18.7	198	3775	female	2009

Or we can choose to drop rows with missing values on a specific column:

dropmissing(penguins, :bill_length_mm)

342×8 DataFrame

317 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64	Float64?	Int64?	Int64?	String7?	Int64
1	Adelie	Torgersen	39.1	18.7	181	3750	male	2007
2	Adelie	Torgersen	39.5	17.4	186	3800	female	2007
3	Adelie	Torgersen	40.3	18.0	195	3250	female	2007
4	Adelie	Torgersen	36.7	19.3	193	3450	female	2007
5	Adelie	Torgersen	39.3	20.6	190	3650	male	2007
6	Adelie	Torgersen	38.9	17.8	181	3625	female	2007
7	Adelie	Torgersen	39.2	19.6	195	4675	male	2007
8	Adelie	Torgersen	34.1	18.1	193	3475	missing	2007
9	Adelie	Torgersen	42.0	20.2	190	4250	missing	2007
10	Adelie	Torgersen	37.8	17.1	186	3300	missing	2007
11	Adelie	Torgersen	37.8	17.3	180	3700	missing	2007
12	Adelie	Torgersen	41.1	17.6	182	3200	female	2007
13	Adelie	Torgersen	38.6	21.2	191	3800	male	2007
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
331	Chinstrap	Dream	45.2	16.6	191	3250	female	2009
332	Chinstrap	Dream	49.3	19.9	203	4050	male	2009
333	Chinstrap	Dream	50.2	18.8	202	3800	male	2009
334	Chinstrap	Dream	45.6	19.4	194	3525	female	2009
335	Chinstrap	Dream	51.9	19.5	206	3950	male	2009
336	Chinstrap	Dream	46.8	16.5	189	3650	female	2009
337	Chinstrap	Dream	45.7	17.0	195	3650	female	2009
338	Chinstrap	Dream	55.8	19.8	207	4000	male	2009
339	Chinstrap	Dream	43.5	18.1	202	3400	female	2009
340	Chinstrap	Dream	49.6	18.2	193	3775	male	2009
341	Chinstrap	Dream	50.8	19.0	210	4100	male	2009
342	Chinstrap	Dream	50.2	18.7	198	3775	female	2009

We can also subset rows using regex:

subset(penguins, :species => ByRow(x -> occursin(r"^Chin", x)))

68×8 DataFrame

43 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64
1	Chinstrap	Dream	46.5	17.9	192	3500	female	2007
2	Chinstrap	Dream	50.0	19.5	196	3900	male	2007
3	Chinstrap	Dream	51.3	19.2	193	3650	male	2007
4	Chinstrap	Dream	45.4	18.7	188	3525	female	2007
5	Chinstrap	Dream	52.7	19.8	197	3725	male	2007
6	Chinstrap	Dream	45.2	17.8	198	3950	female	2007
7	Chinstrap	Dream	46.1	18.2	178	3250	female	2007
8	Chinstrap	Dream	51.3	18.2	197	3750	male	2007
9	Chinstrap	Dream	46.0	18.9	195	4150	female	2007
10	Chinstrap	Dream	51.3	19.9	198	3700	male	2007
11	Chinstrap	Dream	46.6	17.8	193	3800	female	2007
12	Chinstrap	Dream	51.7	20.3	194	3775	male	2007
13	Chinstrap	Dream	47.0	17.3	185	3700	female	2007
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
57	Chinstrap	Dream	45.2	16.6	191	3250	female	2009
58	Chinstrap	Dream	49.3	19.9	203	4050	male	2009
59	Chinstrap	Dream	50.2	18.8	202	3800	male	2009
60	Chinstrap	Dream	45.6	19.4	194	3525	female	2009
61	Chinstrap	Dream	51.9	19.5	206	3950	male	2009
62	Chinstrap	Dream	46.8	16.5	189	3650	female	2009
63	Chinstrap	Dream	45.7	17.0	195	3650	female	2009
64	Chinstrap	Dream	55.8	19.8	207	4000	male	2009
65	Chinstrap	Dream	43.5	18.1	202	3400	female	2009
66	Chinstrap	Dream	49.6	18.2	193	3775	male	2009
67	Chinstrap	Dream	50.8	19.0	210	4100	male	2009
68	Chinstrap	Dream	50.2	18.7	198	3775	female	2009

This uses the ByRow() function to apply a function to each row as opposed to broadcasting, but the broadcasting notation also works

subset(penguins, :species => x -> occursin.(r"^Chin", x))

68×8 DataFrame

43 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64
1	Chinstrap	Dream	46.5	17.9	192	3500	female	2007
2	Chinstrap	Dream	50.0	19.5	196	3900	male	2007
3	Chinstrap	Dream	51.3	19.2	193	3650	male	2007
4	Chinstrap	Dream	45.4	18.7	188	3525	female	2007
5	Chinstrap	Dream	52.7	19.8	197	3725	male	2007
6	Chinstrap	Dream	45.2	17.8	198	3950	female	2007
7	Chinstrap	Dream	46.1	18.2	178	3250	female	2007
8	Chinstrap	Dream	51.3	18.2	197	3750	male	2007
9	Chinstrap	Dream	46.0	18.9	195	4150	female	2007
10	Chinstrap	Dream	51.3	19.9	198	3700	male	2007
11	Chinstrap	Dream	46.6	17.8	193	3800	female	2007
12	Chinstrap	Dream	51.7	20.3	194	3775	male	2007
13	Chinstrap	Dream	47.0	17.3	185	3700	female	2007
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
57	Chinstrap	Dream	45.2	16.6	191	3250	female	2009
58	Chinstrap	Dream	49.3	19.9	203	4050	male	2009
59	Chinstrap	Dream	50.2	18.8	202	3800	male	2009
60	Chinstrap	Dream	45.6	19.4	194	3525	female	2009
61	Chinstrap	Dream	51.9	19.5	206	3950	male	2009
62	Chinstrap	Dream	46.8	16.5	189	3650	female	2009
63	Chinstrap	Dream	45.7	17.0	195	3650	female	2009
64	Chinstrap	Dream	55.8	19.8	207	4000	male	2009
65	Chinstrap	Dream	43.5	18.1	202	3400	female	2009
66	Chinstrap	Dream	49.6	18.2	193	3775	male	2009
67	Chinstrap	Dream	50.8	19.0	210	4100	male	2009
68	Chinstrap	Dream	50.2	18.7	198	3775	female	2009

Finally, note that we can also use the bracket syntax with row indices to filter, but this feels like it’s not the most useful for rows:

#select rows 1, 10, 15
penguins[[1, 10, 15], :]

3×8 DataFrame

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64
1	Adelie	Torgersen	39.1	18.7	181	3750	male	2007
2	Adelie	Torgersen	42.0	20.2	190	4250	missing	2007
3	Adelie	Torgersen	34.6	21.1	198	4400	male	2007

Subsetting Columns

Using bracket syntax to subset columns makes more sense:

penguins[:, 1:3]

344×3 DataFrame

319 rows omitted

Row	species	island	bill_length_mm
	String15	String15	Float64?
1	Adelie	Torgersen	39.1
2	Adelie	Torgersen	39.5
3	Adelie	Torgersen	40.3
4	Adelie	Torgersen	missing
5	Adelie	Torgersen	36.7
6	Adelie	Torgersen	39.3
7	Adelie	Torgersen	38.9
8	Adelie	Torgersen	39.2
9	Adelie	Torgersen	34.1
10	Adelie	Torgersen	42.0
11	Adelie	Torgersen	37.8
12	Adelie	Torgersen	37.8
13	Adelie	Torgersen	41.1
⋮	⋮	⋮	⋮
333	Chinstrap	Dream	45.2
334	Chinstrap	Dream	49.3
335	Chinstrap	Dream	50.2
336	Chinstrap	Dream	45.6
337	Chinstrap	Dream	51.9
338	Chinstrap	Dream	46.8
339	Chinstrap	Dream	45.7
340	Chinstrap	Dream	55.8
341	Chinstrap	Dream	43.5
342	Chinstrap	Dream	49.6
343	Chinstrap	Dream	50.8
344	Chinstrap	Dream	50.2

We can also pass strings or symbols when using the bracket subsetting syntax:

penguins[:, [:species, :island]]

344×2 DataFrame

319 rows omitted

Row	species	island
	String15	String15
1	Adelie	Torgersen
2	Adelie	Torgersen
3	Adelie	Torgersen
4	Adelie	Torgersen
5	Adelie	Torgersen
6	Adelie	Torgersen
7	Adelie	Torgersen
8	Adelie	Torgersen
9	Adelie	Torgersen
10	Adelie	Torgersen
11	Adelie	Torgersen
12	Adelie	Torgersen
13	Adelie	Torgersen
⋮	⋮	⋮
333	Chinstrap	Dream
334	Chinstrap	Dream
335	Chinstrap	Dream
336	Chinstrap	Dream
337	Chinstrap	Dream
338	Chinstrap	Dream
339	Chinstrap	Dream
340	Chinstrap	Dream
341	Chinstrap	Dream
342	Chinstrap	Dream
343	Chinstrap	Dream
344	Chinstrap	Dream

penguins[:, ["species", "island"]]

344×2 DataFrame

319 rows omitted

Row	species	island
	String15	String15
1	Adelie	Torgersen
2	Adelie	Torgersen
3	Adelie	Torgersen
4	Adelie	Torgersen
5	Adelie	Torgersen
6	Adelie	Torgersen
7	Adelie	Torgersen
8	Adelie	Torgersen
9	Adelie	Torgersen
10	Adelie	Torgersen
11	Adelie	Torgersen
12	Adelie	Torgersen
13	Adelie	Torgersen
⋮	⋮	⋮
333	Chinstrap	Dream
334	Chinstrap	Dream
335	Chinstrap	Dream
336	Chinstrap	Dream
337	Chinstrap	Dream
338	Chinstrap	Dream
339	Chinstrap	Dream
340	Chinstrap	Dream
341	Chinstrap	Dream
342	Chinstrap	Dream
343	Chinstrap	Dream
344	Chinstrap	Dream

An alternative way to subset columns is to use the select() function from DataFrames.jl

select(penguins, [:species, :island])

344×2 DataFrame

319 rows omitted

Row	species	island
	String15	String15
1	Adelie	Torgersen
2	Adelie	Torgersen
3	Adelie	Torgersen
4	Adelie	Torgersen
5	Adelie	Torgersen
6	Adelie	Torgersen
7	Adelie	Torgersen
8	Adelie	Torgersen
9	Adelie	Torgersen
10	Adelie	Torgersen
11	Adelie	Torgersen
12	Adelie	Torgersen
13	Adelie	Torgersen
⋮	⋮	⋮
333	Chinstrap	Dream
334	Chinstrap	Dream
335	Chinstrap	Dream
336	Chinstrap	Dream
337	Chinstrap	Dream
338	Chinstrap	Dream
339	Chinstrap	Dream
340	Chinstrap	Dream
341	Chinstrap	Dream
342	Chinstrap	Dream
343	Chinstrap	Dream
344	Chinstrap	Dream

Just as with subsetting rows, we can also subset columns using regex. And the column subsetting is way easier than row subsetting:

select(penguins, r"bill")

344×2 DataFrame

319 rows omitted

Row	bill_length_mm	bill_depth_mm
	Float64?	Float64?
1	39.1	18.7
2	39.5	17.4
3	40.3	18.0
4	missing	missing
5	36.7	19.3
6	39.3	20.6
7	38.9	17.8
8	39.2	19.6
9	34.1	18.1
10	42.0	20.2
11	37.8	17.1
12	37.8	17.3
13	41.1	17.6
⋮	⋮	⋮
333	45.2	16.6
334	49.3	19.9
335	50.2	18.8
336	45.6	19.4
337	51.9	19.5
338	46.8	16.5
339	45.7	17.0
340	55.8	19.8
341	43.5	18.1
342	49.6	18.2
343	50.8	19.0
344	50.2	18.7

Creating New Columns

We can create new columns using the transform() function (or transform!() for in-place).

The general pattern here is that creating a column takes 3 components:

an input column,
a function to apply to that column (usually wrapped in ByRow()), and
the name of the output column

transform(penguins, :species => ByRow(x -> x == "Chinstrap") => :is_chinstrap)

344×9 DataFrame

319 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year	is_chinstrap
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64	Bool
1	Adelie	Torgersen	39.1	18.7	181	3750	male	2007	false
2	Adelie	Torgersen	39.5	17.4	186	3800	female	2007	false
3	Adelie	Torgersen	40.3	18.0	195	3250	female	2007	false
4	Adelie	Torgersen	missing	missing	missing	missing	missing	2007	false
5	Adelie	Torgersen	36.7	19.3	193	3450	female	2007	false
6	Adelie	Torgersen	39.3	20.6	190	3650	male	2007	false
7	Adelie	Torgersen	38.9	17.8	181	3625	female	2007	false
8	Adelie	Torgersen	39.2	19.6	195	4675	male	2007	false
9	Adelie	Torgersen	34.1	18.1	193	3475	missing	2007	false
10	Adelie	Torgersen	42.0	20.2	190	4250	missing	2007	false
11	Adelie	Torgersen	37.8	17.1	186	3300	missing	2007	false
12	Adelie	Torgersen	37.8	17.3	180	3700	missing	2007	false
13	Adelie	Torgersen	41.1	17.6	182	3200	female	2007	false
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
333	Chinstrap	Dream	45.2	16.6	191	3250	female	2009	true
334	Chinstrap	Dream	49.3	19.9	203	4050	male	2009	true
335	Chinstrap	Dream	50.2	18.8	202	3800	male	2009	true
336	Chinstrap	Dream	45.6	19.4	194	3525	female	2009	true
337	Chinstrap	Dream	51.9	19.5	206	3950	male	2009	true
338	Chinstrap	Dream	46.8	16.5	189	3650	female	2009	true
339	Chinstrap	Dream	45.7	17.0	195	3650	female	2009	true
340	Chinstrap	Dream	55.8	19.8	207	4000	male	2009	true
341	Chinstrap	Dream	43.5	18.1	202	3400	female	2009	true
342	Chinstrap	Dream	49.6	18.2	193	3775	male	2009	true
343	Chinstrap	Dream	50.8	19.0	210	4100	male	2009	true
344	Chinstrap	Dream	50.2	18.7	198	3775	female	2009	true

We can also create multiple columns at the same time:

transform(
    penguins,
    :species => ByRow(x -> x == "Chinstrap") => :is_chinstrap,
    :body_mass_g => ByRow(x -> x > 4000) => :chonk
)

344×10 DataFrame

319 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year	is_chinstrap	chonk
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64	Bool	Bool?
1	Adelie	Torgersen	39.1	18.7	181	3750	male	2007	false	false
2	Adelie	Torgersen	39.5	17.4	186	3800	female	2007	false	false
3	Adelie	Torgersen	40.3	18.0	195	3250	female	2007	false	false
4	Adelie	Torgersen	missing	missing	missing	missing	missing	2007	false	missing
5	Adelie	Torgersen	36.7	19.3	193	3450	female	2007	false	false
6	Adelie	Torgersen	39.3	20.6	190	3650	male	2007	false	false
7	Adelie	Torgersen	38.9	17.8	181	3625	female	2007	false	false
8	Adelie	Torgersen	39.2	19.6	195	4675	male	2007	false	true
9	Adelie	Torgersen	34.1	18.1	193	3475	missing	2007	false	false
10	Adelie	Torgersen	42.0	20.2	190	4250	missing	2007	false	true
11	Adelie	Torgersen	37.8	17.1	186	3300	missing	2007	false	false
12	Adelie	Torgersen	37.8	17.3	180	3700	missing	2007	false	false
13	Adelie	Torgersen	41.1	17.6	182	3200	female	2007	false	false
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
333	Chinstrap	Dream	45.2	16.6	191	3250	female	2009	true	false
334	Chinstrap	Dream	49.3	19.9	203	4050	male	2009	true	true
335	Chinstrap	Dream	50.2	18.8	202	3800	male	2009	true	false
336	Chinstrap	Dream	45.6	19.4	194	3525	female	2009	true	false
337	Chinstrap	Dream	51.9	19.5	206	3950	male	2009	true	false
338	Chinstrap	Dream	46.8	16.5	189	3650	female	2009	true	false
339	Chinstrap	Dream	45.7	17.0	195	3650	female	2009	true	false
340	Chinstrap	Dream	55.8	19.8	207	4000	male	2009	true	false
341	Chinstrap	Dream	43.5	18.1	202	3400	female	2009	true	false
342	Chinstrap	Dream	49.6	18.2	193	3775	male	2009	true	false
343	Chinstrap	Dream	50.8	19.0	210	4100	male	2009	true	true
344	Chinstrap	Dream	50.2	18.7	198	3775	female	2009	true	false

We can pass multiple columns into an operation as well:

transform(
    penguins,
    [:bill_length_mm, :bill_depth_mm] => (+) => :bill_stuff
)

344×9 DataFrame

319 rows omitted

Row	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year	bill_stuff
	String15	String15	Float64?	Float64?	Int64?	Int64?	String7?	Int64	Float64?
1	Adelie	Torgersen	39.1	18.7	181	3750	male	2007	57.8
2	Adelie	Torgersen	39.5	17.4	186	3800	female	2007	56.9
3	Adelie	Torgersen	40.3	18.0	195	3250	female	2007	58.3
4	Adelie	Torgersen	missing	missing	missing	missing	missing	2007	missing
5	Adelie	Torgersen	36.7	19.3	193	3450	female	2007	56.0
6	Adelie	Torgersen	39.3	20.6	190	3650	male	2007	59.9
7	Adelie	Torgersen	38.9	17.8	181	3625	female	2007	56.7
8	Adelie	Torgersen	39.2	19.6	195	4675	male	2007	58.8
9	Adelie	Torgersen	34.1	18.1	193	3475	missing	2007	52.2
10	Adelie	Torgersen	42.0	20.2	190	4250	missing	2007	62.2
11	Adelie	Torgersen	37.8	17.1	186	3300	missing	2007	54.9
12	Adelie	Torgersen	37.8	17.3	180	3700	missing	2007	55.1
13	Adelie	Torgersen	41.1	17.6	182	3200	female	2007	58.7
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
333	Chinstrap	Dream	45.2	16.6	191	3250	female	2009	61.8
334	Chinstrap	Dream	49.3	19.9	203	4050	male	2009	69.2
335	Chinstrap	Dream	50.2	18.8	202	3800	male	2009	69.0
336	Chinstrap	Dream	45.6	19.4	194	3525	female	2009	65.0
337	Chinstrap	Dream	51.9	19.5	206	3950	male	2009	71.4
338	Chinstrap	Dream	46.8	16.5	189	3650	female	2009	63.3
339	Chinstrap	Dream	45.7	17.0	195	3650	female	2009	62.7
340	Chinstrap	Dream	55.8	19.8	207	4000	male	2009	75.6
341	Chinstrap	Dream	43.5	18.1	202	3400	female	2009	61.6
342	Chinstrap	Dream	49.6	18.2	193	3775	male	2009	67.8
343	Chinstrap	Dream	50.8	19.0	210	4100	male	2009	69.8
344	Chinstrap	Dream	50.2	18.7	198	3775	female	2009	68.9

This syntax isn’t my favorite. Another way to do the same, though, if we wanted:

penguins.:bill_stuff = penguins.:bill_length_mm .+ penguins.:bill_depth_mm

344-element Vector{Union{Missing, Float64}}:
 57.8
 56.9
 58.3
   missing
 56.0
 59.9
 56.7
 58.800000000000004
 52.2
 62.2
 54.9
 55.099999999999994
 58.7
  ⋮
 61.800000000000004
 69.19999999999999
 69.0
 65.0
 71.4
 63.3
 62.7
 75.6
 61.6
 67.8
 69.8
 68.9

Summarizing DataFrames

Often, summarizing will involve combining the groupby() method with the combine() method:

# we need to drop missings here on the body mass column first

grouped_penguins = groupby(dropmissing(penguins, :body_mass_g), :species)

combine(grouped_penguins, :body_mass_g => mean)

3×2 DataFrame

Row	species	body_mass_g_mean
	String15	Float64
1	Adelie	3700.66
2	Gentoo	5076.02
3	Chinstrap	3733.09

Counting is a common operation:

combine(grouped_penguins, nrow)

3×2 DataFrame

Row	species	nrow
	String15	Int64
1	Adelie	151
2	Gentoo	123
3	Chinstrap	68

The problem here is that we’ve already dropped missing values on body mass, so we’re not counting all of the observations here. I’m not sure if there’s a great way around this inside of combine(), but something to explore more…

Anyway, say we want to take the mean of all of our numeric columns. We can do that as follows. Note that I don’t really like this syntax, especially compared to some of the way you’d do this in tidyverse or even in Python’s polars. It also has the issue that I’m having to keep complete observations rather than column-wise complete data.

numeric_cols = names(penguins, Union{Missing,Number})

nomiss_penguins = dropmissing(penguins)

grouped_penguins = groupby(nomiss_penguins, :species)

species_means = combine(grouped_penguins, numeric_cols .=> mean)

3×7 DataFrame

Row	species	bill_length_mm_mean	bill_depth_mm_mean	flipper_length_mm_mean	body_mass_g_mean	year_mean	bill_stuff_mean
	String15	Float64	Float64	Float64	Float64	Float64	Float64
1	Adelie	38.824	18.3473	190.103	3706.16	2008.05	57.1712
2	Gentoo	47.5681	14.9966	217.235	5092.44	2008.07	62.5647
3	Chinstrap	48.8338	18.4206	195.824	3733.09	2007.97	67.2544

We can also perform multiple aggregations on a grouped dataframe, just like we did when we applied multiple transformations via transform():

combine(
    grouped_penguins,
    :bill_length_mm => mean,
    nrow => :n
)

3×3 DataFrame

Row	species	bill_length_mm_mean	n
	String15	Float64	Int64
1	Adelie	38.824	146
2	Gentoo	47.5681	119
3	Chinstrap	48.8338	68

Reshaping data

Let’s go back to our species_means data from before. Say we want to pivot this longer, so that we have a 3-column dataframe where our columns are species, metric, and value. We can do this with the stack() function:

species_means_long = stack(species_means, Not(:species), variable_name=:metric, value_name=:value)

18×3 DataFrame

Row	species	metric	value
	String15	String	Float64
1	Adelie	bill_length_mm_mean	38.824
2	Gentoo	bill_length_mm_mean	47.5681
3	Chinstrap	bill_length_mm_mean	48.8338
4	Adelie	bill_depth_mm_mean	18.3473
5	Gentoo	bill_depth_mm_mean	14.9966
6	Chinstrap	bill_depth_mm_mean	18.4206
7	Adelie	flipper_length_mm_mean	190.103
8	Gentoo	flipper_length_mm_mean	217.235
9	Chinstrap	flipper_length_mm_mean	195.824
10	Adelie	body_mass_g_mean	3706.16
11	Gentoo	body_mass_g_mean	5092.44
12	Chinstrap	body_mass_g_mean	3733.09
13	Adelie	year_mean	2008.05
14	Gentoo	year_mean	2008.07
15	Chinstrap	year_mean	2007.97
16	Adelie	bill_stuff_mean	57.1712
17	Gentoo	bill_stuff_mean	62.5647
18	Chinstrap	bill_stuff_mean	67.2544

To reshape back to wide, we can use the (appropriately named!) unstack() function. This function has the following signature:

unstack(df, id_col, variable_names, variable_values)

So in the example below, species is the id variable, metric is the column with variable names, and value is the column with values.

unstack(species_means_long, :species, :metric, :value)

3×7 DataFrame

Row	species	bill_length_mm_mean	bill_depth_mm_mean	flipper_length_mm_mean	body_mass_g_mean	year_mean	bill_stuff_mean
	String15	Float64?	Float64?	Float64?	Float64?	Float64?	Float64?
1	Adelie	38.824	18.3473	190.103	3706.16	2008.05	57.1712
2	Gentoo	47.5681	14.9966	217.235	5092.44	2008.07	62.5647
3	Chinstrap	48.8338	18.4206	195.824	3733.09	2007.97	67.2544

Chaining Operations

You can chain dataframe operations together in Julia using the Chain.jl package. This is basically the same as using the pipe in tidyverse R workflows to set up data processing pipelines.

For instance, we can do the following to summarize our data:

@chain penguins begin
    dropmissing
    groupby(:species)
    combine(:body_mass_g => mean)
end

3×2 DataFrame

Row	species	body_mass_g_mean
	String15	Float64
1	Adelie	3706.16
2	Gentoo	5092.44
3	Chinstrap	3733.09

The “default” specification of a chain (like above) is to assume the output of the previous function call is the first argument to the subsequent function call. So the output of dropmissing – a dataframe – is the first argument to groupby(), etc.

Another way to specify a chain is to use an underscore _ as a placeholder for the output of the previous expression. This isn’t super useful in cases where a dataframe will be the first argument, like below:

@chain penguins begin
    dropmissing(_)
    groupby(_, :species)
    combine(_, :body_mass_g => mean)
end

3×2 DataFrame

Row	species	body_mass_g_mean
	String15	Float64
1	Adelie	3706.16
2	Gentoo	5092.44
3	Chinstrap	3733.09

A not-very-useful but nevertheless illustrative version of using an underscore to pipe a value into the not-first argument of a function is below:

@chain :body_mass_g begin
    combine(dropmissing(penguins), _ => mean)
end

1×1 DataFrame

Row	body_mass_g_mean
	Float64
1	4207.06