using CSV
using DataFrames
using Statistics
using Chain
#note that the missingstring arg will read in any "NA" strings as missings
penguins = CSV.read(download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"), DataFrame, missingstring="NA"
);DataFrames.jl Cheat Sheet
A WIP cheat sheet for working with DataFrames.jl
A cheat sheet for using DataFrames.jl. This is a work in progress and will (hopefully) be continuously updated/modified.
Load Pkgs and Read in data
Check Properties of data
size(penguins)(344, 8)
see the first 3 rows
first(penguins, 3)| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| 2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| 3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
get the column names of a dataframe
names(penguins)8-element Vector{String}:
"species"
"island"
"bill_length_mm"
"bill_depth_mm"
"flipper_length_mm"
"body_mass_g"
"sex"
"year"
nb we could also convert these to Symbol via:
Symbol.(names(penguins))8-element Vector{Symbol}:
:species
:island
:bill_length_mm
:bill_depth_mm
:flipper_length_mm
:body_mass_g
:sex
:year
If we want to get the column types, we can do the following:
eltype.(eachcol(penguins))8-element Vector{Type}:
String15
String15
Union{Missing, Float64}
Union{Missing, Float64}
Union{Missing, Int64}
Union{Missing, Int64}
Union{Missing, String7}
Int64
Subsetting Rows
In general, we’ll want to use the subset() or subset!() functions to subset rows in a dataframe (subset!() is the in-place version)
The syntax is a little wonky, and it tends to involve passing a symbol into an anonymous function that is then broadcast. For instance, the following will keep only the Chinstrap penguins:
subset(penguins, :species => s -> s .== "Chinstrap")| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
| 1 | Chinstrap | Dream | 46.5 | 17.9 | 192 | 3500 | female | 2007 |
| 2 | Chinstrap | Dream | 50.0 | 19.5 | 196 | 3900 | male | 2007 |
| 3 | Chinstrap | Dream | 51.3 | 19.2 | 193 | 3650 | male | 2007 |
| 4 | Chinstrap | Dream | 45.4 | 18.7 | 188 | 3525 | female | 2007 |
| 5 | Chinstrap | Dream | 52.7 | 19.8 | 197 | 3725 | male | 2007 |
| 6 | Chinstrap | Dream | 45.2 | 17.8 | 198 | 3950 | female | 2007 |
| 7 | Chinstrap | Dream | 46.1 | 18.2 | 178 | 3250 | female | 2007 |
| 8 | Chinstrap | Dream | 51.3 | 18.2 | 197 | 3750 | male | 2007 |
| 9 | Chinstrap | Dream | 46.0 | 18.9 | 195 | 4150 | female | 2007 |
| 10 | Chinstrap | Dream | 51.3 | 19.9 | 198 | 3700 | male | 2007 |
| 11 | Chinstrap | Dream | 46.6 | 17.8 | 193 | 3800 | female | 2007 |
| 12 | Chinstrap | Dream | 51.7 | 20.3 | 194 | 3775 | male | 2007 |
| 13 | Chinstrap | Dream | 47.0 | 17.3 | 185 | 3700 | female | 2007 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 57 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
| 58 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
| 59 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
| 60 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
| 61 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
| 62 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
| 63 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
| 64 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
| 65 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
| 66 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
| 67 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
| 68 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
And it’s straightforward to pass multiple conditions to the function:
subset(penguins, :species => s -> s .== "Chinstrap", :bill_length_mm => b -> b .> 50, skipmissing=true)| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
| 1 | Chinstrap | Dream | 51.3 | 19.2 | 193 | 3650 | male | 2007 |
| 2 | Chinstrap | Dream | 52.7 | 19.8 | 197 | 3725 | male | 2007 |
| 3 | Chinstrap | Dream | 51.3 | 18.2 | 197 | 3750 | male | 2007 |
| 4 | Chinstrap | Dream | 51.3 | 19.9 | 198 | 3700 | male | 2007 |
| 5 | Chinstrap | Dream | 51.7 | 20.3 | 194 | 3775 | male | 2007 |
| 6 | Chinstrap | Dream | 52.0 | 18.1 | 201 | 4050 | male | 2007 |
| 7 | Chinstrap | Dream | 50.5 | 19.6 | 201 | 4050 | male | 2007 |
| 8 | Chinstrap | Dream | 50.3 | 20.0 | 197 | 3300 | male | 2007 |
| 9 | Chinstrap | Dream | 58.0 | 17.8 | 181 | 3700 | female | 2007 |
| 10 | Chinstrap | Dream | 50.6 | 19.4 | 193 | 3800 | male | 2007 |
| 11 | Chinstrap | Dream | 52.0 | 19.0 | 197 | 4150 | male | 2007 |
| 12 | Chinstrap | Dream | 50.5 | 18.4 | 200 | 3400 | female | 2008 |
| 13 | Chinstrap | Dream | 52.8 | 20.0 | 205 | 4550 | male | 2008 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 19 | Chinstrap | Dream | 50.9 | 17.9 | 196 | 3675 | female | 2009 |
| 20 | Chinstrap | Dream | 50.8 | 18.5 | 201 | 4450 | male | 2009 |
| 21 | Chinstrap | Dream | 50.1 | 17.9 | 190 | 3400 | female | 2009 |
| 22 | Chinstrap | Dream | 51.5 | 18.7 | 187 | 3250 | male | 2009 |
| 23 | Chinstrap | Dream | 51.4 | 19.0 | 201 | 3950 | male | 2009 |
| 24 | Chinstrap | Dream | 50.7 | 19.7 | 203 | 4050 | male | 2009 |
| 25 | Chinstrap | Dream | 52.2 | 18.8 | 197 | 3450 | male | 2009 |
| 26 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
| 27 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
| 28 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
| 29 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
| 30 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
Note the use of the skipmissing argument. If there are missing values in a column that’s being filtered on, this will throw an error
We can also drop rows with any missing data via dropmissing(df)
dropmissing(penguins)| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64 | Float64 | Int64 | Int64 | String7 | Int64 | |
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| 2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| 3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
| 6 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
| 7 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
| 8 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 |
| 9 | Adelie | Torgersen | 38.6 | 21.2 | 191 | 3800 | male | 2007 |
| 10 | Adelie | Torgersen | 34.6 | 21.1 | 198 | 4400 | male | 2007 |
| 11 | Adelie | Torgersen | 36.6 | 17.8 | 185 | 3700 | female | 2007 |
| 12 | Adelie | Torgersen | 38.7 | 19.0 | 195 | 3450 | female | 2007 |
| 13 | Adelie | Torgersen | 42.5 | 20.7 | 197 | 4500 | male | 2007 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 322 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
| 323 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
| 324 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
| 325 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
| 326 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
| 327 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
| 328 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
| 329 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
| 330 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
| 331 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
| 332 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
| 333 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
Or we can choose to drop rows with missing values on a specific column:
dropmissing(penguins, :bill_length_mm)| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64 | Float64? | Int64? | Int64? | String7? | Int64 | |
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| 2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| 3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
| 6 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
| 7 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
| 8 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 |
| 9 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 |
| 10 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 |
| 11 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 |
| 12 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 |
| 13 | Adelie | Torgersen | 38.6 | 21.2 | 191 | 3800 | male | 2007 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 331 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
| 332 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
| 333 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
| 334 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
| 335 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
| 336 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
| 337 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
| 338 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
| 339 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
| 340 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
| 341 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
| 342 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
We can also subset rows using regex:
subset(penguins, :species => ByRow(x -> occursin(r"^Chin", x)))| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
| 1 | Chinstrap | Dream | 46.5 | 17.9 | 192 | 3500 | female | 2007 |
| 2 | Chinstrap | Dream | 50.0 | 19.5 | 196 | 3900 | male | 2007 |
| 3 | Chinstrap | Dream | 51.3 | 19.2 | 193 | 3650 | male | 2007 |
| 4 | Chinstrap | Dream | 45.4 | 18.7 | 188 | 3525 | female | 2007 |
| 5 | Chinstrap | Dream | 52.7 | 19.8 | 197 | 3725 | male | 2007 |
| 6 | Chinstrap | Dream | 45.2 | 17.8 | 198 | 3950 | female | 2007 |
| 7 | Chinstrap | Dream | 46.1 | 18.2 | 178 | 3250 | female | 2007 |
| 8 | Chinstrap | Dream | 51.3 | 18.2 | 197 | 3750 | male | 2007 |
| 9 | Chinstrap | Dream | 46.0 | 18.9 | 195 | 4150 | female | 2007 |
| 10 | Chinstrap | Dream | 51.3 | 19.9 | 198 | 3700 | male | 2007 |
| 11 | Chinstrap | Dream | 46.6 | 17.8 | 193 | 3800 | female | 2007 |
| 12 | Chinstrap | Dream | 51.7 | 20.3 | 194 | 3775 | male | 2007 |
| 13 | Chinstrap | Dream | 47.0 | 17.3 | 185 | 3700 | female | 2007 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 57 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
| 58 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
| 59 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
| 60 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
| 61 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
| 62 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
| 63 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
| 64 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
| 65 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
| 66 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
| 67 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
| 68 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
This uses the ByRow() function to apply a function to each row as opposed to broadcasting, but the broadcasting notation also works
subset(penguins, :species => x -> occursin.(r"^Chin", x))| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
| 1 | Chinstrap | Dream | 46.5 | 17.9 | 192 | 3500 | female | 2007 |
| 2 | Chinstrap | Dream | 50.0 | 19.5 | 196 | 3900 | male | 2007 |
| 3 | Chinstrap | Dream | 51.3 | 19.2 | 193 | 3650 | male | 2007 |
| 4 | Chinstrap | Dream | 45.4 | 18.7 | 188 | 3525 | female | 2007 |
| 5 | Chinstrap | Dream | 52.7 | 19.8 | 197 | 3725 | male | 2007 |
| 6 | Chinstrap | Dream | 45.2 | 17.8 | 198 | 3950 | female | 2007 |
| 7 | Chinstrap | Dream | 46.1 | 18.2 | 178 | 3250 | female | 2007 |
| 8 | Chinstrap | Dream | 51.3 | 18.2 | 197 | 3750 | male | 2007 |
| 9 | Chinstrap | Dream | 46.0 | 18.9 | 195 | 4150 | female | 2007 |
| 10 | Chinstrap | Dream | 51.3 | 19.9 | 198 | 3700 | male | 2007 |
| 11 | Chinstrap | Dream | 46.6 | 17.8 | 193 | 3800 | female | 2007 |
| 12 | Chinstrap | Dream | 51.7 | 20.3 | 194 | 3775 | male | 2007 |
| 13 | Chinstrap | Dream | 47.0 | 17.3 | 185 | 3700 | female | 2007 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 57 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
| 58 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
| 59 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
| 60 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
| 61 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
| 62 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
| 63 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
| 64 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
| 65 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
| 66 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
| 67 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
| 68 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
Finally, note that we can also use the bracket syntax with row indices to filter, but this feels like it’s not the most useful for rows:
#select rows 1, 10, 15
penguins[[1, 10, 15], :]| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| 2 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 |
| 3 | Adelie | Torgersen | 34.6 | 21.1 | 198 | 4400 | male | 2007 |
Subsetting Columns
Using bracket syntax to subset columns makes more sense:
penguins[:, 1:3]| Row | species | island | bill_length_mm |
|---|---|---|---|
| String15 | String15 | Float64? | |
| 1 | Adelie | Torgersen | 39.1 |
| 2 | Adelie | Torgersen | 39.5 |
| 3 | Adelie | Torgersen | 40.3 |
| 4 | Adelie | Torgersen | missing |
| 5 | Adelie | Torgersen | 36.7 |
| 6 | Adelie | Torgersen | 39.3 |
| 7 | Adelie | Torgersen | 38.9 |
| 8 | Adelie | Torgersen | 39.2 |
| 9 | Adelie | Torgersen | 34.1 |
| 10 | Adelie | Torgersen | 42.0 |
| 11 | Adelie | Torgersen | 37.8 |
| 12 | Adelie | Torgersen | 37.8 |
| 13 | Adelie | Torgersen | 41.1 |
| ⋮ | ⋮ | ⋮ | ⋮ |
| 333 | Chinstrap | Dream | 45.2 |
| 334 | Chinstrap | Dream | 49.3 |
| 335 | Chinstrap | Dream | 50.2 |
| 336 | Chinstrap | Dream | 45.6 |
| 337 | Chinstrap | Dream | 51.9 |
| 338 | Chinstrap | Dream | 46.8 |
| 339 | Chinstrap | Dream | 45.7 |
| 340 | Chinstrap | Dream | 55.8 |
| 341 | Chinstrap | Dream | 43.5 |
| 342 | Chinstrap | Dream | 49.6 |
| 343 | Chinstrap | Dream | 50.8 |
| 344 | Chinstrap | Dream | 50.2 |
We can also pass strings or symbols when using the bracket subsetting syntax:
penguins[:, [:species, :island]]| Row | species | island |
|---|---|---|
| String15 | String15 | |
| 1 | Adelie | Torgersen |
| 2 | Adelie | Torgersen |
| 3 | Adelie | Torgersen |
| 4 | Adelie | Torgersen |
| 5 | Adelie | Torgersen |
| 6 | Adelie | Torgersen |
| 7 | Adelie | Torgersen |
| 8 | Adelie | Torgersen |
| 9 | Adelie | Torgersen |
| 10 | Adelie | Torgersen |
| 11 | Adelie | Torgersen |
| 12 | Adelie | Torgersen |
| 13 | Adelie | Torgersen |
| ⋮ | ⋮ | ⋮ |
| 333 | Chinstrap | Dream |
| 334 | Chinstrap | Dream |
| 335 | Chinstrap | Dream |
| 336 | Chinstrap | Dream |
| 337 | Chinstrap | Dream |
| 338 | Chinstrap | Dream |
| 339 | Chinstrap | Dream |
| 340 | Chinstrap | Dream |
| 341 | Chinstrap | Dream |
| 342 | Chinstrap | Dream |
| 343 | Chinstrap | Dream |
| 344 | Chinstrap | Dream |
penguins[:, ["species", "island"]]| Row | species | island |
|---|---|---|
| String15 | String15 | |
| 1 | Adelie | Torgersen |
| 2 | Adelie | Torgersen |
| 3 | Adelie | Torgersen |
| 4 | Adelie | Torgersen |
| 5 | Adelie | Torgersen |
| 6 | Adelie | Torgersen |
| 7 | Adelie | Torgersen |
| 8 | Adelie | Torgersen |
| 9 | Adelie | Torgersen |
| 10 | Adelie | Torgersen |
| 11 | Adelie | Torgersen |
| 12 | Adelie | Torgersen |
| 13 | Adelie | Torgersen |
| ⋮ | ⋮ | ⋮ |
| 333 | Chinstrap | Dream |
| 334 | Chinstrap | Dream |
| 335 | Chinstrap | Dream |
| 336 | Chinstrap | Dream |
| 337 | Chinstrap | Dream |
| 338 | Chinstrap | Dream |
| 339 | Chinstrap | Dream |
| 340 | Chinstrap | Dream |
| 341 | Chinstrap | Dream |
| 342 | Chinstrap | Dream |
| 343 | Chinstrap | Dream |
| 344 | Chinstrap | Dream |
An alternative way to subset columns is to use the select() function from DataFrames.jl
select(penguins, [:species, :island])| Row | species | island |
|---|---|---|
| String15 | String15 | |
| 1 | Adelie | Torgersen |
| 2 | Adelie | Torgersen |
| 3 | Adelie | Torgersen |
| 4 | Adelie | Torgersen |
| 5 | Adelie | Torgersen |
| 6 | Adelie | Torgersen |
| 7 | Adelie | Torgersen |
| 8 | Adelie | Torgersen |
| 9 | Adelie | Torgersen |
| 10 | Adelie | Torgersen |
| 11 | Adelie | Torgersen |
| 12 | Adelie | Torgersen |
| 13 | Adelie | Torgersen |
| ⋮ | ⋮ | ⋮ |
| 333 | Chinstrap | Dream |
| 334 | Chinstrap | Dream |
| 335 | Chinstrap | Dream |
| 336 | Chinstrap | Dream |
| 337 | Chinstrap | Dream |
| 338 | Chinstrap | Dream |
| 339 | Chinstrap | Dream |
| 340 | Chinstrap | Dream |
| 341 | Chinstrap | Dream |
| 342 | Chinstrap | Dream |
| 343 | Chinstrap | Dream |
| 344 | Chinstrap | Dream |
Just as with subsetting rows, we can also subset columns using regex. And the column subsetting is way easier than row subsetting:
select(penguins, r"bill")| Row | bill_length_mm | bill_depth_mm |
|---|---|---|
| Float64? | Float64? | |
| 1 | 39.1 | 18.7 |
| 2 | 39.5 | 17.4 |
| 3 | 40.3 | 18.0 |
| 4 | missing | missing |
| 5 | 36.7 | 19.3 |
| 6 | 39.3 | 20.6 |
| 7 | 38.9 | 17.8 |
| 8 | 39.2 | 19.6 |
| 9 | 34.1 | 18.1 |
| 10 | 42.0 | 20.2 |
| 11 | 37.8 | 17.1 |
| 12 | 37.8 | 17.3 |
| 13 | 41.1 | 17.6 |
| ⋮ | ⋮ | ⋮ |
| 333 | 45.2 | 16.6 |
| 334 | 49.3 | 19.9 |
| 335 | 50.2 | 18.8 |
| 336 | 45.6 | 19.4 |
| 337 | 51.9 | 19.5 |
| 338 | 46.8 | 16.5 |
| 339 | 45.7 | 17.0 |
| 340 | 55.8 | 19.8 |
| 341 | 43.5 | 18.1 |
| 342 | 49.6 | 18.2 |
| 343 | 50.8 | 19.0 |
| 344 | 50.2 | 18.7 |
Creating New Columns
We can create new columns using the transform() function (or transform!() for in-place).
The general pattern here is that creating a column takes 3 components:
- an input column,
- a function to apply to that column (usually wrapped in
ByRow()), and - the name of the output column
transform(penguins, :species => ByRow(x -> x == "Chinstrap") => :is_chinstrap)| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | is_chinstrap |
|---|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | Bool | |
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 | false |
| 2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 | false |
| 3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 | false |
| 4 | Adelie | Torgersen | missing | missing | missing | missing | missing | 2007 | false |
| 5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 | false |
| 6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 | false |
| 7 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 | false |
| 8 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 | false |
| 9 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 | false |
| 10 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 | false |
| 11 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 | false |
| 12 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 | false |
| 13 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 | false |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 333 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 | true |
| 334 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 | true |
| 335 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 | true |
| 336 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 | true |
| 337 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 | true |
| 338 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 | true |
| 339 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 | true |
| 340 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 | true |
| 341 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 | true |
| 342 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 | true |
| 343 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 | true |
| 344 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 | true |
We can also create multiple columns at the same time:
transform(
penguins,
:species => ByRow(x -> x == "Chinstrap") => :is_chinstrap,
:body_mass_g => ByRow(x -> x > 4000) => :chonk
)| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | is_chinstrap | chonk |
|---|---|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | Bool | Bool? | |
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 | false | false |
| 2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 | false | false |
| 3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 | false | false |
| 4 | Adelie | Torgersen | missing | missing | missing | missing | missing | 2007 | false | missing |
| 5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 | false | false |
| 6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 | false | false |
| 7 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 | false | false |
| 8 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 | false | true |
| 9 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 | false | false |
| 10 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 | false | true |
| 11 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 | false | false |
| 12 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 | false | false |
| 13 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 | false | false |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 333 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 | true | false |
| 334 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 | true | true |
| 335 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 | true | false |
| 336 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 | true | false |
| 337 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 | true | false |
| 338 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 | true | false |
| 339 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 | true | false |
| 340 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 | true | false |
| 341 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 | true | false |
| 342 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 | true | false |
| 343 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 | true | true |
| 344 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 | true | false |
We can pass multiple columns into an operation as well:
transform(
penguins,
[:bill_length_mm, :bill_depth_mm] => (+) => :bill_stuff
)| Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | bill_stuff |
|---|---|---|---|---|---|---|---|---|---|
| String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | Float64? | |
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 | 57.8 |
| 2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 | 56.9 |
| 3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 | 58.3 |
| 4 | Adelie | Torgersen | missing | missing | missing | missing | missing | 2007 | missing |
| 5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 | 56.0 |
| 6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 | 59.9 |
| 7 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 | 56.7 |
| 8 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 | 58.8 |
| 9 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 | 52.2 |
| 10 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 | 62.2 |
| 11 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 | 54.9 |
| 12 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 | 55.1 |
| 13 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 | 58.7 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 333 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 | 61.8 |
| 334 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 | 69.2 |
| 335 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 | 69.0 |
| 336 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 | 65.0 |
| 337 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 | 71.4 |
| 338 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 | 63.3 |
| 339 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 | 62.7 |
| 340 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 | 75.6 |
| 341 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 | 61.6 |
| 342 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 | 67.8 |
| 343 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 | 69.8 |
| 344 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 | 68.9 |
This syntax isn’t my favorite. Another way to do the same, though, if we wanted:
penguins.:bill_stuff = penguins.:bill_length_mm .+ penguins.:bill_depth_mm344-element Vector{Union{Missing, Float64}}:
57.8
56.9
58.3
missing
56.0
59.9
56.7
58.800000000000004
52.2
62.2
54.9
55.099999999999994
58.7
⋮
61.800000000000004
69.19999999999999
69.0
65.0
71.4
63.3
62.7
75.6
61.6
67.8
69.8
68.9
Summarizing DataFrames
Often, summarizing will involve combining the groupby() method with the combine() method:
# we need to drop missings here on the body mass column first
grouped_penguins = groupby(dropmissing(penguins, :body_mass_g), :species)
combine(grouped_penguins, :body_mass_g => mean)| Row | species | body_mass_g_mean |
|---|---|---|
| String15 | Float64 | |
| 1 | Adelie | 3700.66 |
| 2 | Gentoo | 5076.02 |
| 3 | Chinstrap | 3733.09 |
Counting is a common operation:
combine(grouped_penguins, nrow)| Row | species | nrow |
|---|---|---|
| String15 | Int64 | |
| 1 | Adelie | 151 |
| 2 | Gentoo | 123 |
| 3 | Chinstrap | 68 |
The problem here is that we’ve already dropped missing values on body mass, so we’re not counting all of the observations here. I’m not sure if there’s a great way around this inside of combine(), but something to explore more…
Anyway, say we want to take the mean of all of our numeric columns. We can do that as follows. Note that I don’t really like this syntax, especially compared to some of the way you’d do this in tidyverse or even in Python’s polars. It also has the issue that I’m having to keep complete observations rather than column-wise complete data.
numeric_cols = names(penguins, Union{Missing,Number})
nomiss_penguins = dropmissing(penguins)
grouped_penguins = groupby(nomiss_penguins, :species)
species_means = combine(grouped_penguins, numeric_cols .=> mean)| Row | species | bill_length_mm_mean | bill_depth_mm_mean | flipper_length_mm_mean | body_mass_g_mean | year_mean | bill_stuff_mean |
|---|---|---|---|---|---|---|---|
| String15 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | Adelie | 38.824 | 18.3473 | 190.103 | 3706.16 | 2008.05 | 57.1712 |
| 2 | Gentoo | 47.5681 | 14.9966 | 217.235 | 5092.44 | 2008.07 | 62.5647 |
| 3 | Chinstrap | 48.8338 | 18.4206 | 195.824 | 3733.09 | 2007.97 | 67.2544 |
We can also perform multiple aggregations on a grouped dataframe, just like we did when we applied multiple transformations via transform():
combine(
grouped_penguins,
:bill_length_mm => mean,
nrow => :n
)| Row | species | bill_length_mm_mean | n |
|---|---|---|---|
| String15 | Float64 | Int64 | |
| 1 | Adelie | 38.824 | 146 |
| 2 | Gentoo | 47.5681 | 119 |
| 3 | Chinstrap | 48.8338 | 68 |
Reshaping data
Let’s go back to our species_means data from before. Say we want to pivot this longer, so that we have a 3-column dataframe where our columns are species, metric, and value. We can do this with the stack() function:
species_means_long = stack(species_means, Not(:species), variable_name=:metric, value_name=:value)| Row | species | metric | value |
|---|---|---|---|
| String15 | String | Float64 | |
| 1 | Adelie | bill_length_mm_mean | 38.824 |
| 2 | Gentoo | bill_length_mm_mean | 47.5681 |
| 3 | Chinstrap | bill_length_mm_mean | 48.8338 |
| 4 | Adelie | bill_depth_mm_mean | 18.3473 |
| 5 | Gentoo | bill_depth_mm_mean | 14.9966 |
| 6 | Chinstrap | bill_depth_mm_mean | 18.4206 |
| 7 | Adelie | flipper_length_mm_mean | 190.103 |
| 8 | Gentoo | flipper_length_mm_mean | 217.235 |
| 9 | Chinstrap | flipper_length_mm_mean | 195.824 |
| 10 | Adelie | body_mass_g_mean | 3706.16 |
| 11 | Gentoo | body_mass_g_mean | 5092.44 |
| 12 | Chinstrap | body_mass_g_mean | 3733.09 |
| 13 | Adelie | year_mean | 2008.05 |
| 14 | Gentoo | year_mean | 2008.07 |
| 15 | Chinstrap | year_mean | 2007.97 |
| 16 | Adelie | bill_stuff_mean | 57.1712 |
| 17 | Gentoo | bill_stuff_mean | 62.5647 |
| 18 | Chinstrap | bill_stuff_mean | 67.2544 |
To reshape back to wide, we can use the (appropriately named!) unstack() function. This function has the following signature:
unstack(df, id_col, variable_names, variable_values)
So in the example below, species is the id variable, metric is the column with variable names, and value is the column with values.
unstack(species_means_long, :species, :metric, :value)| Row | species | bill_length_mm_mean | bill_depth_mm_mean | flipper_length_mm_mean | body_mass_g_mean | year_mean | bill_stuff_mean |
|---|---|---|---|---|---|---|---|
| String15 | Float64? | Float64? | Float64? | Float64? | Float64? | Float64? | |
| 1 | Adelie | 38.824 | 18.3473 | 190.103 | 3706.16 | 2008.05 | 57.1712 |
| 2 | Gentoo | 47.5681 | 14.9966 | 217.235 | 5092.44 | 2008.07 | 62.5647 |
| 3 | Chinstrap | 48.8338 | 18.4206 | 195.824 | 3733.09 | 2007.97 | 67.2544 |
Chaining Operations
You can chain dataframe operations together in Julia using the Chain.jl package. This is basically the same as using the pipe in tidyverse R workflows to set up data processing pipelines.
For instance, we can do the following to summarize our data:
@chain penguins begin
dropmissing
groupby(:species)
combine(:body_mass_g => mean)
end| Row | species | body_mass_g_mean |
|---|---|---|
| String15 | Float64 | |
| 1 | Adelie | 3706.16 |
| 2 | Gentoo | 5092.44 |
| 3 | Chinstrap | 3733.09 |
The “default” specification of a chain (like above) is to assume the output of the previous function call is the first argument to the subsequent function call. So the output of dropmissing – a dataframe – is the first argument to groupby(), etc.
Another way to specify a chain is to use an underscore _ as a placeholder for the output of the previous expression. This isn’t super useful in cases where a dataframe will be the first argument, like below:
@chain penguins begin
dropmissing(_)
groupby(_, :species)
combine(_, :body_mass_g => mean)
end| Row | species | body_mass_g_mean |
|---|---|---|
| String15 | Float64 | |
| 1 | Adelie | 3706.16 |
| 2 | Gentoo | 5092.44 |
| 3 | Chinstrap | 3733.09 |
A not-very-useful but nevertheless illustrative version of using an underscore to pipe a value into the not-first argument of a function is below:
@chain :body_mass_g begin
combine(dropmissing(penguins), _ => mean)
end| Row | body_mass_g_mean |
|---|---|
| Float64 | |
| 1 | 4207.06 |