import polars.selectors as cs
import polars as pl
# read in data
= pl.read_csv(
penguins "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv",
="NA",
null_values )
Polars Cheat Sheet
A work-in-progress cheat sheet for working with polars dataframes
Setup & Read in Data
Check Properties of a Dataframe
use the shape
attribute to check the dimensions (rows, cols)
penguins.shape
(344, 8)
use the head()
method to see the first few rows
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
str | str | f64 | f64 | i64 | i64 | str | i64 |
"Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 |
"Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 |
"Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 |
"Adelie" | "Torgersen" | null | null | null | null | null | 2007 |
"Adelie" | "Torgersen" | 36.7 | 19.3 | 193 | 3450 | "female" | 2007 |
to see the column names, we can use the columns
attribute
penguins.columns
['species',
'island',
'bill_length_mm',
'bill_depth_mm',
'flipper_length_mm',
'body_mass_g',
'sex',
'year']
Subset Rows
For the most part, we want to subset rows using the filter()
method.
For example, we might want to filter out rows that aren’t missing data for the body mass variable
filter(pl.col("body_mass_g").is_not_null()) penguins.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
str | str | f64 | f64 | i64 | i64 | str | i64 |
"Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 |
"Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 |
"Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 |
"Adelie" | "Torgersen" | 36.7 | 19.3 | 193 | 3450 | "female" | 2007 |
"Adelie" | "Torgersen" | 39.3 | 20.6 | 190 | 3650 | "male" | 2007 |
"Adelie" | "Torgersen" | 38.9 | 17.8 | 181 | 3625 | "female" | 2007 |
"Adelie" | "Torgersen" | 39.2 | 19.6 | 195 | 4675 | "male" | 2007 |
"Adelie" | "Torgersen" | 34.1 | 18.1 | 193 | 3475 | null | 2007 |
"Adelie" | "Torgersen" | 42.0 | 20.2 | 190 | 4250 | null | 2007 |
"Adelie" | "Torgersen" | 37.8 | 17.1 | 186 | 3300 | null | 2007 |
"Adelie" | "Torgersen" | 37.8 | 17.3 | 180 | 3700 | null | 2007 |
"Adelie" | "Torgersen" | 41.1 | 17.6 | 182 | 3200 | "female" | 2007 |
… | … | … | … | … | … | … | … |
"Chinstrap" | "Dream" | 45.2 | 16.6 | 191 | 3250 | "female" | 2009 |
"Chinstrap" | "Dream" | 49.3 | 19.9 | 203 | 4050 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.2 | 18.8 | 202 | 3800 | "male" | 2009 |
"Chinstrap" | "Dream" | 45.6 | 19.4 | 194 | 3525 | "female" | 2009 |
"Chinstrap" | "Dream" | 51.9 | 19.5 | 206 | 3950 | "male" | 2009 |
"Chinstrap" | "Dream" | 46.8 | 16.5 | 189 | 3650 | "female" | 2009 |
"Chinstrap" | "Dream" | 45.7 | 17.0 | 195 | 3650 | "female" | 2009 |
"Chinstrap" | "Dream" | 55.8 | 19.8 | 207 | 4000 | "male" | 2009 |
"Chinstrap" | "Dream" | 43.5 | 18.1 | 202 | 3400 | "female" | 2009 |
"Chinstrap" | "Dream" | 49.6 | 18.2 | 193 | 3775 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.8 | 19.0 | 210 | 4100 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.2 | 18.7 | 198 | 3775 | "female" | 2009 |
Basically, filter keeps rows where the expression evaluates to True
. And so we can use any predicate expression that results in a boolean. We can also use multiple expressions.
filter((pl.col("species") == "Adelie") & (pl.col("bill_length_mm") >= 39.0)) penguins.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
str | str | f64 | f64 | i64 | i64 | str | i64 |
"Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 |
"Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 |
"Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 |
"Adelie" | "Torgersen" | 39.3 | 20.6 | 190 | 3650 | "male" | 2007 |
"Adelie" | "Torgersen" | 39.2 | 19.6 | 195 | 4675 | "male" | 2007 |
"Adelie" | "Torgersen" | 42.0 | 20.2 | 190 | 4250 | null | 2007 |
"Adelie" | "Torgersen" | 41.1 | 17.6 | 182 | 3200 | "female" | 2007 |
"Adelie" | "Torgersen" | 42.5 | 20.7 | 197 | 4500 | "male" | 2007 |
"Adelie" | "Torgersen" | 46.0 | 21.5 | 194 | 4200 | "male" | 2007 |
"Adelie" | "Biscoe" | 40.6 | 18.6 | 183 | 3550 | "male" | 2007 |
"Adelie" | "Biscoe" | 40.5 | 17.9 | 187 | 3200 | "female" | 2007 |
"Adelie" | "Biscoe" | 40.5 | 18.9 | 180 | 3950 | "male" | 2007 |
… | … | … | … | … | … | … | … |
"Adelie" | "Torgersen" | 39.0 | 17.1 | 191 | 3050 | "female" | 2009 |
"Adelie" | "Torgersen" | 44.1 | 18.0 | 210 | 4000 | "male" | 2009 |
"Adelie" | "Torgersen" | 43.1 | 19.2 | 197 | 3500 | "male" | 2009 |
"Adelie" | "Dream" | 41.1 | 17.5 | 190 | 3900 | "male" | 2009 |
"Adelie" | "Dream" | 40.2 | 20.1 | 200 | 3975 | "male" | 2009 |
"Adelie" | "Dream" | 39.7 | 17.9 | 193 | 4250 | "male" | 2009 |
"Adelie" | "Dream" | 40.2 | 17.1 | 193 | 3400 | "female" | 2009 |
"Adelie" | "Dream" | 40.6 | 17.2 | 187 | 3475 | "male" | 2009 |
"Adelie" | "Dream" | 40.7 | 17.0 | 190 | 3725 | "male" | 2009 |
"Adelie" | "Dream" | 39.0 | 18.7 | 185 | 3650 | "male" | 2009 |
"Adelie" | "Dream" | 39.2 | 18.6 | 190 | 4250 | "male" | 2009 |
"Adelie" | "Dream" | 41.5 | 18.5 | 201 | 4000 | "male" | 2009 |
We might also want to filter based on string matches. For example, strings that start with a certain substring
filter(pl.col("species").str.starts_with("Chin")) penguins.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
str | str | f64 | f64 | i64 | i64 | str | i64 |
"Chinstrap" | "Dream" | 46.5 | 17.9 | 192 | 3500 | "female" | 2007 |
"Chinstrap" | "Dream" | 50.0 | 19.5 | 196 | 3900 | "male" | 2007 |
"Chinstrap" | "Dream" | 51.3 | 19.2 | 193 | 3650 | "male" | 2007 |
"Chinstrap" | "Dream" | 45.4 | 18.7 | 188 | 3525 | "female" | 2007 |
"Chinstrap" | "Dream" | 52.7 | 19.8 | 197 | 3725 | "male" | 2007 |
"Chinstrap" | "Dream" | 45.2 | 17.8 | 198 | 3950 | "female" | 2007 |
"Chinstrap" | "Dream" | 46.1 | 18.2 | 178 | 3250 | "female" | 2007 |
"Chinstrap" | "Dream" | 51.3 | 18.2 | 197 | 3750 | "male" | 2007 |
"Chinstrap" | "Dream" | 46.0 | 18.9 | 195 | 4150 | "female" | 2007 |
"Chinstrap" | "Dream" | 51.3 | 19.9 | 198 | 3700 | "male" | 2007 |
"Chinstrap" | "Dream" | 46.6 | 17.8 | 193 | 3800 | "female" | 2007 |
"Chinstrap" | "Dream" | 51.7 | 20.3 | 194 | 3775 | "male" | 2007 |
… | … | … | … | … | … | … | … |
"Chinstrap" | "Dream" | 45.2 | 16.6 | 191 | 3250 | "female" | 2009 |
"Chinstrap" | "Dream" | 49.3 | 19.9 | 203 | 4050 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.2 | 18.8 | 202 | 3800 | "male" | 2009 |
"Chinstrap" | "Dream" | 45.6 | 19.4 | 194 | 3525 | "female" | 2009 |
"Chinstrap" | "Dream" | 51.9 | 19.5 | 206 | 3950 | "male" | 2009 |
"Chinstrap" | "Dream" | 46.8 | 16.5 | 189 | 3650 | "female" | 2009 |
"Chinstrap" | "Dream" | 45.7 | 17.0 | 195 | 3650 | "female" | 2009 |
"Chinstrap" | "Dream" | 55.8 | 19.8 | 207 | 4000 | "male" | 2009 |
"Chinstrap" | "Dream" | 43.5 | 18.1 | 202 | 3400 | "female" | 2009 |
"Chinstrap" | "Dream" | 49.6 | 18.2 | 193 | 3775 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.8 | 19.0 | 210 | 4100 | "male" | 2009 |
"Chinstrap" | "Dream" | 50.2 | 18.7 | 198 | 3775 | "female" | 2009 |
Subset Columns
To subset columns, we want to use the select()
method.
"species") penguins.select(
species |
---|
str |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
"Adelie" |
… |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
"Chinstrap" |
To select multiple columns, we pass in a list
"species", "island"]) penguins.select([
species | island |
---|---|
str | str |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
"Adelie" | "Torgersen" |
… | … |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
"Chinstrap" | "Dream" |
To select columns with regex, we can pass regex to pl.col()
. Note that we need to indicate this is regex by passing both ^ and $.
"^bill_.*$")) penguins.select(pl.col(
bill_length_mm | bill_depth_mm |
---|---|
f64 | f64 |
39.1 | 18.7 |
39.5 | 17.4 |
40.3 | 18.0 |
null | null |
36.7 | 19.3 |
39.3 | 20.6 |
38.9 | 17.8 |
39.2 | 19.6 |
34.1 | 18.1 |
42.0 | 20.2 |
37.8 | 17.1 |
37.8 | 17.3 |
… | … |
45.2 | 16.6 |
49.3 | 19.9 |
50.2 | 18.8 |
45.6 | 19.4 |
51.9 | 19.5 |
46.8 | 16.5 |
45.7 | 17.0 |
55.8 | 19.8 |
43.5 | 18.1 |
49.6 | 18.2 |
50.8 | 19.0 |
50.2 | 18.7 |
and we can also exclude by regex
"^bill_.*$")) penguins.select(pl.exclude(
species | island | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|
str | str | i64 | i64 | str | i64 |
"Adelie" | "Torgersen" | 181 | 3750 | "male" | 2007 |
"Adelie" | "Torgersen" | 186 | 3800 | "female" | 2007 |
"Adelie" | "Torgersen" | 195 | 3250 | "female" | 2007 |
"Adelie" | "Torgersen" | null | null | null | 2007 |
"Adelie" | "Torgersen" | 193 | 3450 | "female" | 2007 |
"Adelie" | "Torgersen" | 190 | 3650 | "male" | 2007 |
"Adelie" | "Torgersen" | 181 | 3625 | "female" | 2007 |
"Adelie" | "Torgersen" | 195 | 4675 | "male" | 2007 |
"Adelie" | "Torgersen" | 193 | 3475 | null | 2007 |
"Adelie" | "Torgersen" | 190 | 4250 | null | 2007 |
"Adelie" | "Torgersen" | 186 | 3300 | null | 2007 |
"Adelie" | "Torgersen" | 180 | 3700 | null | 2007 |
… | … | … | … | … | … |
"Chinstrap" | "Dream" | 191 | 3250 | "female" | 2009 |
"Chinstrap" | "Dream" | 203 | 4050 | "male" | 2009 |
"Chinstrap" | "Dream" | 202 | 3800 | "male" | 2009 |
"Chinstrap" | "Dream" | 194 | 3525 | "female" | 2009 |
"Chinstrap" | "Dream" | 206 | 3950 | "male" | 2009 |
"Chinstrap" | "Dream" | 189 | 3650 | "female" | 2009 |
"Chinstrap" | "Dream" | 195 | 3650 | "female" | 2009 |
"Chinstrap" | "Dream" | 207 | 4000 | "male" | 2009 |
"Chinstrap" | "Dream" | 202 | 3400 | "female" | 2009 |
"Chinstrap" | "Dream" | 193 | 3775 | "male" | 2009 |
"Chinstrap" | "Dream" | 210 | 4100 | "male" | 2009 |
"Chinstrap" | "Dream" | 198 | 3775 | "female" | 2009 |
Polars (via the polars.selectors module) also provides utilities similar to tidyverse selectors, such as starts_with()
"bill")) penguins.select(cs.starts_with(
bill_length_mm | bill_depth_mm |
---|---|
f64 | f64 |
39.1 | 18.7 |
39.5 | 17.4 |
40.3 | 18.0 |
null | null |
36.7 | 19.3 |
39.3 | 20.6 |
38.9 | 17.8 |
39.2 | 19.6 |
34.1 | 18.1 |
42.0 | 20.2 |
37.8 | 17.1 |
37.8 | 17.3 |
… | … |
45.2 | 16.6 |
49.3 | 19.9 |
50.2 | 18.8 |
45.6 | 19.4 |
51.9 | 19.5 |
46.8 | 16.5 |
45.7 | 17.0 |
55.8 | 19.8 |
43.5 | 18.1 |
49.6 | 18.2 |
50.8 | 19.0 |
50.2 | 18.7 |
Another useful feature is selecting by data type. This is something that polars.selectors enables. For instance, if we wanted to select all numeric columns
penguins.select(cs.by_dtype(pl.NUMERIC_DTYPES))
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year |
---|---|---|---|---|
f64 | f64 | i64 | i64 | i64 |
39.1 | 18.7 | 181 | 3750 | 2007 |
39.5 | 17.4 | 186 | 3800 | 2007 |
40.3 | 18.0 | 195 | 3250 | 2007 |
null | null | null | null | 2007 |
36.7 | 19.3 | 193 | 3450 | 2007 |
39.3 | 20.6 | 190 | 3650 | 2007 |
38.9 | 17.8 | 181 | 3625 | 2007 |
39.2 | 19.6 | 195 | 4675 | 2007 |
34.1 | 18.1 | 193 | 3475 | 2007 |
42.0 | 20.2 | 190 | 4250 | 2007 |
37.8 | 17.1 | 186 | 3300 | 2007 |
37.8 | 17.3 | 180 | 3700 | 2007 |
… | … | … | … | … |
45.2 | 16.6 | 191 | 3250 | 2009 |
49.3 | 19.9 | 203 | 4050 | 2009 |
50.2 | 18.8 | 202 | 3800 | 2009 |
45.6 | 19.4 | 194 | 3525 | 2009 |
51.9 | 19.5 | 206 | 3950 | 2009 |
46.8 | 16.5 | 189 | 3650 | 2009 |
45.7 | 17.0 | 195 | 3650 | 2009 |
55.8 | 19.8 | 207 | 4000 | 2009 |
43.5 | 18.1 | 202 | 3400 | 2009 |
49.6 | 18.2 | 193 | 3775 | 2009 |
50.8 | 19.0 | 210 | 4100 | 2009 |
50.2 | 18.7 | 198 | 3775 | 2009 |
Creating New Columns
To create new columns, we’ll use the with_columns()
method, which is similar to dplyr::mutate()
=pl.col("species") == "Chinstrap") penguins.with_columns(is_chinstrap
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | is_chinstrap |
---|---|---|---|---|---|---|---|---|
str | str | f64 | f64 | i64 | i64 | str | i64 | bool |
"Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 | false |
"Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 | false |
"Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 | false |
"Adelie" | "Torgersen" | null | null | null | null | null | 2007 | false |
"Adelie" | "Torgersen" | 36.7 | 19.3 | 193 | 3450 | "female" | 2007 | false |
"Adelie" | "Torgersen" | 39.3 | 20.6 | 190 | 3650 | "male" | 2007 | false |
"Adelie" | "Torgersen" | 38.9 | 17.8 | 181 | 3625 | "female" | 2007 | false |
"Adelie" | "Torgersen" | 39.2 | 19.6 | 195 | 4675 | "male" | 2007 | false |
"Adelie" | "Torgersen" | 34.1 | 18.1 | 193 | 3475 | null | 2007 | false |
"Adelie" | "Torgersen" | 42.0 | 20.2 | 190 | 4250 | null | 2007 | false |
"Adelie" | "Torgersen" | 37.8 | 17.1 | 186 | 3300 | null | 2007 | false |
"Adelie" | "Torgersen" | 37.8 | 17.3 | 180 | 3700 | null | 2007 | false |
… | … | … | … | … | … | … | … | … |
"Chinstrap" | "Dream" | 45.2 | 16.6 | 191 | 3250 | "female" | 2009 | true |
"Chinstrap" | "Dream" | 49.3 | 19.9 | 203 | 4050 | "male" | 2009 | true |
"Chinstrap" | "Dream" | 50.2 | 18.8 | 202 | 3800 | "male" | 2009 | true |
"Chinstrap" | "Dream" | 45.6 | 19.4 | 194 | 3525 | "female" | 2009 | true |
"Chinstrap" | "Dream" | 51.9 | 19.5 | 206 | 3950 | "male" | 2009 | true |
"Chinstrap" | "Dream" | 46.8 | 16.5 | 189 | 3650 | "female" | 2009 | true |
"Chinstrap" | "Dream" | 45.7 | 17.0 | 195 | 3650 | "female" | 2009 | true |
"Chinstrap" | "Dream" | 55.8 | 19.8 | 207 | 4000 | "male" | 2009 | true |
"Chinstrap" | "Dream" | 43.5 | 18.1 | 202 | 3400 | "female" | 2009 | true |
"Chinstrap" | "Dream" | 49.6 | 18.2 | 193 | 3775 | "male" | 2009 | true |
"Chinstrap" | "Dream" | 50.8 | 19.0 | 210 | 4100 | "male" | 2009 | true |
"Chinstrap" | "Dream" | 50.2 | 18.7 | 198 | 3775 | "female" | 2009 | true |
We can create multiple columns at a time
penguins.with_columns(=pl.col("species") == "Chinstrap",
is_chinstrap=pl.col("body_mass_g") >= 4000,
chonk=pl.col("bill_length_mm") + pl.col("bill_depth_mm"),
bill_stuff=pl.col("sex").str.slice(0, 1).str.to_uppercase(),
sex_initial )
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | is_chinstrap | chonk | bill_stuff | sex_initial |
---|---|---|---|---|---|---|---|---|---|---|---|
str | str | f64 | f64 | i64 | i64 | str | i64 | bool | bool | f64 | str |
"Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 | false | false | 57.8 | "M" |
"Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 | false | false | 56.9 | "F" |
"Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 | false | false | 58.3 | "F" |
"Adelie" | "Torgersen" | null | null | null | null | null | 2007 | false | null | null | null |
"Adelie" | "Torgersen" | 36.7 | 19.3 | 193 | 3450 | "female" | 2007 | false | false | 56.0 | "F" |
"Adelie" | "Torgersen" | 39.3 | 20.6 | 190 | 3650 | "male" | 2007 | false | false | 59.9 | "M" |
"Adelie" | "Torgersen" | 38.9 | 17.8 | 181 | 3625 | "female" | 2007 | false | false | 56.7 | "F" |
"Adelie" | "Torgersen" | 39.2 | 19.6 | 195 | 4675 | "male" | 2007 | false | true | 58.8 | "M" |
"Adelie" | "Torgersen" | 34.1 | 18.1 | 193 | 3475 | null | 2007 | false | false | 52.2 | null |
"Adelie" | "Torgersen" | 42.0 | 20.2 | 190 | 4250 | null | 2007 | false | true | 62.2 | null |
"Adelie" | "Torgersen" | 37.8 | 17.1 | 186 | 3300 | null | 2007 | false | false | 54.9 | null |
"Adelie" | "Torgersen" | 37.8 | 17.3 | 180 | 3700 | null | 2007 | false | false | 55.1 | null |
… | … | … | … | … | … | … | … | … | … | … | … |
"Chinstrap" | "Dream" | 45.2 | 16.6 | 191 | 3250 | "female" | 2009 | true | false | 61.8 | "F" |
"Chinstrap" | "Dream" | 49.3 | 19.9 | 203 | 4050 | "male" | 2009 | true | true | 69.2 | "M" |
"Chinstrap" | "Dream" | 50.2 | 18.8 | 202 | 3800 | "male" | 2009 | true | false | 69.0 | "M" |
"Chinstrap" | "Dream" | 45.6 | 19.4 | 194 | 3525 | "female" | 2009 | true | false | 65.0 | "F" |
"Chinstrap" | "Dream" | 51.9 | 19.5 | 206 | 3950 | "male" | 2009 | true | false | 71.4 | "M" |
"Chinstrap" | "Dream" | 46.8 | 16.5 | 189 | 3650 | "female" | 2009 | true | false | 63.3 | "F" |
"Chinstrap" | "Dream" | 45.7 | 17.0 | 195 | 3650 | "female" | 2009 | true | false | 62.7 | "F" |
"Chinstrap" | "Dream" | 55.8 | 19.8 | 207 | 4000 | "male" | 2009 | true | true | 75.6 | "M" |
"Chinstrap" | "Dream" | 43.5 | 18.1 | 202 | 3400 | "female" | 2009 | true | false | 61.6 | "F" |
"Chinstrap" | "Dream" | 49.6 | 18.2 | 193 | 3775 | "male" | 2009 | true | false | 67.8 | "M" |
"Chinstrap" | "Dream" | 50.8 | 19.0 | 210 | 4100 | "male" | 2009 | true | true | 69.8 | "M" |
"Chinstrap" | "Dream" | 50.2 | 18.7 | 198 | 3775 | "female" | 2009 | true | false | 68.9 | "F" |
I’ll likely add more examples here as I encounter common use cases.
Summarizing DataFrames
Most summarizing workflows will involve the groupby()
method, followed by some other operation.
Probably the most common thing we’ll want to do is count by group
"island").count() penguins.groupby(
island | count |
---|---|
str | u32 |
"Dream" | 124 |
"Biscoe" | 168 |
"Torgersen" | 52 |
We can also group by multiple columns
"island", "species"]).count() penguins.groupby([
island | species | count |
---|---|---|
str | str | u32 |
"Torgersen" | "Adelie" | 52 |
"Biscoe" | "Gentoo" | 124 |
"Dream" | "Chinstrap" | 68 |
"Dream" | "Adelie" | 56 |
"Biscoe" | "Adelie" | 44 |
Another common thing might be to take the average of a numeric column by group. We can do this via the agg()
method and passing various aggregation/summarization functions to this method.
"species").agg(pl.mean("bill_length_mm").alias("bill_length_mean")) penguins.groupby(
species | bill_length_mean |
---|---|
str | f64 |
"Chinstrap" | 48.833824 |
"Adelie" | 38.791391 |
"Gentoo" | 47.504878 |
We might also want to get the average of all of our numeric types by group. This is another situation where the polars.selectors
module is helpful.
"species").agg(cs.by_dtype(pl.NUMERIC_DTYPES).mean().suffix("_mean")) penguins.groupby(
species | bill_length_mm_mean | bill_depth_mm_mean | flipper_length_mm_mean | body_mass_g_mean | year_mean |
---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | f64 |
"Adelie" | 38.791391 | 18.346358 | 189.953642 | 3700.662252 | 2008.013158 |
"Chinstrap" | 48.833824 | 18.420588 | 195.823529 | 3733.088235 | 2007.970588 |
"Gentoo" | 47.504878 | 14.982114 | 217.186992 | 5076.01626 | 2008.080645 |
Or we might want to apply multiple aggregation functions to a column. This works basically like with_columns()
in that you can pass multiple expressions
"species").agg(
penguins.groupby("bill_length_mm").alias("bill_length_mean"),
pl.mean("bill_length_mm").count().alias("n"),
pl.col( )
species | bill_length_mean | n |
---|---|---|
str | f64 | u32 |
"Chinstrap" | 48.833824 | 68 |
"Gentoo" | 47.504878 | 124 |
"Adelie" | 38.791391 | 152 |
Reshaping Long and Wide
Let’s take our summary from before where we calculated the average of all of the numeric columns by species
= penguins.groupby("species").agg(cs.by_dtype(pl.NUMERIC_DTYPES).mean()) species_means
Now suppose we want to pivot this longer so that it’s in “tidy” format. To do this, we want the melt()
method.
= species_means.melt(id_vars="species")
species_long
species_long.head()
species | variable | value |
---|---|---|
str | str | f64 |
"Chinstrap" | "bill_length_mm… | 48.833824 |
"Adelie" | "bill_length_mm… | 38.791391 |
"Gentoo" | "bill_length_mm… | 47.504878 |
"Chinstrap" | "bill_depth_mm" | 18.420588 |
"Adelie" | "bill_depth_mm" | 18.346358 |
Note that in the above, there’s also a value_vars
parameter. By leaving it empty, it defaults to every column not in id_vars
, but this might not always be the right choice.
And if we want to reshape back to wide, we use pivot()
. Pivot can optionally perform aggregations while reshaping, so we need to tell it that we don’t want it to aggregate anything (by passing aggregate_function=None
)
="value", index="species", columns="variable", aggregate_function=None) species_long.pivot(values
species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year |
---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | f64 |
"Chinstrap" | 48.833824 | 18.420588 | 195.823529 | 3733.088235 | 2007.970588 |
"Adelie" | 38.791391 | 18.346358 | 189.953642 | 3700.662252 | 2008.013158 |
"Gentoo" | 47.504878 | 14.982114 | 217.186992 | 5076.01626 | 2008.080645 |