import polars.selectors as cs
import polars as pl
# read in data
penguins = pl.read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv",
null_values="NA",
)Polars Cheat Sheet
A work-in-progress cheat sheet for working with polars dataframes
Setup & Read in Data
Check Properties of a Dataframe
use the shape attribute to check the dimensions (rows, cols)
penguins.shape(344, 8)
use the head() method to see the first few rows
penguins.head()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | i64 | i64 | str | i64 |
| "Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 |
| "Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 |
| "Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 |
| "Adelie" | "Torgersen" | null | null | null | null | null | 2007 |
| "Adelie" | "Torgersen" | 36.7 | 19.3 | 193 | 3450 | "female" | 2007 |
to see the column names, we can use the columns attribute
penguins.columns['species',
'island',
'bill_length_mm',
'bill_depth_mm',
'flipper_length_mm',
'body_mass_g',
'sex',
'year']
Subset Rows
For the most part, we want to subset rows using the filter() method.
For example, we might want to filter out rows that aren’t missing data for the body mass variable
penguins.filter(pl.col("body_mass_g").is_not_null())| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | i64 | i64 | str | i64 |
| "Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 |
| "Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 |
| "Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 |
| "Adelie" | "Torgersen" | 36.7 | 19.3 | 193 | 3450 | "female" | 2007 |
| "Adelie" | "Torgersen" | 39.3 | 20.6 | 190 | 3650 | "male" | 2007 |
| "Adelie" | "Torgersen" | 38.9 | 17.8 | 181 | 3625 | "female" | 2007 |
| "Adelie" | "Torgersen" | 39.2 | 19.6 | 195 | 4675 | "male" | 2007 |
| "Adelie" | "Torgersen" | 34.1 | 18.1 | 193 | 3475 | null | 2007 |
| "Adelie" | "Torgersen" | 42.0 | 20.2 | 190 | 4250 | null | 2007 |
| "Adelie" | "Torgersen" | 37.8 | 17.1 | 186 | 3300 | null | 2007 |
| "Adelie" | "Torgersen" | 37.8 | 17.3 | 180 | 3700 | null | 2007 |
| "Adelie" | "Torgersen" | 41.1 | 17.6 | 182 | 3200 | "female" | 2007 |
| … | … | … | … | … | … | … | … |
| "Chinstrap" | "Dream" | 45.2 | 16.6 | 191 | 3250 | "female" | 2009 |
| "Chinstrap" | "Dream" | 49.3 | 19.9 | 203 | 4050 | "male" | 2009 |
| "Chinstrap" | "Dream" | 50.2 | 18.8 | 202 | 3800 | "male" | 2009 |
| "Chinstrap" | "Dream" | 45.6 | 19.4 | 194 | 3525 | "female" | 2009 |
| "Chinstrap" | "Dream" | 51.9 | 19.5 | 206 | 3950 | "male" | 2009 |
| "Chinstrap" | "Dream" | 46.8 | 16.5 | 189 | 3650 | "female" | 2009 |
| "Chinstrap" | "Dream" | 45.7 | 17.0 | 195 | 3650 | "female" | 2009 |
| "Chinstrap" | "Dream" | 55.8 | 19.8 | 207 | 4000 | "male" | 2009 |
| "Chinstrap" | "Dream" | 43.5 | 18.1 | 202 | 3400 | "female" | 2009 |
| "Chinstrap" | "Dream" | 49.6 | 18.2 | 193 | 3775 | "male" | 2009 |
| "Chinstrap" | "Dream" | 50.8 | 19.0 | 210 | 4100 | "male" | 2009 |
| "Chinstrap" | "Dream" | 50.2 | 18.7 | 198 | 3775 | "female" | 2009 |
Basically, filter keeps rows where the expression evaluates to True. And so we can use any predicate expression that results in a boolean. We can also use multiple expressions.
penguins.filter((pl.col("species") == "Adelie") & (pl.col("bill_length_mm") >= 39.0))| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | i64 | i64 | str | i64 |
| "Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 |
| "Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 |
| "Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 |
| "Adelie" | "Torgersen" | 39.3 | 20.6 | 190 | 3650 | "male" | 2007 |
| "Adelie" | "Torgersen" | 39.2 | 19.6 | 195 | 4675 | "male" | 2007 |
| "Adelie" | "Torgersen" | 42.0 | 20.2 | 190 | 4250 | null | 2007 |
| "Adelie" | "Torgersen" | 41.1 | 17.6 | 182 | 3200 | "female" | 2007 |
| "Adelie" | "Torgersen" | 42.5 | 20.7 | 197 | 4500 | "male" | 2007 |
| "Adelie" | "Torgersen" | 46.0 | 21.5 | 194 | 4200 | "male" | 2007 |
| "Adelie" | "Biscoe" | 40.6 | 18.6 | 183 | 3550 | "male" | 2007 |
| "Adelie" | "Biscoe" | 40.5 | 17.9 | 187 | 3200 | "female" | 2007 |
| "Adelie" | "Biscoe" | 40.5 | 18.9 | 180 | 3950 | "male" | 2007 |
| … | … | … | … | … | … | … | … |
| "Adelie" | "Torgersen" | 39.0 | 17.1 | 191 | 3050 | "female" | 2009 |
| "Adelie" | "Torgersen" | 44.1 | 18.0 | 210 | 4000 | "male" | 2009 |
| "Adelie" | "Torgersen" | 43.1 | 19.2 | 197 | 3500 | "male" | 2009 |
| "Adelie" | "Dream" | 41.1 | 17.5 | 190 | 3900 | "male" | 2009 |
| "Adelie" | "Dream" | 40.2 | 20.1 | 200 | 3975 | "male" | 2009 |
| "Adelie" | "Dream" | 39.7 | 17.9 | 193 | 4250 | "male" | 2009 |
| "Adelie" | "Dream" | 40.2 | 17.1 | 193 | 3400 | "female" | 2009 |
| "Adelie" | "Dream" | 40.6 | 17.2 | 187 | 3475 | "male" | 2009 |
| "Adelie" | "Dream" | 40.7 | 17.0 | 190 | 3725 | "male" | 2009 |
| "Adelie" | "Dream" | 39.0 | 18.7 | 185 | 3650 | "male" | 2009 |
| "Adelie" | "Dream" | 39.2 | 18.6 | 190 | 4250 | "male" | 2009 |
| "Adelie" | "Dream" | 41.5 | 18.5 | 201 | 4000 | "male" | 2009 |
We might also want to filter based on string matches. For example, strings that start with a certain substring
penguins.filter(pl.col("species").str.starts_with("Chin"))| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | i64 | i64 | str | i64 |
| "Chinstrap" | "Dream" | 46.5 | 17.9 | 192 | 3500 | "female" | 2007 |
| "Chinstrap" | "Dream" | 50.0 | 19.5 | 196 | 3900 | "male" | 2007 |
| "Chinstrap" | "Dream" | 51.3 | 19.2 | 193 | 3650 | "male" | 2007 |
| "Chinstrap" | "Dream" | 45.4 | 18.7 | 188 | 3525 | "female" | 2007 |
| "Chinstrap" | "Dream" | 52.7 | 19.8 | 197 | 3725 | "male" | 2007 |
| "Chinstrap" | "Dream" | 45.2 | 17.8 | 198 | 3950 | "female" | 2007 |
| "Chinstrap" | "Dream" | 46.1 | 18.2 | 178 | 3250 | "female" | 2007 |
| "Chinstrap" | "Dream" | 51.3 | 18.2 | 197 | 3750 | "male" | 2007 |
| "Chinstrap" | "Dream" | 46.0 | 18.9 | 195 | 4150 | "female" | 2007 |
| "Chinstrap" | "Dream" | 51.3 | 19.9 | 198 | 3700 | "male" | 2007 |
| "Chinstrap" | "Dream" | 46.6 | 17.8 | 193 | 3800 | "female" | 2007 |
| "Chinstrap" | "Dream" | 51.7 | 20.3 | 194 | 3775 | "male" | 2007 |
| … | … | … | … | … | … | … | … |
| "Chinstrap" | "Dream" | 45.2 | 16.6 | 191 | 3250 | "female" | 2009 |
| "Chinstrap" | "Dream" | 49.3 | 19.9 | 203 | 4050 | "male" | 2009 |
| "Chinstrap" | "Dream" | 50.2 | 18.8 | 202 | 3800 | "male" | 2009 |
| "Chinstrap" | "Dream" | 45.6 | 19.4 | 194 | 3525 | "female" | 2009 |
| "Chinstrap" | "Dream" | 51.9 | 19.5 | 206 | 3950 | "male" | 2009 |
| "Chinstrap" | "Dream" | 46.8 | 16.5 | 189 | 3650 | "female" | 2009 |
| "Chinstrap" | "Dream" | 45.7 | 17.0 | 195 | 3650 | "female" | 2009 |
| "Chinstrap" | "Dream" | 55.8 | 19.8 | 207 | 4000 | "male" | 2009 |
| "Chinstrap" | "Dream" | 43.5 | 18.1 | 202 | 3400 | "female" | 2009 |
| "Chinstrap" | "Dream" | 49.6 | 18.2 | 193 | 3775 | "male" | 2009 |
| "Chinstrap" | "Dream" | 50.8 | 19.0 | 210 | 4100 | "male" | 2009 |
| "Chinstrap" | "Dream" | 50.2 | 18.7 | 198 | 3775 | "female" | 2009 |
Subset Columns
To subset columns, we want to use the select() method.
penguins.select("species")| species |
|---|
| str |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| "Adelie" |
| … |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
| "Chinstrap" |
To select multiple columns, we pass in a list
penguins.select(["species", "island"])| species | island |
|---|---|
| str | str |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| "Adelie" | "Torgersen" |
| … | … |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
| "Chinstrap" | "Dream" |
To select columns with regex, we can pass regex to pl.col(). Note that we need to indicate this is regex by passing both ^ and $.
penguins.select(pl.col("^bill_.*$"))| bill_length_mm | bill_depth_mm |
|---|---|
| f64 | f64 |
| 39.1 | 18.7 |
| 39.5 | 17.4 |
| 40.3 | 18.0 |
| null | null |
| 36.7 | 19.3 |
| 39.3 | 20.6 |
| 38.9 | 17.8 |
| 39.2 | 19.6 |
| 34.1 | 18.1 |
| 42.0 | 20.2 |
| 37.8 | 17.1 |
| 37.8 | 17.3 |
| … | … |
| 45.2 | 16.6 |
| 49.3 | 19.9 |
| 50.2 | 18.8 |
| 45.6 | 19.4 |
| 51.9 | 19.5 |
| 46.8 | 16.5 |
| 45.7 | 17.0 |
| 55.8 | 19.8 |
| 43.5 | 18.1 |
| 49.6 | 18.2 |
| 50.8 | 19.0 |
| 50.2 | 18.7 |
and we can also exclude by regex
penguins.select(pl.exclude("^bill_.*$"))| species | island | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|
| str | str | i64 | i64 | str | i64 |
| "Adelie" | "Torgersen" | 181 | 3750 | "male" | 2007 |
| "Adelie" | "Torgersen" | 186 | 3800 | "female" | 2007 |
| "Adelie" | "Torgersen" | 195 | 3250 | "female" | 2007 |
| "Adelie" | "Torgersen" | null | null | null | 2007 |
| "Adelie" | "Torgersen" | 193 | 3450 | "female" | 2007 |
| "Adelie" | "Torgersen" | 190 | 3650 | "male" | 2007 |
| "Adelie" | "Torgersen" | 181 | 3625 | "female" | 2007 |
| "Adelie" | "Torgersen" | 195 | 4675 | "male" | 2007 |
| "Adelie" | "Torgersen" | 193 | 3475 | null | 2007 |
| "Adelie" | "Torgersen" | 190 | 4250 | null | 2007 |
| "Adelie" | "Torgersen" | 186 | 3300 | null | 2007 |
| "Adelie" | "Torgersen" | 180 | 3700 | null | 2007 |
| … | … | … | … | … | … |
| "Chinstrap" | "Dream" | 191 | 3250 | "female" | 2009 |
| "Chinstrap" | "Dream" | 203 | 4050 | "male" | 2009 |
| "Chinstrap" | "Dream" | 202 | 3800 | "male" | 2009 |
| "Chinstrap" | "Dream" | 194 | 3525 | "female" | 2009 |
| "Chinstrap" | "Dream" | 206 | 3950 | "male" | 2009 |
| "Chinstrap" | "Dream" | 189 | 3650 | "female" | 2009 |
| "Chinstrap" | "Dream" | 195 | 3650 | "female" | 2009 |
| "Chinstrap" | "Dream" | 207 | 4000 | "male" | 2009 |
| "Chinstrap" | "Dream" | 202 | 3400 | "female" | 2009 |
| "Chinstrap" | "Dream" | 193 | 3775 | "male" | 2009 |
| "Chinstrap" | "Dream" | 210 | 4100 | "male" | 2009 |
| "Chinstrap" | "Dream" | 198 | 3775 | "female" | 2009 |
Polars (via the polars.selectors module) also provides utilities similar to tidyverse selectors, such as starts_with()
penguins.select(cs.starts_with("bill"))| bill_length_mm | bill_depth_mm |
|---|---|
| f64 | f64 |
| 39.1 | 18.7 |
| 39.5 | 17.4 |
| 40.3 | 18.0 |
| null | null |
| 36.7 | 19.3 |
| 39.3 | 20.6 |
| 38.9 | 17.8 |
| 39.2 | 19.6 |
| 34.1 | 18.1 |
| 42.0 | 20.2 |
| 37.8 | 17.1 |
| 37.8 | 17.3 |
| … | … |
| 45.2 | 16.6 |
| 49.3 | 19.9 |
| 50.2 | 18.8 |
| 45.6 | 19.4 |
| 51.9 | 19.5 |
| 46.8 | 16.5 |
| 45.7 | 17.0 |
| 55.8 | 19.8 |
| 43.5 | 18.1 |
| 49.6 | 18.2 |
| 50.8 | 19.0 |
| 50.2 | 18.7 |
Another useful feature is selecting by data type. This is something that polars.selectors enables. For instance, if we wanted to select all numeric columns
penguins.select(cs.by_dtype(pl.NUMERIC_DTYPES))| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year |
|---|---|---|---|---|
| f64 | f64 | i64 | i64 | i64 |
| 39.1 | 18.7 | 181 | 3750 | 2007 |
| 39.5 | 17.4 | 186 | 3800 | 2007 |
| 40.3 | 18.0 | 195 | 3250 | 2007 |
| null | null | null | null | 2007 |
| 36.7 | 19.3 | 193 | 3450 | 2007 |
| 39.3 | 20.6 | 190 | 3650 | 2007 |
| 38.9 | 17.8 | 181 | 3625 | 2007 |
| 39.2 | 19.6 | 195 | 4675 | 2007 |
| 34.1 | 18.1 | 193 | 3475 | 2007 |
| 42.0 | 20.2 | 190 | 4250 | 2007 |
| 37.8 | 17.1 | 186 | 3300 | 2007 |
| 37.8 | 17.3 | 180 | 3700 | 2007 |
| … | … | … | … | … |
| 45.2 | 16.6 | 191 | 3250 | 2009 |
| 49.3 | 19.9 | 203 | 4050 | 2009 |
| 50.2 | 18.8 | 202 | 3800 | 2009 |
| 45.6 | 19.4 | 194 | 3525 | 2009 |
| 51.9 | 19.5 | 206 | 3950 | 2009 |
| 46.8 | 16.5 | 189 | 3650 | 2009 |
| 45.7 | 17.0 | 195 | 3650 | 2009 |
| 55.8 | 19.8 | 207 | 4000 | 2009 |
| 43.5 | 18.1 | 202 | 3400 | 2009 |
| 49.6 | 18.2 | 193 | 3775 | 2009 |
| 50.8 | 19.0 | 210 | 4100 | 2009 |
| 50.2 | 18.7 | 198 | 3775 | 2009 |
Creating New Columns
To create new columns, we’ll use the with_columns() method, which is similar to dplyr::mutate()
penguins.with_columns(is_chinstrap=pl.col("species") == "Chinstrap")| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | is_chinstrap |
|---|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | i64 | i64 | str | i64 | bool |
| "Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 | false |
| "Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 | false |
| "Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 | false |
| "Adelie" | "Torgersen" | null | null | null | null | null | 2007 | false |
| "Adelie" | "Torgersen" | 36.7 | 19.3 | 193 | 3450 | "female" | 2007 | false |
| "Adelie" | "Torgersen" | 39.3 | 20.6 | 190 | 3650 | "male" | 2007 | false |
| "Adelie" | "Torgersen" | 38.9 | 17.8 | 181 | 3625 | "female" | 2007 | false |
| "Adelie" | "Torgersen" | 39.2 | 19.6 | 195 | 4675 | "male" | 2007 | false |
| "Adelie" | "Torgersen" | 34.1 | 18.1 | 193 | 3475 | null | 2007 | false |
| "Adelie" | "Torgersen" | 42.0 | 20.2 | 190 | 4250 | null | 2007 | false |
| "Adelie" | "Torgersen" | 37.8 | 17.1 | 186 | 3300 | null | 2007 | false |
| "Adelie" | "Torgersen" | 37.8 | 17.3 | 180 | 3700 | null | 2007 | false |
| … | … | … | … | … | … | … | … | … |
| "Chinstrap" | "Dream" | 45.2 | 16.6 | 191 | 3250 | "female" | 2009 | true |
| "Chinstrap" | "Dream" | 49.3 | 19.9 | 203 | 4050 | "male" | 2009 | true |
| "Chinstrap" | "Dream" | 50.2 | 18.8 | 202 | 3800 | "male" | 2009 | true |
| "Chinstrap" | "Dream" | 45.6 | 19.4 | 194 | 3525 | "female" | 2009 | true |
| "Chinstrap" | "Dream" | 51.9 | 19.5 | 206 | 3950 | "male" | 2009 | true |
| "Chinstrap" | "Dream" | 46.8 | 16.5 | 189 | 3650 | "female" | 2009 | true |
| "Chinstrap" | "Dream" | 45.7 | 17.0 | 195 | 3650 | "female" | 2009 | true |
| "Chinstrap" | "Dream" | 55.8 | 19.8 | 207 | 4000 | "male" | 2009 | true |
| "Chinstrap" | "Dream" | 43.5 | 18.1 | 202 | 3400 | "female" | 2009 | true |
| "Chinstrap" | "Dream" | 49.6 | 18.2 | 193 | 3775 | "male" | 2009 | true |
| "Chinstrap" | "Dream" | 50.8 | 19.0 | 210 | 4100 | "male" | 2009 | true |
| "Chinstrap" | "Dream" | 50.2 | 18.7 | 198 | 3775 | "female" | 2009 | true |
We can create multiple columns at a time
penguins.with_columns(
is_chinstrap=pl.col("species") == "Chinstrap",
chonk=pl.col("body_mass_g") >= 4000,
bill_stuff=pl.col("bill_length_mm") + pl.col("bill_depth_mm"),
sex_initial=pl.col("sex").str.slice(0, 1).str.to_uppercase(),
)| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | is_chinstrap | chonk | bill_stuff | sex_initial |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | i64 | i64 | str | i64 | bool | bool | f64 | str |
| "Adelie" | "Torgersen" | 39.1 | 18.7 | 181 | 3750 | "male" | 2007 | false | false | 57.8 | "M" |
| "Adelie" | "Torgersen" | 39.5 | 17.4 | 186 | 3800 | "female" | 2007 | false | false | 56.9 | "F" |
| "Adelie" | "Torgersen" | 40.3 | 18.0 | 195 | 3250 | "female" | 2007 | false | false | 58.3 | "F" |
| "Adelie" | "Torgersen" | null | null | null | null | null | 2007 | false | null | null | null |
| "Adelie" | "Torgersen" | 36.7 | 19.3 | 193 | 3450 | "female" | 2007 | false | false | 56.0 | "F" |
| "Adelie" | "Torgersen" | 39.3 | 20.6 | 190 | 3650 | "male" | 2007 | false | false | 59.9 | "M" |
| "Adelie" | "Torgersen" | 38.9 | 17.8 | 181 | 3625 | "female" | 2007 | false | false | 56.7 | "F" |
| "Adelie" | "Torgersen" | 39.2 | 19.6 | 195 | 4675 | "male" | 2007 | false | true | 58.8 | "M" |
| "Adelie" | "Torgersen" | 34.1 | 18.1 | 193 | 3475 | null | 2007 | false | false | 52.2 | null |
| "Adelie" | "Torgersen" | 42.0 | 20.2 | 190 | 4250 | null | 2007 | false | true | 62.2 | null |
| "Adelie" | "Torgersen" | 37.8 | 17.1 | 186 | 3300 | null | 2007 | false | false | 54.9 | null |
| "Adelie" | "Torgersen" | 37.8 | 17.3 | 180 | 3700 | null | 2007 | false | false | 55.1 | null |
| … | … | … | … | … | … | … | … | … | … | … | … |
| "Chinstrap" | "Dream" | 45.2 | 16.6 | 191 | 3250 | "female" | 2009 | true | false | 61.8 | "F" |
| "Chinstrap" | "Dream" | 49.3 | 19.9 | 203 | 4050 | "male" | 2009 | true | true | 69.2 | "M" |
| "Chinstrap" | "Dream" | 50.2 | 18.8 | 202 | 3800 | "male" | 2009 | true | false | 69.0 | "M" |
| "Chinstrap" | "Dream" | 45.6 | 19.4 | 194 | 3525 | "female" | 2009 | true | false | 65.0 | "F" |
| "Chinstrap" | "Dream" | 51.9 | 19.5 | 206 | 3950 | "male" | 2009 | true | false | 71.4 | "M" |
| "Chinstrap" | "Dream" | 46.8 | 16.5 | 189 | 3650 | "female" | 2009 | true | false | 63.3 | "F" |
| "Chinstrap" | "Dream" | 45.7 | 17.0 | 195 | 3650 | "female" | 2009 | true | false | 62.7 | "F" |
| "Chinstrap" | "Dream" | 55.8 | 19.8 | 207 | 4000 | "male" | 2009 | true | true | 75.6 | "M" |
| "Chinstrap" | "Dream" | 43.5 | 18.1 | 202 | 3400 | "female" | 2009 | true | false | 61.6 | "F" |
| "Chinstrap" | "Dream" | 49.6 | 18.2 | 193 | 3775 | "male" | 2009 | true | false | 67.8 | "M" |
| "Chinstrap" | "Dream" | 50.8 | 19.0 | 210 | 4100 | "male" | 2009 | true | true | 69.8 | "M" |
| "Chinstrap" | "Dream" | 50.2 | 18.7 | 198 | 3775 | "female" | 2009 | true | false | 68.9 | "F" |
I’ll likely add more examples here as I encounter common use cases.
Summarizing DataFrames
Most summarizing workflows will involve the groupby() method, followed by some other operation.
Probably the most common thing we’ll want to do is count by group
penguins.groupby("island").count()| island | count |
|---|---|
| str | u32 |
| "Dream" | 124 |
| "Biscoe" | 168 |
| "Torgersen" | 52 |
We can also group by multiple columns
penguins.groupby(["island", "species"]).count()| island | species | count |
|---|---|---|
| str | str | u32 |
| "Torgersen" | "Adelie" | 52 |
| "Biscoe" | "Gentoo" | 124 |
| "Dream" | "Chinstrap" | 68 |
| "Dream" | "Adelie" | 56 |
| "Biscoe" | "Adelie" | 44 |
Another common thing might be to take the average of a numeric column by group. We can do this via the agg() method and passing various aggregation/summarization functions to this method.
penguins.groupby("species").agg(pl.mean("bill_length_mm").alias("bill_length_mean"))| species | bill_length_mean |
|---|---|
| str | f64 |
| "Chinstrap" | 48.833824 |
| "Adelie" | 38.791391 |
| "Gentoo" | 47.504878 |
We might also want to get the average of all of our numeric types by group. This is another situation where the polars.selectors module is helpful.
penguins.groupby("species").agg(cs.by_dtype(pl.NUMERIC_DTYPES).mean().suffix("_mean"))| species | bill_length_mm_mean | bill_depth_mm_mean | flipper_length_mm_mean | body_mass_g_mean | year_mean |
|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 |
| "Adelie" | 38.791391 | 18.346358 | 189.953642 | 3700.662252 | 2008.013158 |
| "Chinstrap" | 48.833824 | 18.420588 | 195.823529 | 3733.088235 | 2007.970588 |
| "Gentoo" | 47.504878 | 14.982114 | 217.186992 | 5076.01626 | 2008.080645 |
Or we might want to apply multiple aggregation functions to a column. This works basically like with_columns() in that you can pass multiple expressions
penguins.groupby("species").agg(
pl.mean("bill_length_mm").alias("bill_length_mean"),
pl.col("bill_length_mm").count().alias("n"),
)| species | bill_length_mean | n |
|---|---|---|
| str | f64 | u32 |
| "Chinstrap" | 48.833824 | 68 |
| "Gentoo" | 47.504878 | 124 |
| "Adelie" | 38.791391 | 152 |
Reshaping Long and Wide
Let’s take our summary from before where we calculated the average of all of the numeric columns by species
species_means = penguins.groupby("species").agg(cs.by_dtype(pl.NUMERIC_DTYPES).mean())Now suppose we want to pivot this longer so that it’s in “tidy” format. To do this, we want the melt() method.
species_long = species_means.melt(id_vars="species")
species_long.head()| species | variable | value |
|---|---|---|
| str | str | f64 |
| "Chinstrap" | "bill_length_mm… | 48.833824 |
| "Adelie" | "bill_length_mm… | 38.791391 |
| "Gentoo" | "bill_length_mm… | 47.504878 |
| "Chinstrap" | "bill_depth_mm" | 18.420588 |
| "Adelie" | "bill_depth_mm" | 18.346358 |
Note that in the above, there’s also a value_vars parameter. By leaving it empty, it defaults to every column not in id_vars, but this might not always be the right choice.
And if we want to reshape back to wide, we use pivot(). Pivot can optionally perform aggregations while reshaping, so we need to tell it that we don’t want it to aggregate anything (by passing aggregate_function=None)
species_long.pivot(values="value", index="species", columns="variable", aggregate_function=None)| species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year |
|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 |
| "Chinstrap" | 48.833824 | 18.420588 | 195.823529 | 3733.088235 | 2007.970588 |
| "Adelie" | 38.791391 | 18.346358 | 189.953642 | 3700.662252 | 2008.013158 |
| "Gentoo" | 47.504878 | 14.982114 | 217.186992 | 5076.01626 | 2008.080645 |