Polars Cheat Sheet

A work-in-progress cheat sheet for working with polars dataframes

Author
Published

May 22, 2024

Setup & Read in Data

import polars.selectors as cs
import polars as pl

# read in data
penguins = pl.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv",
    null_values="NA",
)

Check Properties of a Dataframe

use the shape attribute to check the dimensions (rows, cols)

penguins.shape
(344, 8)

use the head() method to see the first few rows

penguins.head()
shape: (5, 8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
str str f64 f64 i64 i64 str i64
"Adelie" "Torgersen" 39.1 18.7 181 3750 "male" 2007
"Adelie" "Torgersen" 39.5 17.4 186 3800 "female" 2007
"Adelie" "Torgersen" 40.3 18.0 195 3250 "female" 2007
"Adelie" "Torgersen" null null null null null 2007
"Adelie" "Torgersen" 36.7 19.3 193 3450 "female" 2007

to see the column names, we can use the columns attribute

penguins.columns
['species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex',
 'year']

Subset Rows

For the most part, we want to subset rows using the filter() method.

For example, we might want to filter out rows that aren’t missing data for the body mass variable

penguins.filter(pl.col("body_mass_g").is_not_null())
shape: (342, 8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
str str f64 f64 i64 i64 str i64
"Adelie" "Torgersen" 39.1 18.7 181 3750 "male" 2007
"Adelie" "Torgersen" 39.5 17.4 186 3800 "female" 2007
"Adelie" "Torgersen" 40.3 18.0 195 3250 "female" 2007
"Adelie" "Torgersen" 36.7 19.3 193 3450 "female" 2007
"Adelie" "Torgersen" 39.3 20.6 190 3650 "male" 2007
"Adelie" "Torgersen" 38.9 17.8 181 3625 "female" 2007
"Adelie" "Torgersen" 39.2 19.6 195 4675 "male" 2007
"Adelie" "Torgersen" 34.1 18.1 193 3475 null 2007
"Adelie" "Torgersen" 42.0 20.2 190 4250 null 2007
"Adelie" "Torgersen" 37.8 17.1 186 3300 null 2007
"Adelie" "Torgersen" 37.8 17.3 180 3700 null 2007
"Adelie" "Torgersen" 41.1 17.6 182 3200 "female" 2007
"Chinstrap" "Dream" 45.2 16.6 191 3250 "female" 2009
"Chinstrap" "Dream" 49.3 19.9 203 4050 "male" 2009
"Chinstrap" "Dream" 50.2 18.8 202 3800 "male" 2009
"Chinstrap" "Dream" 45.6 19.4 194 3525 "female" 2009
"Chinstrap" "Dream" 51.9 19.5 206 3950 "male" 2009
"Chinstrap" "Dream" 46.8 16.5 189 3650 "female" 2009
"Chinstrap" "Dream" 45.7 17.0 195 3650 "female" 2009
"Chinstrap" "Dream" 55.8 19.8 207 4000 "male" 2009
"Chinstrap" "Dream" 43.5 18.1 202 3400 "female" 2009
"Chinstrap" "Dream" 49.6 18.2 193 3775 "male" 2009
"Chinstrap" "Dream" 50.8 19.0 210 4100 "male" 2009
"Chinstrap" "Dream" 50.2 18.7 198 3775 "female" 2009

Basically, filter keeps rows where the expression evaluates to True. And so we can use any predicate expression that results in a boolean. We can also use multiple expressions.

penguins.filter((pl.col("species") == "Adelie") & (pl.col("bill_length_mm") >= 39.0))
shape: (72, 8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
str str f64 f64 i64 i64 str i64
"Adelie" "Torgersen" 39.1 18.7 181 3750 "male" 2007
"Adelie" "Torgersen" 39.5 17.4 186 3800 "female" 2007
"Adelie" "Torgersen" 40.3 18.0 195 3250 "female" 2007
"Adelie" "Torgersen" 39.3 20.6 190 3650 "male" 2007
"Adelie" "Torgersen" 39.2 19.6 195 4675 "male" 2007
"Adelie" "Torgersen" 42.0 20.2 190 4250 null 2007
"Adelie" "Torgersen" 41.1 17.6 182 3200 "female" 2007
"Adelie" "Torgersen" 42.5 20.7 197 4500 "male" 2007
"Adelie" "Torgersen" 46.0 21.5 194 4200 "male" 2007
"Adelie" "Biscoe" 40.6 18.6 183 3550 "male" 2007
"Adelie" "Biscoe" 40.5 17.9 187 3200 "female" 2007
"Adelie" "Biscoe" 40.5 18.9 180 3950 "male" 2007
"Adelie" "Torgersen" 39.0 17.1 191 3050 "female" 2009
"Adelie" "Torgersen" 44.1 18.0 210 4000 "male" 2009
"Adelie" "Torgersen" 43.1 19.2 197 3500 "male" 2009
"Adelie" "Dream" 41.1 17.5 190 3900 "male" 2009
"Adelie" "Dream" 40.2 20.1 200 3975 "male" 2009
"Adelie" "Dream" 39.7 17.9 193 4250 "male" 2009
"Adelie" "Dream" 40.2 17.1 193 3400 "female" 2009
"Adelie" "Dream" 40.6 17.2 187 3475 "male" 2009
"Adelie" "Dream" 40.7 17.0 190 3725 "male" 2009
"Adelie" "Dream" 39.0 18.7 185 3650 "male" 2009
"Adelie" "Dream" 39.2 18.6 190 4250 "male" 2009
"Adelie" "Dream" 41.5 18.5 201 4000 "male" 2009

We might also want to filter based on string matches. For example, strings that start with a certain substring

penguins.filter(pl.col("species").str.starts_with("Chin"))
shape: (68, 8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
str str f64 f64 i64 i64 str i64
"Chinstrap" "Dream" 46.5 17.9 192 3500 "female" 2007
"Chinstrap" "Dream" 50.0 19.5 196 3900 "male" 2007
"Chinstrap" "Dream" 51.3 19.2 193 3650 "male" 2007
"Chinstrap" "Dream" 45.4 18.7 188 3525 "female" 2007
"Chinstrap" "Dream" 52.7 19.8 197 3725 "male" 2007
"Chinstrap" "Dream" 45.2 17.8 198 3950 "female" 2007
"Chinstrap" "Dream" 46.1 18.2 178 3250 "female" 2007
"Chinstrap" "Dream" 51.3 18.2 197 3750 "male" 2007
"Chinstrap" "Dream" 46.0 18.9 195 4150 "female" 2007
"Chinstrap" "Dream" 51.3 19.9 198 3700 "male" 2007
"Chinstrap" "Dream" 46.6 17.8 193 3800 "female" 2007
"Chinstrap" "Dream" 51.7 20.3 194 3775 "male" 2007
"Chinstrap" "Dream" 45.2 16.6 191 3250 "female" 2009
"Chinstrap" "Dream" 49.3 19.9 203 4050 "male" 2009
"Chinstrap" "Dream" 50.2 18.8 202 3800 "male" 2009
"Chinstrap" "Dream" 45.6 19.4 194 3525 "female" 2009
"Chinstrap" "Dream" 51.9 19.5 206 3950 "male" 2009
"Chinstrap" "Dream" 46.8 16.5 189 3650 "female" 2009
"Chinstrap" "Dream" 45.7 17.0 195 3650 "female" 2009
"Chinstrap" "Dream" 55.8 19.8 207 4000 "male" 2009
"Chinstrap" "Dream" 43.5 18.1 202 3400 "female" 2009
"Chinstrap" "Dream" 49.6 18.2 193 3775 "male" 2009
"Chinstrap" "Dream" 50.8 19.0 210 4100 "male" 2009
"Chinstrap" "Dream" 50.2 18.7 198 3775 "female" 2009

Subset Columns

To subset columns, we want to use the select() method.

penguins.select("species")
shape: (344, 1)
species
str
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"

To select multiple columns, we pass in a list

penguins.select(["species", "island"])
shape: (344, 2)
species island
str str
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Adelie" "Torgersen"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"
"Chinstrap" "Dream"

To select columns with regex, we can pass regex to pl.col(). Note that we need to indicate this is regex by passing both ^ and $.

penguins.select(pl.col("^bill_.*$"))
shape: (344, 2)
bill_length_mm bill_depth_mm
f64 f64
39.1 18.7
39.5 17.4
40.3 18.0
null null
36.7 19.3
39.3 20.6
38.9 17.8
39.2 19.6
34.1 18.1
42.0 20.2
37.8 17.1
37.8 17.3
45.2 16.6
49.3 19.9
50.2 18.8
45.6 19.4
51.9 19.5
46.8 16.5
45.7 17.0
55.8 19.8
43.5 18.1
49.6 18.2
50.8 19.0
50.2 18.7

and we can also exclude by regex

penguins.select(pl.exclude("^bill_.*$"))
shape: (344, 6)
species island flipper_length_mm body_mass_g sex year
str str i64 i64 str i64
"Adelie" "Torgersen" 181 3750 "male" 2007
"Adelie" "Torgersen" 186 3800 "female" 2007
"Adelie" "Torgersen" 195 3250 "female" 2007
"Adelie" "Torgersen" null null null 2007
"Adelie" "Torgersen" 193 3450 "female" 2007
"Adelie" "Torgersen" 190 3650 "male" 2007
"Adelie" "Torgersen" 181 3625 "female" 2007
"Adelie" "Torgersen" 195 4675 "male" 2007
"Adelie" "Torgersen" 193 3475 null 2007
"Adelie" "Torgersen" 190 4250 null 2007
"Adelie" "Torgersen" 186 3300 null 2007
"Adelie" "Torgersen" 180 3700 null 2007
"Chinstrap" "Dream" 191 3250 "female" 2009
"Chinstrap" "Dream" 203 4050 "male" 2009
"Chinstrap" "Dream" 202 3800 "male" 2009
"Chinstrap" "Dream" 194 3525 "female" 2009
"Chinstrap" "Dream" 206 3950 "male" 2009
"Chinstrap" "Dream" 189 3650 "female" 2009
"Chinstrap" "Dream" 195 3650 "female" 2009
"Chinstrap" "Dream" 207 4000 "male" 2009
"Chinstrap" "Dream" 202 3400 "female" 2009
"Chinstrap" "Dream" 193 3775 "male" 2009
"Chinstrap" "Dream" 210 4100 "male" 2009
"Chinstrap" "Dream" 198 3775 "female" 2009

Polars (via the polars.selectors module) also provides utilities similar to tidyverse selectors, such as starts_with()

penguins.select(cs.starts_with("bill"))
shape: (344, 2)
bill_length_mm bill_depth_mm
f64 f64
39.1 18.7
39.5 17.4
40.3 18.0
null null
36.7 19.3
39.3 20.6
38.9 17.8
39.2 19.6
34.1 18.1
42.0 20.2
37.8 17.1
37.8 17.3
45.2 16.6
49.3 19.9
50.2 18.8
45.6 19.4
51.9 19.5
46.8 16.5
45.7 17.0
55.8 19.8
43.5 18.1
49.6 18.2
50.8 19.0
50.2 18.7

Another useful feature is selecting by data type. This is something that polars.selectors enables. For instance, if we wanted to select all numeric columns

penguins.select(cs.by_dtype(pl.NUMERIC_DTYPES))
shape: (344, 5)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
f64 f64 i64 i64 i64
39.1 18.7 181 3750 2007
39.5 17.4 186 3800 2007
40.3 18.0 195 3250 2007
null null null null 2007
36.7 19.3 193 3450 2007
39.3 20.6 190 3650 2007
38.9 17.8 181 3625 2007
39.2 19.6 195 4675 2007
34.1 18.1 193 3475 2007
42.0 20.2 190 4250 2007
37.8 17.1 186 3300 2007
37.8 17.3 180 3700 2007
45.2 16.6 191 3250 2009
49.3 19.9 203 4050 2009
50.2 18.8 202 3800 2009
45.6 19.4 194 3525 2009
51.9 19.5 206 3950 2009
46.8 16.5 189 3650 2009
45.7 17.0 195 3650 2009
55.8 19.8 207 4000 2009
43.5 18.1 202 3400 2009
49.6 18.2 193 3775 2009
50.8 19.0 210 4100 2009
50.2 18.7 198 3775 2009

Creating New Columns

To create new columns, we’ll use the with_columns() method, which is similar to dplyr::mutate()

penguins.with_columns(is_chinstrap=pl.col("species") == "Chinstrap")
shape: (344, 9)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year is_chinstrap
str str f64 f64 i64 i64 str i64 bool
"Adelie" "Torgersen" 39.1 18.7 181 3750 "male" 2007 false
"Adelie" "Torgersen" 39.5 17.4 186 3800 "female" 2007 false
"Adelie" "Torgersen" 40.3 18.0 195 3250 "female" 2007 false
"Adelie" "Torgersen" null null null null null 2007 false
"Adelie" "Torgersen" 36.7 19.3 193 3450 "female" 2007 false
"Adelie" "Torgersen" 39.3 20.6 190 3650 "male" 2007 false
"Adelie" "Torgersen" 38.9 17.8 181 3625 "female" 2007 false
"Adelie" "Torgersen" 39.2 19.6 195 4675 "male" 2007 false
"Adelie" "Torgersen" 34.1 18.1 193 3475 null 2007 false
"Adelie" "Torgersen" 42.0 20.2 190 4250 null 2007 false
"Adelie" "Torgersen" 37.8 17.1 186 3300 null 2007 false
"Adelie" "Torgersen" 37.8 17.3 180 3700 null 2007 false
"Chinstrap" "Dream" 45.2 16.6 191 3250 "female" 2009 true
"Chinstrap" "Dream" 49.3 19.9 203 4050 "male" 2009 true
"Chinstrap" "Dream" 50.2 18.8 202 3800 "male" 2009 true
"Chinstrap" "Dream" 45.6 19.4 194 3525 "female" 2009 true
"Chinstrap" "Dream" 51.9 19.5 206 3950 "male" 2009 true
"Chinstrap" "Dream" 46.8 16.5 189 3650 "female" 2009 true
"Chinstrap" "Dream" 45.7 17.0 195 3650 "female" 2009 true
"Chinstrap" "Dream" 55.8 19.8 207 4000 "male" 2009 true
"Chinstrap" "Dream" 43.5 18.1 202 3400 "female" 2009 true
"Chinstrap" "Dream" 49.6 18.2 193 3775 "male" 2009 true
"Chinstrap" "Dream" 50.8 19.0 210 4100 "male" 2009 true
"Chinstrap" "Dream" 50.2 18.7 198 3775 "female" 2009 true

We can create multiple columns at a time

penguins.with_columns(
    is_chinstrap=pl.col("species") == "Chinstrap",
    chonk=pl.col("body_mass_g") >= 4000,
    bill_stuff=pl.col("bill_length_mm") + pl.col("bill_depth_mm"),
    sex_initial=pl.col("sex").str.slice(0, 1).str.to_uppercase(),
)
shape: (344, 12)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year is_chinstrap chonk bill_stuff sex_initial
str str f64 f64 i64 i64 str i64 bool bool f64 str
"Adelie" "Torgersen" 39.1 18.7 181 3750 "male" 2007 false false 57.8 "M"
"Adelie" "Torgersen" 39.5 17.4 186 3800 "female" 2007 false false 56.9 "F"
"Adelie" "Torgersen" 40.3 18.0 195 3250 "female" 2007 false false 58.3 "F"
"Adelie" "Torgersen" null null null null null 2007 false null null null
"Adelie" "Torgersen" 36.7 19.3 193 3450 "female" 2007 false false 56.0 "F"
"Adelie" "Torgersen" 39.3 20.6 190 3650 "male" 2007 false false 59.9 "M"
"Adelie" "Torgersen" 38.9 17.8 181 3625 "female" 2007 false false 56.7 "F"
"Adelie" "Torgersen" 39.2 19.6 195 4675 "male" 2007 false true 58.8 "M"
"Adelie" "Torgersen" 34.1 18.1 193 3475 null 2007 false false 52.2 null
"Adelie" "Torgersen" 42.0 20.2 190 4250 null 2007 false true 62.2 null
"Adelie" "Torgersen" 37.8 17.1 186 3300 null 2007 false false 54.9 null
"Adelie" "Torgersen" 37.8 17.3 180 3700 null 2007 false false 55.1 null
"Chinstrap" "Dream" 45.2 16.6 191 3250 "female" 2009 true false 61.8 "F"
"Chinstrap" "Dream" 49.3 19.9 203 4050 "male" 2009 true true 69.2 "M"
"Chinstrap" "Dream" 50.2 18.8 202 3800 "male" 2009 true false 69.0 "M"
"Chinstrap" "Dream" 45.6 19.4 194 3525 "female" 2009 true false 65.0 "F"
"Chinstrap" "Dream" 51.9 19.5 206 3950 "male" 2009 true false 71.4 "M"
"Chinstrap" "Dream" 46.8 16.5 189 3650 "female" 2009 true false 63.3 "F"
"Chinstrap" "Dream" 45.7 17.0 195 3650 "female" 2009 true false 62.7 "F"
"Chinstrap" "Dream" 55.8 19.8 207 4000 "male" 2009 true true 75.6 "M"
"Chinstrap" "Dream" 43.5 18.1 202 3400 "female" 2009 true false 61.6 "F"
"Chinstrap" "Dream" 49.6 18.2 193 3775 "male" 2009 true false 67.8 "M"
"Chinstrap" "Dream" 50.8 19.0 210 4100 "male" 2009 true true 69.8 "M"
"Chinstrap" "Dream" 50.2 18.7 198 3775 "female" 2009 true false 68.9 "F"

I’ll likely add more examples here as I encounter common use cases.

Summarizing DataFrames

Most summarizing workflows will involve the groupby() method, followed by some other operation.

Probably the most common thing we’ll want to do is count by group

penguins.groupby("island").count()
shape: (3, 2)
island count
str u32
"Dream" 124
"Biscoe" 168
"Torgersen" 52

We can also group by multiple columns

penguins.groupby(["island", "species"]).count()
shape: (5, 3)
island species count
str str u32
"Torgersen" "Adelie" 52
"Biscoe" "Gentoo" 124
"Dream" "Chinstrap" 68
"Dream" "Adelie" 56
"Biscoe" "Adelie" 44

Another common thing might be to take the average of a numeric column by group. We can do this via the agg() method and passing various aggregation/summarization functions to this method.

penguins.groupby("species").agg(pl.mean("bill_length_mm").alias("bill_length_mean"))
shape: (3, 2)
species bill_length_mean
str f64
"Chinstrap" 48.833824
"Adelie" 38.791391
"Gentoo" 47.504878

We might also want to get the average of all of our numeric types by group. This is another situation where the polars.selectors module is helpful.

penguins.groupby("species").agg(cs.by_dtype(pl.NUMERIC_DTYPES).mean().suffix("_mean"))
shape: (3, 6)
species bill_length_mm_mean bill_depth_mm_mean flipper_length_mm_mean body_mass_g_mean year_mean
str f64 f64 f64 f64 f64
"Adelie" 38.791391 18.346358 189.953642 3700.662252 2008.013158
"Chinstrap" 48.833824 18.420588 195.823529 3733.088235 2007.970588
"Gentoo" 47.504878 14.982114 217.186992 5076.01626 2008.080645

Or we might want to apply multiple aggregation functions to a column. This works basically like with_columns() in that you can pass multiple expressions

penguins.groupby("species").agg(
    pl.mean("bill_length_mm").alias("bill_length_mean"),
    pl.col("bill_length_mm").count().alias("n"),
)
shape: (3, 3)
species bill_length_mean n
str f64 u32
"Chinstrap" 48.833824 68
"Gentoo" 47.504878 124
"Adelie" 38.791391 152

Reshaping Long and Wide

Let’s take our summary from before where we calculated the average of all of the numeric columns by species

species_means = penguins.groupby("species").agg(cs.by_dtype(pl.NUMERIC_DTYPES).mean())

Now suppose we want to pivot this longer so that it’s in “tidy” format. To do this, we want the melt() method.

species_long = species_means.melt(id_vars="species")

species_long.head()
shape: (5, 3)
species variable value
str str f64
"Chinstrap" "bill_length_mm… 48.833824
"Adelie" "bill_length_mm… 38.791391
"Gentoo" "bill_length_mm… 47.504878
"Chinstrap" "bill_depth_mm" 18.420588
"Adelie" "bill_depth_mm" 18.346358

Note that in the above, there’s also a value_vars parameter. By leaving it empty, it defaults to every column not in id_vars, but this might not always be the right choice.

And if we want to reshape back to wide, we use pivot(). Pivot can optionally perform aggregations while reshaping, so we need to tell it that we don’t want it to aggregate anything (by passing aggregate_function=None)

 species_long.pivot(values="value", index="species", columns="variable", aggregate_function=None)
shape: (3, 6)
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
str f64 f64 f64 f64 f64
"Chinstrap" 48.833824 18.420588 195.823529 3733.088235 2007.970588
"Adelie" 38.791391 18.346358 189.953642 3700.662252 2008.013158
"Gentoo" 47.504878 14.982114 217.186992 5076.01626 2008.080645