Setup & Read in Data

import polars.selectors as cs
import polars as pl

# read in data
penguins = pl.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv",
    null_values="NA",
)

Check Properties of a Dataframe

use the shape attribute to check the dimensions (rows, cols)

penguins.shape

(344, 8)

use the head() method to see the first few rows

penguins.head()

shape: (5, 8)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
str	str	f64	f64	i64	i64	str	i64
"Adelie"	"Torgersen"	39.1	18.7	181	3750	"male"	2007
"Adelie"	"Torgersen"	39.5	17.4	186	3800	"female"	2007
"Adelie"	"Torgersen"	40.3	18.0	195	3250	"female"	2007
"Adelie"	"Torgersen"	null	null	null	null	null	2007
"Adelie"	"Torgersen"	36.7	19.3	193	3450	"female"	2007

to see the column names, we can use the columns attribute

penguins.columns

['species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex',
 'year']

Subset Rows

For the most part, we want to subset rows using the filter() method.

For example, we might want to filter out rows that aren’t missing data for the body mass variable

penguins.filter(pl.col("body_mass_g").is_not_null())

shape: (342, 8)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
str	str	f64	f64	i64	i64	str	i64
"Adelie"	"Torgersen"	39.1	18.7	181	3750	"male"	2007
"Adelie"	"Torgersen"	39.5	17.4	186	3800	"female"	2007
"Adelie"	"Torgersen"	40.3	18.0	195	3250	"female"	2007
"Adelie"	"Torgersen"	36.7	19.3	193	3450	"female"	2007
"Adelie"	"Torgersen"	39.3	20.6	190	3650	"male"	2007
"Adelie"	"Torgersen"	38.9	17.8	181	3625	"female"	2007
"Adelie"	"Torgersen"	39.2	19.6	195	4675	"male"	2007
"Adelie"	"Torgersen"	34.1	18.1	193	3475	null	2007
"Adelie"	"Torgersen"	42.0	20.2	190	4250	null	2007
"Adelie"	"Torgersen"	37.8	17.1	186	3300	null	2007
"Adelie"	"Torgersen"	37.8	17.3	180	3700	null	2007
"Adelie"	"Torgersen"	41.1	17.6	182	3200	"female"	2007
…	…	…	…	…	…	…	…
"Chinstrap"	"Dream"	45.2	16.6	191	3250	"female"	2009
"Chinstrap"	"Dream"	49.3	19.9	203	4050	"male"	2009
"Chinstrap"	"Dream"	50.2	18.8	202	3800	"male"	2009
"Chinstrap"	"Dream"	45.6	19.4	194	3525	"female"	2009
"Chinstrap"	"Dream"	51.9	19.5	206	3950	"male"	2009
"Chinstrap"	"Dream"	46.8	16.5	189	3650	"female"	2009
"Chinstrap"	"Dream"	45.7	17.0	195	3650	"female"	2009
"Chinstrap"	"Dream"	55.8	19.8	207	4000	"male"	2009
"Chinstrap"	"Dream"	43.5	18.1	202	3400	"female"	2009
"Chinstrap"	"Dream"	49.6	18.2	193	3775	"male"	2009
"Chinstrap"	"Dream"	50.8	19.0	210	4100	"male"	2009
"Chinstrap"	"Dream"	50.2	18.7	198	3775	"female"	2009

Basically, filter keeps rows where the expression evaluates to True. And so we can use any predicate expression that results in a boolean. We can also use multiple expressions.

penguins.filter((pl.col("species") == "Adelie") & (pl.col("bill_length_mm") >= 39.0))

shape: (72, 8)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
str	str	f64	f64	i64	i64	str	i64
"Adelie"	"Torgersen"	39.1	18.7	181	3750	"male"	2007
"Adelie"	"Torgersen"	39.5	17.4	186	3800	"female"	2007
"Adelie"	"Torgersen"	40.3	18.0	195	3250	"female"	2007
"Adelie"	"Torgersen"	39.3	20.6	190	3650	"male"	2007
"Adelie"	"Torgersen"	39.2	19.6	195	4675	"male"	2007
"Adelie"	"Torgersen"	42.0	20.2	190	4250	null	2007
"Adelie"	"Torgersen"	41.1	17.6	182	3200	"female"	2007
"Adelie"	"Torgersen"	42.5	20.7	197	4500	"male"	2007
"Adelie"	"Torgersen"	46.0	21.5	194	4200	"male"	2007
"Adelie"	"Biscoe"	40.6	18.6	183	3550	"male"	2007
"Adelie"	"Biscoe"	40.5	17.9	187	3200	"female"	2007
"Adelie"	"Biscoe"	40.5	18.9	180	3950	"male"	2007
…	…	…	…	…	…	…	…
"Adelie"	"Torgersen"	39.0	17.1	191	3050	"female"	2009
"Adelie"	"Torgersen"	44.1	18.0	210	4000	"male"	2009
"Adelie"	"Torgersen"	43.1	19.2	197	3500	"male"	2009
"Adelie"	"Dream"	41.1	17.5	190	3900	"male"	2009
"Adelie"	"Dream"	40.2	20.1	200	3975	"male"	2009
"Adelie"	"Dream"	39.7	17.9	193	4250	"male"	2009
"Adelie"	"Dream"	40.2	17.1	193	3400	"female"	2009
"Adelie"	"Dream"	40.6	17.2	187	3475	"male"	2009
"Adelie"	"Dream"	40.7	17.0	190	3725	"male"	2009
"Adelie"	"Dream"	39.0	18.7	185	3650	"male"	2009
"Adelie"	"Dream"	39.2	18.6	190	4250	"male"	2009
"Adelie"	"Dream"	41.5	18.5	201	4000	"male"	2009

We might also want to filter based on string matches. For example, strings that start with a certain substring

penguins.filter(pl.col("species").str.starts_with("Chin"))

shape: (68, 8)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
str	str	f64	f64	i64	i64	str	i64
"Chinstrap"	"Dream"	46.5	17.9	192	3500	"female"	2007
"Chinstrap"	"Dream"	50.0	19.5	196	3900	"male"	2007
"Chinstrap"	"Dream"	51.3	19.2	193	3650	"male"	2007
"Chinstrap"	"Dream"	45.4	18.7	188	3525	"female"	2007
"Chinstrap"	"Dream"	52.7	19.8	197	3725	"male"	2007
"Chinstrap"	"Dream"	45.2	17.8	198	3950	"female"	2007
"Chinstrap"	"Dream"	46.1	18.2	178	3250	"female"	2007
"Chinstrap"	"Dream"	51.3	18.2	197	3750	"male"	2007
"Chinstrap"	"Dream"	46.0	18.9	195	4150	"female"	2007
"Chinstrap"	"Dream"	51.3	19.9	198	3700	"male"	2007
"Chinstrap"	"Dream"	46.6	17.8	193	3800	"female"	2007
"Chinstrap"	"Dream"	51.7	20.3	194	3775	"male"	2007
…	…	…	…	…	…	…	…
"Chinstrap"	"Dream"	45.2	16.6	191	3250	"female"	2009
"Chinstrap"	"Dream"	49.3	19.9	203	4050	"male"	2009
"Chinstrap"	"Dream"	50.2	18.8	202	3800	"male"	2009
"Chinstrap"	"Dream"	45.6	19.4	194	3525	"female"	2009
"Chinstrap"	"Dream"	51.9	19.5	206	3950	"male"	2009
"Chinstrap"	"Dream"	46.8	16.5	189	3650	"female"	2009
"Chinstrap"	"Dream"	45.7	17.0	195	3650	"female"	2009
"Chinstrap"	"Dream"	55.8	19.8	207	4000	"male"	2009
"Chinstrap"	"Dream"	43.5	18.1	202	3400	"female"	2009
"Chinstrap"	"Dream"	49.6	18.2	193	3775	"male"	2009
"Chinstrap"	"Dream"	50.8	19.0	210	4100	"male"	2009
"Chinstrap"	"Dream"	50.2	18.7	198	3775	"female"	2009

Subset Columns

To subset columns, we want to use the select() method.

penguins.select("species")

shape: (344, 1)

species
str
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
"Adelie"
…
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"
"Chinstrap"

To select multiple columns, we pass in a list

penguins.select(["species", "island"])

shape: (344, 2)

species	island
str	str
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
"Adelie"	"Torgersen"
…	…
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"
"Chinstrap"	"Dream"

To select columns with regex, we can pass regex to pl.col(). Note that we need to indicate this is regex by passing both ^ and $.

penguins.select(pl.col("^bill_.*$"))

shape: (344, 2)

bill_length_mm	bill_depth_mm
f64	f64
39.1	18.7
39.5	17.4
40.3	18.0
null	null
36.7	19.3
39.3	20.6
38.9	17.8
39.2	19.6
34.1	18.1
42.0	20.2
37.8	17.1
37.8	17.3
…	…
45.2	16.6
49.3	19.9
50.2	18.8
45.6	19.4
51.9	19.5
46.8	16.5
45.7	17.0
55.8	19.8
43.5	18.1
49.6	18.2
50.8	19.0
50.2	18.7

and we can also exclude by regex

penguins.select(pl.exclude("^bill_.*$"))

shape: (344, 6)

species	island	flipper_length_mm	body_mass_g	sex	year
str	str	i64	i64	str	i64
"Adelie"	"Torgersen"	181	3750	"male"	2007
"Adelie"	"Torgersen"	186	3800	"female"	2007
"Adelie"	"Torgersen"	195	3250	"female"	2007
"Adelie"	"Torgersen"	null	null	null	2007
"Adelie"	"Torgersen"	193	3450	"female"	2007
"Adelie"	"Torgersen"	190	3650	"male"	2007
"Adelie"	"Torgersen"	181	3625	"female"	2007
"Adelie"	"Torgersen"	195	4675	"male"	2007
"Adelie"	"Torgersen"	193	3475	null	2007
"Adelie"	"Torgersen"	190	4250	null	2007
"Adelie"	"Torgersen"	186	3300	null	2007
"Adelie"	"Torgersen"	180	3700	null	2007
…	…	…	…	…	…
"Chinstrap"	"Dream"	191	3250	"female"	2009
"Chinstrap"	"Dream"	203	4050	"male"	2009
"Chinstrap"	"Dream"	202	3800	"male"	2009
"Chinstrap"	"Dream"	194	3525	"female"	2009
"Chinstrap"	"Dream"	206	3950	"male"	2009
"Chinstrap"	"Dream"	189	3650	"female"	2009
"Chinstrap"	"Dream"	195	3650	"female"	2009
"Chinstrap"	"Dream"	207	4000	"male"	2009
"Chinstrap"	"Dream"	202	3400	"female"	2009
"Chinstrap"	"Dream"	193	3775	"male"	2009
"Chinstrap"	"Dream"	210	4100	"male"	2009
"Chinstrap"	"Dream"	198	3775	"female"	2009

Polars (via the polars.selectors module) also provides utilities similar to tidyverse selectors, such as starts_with()

penguins.select(cs.starts_with("bill"))

shape: (344, 2)

bill_length_mm	bill_depth_mm
f64	f64
39.1	18.7
39.5	17.4
40.3	18.0
null	null
36.7	19.3
39.3	20.6
38.9	17.8
39.2	19.6
34.1	18.1
42.0	20.2
37.8	17.1
37.8	17.3
…	…
45.2	16.6
49.3	19.9
50.2	18.8
45.6	19.4
51.9	19.5
46.8	16.5
45.7	17.0
55.8	19.8
43.5	18.1
49.6	18.2
50.8	19.0
50.2	18.7

Another useful feature is selecting by data type. This is something that polars.selectors enables. For instance, if we wanted to select all numeric columns

penguins.select(cs.by_dtype(pl.NUMERIC_DTYPES))

shape: (344, 5)

bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
f64	f64	i64	i64	i64
39.1	18.7	181	3750	2007
39.5	17.4	186	3800	2007
40.3	18.0	195	3250	2007
null	null	null	null	2007
36.7	19.3	193	3450	2007
39.3	20.6	190	3650	2007
38.9	17.8	181	3625	2007
39.2	19.6	195	4675	2007
34.1	18.1	193	3475	2007
42.0	20.2	190	4250	2007
37.8	17.1	186	3300	2007
37.8	17.3	180	3700	2007
…	…	…	…	…
45.2	16.6	191	3250	2009
49.3	19.9	203	4050	2009
50.2	18.8	202	3800	2009
45.6	19.4	194	3525	2009
51.9	19.5	206	3950	2009
46.8	16.5	189	3650	2009
45.7	17.0	195	3650	2009
55.8	19.8	207	4000	2009
43.5	18.1	202	3400	2009
49.6	18.2	193	3775	2009
50.8	19.0	210	4100	2009
50.2	18.7	198	3775	2009

Creating New Columns

To create new columns, we’ll use the with_columns() method, which is similar to dplyr::mutate()

penguins.with_columns(is_chinstrap=pl.col("species") == "Chinstrap")

shape: (344, 9)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year	is_chinstrap
str	str	f64	f64	i64	i64	str	i64	bool
"Adelie"	"Torgersen"	39.1	18.7	181	3750	"male"	2007	false
"Adelie"	"Torgersen"	39.5	17.4	186	3800	"female"	2007	false
"Adelie"	"Torgersen"	40.3	18.0	195	3250	"female"	2007	false
"Adelie"	"Torgersen"	null	null	null	null	null	2007	false
"Adelie"	"Torgersen"	36.7	19.3	193	3450	"female"	2007	false
"Adelie"	"Torgersen"	39.3	20.6	190	3650	"male"	2007	false
"Adelie"	"Torgersen"	38.9	17.8	181	3625	"female"	2007	false
"Adelie"	"Torgersen"	39.2	19.6	195	4675	"male"	2007	false
"Adelie"	"Torgersen"	34.1	18.1	193	3475	null	2007	false
"Adelie"	"Torgersen"	42.0	20.2	190	4250	null	2007	false
"Adelie"	"Torgersen"	37.8	17.1	186	3300	null	2007	false
"Adelie"	"Torgersen"	37.8	17.3	180	3700	null	2007	false
…	…	…	…	…	…	…	…	…
"Chinstrap"	"Dream"	45.2	16.6	191	3250	"female"	2009	true
"Chinstrap"	"Dream"	49.3	19.9	203	4050	"male"	2009	true
"Chinstrap"	"Dream"	50.2	18.8	202	3800	"male"	2009	true
"Chinstrap"	"Dream"	45.6	19.4	194	3525	"female"	2009	true
"Chinstrap"	"Dream"	51.9	19.5	206	3950	"male"	2009	true
"Chinstrap"	"Dream"	46.8	16.5	189	3650	"female"	2009	true
"Chinstrap"	"Dream"	45.7	17.0	195	3650	"female"	2009	true
"Chinstrap"	"Dream"	55.8	19.8	207	4000	"male"	2009	true
"Chinstrap"	"Dream"	43.5	18.1	202	3400	"female"	2009	true
"Chinstrap"	"Dream"	49.6	18.2	193	3775	"male"	2009	true
"Chinstrap"	"Dream"	50.8	19.0	210	4100	"male"	2009	true
"Chinstrap"	"Dream"	50.2	18.7	198	3775	"female"	2009	true

We can create multiple columns at a time

penguins.with_columns(
    is_chinstrap=pl.col("species") == "Chinstrap",
    chonk=pl.col("body_mass_g") >= 4000,
    bill_stuff=pl.col("bill_length_mm") + pl.col("bill_depth_mm"),
    sex_initial=pl.col("sex").str.slice(0, 1).str.to_uppercase(),
)

shape: (344, 12)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year	is_chinstrap	chonk	bill_stuff	sex_initial
str	str	f64	f64	i64	i64	str	i64	bool	bool	f64	str
"Adelie"	"Torgersen"	39.1	18.7	181	3750	"male"	2007	false	false	57.8	"M"
"Adelie"	"Torgersen"	39.5	17.4	186	3800	"female"	2007	false	false	56.9	"F"
"Adelie"	"Torgersen"	40.3	18.0	195	3250	"female"	2007	false	false	58.3	"F"
"Adelie"	"Torgersen"	null	null	null	null	null	2007	false	null	null	null
"Adelie"	"Torgersen"	36.7	19.3	193	3450	"female"	2007	false	false	56.0	"F"
"Adelie"	"Torgersen"	39.3	20.6	190	3650	"male"	2007	false	false	59.9	"M"
"Adelie"	"Torgersen"	38.9	17.8	181	3625	"female"	2007	false	false	56.7	"F"
"Adelie"	"Torgersen"	39.2	19.6	195	4675	"male"	2007	false	true	58.8	"M"
"Adelie"	"Torgersen"	34.1	18.1	193	3475	null	2007	false	false	52.2	null
"Adelie"	"Torgersen"	42.0	20.2	190	4250	null	2007	false	true	62.2	null
"Adelie"	"Torgersen"	37.8	17.1	186	3300	null	2007	false	false	54.9	null
"Adelie"	"Torgersen"	37.8	17.3	180	3700	null	2007	false	false	55.1	null
…	…	…	…	…	…	…	…	…	…	…	…
"Chinstrap"	"Dream"	45.2	16.6	191	3250	"female"	2009	true	false	61.8	"F"
"Chinstrap"	"Dream"	49.3	19.9	203	4050	"male"	2009	true	true	69.2	"M"
"Chinstrap"	"Dream"	50.2	18.8	202	3800	"male"	2009	true	false	69.0	"M"
"Chinstrap"	"Dream"	45.6	19.4	194	3525	"female"	2009	true	false	65.0	"F"
"Chinstrap"	"Dream"	51.9	19.5	206	3950	"male"	2009	true	false	71.4	"M"
"Chinstrap"	"Dream"	46.8	16.5	189	3650	"female"	2009	true	false	63.3	"F"
"Chinstrap"	"Dream"	45.7	17.0	195	3650	"female"	2009	true	false	62.7	"F"
"Chinstrap"	"Dream"	55.8	19.8	207	4000	"male"	2009	true	true	75.6	"M"
"Chinstrap"	"Dream"	43.5	18.1	202	3400	"female"	2009	true	false	61.6	"F"
"Chinstrap"	"Dream"	49.6	18.2	193	3775	"male"	2009	true	false	67.8	"M"
"Chinstrap"	"Dream"	50.8	19.0	210	4100	"male"	2009	true	true	69.8	"M"
"Chinstrap"	"Dream"	50.2	18.7	198	3775	"female"	2009	true	false	68.9	"F"

I’ll likely add more examples here as I encounter common use cases.

Summarizing DataFrames

Most summarizing workflows will involve the groupby() method, followed by some other operation.

Probably the most common thing we’ll want to do is count by group

penguins.groupby("island").count()

shape: (3, 2)

island	count
str	u32
"Dream"	124
"Biscoe"	168
"Torgersen"	52

We can also group by multiple columns

penguins.groupby(["island", "species"]).count()

shape: (5, 3)

island	species	count
str	str	u32
"Torgersen"	"Adelie"	52
"Biscoe"	"Gentoo"	124
"Dream"	"Chinstrap"	68
"Dream"	"Adelie"	56
"Biscoe"	"Adelie"	44

Another common thing might be to take the average of a numeric column by group. We can do this via the agg() method and passing various aggregation/summarization functions to this method.

penguins.groupby("species").agg(pl.mean("bill_length_mm").alias("bill_length_mean"))

shape: (3, 2)

species	bill_length_mean
str	f64
"Chinstrap"	48.833824
"Adelie"	38.791391
"Gentoo"	47.504878

We might also want to get the average of all of our numeric types by group. This is another situation where the polars.selectors module is helpful.

penguins.groupby("species").agg(cs.by_dtype(pl.NUMERIC_DTYPES).mean().suffix("_mean"))

shape: (3, 6)

species	bill_length_mm_mean	bill_depth_mm_mean	flipper_length_mm_mean	body_mass_g_mean	year_mean
str	f64	f64	f64	f64	f64
"Adelie"	38.791391	18.346358	189.953642	3700.662252	2008.013158
"Chinstrap"	48.833824	18.420588	195.823529	3733.088235	2007.970588
"Gentoo"	47.504878	14.982114	217.186992	5076.01626	2008.080645

Or we might want to apply multiple aggregation functions to a column. This works basically like with_columns() in that you can pass multiple expressions

penguins.groupby("species").agg(
    pl.mean("bill_length_mm").alias("bill_length_mean"),
    pl.col("bill_length_mm").count().alias("n"),
)

shape: (3, 3)

species	bill_length_mean	n
str	f64	u32
"Chinstrap"	48.833824	68
"Gentoo"	47.504878	124
"Adelie"	38.791391	152

Reshaping Long and Wide

Let’s take our summary from before where we calculated the average of all of the numeric columns by species

species_means = penguins.groupby("species").agg(cs.by_dtype(pl.NUMERIC_DTYPES).mean())

Now suppose we want to pivot this longer so that it’s in “tidy” format. To do this, we want the melt() method.

species_long = species_means.melt(id_vars="species")

species_long.head()

shape: (5, 3)

species	variable	value
str	str	f64
"Chinstrap"	"bill_length_mm…	48.833824
"Adelie"	"bill_length_mm…	38.791391
"Gentoo"	"bill_length_mm…	47.504878
"Chinstrap"	"bill_depth_mm"	18.420588
"Adelie"	"bill_depth_mm"	18.346358

Note that in the above, there’s also a value_vars parameter. By leaving it empty, it defaults to every column not in id_vars, but this might not always be the right choice.

And if we want to reshape back to wide, we use pivot(). Pivot can optionally perform aggregations while reshaping, so we need to tell it that we don’t want it to aggregate anything (by passing aggregate_function=None)

 species_long.pivot(values="value", index="species", columns="variable", aggregate_function=None)

shape: (3, 6)

species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
str	f64	f64	f64	f64	f64
"Chinstrap"	48.833824	18.420588	195.823529	3733.088235	2007.970588
"Adelie"	38.791391	18.346358	189.953642	3700.662252	2008.013158
"Gentoo"	47.504878	14.982114	217.186992	5076.01626	2008.080645