Liam Brannigan

Blog post index

Polars & Huggingface datasets

Published on: 1st November 2022

This post was created while writing my Data Analysis with Polars course. Check it out on Udemy

One consequence of the Apache Arrow era is that different libraries will integrate more easily.

Here for example we load data from a Huggingface dataset into a Polars dataframe with zero-copy.

from datasets import load_dataset
import polars as pl

dataset = load_dataset("rotten_tomatoes", split="train")
df = pl.from_arrow(dataset.data.table)

shape: (3, 2)
┌───────────────────────────────────────────────────────┬───────┐
 text                                                   label 
 ---                                                    ---   
 str                                                    i64   
╞═══════════════════════════════════════════════════════╪═══════╡
 the rock is destined to be the 21st century's new ... ┆ 1     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ the gorgeously elaborate continuation of " the lor... ┆ 1     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ effective but too-tepid biopic                        ┆ 1     │
└───────────────────────────────────────────────────────┴───────┘

Hopefully there will be an explicit to_polars() method in datasets.

I’ll be digging into this in more detail - can we exploit the memory-mapped datasets that datasets can produce with Polars new out-of-core capabilities?

Also: please don’t call libraries datasets😂

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.