Published on: 11th October 2022
This post was created while writing my Data Analysis with Polars course. Check it out on Udemy
Polars is backed by Apache Arrow rather than Numpy. One argument you hear against working in Polars is that you’ll have to convert back to Numpy to fit ML models.
Does this argument against using Polars and Apache Arrow libraries hold water?
Nope - it’s not true now and will be more invalid over time.
Let’s take a Polars dataframe of the Titanic data for an example.
import polars as pl
import xgboost as xgb
df = pl.read_csv(csvPath)
X = (
df
.select(["Pclass"])
.to_dummies()
.to_arrow()
)
y = df["Survived"]
model = xgb.XGBClassifier(objective='binary:logistic')
model.fit(X, y)
df = pl.concat([
df,
pl.DataFrame(model.predict_proba(X)[:,1],columns=["pos"])
],
how="horizontal"
)
No Numpy or Pandas required.
We can do this because XGBoost introduced support for Arrow in recent months. Other ML and feature engineering libraries are working on Arrow support as well.
In addition, if your library does need a Numpy array then it’s often quicker to load and pre-process your data in Polars and then convert to a Numpy array at the last minute rather than using Pandas.
Want to know more about Polars for high performance data science and ML? Then you can:
or let me know if you would like a Polars workshop for your organisation.