pandera: Statistical Data Validation of Pandas Dataframes
Video: https://youtu.be/PxTLD-ueNd4
Abstract
pandas is an essential tool in the data scientist’s toolkit for modern
data engineering, analysis, and modeling in the Python ecosystem. However,
dataframes can often be difficult to reason about in terms of their data
types and statistical properties as data is reshaped from its raw form to
one that’s ready for analysis. Here, I introduce pandera, an open source
package that provides a flexible and expressive data validation API designed
to make it easy for data wranglers to define dataframe schemas. These
schemas execute logical and statistical assertions at runtime so that
analysts can spend less time worrying about the correctness of their
dataframes and more time obtaining insights and training models.
data validation, data engineering
DOI10.25080/Majora-342d178e-010