Pandera: Going Beyond Pandas Data Validation
Niels Bantilan
Data quality remains a core concern for practitioners in machine learning,
data science, and data engineering, and many specialized packages have emerged
to fulfill the need of validating and monitoring data and models. However, as
the open source community creates new data processing frameworks - notably,
new highly performant entrants such as Polars - existing data quality frameworks
need to catch up to support them, and in some cases, the Python community
more broadly creates new data validation libraries for these new data frameworks.
This paper outlines pandera's motivation and challenges that took it from being
a pandas-only data validation framework niels\_bantilan-proc-scipy-2020
to one that is extensible to other non-pandas-compliant dataframe-like libraries.
It also provides an informative case study of the technical and organizational
challenges associated with expanding the scope of a library beyond its original
boundaries.
data validation, data testing, data science, machine learning, data engineering
DOI10.25080/gerudo-f2bc6f59-010