Conference site ยป Proceedings

Pandera: Going Beyond Pandas Data Validation

Niels Bantilan


Data quality remains a core concern for practitioners in machine learning, data science, and data engineering, and many specialized packages have emerged to fulfill the need of validating and monitoring data and models. However, as the open source community creates new data processing frameworks - notably, new highly performant entrants such as Polars - existing data quality frameworks need to catch up to support them, and in some cases, the Python community more broadly creates new data validation libraries for these new data frameworks. This paper outlines pandera's motivation and challenges that took it from being a pandas-only data validation framework niels\_bantilan-proc-scipy-2020 to one that is extensible to other non-pandas-compliant dataframe-like libraries. It also provides an informative case study of the technical and organizational challenges associated with expanding the scope of a library beyond its original boundaries.


data validation, data testing, data science, machine learning, data engineering



Bibtex entry

Full text PDF