Python in Data Science Research and Education
Randy Paffenroth
Xiangnan Kong
Abstract
In this paper we demonstrate how Python can be used throughout the
entire life cycle of a graduate program in Data Science. In
interdisciplinary fields, such as Data Science, the students often
come from a variety of different backgrounds where, for example,
some students may have strong mathematical training but less
experience in programming. Python’s ease of use, open source
license, and access to a vast array of libraries make it
particularly suited for such students. In particular, we will
discuss how Python, IPython notebooks, scikit-learn, NumPy, SciPy,
and pandas can be used in several phases of graduate Data Science
education, starting from introductory classes (covering topics such
as data gathering, data cleaning, statistics, regression,
classification, machine learning, etc.) and culminating in degree
capstone research projects using more advanced ideas such as convex
optimization, non-linear dimension reduction, and compressed
sensing. One particular item of note is the scikit-learn library,
which provides numerous routines for machine learning. Having
access to such a library allows interesting problems to be addressed
early in the educational process and the experience gained with such
“black box” routines provides a firm foundation for the students own
software development, analysis, and research later in their academic
experience.
data science, education, machine learning
DOI10.25080/Majora-7b98e3ed-019