Katie Malone is a Director of Data Science at Civis Analytics, a data science software and services company. She leads a team of diverse data scientists who serve as technical and methodological advisors to the Civis consulting team, as well as writing the core machine learning and data science software that underpins the Civis Data Science Platform. Before working at Civis, she completed a Ph.D. in physics at Stanford, working at CERN on Higgs boson searches. She was also the instructor of Udacity’s Introduction to Machine Learning course and hosts Linear Digressions, a weekly podcast on data science and machine learning.
Tools like jupyter notebooks are great for getting started with data science and doing exploratory analysis, but they don’t make great reusable software. If you want to re-build a model, change parameters and compare results, configure your model for different settings, or generally write data science software, you need to expand your toolkit beyond notebooks.
This talk takes you through the process of formalizing a quick notebook-based data analysis and turning it into something more like modularized, tested, reusable data science software. We will start with a quick sprint through a data science problem from a popular online competition website, using primarily pandas and scikit-learn to do some data exploration, modeling, and validation.
Then we’ll take this code and begin to break it apart and re-build it, this time as a more formal set of python scripts. We’ll introduce some simple best practices for writing reusable python code, writing the stream-of-consciousness exploratory data science code as a series of functions that are modular and configurable for re-use. We’ll also add on some bells and whistles like light data governance, a command-line interpreter, and unit testing of the data science code.
This course is ideal for intermediate data scientists with some experience using python in a notebook but who are looking to write more mature and professional data science code. Familiarity with Python and the python data science stack (pandas, scikit-learn) is assumed. Some familiarity with machine learning algorithms and best practices is also assumed; we will briefly cover several algorithms, metrics, validation methods, etc. but the main objective of this talk will be around assuming a participant who is already familiar with those basic ideas and is looking to build more robust and user-friendly tools with them.