The Science of Bad Data
CEO, Stochastic Solutions Limited; Visiting Professor, Department of Mathematics, University of Edinburgh; Organizer, PyData Edinburgh
Abstract: Despite the ever-growing focus on machine-learning, AI and data science, there is no agreed and widely adopted way of testing analytical algorithms and data pipelines. Yet everyone knows that bad inputs, incorrect outputs and problems arising from changes to the format or content of data can break processes as surely as an unexpected item in the bagging area.
This talk will outline an approach to testing data, data processes and data pipelines that we call test-driven data analysis. This is both a methodology and set of open-source Python libraries, that can be used to check data and processes during development and deployment. This is inspired by test-driven development—the practice of writing, maintaining and running software tests continually, including before functionality is implemented.
The talk will explain AI approaches to assist in the development of constraints to check data to detect problems with range, distribution, type, formatting, string content, data ranges, relative date ranges, duplicates, foreign-key consistency and more. We will discuss both functionality already available in the library, including automatic generation of regular expressions to characterise text fields, and further functionality under development in the lab, including automatic detection of likely outliers.
As the scope of data-driven decisioning expands ever further, the importance of self-monitoring and detection of change, bad inputs and incorrect outputs becomes ever greater: this talk will offer practical tools and methodologies to support this.
Bio: Nick is a practising data scientist with over 30 years experience, from neural networks and genetic algorithms on parallel systems in the late 1980s, through parallel machine learning and 3D visualisation software as a founder of Quadstone, from 1995, to novel modelling methods (e.g. uplift modelling) in the early 2000s. Since 2007 , he has run Edinburgh data science specialists Stochastic Solutions.
Nick enjoys using his deep knowledge of underlying algorithms to fashion tailored solutions to practical business problems for clients including Barclays, Sainsburys, T-Mobile and Skyscanner, and has a particular interest in testing and correctness in data science.