Training Course - Workshop
Charged Event - Fee TBC
Registration open soon
This training course aims to bring the ideas and benefits of test-driven development to the arena of data analysis, augmenting and adapting those ideas as appropriate. We will introduce attendees to testing data at all stages of processing and the data processes themselves.
We will use the open source Python TDDA library (test-driven data analysis), available with pip from PyPI, and in source form on Github, to support this. We will work primarily with data in CSV files and Pandas DataFrames for hands-on work, but will also illustrate testing data in relational databases such as Postgres. The methods and tools are applicable to structured data and data pipelines using any software, not just Python.
There will be two core sections to the course, both supported by the TDDA library.
1. Testing Data Processes and Pipelines
We will introduce the idea of a reference test, which is a lot like a system or integration test for an analytical process. We will show how these can be written for various kinds of analytical processes over different data types, and then tested using either Python's built-in unittest module or the popular pytest library. Topics covered will include
- Motivation for and introduction to testing
- Special considerations for testing analytical software and processes
- Testing outputs that are not always identical (because of reasons like version number, dates, host names, randomized identifiers, permuted outputs etc)
- Automatically saving outputs from failing tests to file
- Support for comparing failing outputs and expected ("reference") outputs with "diff" and similar tools
- Automatic regeneration of reference outputs after verified changes
- Running subsets of tests easily using tagging.
2. Using AI to Generate Constraints from Data and their use for Detecting Bad Data
We will introduce the idea of using constraints to check ("verify") data in various ways, including identification of change, outliers, duplicates, missing values, disallowed values, out-of-sequence dates and strings that fail to conform to an expected structure. Advanced string verification methods covered will include automatic generation of regular expressions to characterise patterns in text data using the rexpy module, which is integrated as part of the TDDA library.
We will then show how the TDDA library can be used not only to verify data against constraints, and to detect failing records and values, but also how it can generate suitable constraints directly from data.
Attendees will need to provide a laptop (Mac, Linux or Windows) with a suitable working Python, and with NumPy, Pandas and the TDDA library installed. People are welcome to work in pairs.
- We recommend Python 3.7; Python 2.7 and 3.6 will also work
- The Pandas library for processing with DataFrames should be installed (pip install pandas); this will also install NumPy if you don't already have it.
- The TDDA library should be installed (pip install tdda). This includes the example data required for the training.
Detailed instructions on system configuration will be supplied to registered attendees before the session, as well as instructions for how to test the installation.
The course is primarily aimed at practising data scientists with some familiarity with Python, or programmers coming to data science. Previous experience of testing and Pandas will be advantageous but is not required. As noted above, although the specific library used is Python, the data testing is almost entirely language neutral, and even the testing of data processes can be used with other languages, from within a Python test script.
Non-programmers with an interest in QA for data and data processes will also benefit from some of the overview material, and are welcome to attend, but may struggle with the hands-on parts of the course.
TARGET AUDIENCE SIZE
max 50; viable at 10+