devarena logo

Table of Contents

Reading Time: 17 minutes


You are a part of a data science team at a product company. Your team has a number of machine learning models in place. Their outputs guide critical business decisions, as well as a couple of dashboards displaying important KPIs that are closely watched by your executives day and night. 

On that fatal day, you had just brewed yourself a cup of coffee and were about to begin your workday when the universe collapsed. Everyone at the company went crazy. The business metrics dashboard was displaying what seemed to be random numbers (except every full hour, when the KPIs look okay for a short time) and the models were predicting the company’s insolvency looming fast. What is worse, every attempt to resolve this madness resulted in your data engineering and research teams reporting new broken services and models.

That was the debt collection day and the unpaid debt was of the worst kind: pipeline debt. How did it accumulate? Let’s go back a few months.

What is pipeline debt?

You were just about to start a new exciting machine learning project. You had located some useful data scattered around your company’s databases, feature stores, and spreadsheets. To make the data usable, you constructed a data pipeline: a set of jobs and Python functions that ingest, process, clean and combine all these data. Finally, your pipeline feeds the data into a machine learning model. We can depict the entire process schematically as follows.

Simple data pipelines are manageable | Source: Author

Your data pipeline worked well: it kept populating the downstream machine learning model with data, which the model was turning into accurate predictions. However, the model deployed as a service in the cloud was not very actionable. To make it more useful, you built a set of dashboards for presenting the model’s output as well as important KPIs to the business stakeholders, thus deepening your pipeline.

Existing pipelines are likely to be extended
Existing pipelines are likely to be extended | Source: Author

One day, you were telling a colleague from the research team about your project over lunch. They got quite excited and decided to do something a little similar with their data, making the company’s data pipeline wider and cross-team-border.

The more pipelines there are, the more complex the system gets
The more pipelines there are, the more complex the system gets | Source: Author

A few weeks later the two of you got to chat again. As you’ve learned about what the research team was up to, both of you noticed how useful and valuable it would be if your two teams used each other’s data for powering your respective models and analyses. Upon implementing this idea, the company’s data pipeline was looking like this.

If many pipelines exist, they will inevitably blend
If many pipelines exist, they will inevitably blend | Source: Author

The pictures above should have made the alarm bells ring already – what they show is pipeline debt accumulating. 

Pipeline debt is technical debt in data pipelines. It arises when your data pipeline is triple-U: 

  • 1Undocumented 
  • 2Untested
  • 3Unstable

It comes in many flavors but all share some characteristics. The system is entangled, meaning that a change in one place can derail a different process elsewhere. This makes code refactoring and debugging exceptionally hard. 

For a software engineer, this will sound like a solved problem and the solution is called automated testing. However, testing software is very different from testing data in two major ways:

  • 1First, while you have full control over your code and can change it when it doesn’t work, you can’t always change your data; in many cases, you are merely an observer watching data as it comes, generated by some real-world process. 
  • 2Second, software code is always right or wrong: either it does what it is designed to do, or it doesn’t. Data is never right or wrong. It can only be suitable or not for a particular purpose. This is why automated testing needs a special approach when data is involved.

Automated testing: expectations to the rescue 

Automated testing tailored for data pipelines is the premise of Great Expectations, a widely used open-source Python package for data validation.

Always know what to expect from your data
Always know what to expect from your data | Source: Author

Developed by Superconductive and first published in 2018, Great Expectations comes with the tagline “Always know what to expect from your data” and this is exactly what it offers.

The package is built around the concept of an expectation. The expectation can be thought of as a unit test for data. It is a declarative statement that describes the property of a dataset and does so in a simple, human-readable language.

For example, to assert that the values of the column “num_complaints” in some table in between one and five, you can write:

expect_column_values_to_be_between(
    column="num_complaints",
    min_value=1,
    max_value=5,
)

This statement will validate your data and return a success or a failure result. 

As we have already mentioned, you do not always control your data but rather passively observe it flowing. It is often the case that an atypical value pops up in your data from time to time without necessarily being a reason for distress. Great Expectations accommodate this via the “mostly” keyword which allows for describing how often should the expectation be matched.

expect_column_values_to_be_between(
    column="num_complaints",
    min_value=1,
    max_value=5,
    mostly=0.95,
)

The above statement will return success if at least 95% of “num_complaints” values are between one and five.

In order to understand the data well, it is crucial to have some context about why do we expect certain properties from it. We can simply add it by passing the “meta”  parameter to the expectation with any relevant information about how it came to be. Our colleagues or even our future selves will thank us for it.

expect_column_values_to_be_between(
    column="num_complaints",
    min_value=1,
    max_value=5,
    mostly=0.95,
    meta={
        “created_by”: “Michal”,
        “craeted_on”: “28.03.2022”,
        “notes”: “number of client complaints; more than 5 is unusual”
                 “and likely means something broke”,
    }
)

These metadata notes will also form a basis for the data documentation which Great Expectations can just generate out of thin air – but more on this later!

The package contains several dozen expectations to use out of the box, all of them with wordy, human-readable names such as “expect_column_distinct_values_to_be_in_set”, “expect_column_sum_to_be_between”, or “expect_column_kl_divergence_to_be_less_than”. This syntax allows one to clearly state what is expected of the data and why. 

Some expectations are applicable to column values, others to their aggregate functions or entire density distributions. Naturally, the package also makes it possible to easily create custom expectations for when a tailored solution is needed.

Great Expectations works with many different backends. You can evaluate your expectations locally on a Pandas data frame just as easily as on a SQL database (via SQLAlchemy) or on an Apache Spark cluster.

So, how do the expectations help to reduce pipeline debt? The answer to this is multifold. 

  • 1First, the process of crafting the expectations forces us to sit and ponder about our data: its nature, sources, and what can go wrong with it. This creates a deeper understanding and improves data-related communication within the team. 
  • 2Second, by clearly stating what we expect from the data, we can detect any unusual situations such as system outages early on. 
  • 3Third, by validating new data against a set of pre-existing expectations we can be sure we don’t feed our machine learning models garbage. 
  • 4Next, having the expectations defined brings us very close to having well-maintained data documentation in place. The list goes on and on.

We will shortly discuss all of the above benefits of knowing one’s expectations and more, but first, let’s set up the GE package!

Getting started with Great Expectations

In the remainder of this article, we will go through GE’s most useful features as well as a couple of clever use cases for the package. To keep it practical, the examples feature a real-life dataset, so before we can delve deep into Great Expectations functionalities, let’s spend a short while discussing the problem setting and the dataset, and how you can install and set up GE on your machine.

Problem setting

We will be looking at the Taxi Trips by the City of Chicago dataset. The data contain information about each taxi trip reported to the city authorities, such as the trip’s start and end time, cab ID, trip’s distance, fare, pickup and dropoff locations, and many more. The original data is huge (updated monthly from 2013 up till now), so for the purpose of this demonstration, we will limit it to just two days: February 27 and February 28, 2022. This amounts to over 13,000 trips. 

We will treat the data from February 27 as our company’s existing data, which we will automatically profile to craft the expectations suite. We will then treat the February 28 data as new, incoming data that we will validate against our expectations to make sure there is nothing funky going on there and that we can safely add these new data to our company’s database and utilize them for training machine learning models, for instance. 

Setting up Great Expectations

Let’s start with installing the package. Great Expectations require Python 3 and you can install it with pip.

pip install great_expectations

The above command installs not only the Python package itself but also the accompanying CLI (command line interface) that offers convenient utilities available from the terminal. We will now use one of them, the init command, to set up the Great Expectations project.

great_expectations init

Having run this command, you should see the following prompt in your terminal window:

Setting up Great Expectations

Type enter to proceed and a directory named “great_expectations” will appear in your project’s directory with all the contents as displayed on the screenshot above. 

All of this is referred to as a Data Context by the package authors. The Data Context contains all the files Great Expectations need to serve your project properly. It contains various configurations and metadata and offers access to data sources, expectations and other GE objects. No need to worry about it too much; for now, we can just trust the setup wizard that the data context is properly initialized and move on to connect some data.

Connecting data

To connect a new data source to our data context, we simply need to run the following command in the terminal.

great_expectations datasource new

This will generate three prompts. First, we are asked whether we would like to connect to a filesystem or a database. Since we have our taxi trips data locally as CSV files, we choose the former. The second question is about the processing engine we would like to use: pandas or spark. We go for pandas. Finally, we are asked for the path to our data files, which we need to type.

Connecting data

Providing all the necessary inputs results in a Jupyter Notebook being opened. The notebook, called “datasource_new”, provides some boilerplate Python code for configuring our data source. The default settings are good to go, so we don’t need to change anything, maybe except for the data source name in the second code cell. I’ve called mine “trips”.

Create a new pandas Datasource

Having changed the name, we need to run all the notebook cells, which will effectively create our data source. The last cell’s printout should confirm that our “trips” data source exists. Now, with the data source created, we can safely close and delete the notebook.

With the package set up and the data connected, we can dive deep into Great Expectations’ key features!

Key features of Great Expectations

Great Expectations offer three very useful features: 

  • 1Automated data profiling, to create the expectations suite from the data at hand.
  • 2Automatic generation of data documentation.
  • 3And finally, validation of new data, to guard against letting funky new data into our databases and machine learning models.

Let’s go through them one by one.

Automated data profiling

Where do the expectations come from? You could draft them manually one by one based on your familiarity with the data and any relevant domain knowledge. A more commonly used approach, however, is to let GE create them automatically by profiling the existing data. It is a quick way to produce a basic set of expectations that we can extend and build upon later.

The automatic profiler considers a couple of basic properties of the data: columns’ types, aggregate statistics such as min, max, or mean, distinct value counts and the number of missing values, among others.

In order to run automated profiling, we need to navigate to the directory where our data context resides in the terminal and run the following command. 

great_expectations suite new

The prompt will pop up asking us to choose the way in which we want our expectations to be created. We go for the last option, the automatic profiler.

Automated data profiling

GE then asks which data to profile. It has detected CSV files available in our data context. As explained earlier, we choose the data from February 27 for profiling.

Automated data profiling

Next, we got two more prompts to go through. First, we need to type the name of our expectations suite, and then confirm all the choices made so far. This will open, as you might have guessed, a jupyter notebook full of boilerplate code that allows us to customize our expectations.

In the second code cell of the notebook, we can see a variable called “ignored_columns”, defined as a list of all columns in our data. What we need to do here is comment out the columns that we actually do want to profile. Let’s comment out trips time, distance, and fare.

Key features of Great Expectations: select columns

Then, we just need to run the entire notebook to create the expectations suite. Our expectation suite has been saved inside our data context, in the expectations directory, as a JSON file. While we could browse this pretty readable JSON file, it is far more convenient to look at the Data Docs, which should have opened in the browser just as we ran the notebook. This brings us to the second great feature of Great Expectations.

Data documentation 

The package automatically renders an expectation suite into an HTML page that can serve as data documentation: a source of truth about what data is there and what it should look like.

Screenshot from Data Docs
Screenshot from Data Docs | Source: Author

Data Docs feature summary statistics of the data and the expectations that were created based on them. The yellow button in the Action panel on the left-hand side guides us through editing the expectations so that we can fix those that might have been generated incorrectly or add entirely new ones. Feel free to click around and explore this wonderland! Once you’re back, we will move on to the validation of new data.

Validation of new data 

In order to prevent pipeline debt from accumulating, each new portion of data eager to enter your databases, analyses and models ought to be validated. This means that we want to make sure new data meets all the expectations we have generated based on our existing data and/or domain knowledge. To do this with GE, we need to set up what’s called a Checkpoint in the package’s jargon. 

A Checkpoint runs an expectations suite against a batch of data. We can instantiate it by passing “checkpoint new” keywords to great_expectations, followed by a checkpoint name of choice. Here, I’ve called mine “feb_28_checkpoint”.

great_expectations checkpoint new feb_28_checkpoint

This opens yet another configuration notebook. The only cell of importance to us is the second code cell defining the “yaml_config” variable. There, we can choose which data set should be validated (“data_asset_name”) and which expectation suite should it be evaluated against (“expectation_suite_name”). This time, we can leave all the default values as they are – GE has deduced that since we only have two data files and one was used for profiling, we would likely want to validate the other.

Key features of Great Expectations: validation of new data 

In order to run our checkpoint, that is, evaluate our expectations on the new data, we just need to uncomment the last two lines of code in the final “Run Your Checkpoint” part of the notebook and run it. This opens Data Docs again, this time showing us the validation results.

For our taxi trips data, many expectations have failed. Some of these failures could have been expected: for example, based on the February 27 data, we’ve created an expectation stating the median fare should be greater than or equal to 21, while for the February 28 data the median fare is 15. It’s not so surprising to see the median fare differ between days.

Some expectations have failed
Some expectations have failed | Source: Author

This example stresses the importance of carefully analyzing and drafting the expectation suite. The values generated by the automated profiler should be treated as a starting point rather than a ready-made suite.

Working with expectations manually

In the previous sections, we have made use of the CLI, aided by Jupyter notebooks. However, it is possible to create and edit expectations manually. 

As mentioned earlier, the expectation suite is merely a JSON file containing expectations in the format that we saw at the beginning of this article, for example:

  {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "Trip Seconds",
        "mostly": 0.99
      },
      "meta": {}
    },

We can use any text editor to modify these expectations or add new ones. This way of interacting with GE is pretty handy, especially when you have a large expectation suite or a couple of them in place.

Use cases for Great Expectations

In the previous sections, we have gone through a pretty standard workflow of building up a data validation process. Let’s revisit it shortly:

  • 1We use an automated data profiler to create an expectations suite based on our existing data.
  • 2We carefully analyze the expectations, fixing and adding more as needed (we didn’t really do this in this tutorial, as it is a very data and domain-specific process, but it is a crucial step in practice!). At this step, we might discover some interesting or dangerous properties in our data. This is the right time to clarify them.
  • 3We run each new incoming batch of data against our expectation suite and only allow it further into our data pipelines if it passes the validation. If it fails, we try to understand why and whether it is the new data that is skewed, or maybe our expectations did not accommodate some corner case.

Now that we know how to work with Great Expectations to validate data, let’s discuss a couple of specific use cases in which investing time in GE pays back a great deal.

Detecting data drift

A notorious danger to machine learning models deployed in production is data drift. Data drift is a situation when the distribution of model inputs changes. This can happen for a multitude of reasons: data-collecting devices tend to break or have their software updated, which impacts the way data is being recorded. If the data is produced by humans, it is even more volatile as fashions and demographics evolve quickly.

Data drift constitutes a serious problem for machine learning models. It can make the decision boundaries learned by the algorithm invalid for the new-regime data, which has a detrimental impact on the model’s performance.

Enters data validation. 

In situations where data drift could be of concern, just create expectations about the model input features that validate their long-term trend, average values, or historic range and volatility. As soon as the world changes and your incoming data starts to look differently, GE will alert you by spitting out an array of failed tests!

Preventing outliers from distorting model outputs

Another threat to models deployed in production, slightly similar to the data drift, are outliers. What happens to a model’s output when it gets an unusual value as input, typically very high or very low? If the model has not seen such an extreme value during training, an honest answer for it would be to say: I don’t know what the prediction should be!

Unfortunately, machine learning models are not this honest. Much to the contrary: the model will likely produce some output that will be highly unreliable without any warning.

Fortunately, one can easily prevent it with a proper expectations suite! Just set allowed ranges for the model’s input features based on what it has seen in training to make sure you are not making predictions based on outliers.

Preventing pipeline failures from spilling over

Data pipelines do fail sometimes. You might have missed a corner case. Or the power might have gone off for a moment in your server room. Whatever the reason, it happens that a data processing job expecting new files to appear somewhere suddenly finds none.

If this makes the code fail, that’s not necessarily bad. But often it doesn’t: the job succeeds, announcing happily to the downstream systems that your website had 0 visits on the previous day. These data points are then shown on KPI dashboards or even worse, are fed into models that automatically retrain. How to prevent such a scenario?

With expectations, of course. Simply expect the recent data – for instance, with a fresh enough timestamp – to be there.

Detecting harmful biases

Bias in machine learning models is a topic that has seen increasing awareness and interest recently. This is crucial, considering how profoundly the models can impact people’s lives. The open question is how to detect and prevent these biases from doing charm.

While by no means do they provide an ultimate answer, Great Expectations can at least help us in detecting dangerous biases. Fairness in machine learning is a vast and complex topic, so let us focus on two small parts of the big picture: the training data that goes into the model, and the predictions produced by it for different test inputs.

When it comes to the training data, we want it to be fair and unbiased, whatever that means in our particular case. If the data is about users, for instance, you might want to include users from various geographies in appropriate proportions, matching their global population. Whether or not this is the case can be checked by validating each batch of training data against an appropriate expectations suite before the data is allowed to be used for training.

As for the model’s output, we might want it, for instance, to produce the same predictions for both women and men if their remaining characteristics are the same. To ensure this, just test the model on a hold-out test set and run the results against a pre-crafted suite of expectations.

Improving team communication and data understanding

Last but not the least, let me give you an example of a very creative usage of Great Expectations that I heard about from James Campbell, one of the package’s authors, as he was interviewed in the Data Engineering Podcast.

Namely, you could start off by creating an empty expectations suite, that is: list all the columns, but don’t impose any checks on their values yet. Then, get together people who the data or the business processes involved and ask them: What is the maximal monthly churn rate that is worrisome? How low does the website stickiness have to fall to trigger an alert? Such conversations can improve the data-related communication between the teams and the understanding of the data themselves in the company.

Additional resources

Thanks for reading! To learn more about the Great Expectations package, including how to use it with Apache Spark or relational databases, or how to write custom expectations, do check out the package’s official documentation. It is really well-written and pleasant to read. You might also enjoy listening to the already mentioned interview with one of GE’s authors, and if you are looking for a shorter resource, check out this presentation from the Head of Product at Superconductive, the company behind Great Expectations. Finally, I wish you always know what to expect from your data!


READ NEXT

Version and Compare Datasets in Model Training Runs

You can version datasets, models, and other file objects as Artifacts in Neptune.

This guide shows how to:

  • Keep track of a dataset version in your model training runs with artifacts
  • Query the dataset version from previous runs to make sure you are training on the same dataset version
  • Group your Neptune Runs by the dataset version they were trained on
  • See if models were trained on the same dataset version
  • Compare datasets in the Neptune UI to see what changed


Continue reading ->




Source link

Spread the Word!