We all want our models to generalize well so that they remain at their peak performance on any kind of dataset. To ensure such demands we often rely on cross-validation in our machine learning projects, a resampling procedure used to evaluate machine learning models on limited data samples. It could be a nightmare to realize that there is something wrong with your cross-validation strategy after you have spent all the time in the world tuning your model using it.

In this article, we will cover the most common seven mistakes people commit when using cross-validation and how you can avoid them. Let’s get started with a brief introduction to cross-validation.

What is cross-validation?

It is a statistical method used to evaluate the performance of machine learning models before they are put to use.  It involves the following steps:

1. First, we divide the dataset into k folds.
2. One out of k folds is used for testing while using k-1 folds for model training.
3. This procedure is repeated k times to estimate the mean performance of the model.

This method results in a performance estimate with less bias.

To know more about cross-validation you can refer to this article 👉 Cross-Validation in Machine Learning: How to Do It Right

Common mistakes while doing cross-validation

1. Randomly choosing the number of splits

The key configuration parameter for k-fold cross-validation is k that defines the number of folds in which the dataset will be split. This is the first dilemma when using k fold cross-validation. We generally stick to the most commonly used value of k i.e. 5, which is computationally less expensive.

There is no hard and fast rule to decide the value of k. If the value of k is too small (like 2 or 3), the pessimistic bias of the performance estimator will be high. If you increase the value of k the bias and the variance of the resulting estimate will decrease, giving a decent tradeoff around k=10.

However, while determining the value of k, you must ensure that the chosen value makes sense for your dataset. One tried and tested approach to tackle this problem is to perform a sensitivity analysis for different values of k i.e. estimate the performance of the model with different values of k and see how they fare.

2. Always choosing target stratified cross-validation for classification

When dealing with classification problems, our natural tendency is to always go for the popular stratified k fold which is stratified on the target labels. This assumes that the underlying data contains independent samples and is identically distributed. However, this might not be true for every case. Let’s understand this with the help of an example:

Consider medical data collected from multiple patients, this data could contain multiple samples from each patient. And such data is likely to be dependent on the individual patient group. In this case, we would like to know if our model trained on a particular group generalizes well on unseen patient groups. To achieve this we want to keep the train and validation groups exclusive of each other, that is by not keeping any of the training patient groups in the validation dataset.

Group k-fold is the variation of the k fold that helps us to deal with the above situation. It ensures that the same group is not represented in training, validation, and testing sets. You can find the scikit-learn implementation here

Most of the classification problems have class imbalanced data. In this situation, you might want to combine group k fold with stratification, there is a scikit-learn dev implementation of stratified group k fold that you can find here.

In the above images, you can see the difference between GroupKfold and StatifiedGroupKfold. The Stratified group k-fold tried to keep the constraint on group k-fold while attempting to return stratified samples.

3. Choosing cross-validation technique for a regression problem

When selecting a cross-validation scheme for a regression problem, most people go for normal K Fold because the target values are continuous. This will lead to a random split of train and validation set and fail to ensure an identical distribution of target values in train and validation.

To solve this common mistake we can bin the target values to n bins and then do a stratified k fold using these bin values. Let’s take a look at how this is done:

From sklearn.model_selection import StratifiedKFold
def create_folds(df, n_grp, n_s=5):

df['Fold'] = -1
skf = StratifiedKFold(n_splits=n_s)
df['grp'] = pd.cut(df.target, n_grp, labels=False)
target = df.grp
for fold_no, (t, v) in enumerate(skf.split(target, target)):
df.loc[v, 'Fold'] = fold_no
return df

Now this will ensure that there is an identical target distribution in your train and validation. This method is also known as continuous target stratification.

4. Oversampling before cross-validation

This is a very common mistake that I have observed when doing cross-validation when working with classification problems. Most of the datasets in classification tasks have a class imbalance issue. One way to solve class imbalance is to do oversampling, that is by randomly adding samples from minority classes to the dataset to make the number of samples for each labels balance.

Consider that you are working on a fraud detection dataset. You observe that there is an extreme class imbalance in this problem and you decide to address this issue first. To tackle the issue you decide to do oversampling. After this, you choose to do cross-validation. Now your train and validation datasets formed by k-fold cross-validation will have a number of common samples which will lead to overestimation of the model performance. So, avoid doing oversampling followed by cross-validation.

If oversampling is absolutely critical to your task, you should first split the data into corresponding folds and then proceed to do the upsampling inside the training and validation loop. Here is the pseudo-code to do this:

kfold = KFold(n_splits=n_splits)
scores = []
for train,valid in kfold.split(data):

train_oversampled = oversample_function(train)
score = train_and_validate(train_oversampled,valid)
scores.append(score)

5. Knowledge leak

It is a major problem in machine learning and this is common when doing cross-validation. Knowledge leak or data leak is when information outside the training set is used to create models. This additional information allows the model to know something additional about the data that otherwise it would have not known. This leads to overly optimistic predictions and overestimation of model performance. Let’s understand this by taking a popular mistake,

Consider that you are working with tabular data for drug classification containing hundreds of features. You decide to do some feature engineering through principal component analysis. After doing PCA you add the first 5 principal components to data as features. After this, you decide to do cross-validation and feed the data into the CV loop. Now you will observe that your CV results have improved more than expected. You’re happy with your model and proceed to test the performance on the holdout test set. But to your surprise, your model performs poorly!

Here you have committed the mistake of data leak because you did the PCA on the whole data rather than doing it on train and validation separately. In general, any feature engineering you apply to data should be done separately for training and validation inside the CV loop to avoid data leakage.

6. Cross-validation for time series data

We know that time-series data is different. When doing time series forecasting our objective is to build a model that is capable of predicting the future from some past observations. While validating time series models one should always keep only out-of-time samples in the validation/test set. This is to ensure that the model has generalized well on the historical data.

It is a common mistake to use the normal k-fold method available in scikit-learn to do cross-validation when modeling time series data. This will lead to random splitting of train and validation datasets that fail to ensure the out-of-time condition. So, while doing time series forecasting one should always use the timeseries split method available in scikit-learn to do cross-validation.

Lag features are extensively used to model response while doing time series forecasting, this could also lead to a data leak. While using lag features you can use blocked time series split. Here is an image depicting the difference between Timeseries split and the blocked time series split method.

It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next.

7. Randomness in cross-validation estimation

Almost all the algorithms used in machine learning are influenced by the random seed value. However, the effect of the random seed is particularly high in the case of complex deep neural networks like transformers. The effect of this can be very evident when training large models like BERT with small datasets. With such models, it is unlikely for you to get an accurate estimation of model performance using the normal cross-validation.

To avoid any surprises when testing your model with your hold-out test dataset, you should do cross-validation on multiple seeds and average the model performance. This would enable you to get a far better understanding of your model performance. Here is the pseudo-code for this,

SEEDS = [1, 2, 3, 4, 5]
ScoreMetric = []

for seed in SEEDS:

seed_all(seed)
kfold = KFold(n_splits=5, random_state=seed)
scores = []

for train,valid in kfold.split(data):

score = train_and_validate(train,valid)
scores.append(score)

ScoreMetric.append(scores)

Print(f"Average Model Performance {np.mean(ScoreMetric)}")

Conclusion

Throughout this article, we discussed some common mistakes that people commit while doing cross-validation. We also took a look at different solutions for these mistakes. I hope with this article you are empowered with better knowledge on performing cross-validation.

Simultaneously, I would urge you to refer to more research papers on experiments done using different model selection and validation techniques. I have attached some great references to get you started with this.