deep into: cross validation

 


what happens if test models are on the same train data?

The first thing we must keep in mind is models we create need to perform well in production environments not only in our development environments. to do so models need to be generalized as much as possible. in another way, if the model overfits then it predicts so well in train data but it can't give good predictions on unseen data. the issue with test models on the same data we used to train the model will give very high accuracy but when models have seen new data it can't have that much accuracy or worse. to prevent this blindspot we must test models in new data which are didn't use to train that model.

split the dataset into two parts

one solution for this is, we can split data into two parts randomly. the more common ratio in 4:1 or 80% of data to train and 20% to test the model. sci-kit learn to provide a function to do the exact same thing. train_test_split. there are two parameters in this function which are test_size and train_size but need to provide one of them. in default, sci-kit learn shuffles data before the split. we can control that by shuffle parameter. 

but there there is still weakness in this method also. how we know this split was the best way. what about if there is another combination better than this one.

cross-validation

cross-validation give an answer to that question. rather than worrying about which combination of data is better, it uses all of them. first, it divides data into k splits with the same amount of data for each fold. then use one fold at a time as a test set and all other folds as training sets.


test set we divide first place still holds to final validation. this strategy is famously called "leave one out". in every fold the only k - 1 split (green) is used to train models and the remaining split (purple) is used to test the model.

sci-kit learn implementation: 

Type of cross-validation offer by sci-kit learn

sci-kit learn offers few cross-validation types in the model_selection module. which are KFold, RepeatedKFold, GroupKFold, StratifiedKFold. each of them has different functionality and they have different uses.

  • RepeatedKFold - repeat k fold to get different splits with different random states.

  • GroupKFold - gives approximately balance folds with non-overlapping groups.

  • StratifiedKFold - each split provides approximately the same percentage of samples of each target class as the complete set.

learn more : 

Comments

Post a Comment