Deep Into: bias and variance

 


what will decide whether our model performer well or perform horribly? hmm... well, there are a few things that are effect models. data quality, model complexity, hyperparameters etc. so what about model complexity? yeah.. it's. before talking about model complexity, we should have a few things: train set, validation set, bias, variance, overfitting, and underfitting. let's talk about them first.

Train and validation datasets

whole data set and split train and validation sets

first, we need to understand why we use two separate data sets for training and validation. think if we have done training our model on full data set then we can't check whether it generalized or not. and we can observe how our model working on unseen data. so we randomly split the data set into two-part. generally 80% data into train set and 20% to validation. this may change according to the size of the data set we have. 


Training models on training data

let's try to train two models on the same data, one is simple and one is much complex than the first one. first, try on a simple model(least squares) on train data.

simple linear model fits training data

because of linear models unflexible nature, they can't represent the true relationship of input and output. no matter how long we try to fit a line it can't achieve the arc that data lie. in this case we can say that model isn't complex enough to replicate the true relationship of input and output. next look at a much complex model than the linear model.

complex model fit training data

with this model, we can capture the true relationship between input and output. this model is super complex and super flexible to fit arc-like data points. now we have all the necessary background knowledge to go to our main topics.

Bias

simply bias is the difference between predictions and true values. if the model has a high bias we can say that model can capture the complexity of data or model complexity not enough to capture the relationship of inputs and outputs. so how we measure this bias? we can use the sum of squared. that means we get the distance between prediction and real point and square it and finally sum of all squared distances. in mathematical terms,


you can understand this more easily with this graph,

the sum of square distances


so what about other models,

the sum of square distances

there are on bias, almost perfect fit. now let's talk about more terminologies, simple model(linear) has a higher bias than the complex one. so definitely complex model identifies the relationship between inputs and outputs better than the linear model. so we say that simple model is an Underfitting because it can't capture the true relationship of data. remember the higher sum of squares means higher bias. so far we worked on the training dataset. now evaluate our two models with test data which are unseen data to models. let's see how those models predict with that data.

Variance

first, we evaluate a simple model (linear) with validation data (test).

validation of a simple model

not much change on validation data simple model also have bias. but the important thing to notice this model has consistency. it didn't change much in validation data. now time to look at the complex model and its behaviour in validation data.

validation of the complex model

the complex model clearly struggles with validation data. the model shows a huge change from training data to validation data. and it lost consistency. so we say that a complex model has high variance than a simple model. even though the complex model did a great job in training data, with validation data it almost beaten by a simple model.

variance is the difference between fits in various data sets. you can see that here complex model has a high variance with validation data rather than training data. we call these types of situations Overfitting. that the model more complex than the relationship between input and output data so it's fitting to random noise and outliers of the training set. then what happens it unable to generalize. so it does terrible work on unseen data. in contrast, the simple model has a high bias but low variance, and the complex model has a low bias but a high variance. so it's hard to predict how complex model works in different data sets. Maybe it will work well in some data and maybe not. because of this uncertainty, a simple model can be trusted than a complex model. by the way simple model doing a good job, not a great one.

these are two extreme situations. what we need to do is find the intermediate model for better performance. which mean not the simplest model can have or neither the most complex model can build. we should take the average or intermediate model. we can use adding more features, get more data, dimensional reduction, regularization, bagging, boosting and stacking technics to overcome this. that's a whole new story. so I'll explain them in another post.

Summery

I discuss in this post a very important topic in machine learning which are bias and various. first, we discuss training and validation data sets and then we compare two models how they work on each data set.

Comments

Post a Comment