Deep Into: logistic classifications

will it rain tomorrow?

supervise learning can be divided into future two parts which are regression and classification. we use regression to predict numerical values with historical data. likewise, classification techniques are used to classify data using their shared qualities or characteristics. in machine learning we can do this classification using several algorithms in this post I hope to build a logistic regression model to use in binary classification. binary means there are only two possible outcomes like head or tail, right or left, or yes/no.




for example, if we consider the wine problem, next month's wine price is a kind of regression question. when we consider wine type quality as a classification problem (good or bad). now we have a clear idea about classification. let's dive deeper. how we do this classification thing? we have several strategies for this task. The most simple approach is the regression line. we fit the regression line by using the data we have and we can use that line as a boundary to group our data. but this is not a good idea.. because very low accuracy can achieve by this method. so I decide to use another strategy which is the logistic classification method which uses conditional probability. with this, I can get a probability of an event whether it happens or not.

classification with linear regression

as I say before linear regression method not very much accurate it misses most of the data points. then looks logistic method gives probability so to get a prediction we need to set a threshold (normally 0.5) to get predictions. then group data according to threshold greater values go to one class and lower values go to another class.

probability graph of a logistic classification model

same features used in the linear model used heer also and no changes are done on data. as you can see in graph x and the y-axis shows features used in the model and the z-axis how the probability. we can assume threshold as 0.5 then greater values than it treats as "yes" others as "no". so let's compare predictions of each model through grapes.

predictions of linear regression-based model
predictions of the logistic classification model

Logistic classification

the logistic model is composed of a standard logistic function which looks like this,
this function gives the probability of a certain event, we can write its exponential part like this,
here beta values are the weight of features we are using in the model and X values are our features. Y is in between 0 and 1 and X can be any value between positive and negative infinite. that is a very significant range so we get an odd ratio to improve this function then,


now odd ratio or P in the range zero to positive infinite. we can do future improvements by taking the natural logarithm of P(odd ratio),


now P in the range of negative infinity (because odd ratio approach to 1 then ln(0) ~ negative infinite) to positive infinite (because odd ratio approach 0 then ln(1) ~ positive infinite) which we want to get.

Model building

let's move on to our question "will it rain tomorrow?". now we going to build a model to answer this question. so we need to find whether dataset. Kaggle provides lots of publicly available data set on many topics so I choose the Australian weather data set to provide time-series data on rain from 2008 to 2017. a lot of data 😍...



so through this work, I use few libraries,
  • NumPy
  • pandas
  • feature engine
  • SciPy
  • Seabron
  • plotly
  • SciKit learn
first, we need to examine few things in our dataset.
  1. are there any missing values
  2. distribution of features
  3. possible outliers
I'm not willing to show all these things in here. but you can see these things in this Jupiter notebook. after cleaning the data set now we good to go...

we have two options here, we can use a pre-built library like SciKit learn or we can build our own. I decide to do this both. first, build our own module only using NumPy then use SciKit learn Logistic Regression model, and then compare the performance of two models.

logistic function,

teata is our weight list for features.

note:

I used the normalization function to normalize data because otherwise overflow errors can occur during calculation and ruin our model. needs because overflows occur in exp and wrong value errors can occur because of 0s and inf values. with this normalization, we can avoid them.

we can use the Pearson correlation coefficient algorithm to select features to train this model.

as I say I used 0.5 as a threshold,

now let's build a model using SciKit learn,

let's compare two models with the help of this graph,

compare probabilities of NO in both model

we can see how the probability of each model related to each other it's clear that both models are fairly working well because data points lie in a kind of symmetric shape. blue points indicate were predicted values by both models, not the same and orange points indicate predictions are the same in that point. you can get to know exactly which model is better with this results,


so what about class balance in both models,


they are also very close. 

learn more

summery

in this post, we discuss why logistic classification better than linear regression-based classification and then we discuss little math behind the logistic method and build a model from scratch. then compare our model with SciKit learn logistic regression model. hope you learn new things from this post.

Comments

Post a Comment