Bias And Variance

Machine Learning especially deep learning is all about the model and data that we feed to the model. There are 2 things to consider.
* Model Complexity.
* Number of hidden layers.
* Number of nodes in the hidden layers.
* Data Provided to the model
* Number of features in the data( data points).
* Number of the row of data
There are few other things like learning rate, regularisation, dropout etc, these are the things which mostly related to how the model should learn. In order to make a good model, we have had to get the same database on which we have to tune the model so that the end error is minimum. To tune the model we often focus on the bias and the Variance, but we should have a knowledge about this two concept in order to fine-tune the model.

Lets us first walk through a simple story

Suppose you start learning an instrument, and we started with the one of the simple song A. Initially you will not be able to play that particular song correctly but after practicing many times you will be able to play it correctly. Suppose it has been a week and you can play song A perfectly, but thus this means you have learned to play?
Now your instructor ask you to play some other song B, when you start playing song B, you were not able to play is correct as you have practiced the song A only, you have not studied how in general music has to be played, but so have only perfected the single song.
Now let’s compare this story with the BIAS and VARIANCE.

BIAS
When you start learning the instrument, mistake ( error ) you have done while learning that particular song is the measure of BIAS. In simple term error proportionate to the specific dataset.
Higher the error, higher is the Bias. How well our model is performing, assuming dataset is static, and we have to predict it for the given dataset only as of now.

VARIANCE
When you have learned to play the song A, but the mistake (error) you have done to while playing song B is the measure of the VARIANCE. In simple term error proportionate when there is a change in the dataset.
How well we can predict when our model sees the new dataset which it had not to seem earlier.

So whenever we start with any model, usually we will have high error, model is not able to predict the correct value (HIGH BIAS), but once we add more layers or we add more feature our model is able to predict the current value for the given dataset (LOW BIAS).
Now we will need some new data to the mode, it will have the high error for the new data set(High Variance), but if we make our model generic, not too much dependent on the initial data, it will predict the correct value for new dataset also (Low Variance).

So our end result will be to obtain low bias and low variance. But what metrics we should observe and what actions can be taken to achieve this has to be sorted.
To start with one should always create an error graph for both training set and validation set against time.

 Bias Variance In the graph of error vs time, both the training error and validation will be high. It means we have high Bias In the graph of error vs time, the training error is decreasing but validation will be high. It means we have high Variance. High Bias is often known as underfitting. High Variance often knows as overfitting. To solve High Bias one can do the following Add more features. Increase hidden layers or increase the number of neural in the layers. Train Model for the longer time. To solve High Variance one can do the following Add more data in the training set Use regularization Decrease the learning rate

Note
High Bias Low Variance: Models are consistent but inaccurate on average
High Bias High Variance: Models are inaccurate and also inconsistent on average
Low Bias High variance: Models are somewhat accurate but inconsistent on averages. A small change in the data can cause a large error.

Some Important Links to understand Bias and Variance
* Andrew NG coursera discussion
* Quora Discussion

Different dropout in Tensorflow

Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. Why Dropout: Dropout helps prevent weights from converging to identical positions. It does this by randomly turning nodes off when forward propagating.

In simple terms some of your neurons will not participate in the calculation.

In Tensorflow we have two dropout functions.

• tf.nn.dropout
• It has parameter  keep_prop which state the probability of neuron which will not be drop. If we dive keep_prop value to 0.6, it means 60% of neurons will remain and 40% will be drop.
• tf.layers.dropout
• It has two main parameters, rate, and training. Rate means the number of neurons which will be drop.If the rate is 0.6 then 60% neurons will be dropped and 40% will be used.
• We can see keep_prop = 1- rate.
• Training parameter is to differentiate if the network is running for training or to get the result.We need dropout only when we are training the neural network, not while we are testing(inferencing) it.

So what is the difference between these two functions?

tf.layers.dropout is a wrapper over the tf.nn.droput, which give us the demarcation of whether to return the output in training mode (apply dropout) or in inference mode (return the input untouched).”

Difference Between Generative and Discriminative machine learning

To understand these two models we first have to see what is the difference between joint probability [P(x,y)] and conditional probability[P(x|y)].

Joint probability:  p(A and B).  The probability of event A and event B occurring.  It is the probability of the intersection of two or more events.  The probability of the intersection of A and B may be written p(A ∩ B). Example:  the probability that a card is a four and red =p(four and red) = 2/52=1/26.  (There are two red fours in a deck of 52, the 4 of hearts and the 4 of diamonds).

Conditional probability:  p(A|B) is the probability of event An occurring, given that event B occurs. Example:  given that you drew a red card, what’s the probability that it’s a four (p(four|red))=2/26=1/13.  So out of the 26 red cards (given a red card), there are two fours so 2/26=1/13.

generative algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumptions, which category is most likely to generate this signal? Let’s say you have input data x and you want to classify the data into labels y. A generative model learns the joint probability distribution p(x,y). A generative algorithm models how the data was “generated”, so you ask it “what’s the likelihood this or that class generated this instance?” and pick the one with the better probability.

discriminative algorithm does not care about how the data was generated, it simply categorizes a given signal. Discriminative model learns the conditional probability distribution p(y|x) – which you should read as the probability of y given x. A discriminative algorithm uses the data to create a decision boundary, so you ask it “what side of the decision boundary is this instance on?

The fundamental difference between discriminative models and generative models is:

• Discriminative models learn the (hard or soft) boundary between classes
• Generative models model the distribution of individual classes

Given input data point x, the aim is to predict continuous (regression) or discrete (classification) output. That is given x, we are interested in modeling p(y|x). There are three approaches to this:

1. Generative Models:
One way is to model p(x, y) directly. Once we do that, we can obtain p(y|x) by simply conditioning on x. And we can then use decision theory to determine class membership i.e. we can use loss matrix, etc. to determine which class the point belongs to (such an assignment would minimize the expected loss). For e.g. in Naive Bayes model, you can learn p(y), the prior class probabilities from the data. You can also learn p(x|y) from the data using said maximum likelihood estimation (or you can Bayes estimator if you will). Once you have p(y) and p(x|y), p(x, y) is not difficult to find out.

2. Discriminative Models:
Instead of modeling p(x, y), we can directly model p(y|x), for e.g. in logistic regression p(y|x) is assumed to be of the form 1 / (1 + exp(-sigma(wi. xi))). All we have to do in such a case is to learn weights that would minimize the squared loss.

Generative models often outperform discriminative models on smaller datasets because their generative assumptions place some structure on your model that prevent overfitting. For example, let’s consider Naive Bayes vs. Logistic Regression. The Naive Bayes assumption is of course rarely satisfied, so logistic regression will tend to outperform Naive Bayes as your dataset grows (since it can capture dependencies that Naive Bayes can’t). But when you only have a small data set, logistic regression might pick up on spurious patterns that don’t really exist, so the Naive Bayes acts as a kind of regularizer on your model that prevents overfitting. There’s a paper by Andrew Ng and Michael Jordan on discriminative vs. generative classifiers that talks about this more.

Whenever an algorithm involves assuming, calculating or estimating the distribution of Y, it is generative, or simply put, if the algorithm cares about the distribution of Y, it is generative, if not, then it is discriminative.

Now a Small story to tell your 12-year-old kid, so that they can also understand the difference between these two models

Let’s say you have two kids “Gen” and “Dis”, and since their birth, they never opened their eyes. Today is the first day they will open their eyes, and you want to celebrate this occasion by teaching them the difference between Cat and Dog. You take them to pet store nearby.

Before showing around, you tell Gen and Dis to pay special attention to color, size, eye color, fur size, their voice etc.(feature set) of the pets they are going to see. After the end of this visit, you want to check if they understood the difference between cat and dog.

Now you give two photos one of a cat and one of a dog to Dis and ask which one is which. Dis has meticulously written down several conditions like if the voice sounds like meow and eyes are blue or green and has stripes with color brown or black then the animal is a cat. Thanks to her relatively simple rules, she quickly detected which one is a cat and which one is a dog.

Now instead of giving two photos you gave Gen two pieces of blank paper and ask her to draw what a cat and a dog looks like.

Well now, given any photo Gen can also tell which one is cat and which one is dog based on the drawing she created. In most cases drawing of cat and dog was unnecessary and time consuming for the task of detection which one is a cat.

But if there were only a few dogs and cats to look for Gen and Dis (low training data). In such cases if you give a photo of a brown dog with stripes with blue eyes, there is a chance that Dis would mark it as a cat. While Gen has her drawing and she can better detect that this photo is of a dog.

If you ask Gen to pay attention to more things(features), it will create a better sketch. But, if you show more examples(data-set) of cat and dog, Dis would mostly be better than Gen.

Since Dis is very meticulous in her observations if you ask her to pay attention to more features it will create more complicated rules(overfitting) and the chance of wrongly identifying a cat and a dog will increase, but that would not happen easily with Gen.

What if before going to pet store I don’t tell them that there are only two types of animal(no labeled data). Dis would fail completely because she will not know what to look for while Gen would be able to draw the sketch anyway. This is a huge advantage sometimes(semi-supervised).

Now let me reveal the suspense which you might already know: Dis is for discriminative and Gen is for generative.