Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks.
Why Dropout: Dropout helps prevent weights from converging to identical positions. It does this by randomly turning nodes off when forward propagating.
In simple terms some of your neurons will not participate in the calculation.
In Tensorflow we have two dropout functions.
- It has parameter keep_prop which state the probability of neuron which will not be drop. If we dive keep_prop value to 0.6, it means 60% of neurons will remain and 40% will be drop.
- It has two main parameters, rate, and training. Rate means the number of neurons which will be drop.If the rate is 0.6 then 60% neurons will be dropped and 40% will be used.
- We can see keep_prop = 1- rate.
- Training parameter is to differentiate if the network is running for training or to get the result.We need dropout only when we are training the neural network, not while we are testing(inferencing) it.
So what is the difference between these two functions?
tf.layers.dropout is a wrapper over the tf.nn.droput, which give us the demarcation of whether to return the output in training mode (apply dropout) or in inference mode (return the input untouched).”
To understand these two models we first have to see what is the difference between joint probability [P(x,y)] and conditional probability[P(x|y)].
Joint probability: p(A and B). The probability of event A and event B occurring. It is the probability of the intersection of two or more events. The probability of the intersection of A and B may be written p(A ∩ B). Example: the probability that a card is a four and red =p(four and red) = 2/52=1/26. (There are two red fours in a deck of 52, the 4 of hearts and the 4 of diamonds).
Conditional probability: p(A|B) is the probability of event An occurring, given that event B occurs. Example: given that you drew a red card, what’s the probability that it’s a four (p(four|red))=2/26=1/13. So out of the 26 red cards (given a red card), there are two fours so 2/26=1/13.
For better understanding, click here for more on probability.
A generative algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumptions, which category is most likely to generate this signal? Let’s say you have input data x and you want to classify the data into labels y. A generative model learns the joint probability distribution p(x,y). A generative algorithm models how the data was “generated”, so you ask it “what’s the likelihood this or that class generated this instance?” and pick the one with the better probability.
A discriminative algorithm does not care about how the data was generated, it simply categorizes a given signal. Discriminative model learns the conditional probability distribution p(y|x) – which you should read as the probability of y given x. A discriminative algorithm uses the data to create a decision boundary, so you ask it “what side of the decision boundary is this instance on?
The fundamental difference between discriminative models and generative models is:
- Discriminative models learn the (hard or soft) boundary between classes
- Generative models model the distribution of individual classes
Given input data point x, the aim is to predict continuous (regression) or discrete (classification) output. That is given x, we are interested in modeling p(y|x). There are three approaches to this:
1. Generative Models:
One way is to model p(x, y) directly. Once we do that, we can obtain p(y|x) by simply conditioning on x. And we can then use decision theory to determine class membership i.e. we can use loss matrix, etc. to determine which class the point belongs to (such an assignment would minimize the expected loss). For e.g. in Naive Bayes model, you can learn p(y), the prior class probabilities from the data. You can also learn p(x|y) from the data using said maximum likelihood estimation (or you can Bayes estimator if you will). Once you have p(y) and p(x|y), p(x, y) is not difficult to find out.
2. Discriminative Models:
Instead of modeling p(x, y), we can directly model p(y|x), for e.g. in logistic regression p(y|x) is assumed to be of the form 1 / (1 + exp(-sigma(wi. xi))). All we have to do in such a case is to learn weights that would minimize the squared loss.
Generative models often outperform discriminative models on smaller datasets because their generative assumptions place some structure on your model that prevent overfitting. For example, let’s consider Naive Bayes vs. Logistic Regression. The Naive Bayes assumption is of course rarely satisfied, so logistic regression will tend to outperform Naive Bayes as your dataset grows (since it can capture dependencies that Naive Bayes can’t). But when you only have a small data set, logistic regression might pick up on spurious patterns that don’t really exist, so the Naive Bayes acts as a kind of regularizer on your model that prevents overfitting. There’s a paper by Andrew Ng and Michael Jordan on discriminative vs. generative classifiers that talks about this more.
Whenever an algorithm involves assuming, calculating or estimating the distribution of Y, it is generative, or simply put, if the algorithm cares about the distribution of Y, it is generative, if not, then it is discriminative.
Now a Small story to tell your 12-year-old kid, so that they can also understand the difference between these two models
Let’s say you have two kids “Gen” and “Dis”, and since their birth, they never opened their eyes. Today is the first day they will open their eyes, and you want to celebrate this occasion by teaching them the difference between Cat and Dog. You take them to pet store nearby.
Before showing around, you tell Gen and Dis to pay special attention to color, size, eye color, fur size, their voice etc.(feature set) of the pets they are going to see. After the end of this visit, you want to check if they understood the difference between cat and dog.
Now you give two photos one of a cat and one of a dog to Dis and ask which one is which. Dis has meticulously written down several conditions like if the voice sounds like meow and eyes are blue or green and has stripes with color brown or black then the animal is a cat. Thanks to her relatively simple rules, she quickly detected which one is a cat and which one is a dog.
Now instead of giving two photos you gave Gen two pieces of blank paper and ask her to draw what a cat and a dog looks like.
Well now, given any photo Gen can also tell which one is cat and which one is dog based on the drawing she created. In most cases drawing of cat and dog was unnecessary and time consuming for the task of detection which one is a cat.
But if there were only a few dogs and cats to look for Gen and Dis (low training data). In such cases if you give a photo of a brown dog with stripes with blue eyes, there is a chance that Dis would mark it as a cat. While Gen has her drawing and she can better detect that this photo is of a dog.
If you ask Gen to pay attention to more things(features), it will create a better sketch. But, if you show more examples(data-set) of cat and dog, Dis would mostly be better than Gen.
Since Dis is very meticulous in her observations if you ask her to pay attention to more features it will create more complicated rules(overfitting) and the chance of wrongly identifying a cat and a dog will increase, but that would not happen easily with Gen.
What if before going to pet store I don’t tell them that there are only two types of animal(no labeled data). Dis would fail completely because she will not know what to look for while Gen would be able to draw the sketch anyway. This is a huge advantage sometimes(semi-supervised).
Now let me reveal the suspense which you might already know: Dis is for discriminative and Gen is for generative.