- Önyargılı düğümler neden sinir ağlarında kullanılıyor?
- Kaç tane kullanmalısın?
- Bunları hangi katmanlarda kullanmalısınız: tüm gizli katmanlar ve çıktı katmanlar?
Yanıtlar:
The bias node in a neural network is a node that is always 'on'. That is, its value is set to without regard for the data in a given pattern. It is analogous to the intercept in a regression model, and serves the same function. If a neural network does not have a bias node in a given layer, it will not be able to produce output in the next layer that differs from (on the linear scale, or the value that corresponds to the transformation of when passed through the activation function) when the feature values are .
Consider a simple example: You have a feed forward perceptron with 2 input nodes and , and 1 output node . and are binary features and set at their reference level, . Multiply those 2 's by whatever weights you like, and , sum the products and pass it through whatever activation function you prefer. Without a bias node, only one output value is possible, which may yield a very poor fit. For instance, using a logistic activation function, must be , which would be awful for classifying rare events.
A bias node provides considerable flexibility to a neural network model. In the example given above, the only predicted proportion possible without a bias node was , but with a bias node, any proportion in can be fit for the patterns where . For each layer, , in which a bias node is added, the bias node will add additional parameters / weights to be estimated (where is the number of nodes in layer ). More parameters to be fitted means it will take proportionately longer for the neural network to be trained. It also increases the chance of overfitting, if you don't have considerably more data than weights to be learned.
With this understanding in mind, we can answer your explicit questions:
Simple, short answers:
In a couple of experiments in my masters thesis (e.g. page 59), I found that the bias might be important for the first layer(s), but especially at the fully connected layers at the end it seems not to play a big role. Hence one can have them at the first few layers and not at the last ones. Simply train a network, plot the distribution of weights of the bias nodes and prune them if the weights seem to be too close to zero.
This might be highly dependent on the network architecture / dataset.
1
value in the previous layer and one weight (one bias value) for each of the next layers neurons.
In the context of neural networks, Batch Normalization is currently the gold-standard for making smart "bias nodes." Instead of clamping a neuron's bias value, you instead adjust for the covariance of the neuron's input. So in a CNN, you would apply a batch normalization just between the convolutional layer and the next fully connected layer (of say, ReLus). In theory, all fully connected layers could benefit from Batch Normalization but this in practice becomes very expensive to implement since each batch normalization carries its own parameters.
Concerning why, most of the answers already have explained that, in particular, neurons are susceptible to saturated gradients when the input pushes the activation to an extreme. In the case of ReLu's this would be pushed to the left, giving a gradient of 0. In general, when you train a model, you first normalize the inputs to the neural network. Batch Normalization is a way of normalizing the inputs inside the neural network, between layers.