If the relationship between the9 input and the correct output is complicated and the network has enough hidden units to model it accurately, there will typically be many different settings of the weights that can model the training set almost perfectly, especially if there is only a limited amount of labeled training data. Each of these weight vectors will make different predictions on held-out test data and almost all of them will do worse on the test data than on the training data because the feature detectors have been tuned to work well together on the training data but not on the test data.

The standard way to reduce error on the test set is to train many separate networks and then apply each of these networks to the test data.

Overfitting also can be reduced by using “dropout” to prevent complex co-adaptations( 参数之间的关联性 ) on the training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. It’s also can prevent the weight from growing too large.

At test time, they use the “mean network” that contains all of the hidden units but with their outgoing weights halved to compensate for the fact that twice as many of them are active.

Dropout is kind of like “bagging” in which different models are trained on different random selections of cases from the training set and all models are given equal weight in the combination(Like what did in distributed ML). Dropout can be seen as an extreme form of bagging in which each model is trained on a single case and each parameter of the model is very strongly regularized by sharing it with the corresponding parameter in all the other models.

Useful references