I haven't seen an answer from a trusted source, but I'll try to answer this myself, with a simple example (with my current knowledge).
In general, note that training a MLP using back-propagation is usually implemented with matrices.
Time complexity of matrix multiplication
The time complexity of matrix multiplication for is simply .
Notice that we are assuming simplest multiplication algorithm here: there exists some other algorithms with somewhat better time complexity.
Feedforward pass algorithm
Feedforward propagation algorithm is as follows.
First, to go from layer to , you do
Then you apply the activation function
If we have layers (including input and output layer), this will run times.
As an example, let's compute the time complexity for the forward pass algorithm for a MLP with layers, where denotes the number of nodes of the input layer, the number of nodes in the second layer, the number of nodes in the third layer and the number of nodes in the output layer.
is a matrix with rows and columns ( thus contains the weights going from layer to layer ).
Assume you have training examples. For propagating from layer to , we have first
zaman karmaşıklığına sahiptir. Sonra aktivasyon fonksiyonunu uygularız
and this has time complexity, because it is an element-wise operation.
So, in total, we have
Using same logic, for going , we have , and, for , we have .
In total, the time complexity for feedforward propagation will be
I'm not sure if this can be simplified further or not. Maybe it's just , but I'm not sure.
The back-propagation algorithm proceeds as follows. Starting from the output layer , we compute the error signal, , a matrix containing the error signals for nodes at layer
where means element-wise multiplication. Note that has rows and columns: it simply means each column is the error signal for training example .
We then compute the "delta weights", (between layer and layer )
where is the transpose of .
We then adjust the weights
For , we thus have the time complexity .
Now, going back from . We first have
where is the transpose of . For , we have the time complexity .
And finally, for , we have . In total, we have
which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be
This time complexity is then multiplied by number of iterations (epochs). So, we have
where is number of iterations.
Note that these matrix operations can greatly be paralelized by GPUs.
We tried to find the time complexity for training a neural network that has 4 layers with respectively , , and nodes, with training examples and epochs. The result was .
We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)
Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.
I'm not sure what the results would be using other optimizers such as RMSprop.
The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.
If you're not familiar with back-propagation, check this article: