*I haven't seen an answer from a trusted source, but I'll try to answer this myself, with a simple example (with my current knowledge).*

In general, note that training a MLP using back-propagation is usually implemented with matrices.

### Time complexity of matrix multiplication

The time complexity of matrix multiplication for Mij∗Mjk${M}_{ij}\ast {M}_{jk}$ is simply O(i∗j∗k)$\mathcal{O}(i\ast j\ast k)$.

Notice that we are assuming simplest multiplication algorithm here: there exists some other algorithms with somewhat better time complexity.

### Feedforward pass algorithm

Feedforward propagation algorithm is as follows.

First, to go from layer i$i$ to j$j$, you do

Sj=Wji∗Zi$${S}_{j}={W}_{ji}\ast {Z}_{i}$$

Then you apply the activation function

Zj=f(Sj)$${Z}_{j}=f({S}_{j})$$

If we have N$N$ layers (including input and output layer), this will run N−1$N-1$ times.

### Example

As an example, let's compute the time complexity for the forward pass algorithm for a MLP with 4$4$ layers, where i$i$ denotes the number of nodes of the input layer, j$j$ the number of nodes in the second layer, k$k$ the number of nodes in the third layer and l$l$ the number of nodes in the output layer.

4$4$3$3$Wji${W}_{ji}$Wkj${W}_{kj}$Wlk${W}_{lk}$Wji${W}_{ji}$ is a matrix with j$j$ rows and i$i$ columns (Wji${W}_{ji}$ thus contains the weights going from layer i$i$ to layer j$j$).

Assume you have t$t$ training examples. For propagating from layer i$i$ to j$j$, we have first

Sjt=Wji∗Zit$${S}_{jt}={W}_{ji}\ast {Z}_{it}$$

O(j∗i∗t)$\mathcal{O}(j\ast i\ast t)$ zaman karmaşıklığına sahiptir. Sonra aktivasyon fonksiyonunu uygularız

Zjt=f(Sjt)$${Z}_{jt}=f({S}_{jt})$$

and this has O(j∗t)$\mathcal{O}(j\ast t)$ time complexity, because it is an element-wise operation.

So, in total, we have

O(j∗i∗t+j∗t)=O(j∗t∗(t+1))=O(j∗i∗t)$$\mathcal{O}(j\ast i\ast t+j\ast t)=\mathcal{O}(j\ast t\ast (t+1))=\mathcal{O}(j\ast i\ast t)$$

Using same logic, for going j→k$j\to k$, we have O(k∗j∗t)$\mathcal{O}(k\ast j\ast t)$, and, for k→l$k\to l$, we have O(l∗k∗t)$\mathcal{O}(l\ast k\ast t)$.

In total, the time complexity for feedforward propagation will be

O(j∗i∗t+k∗j∗t+l∗k∗t)=O(t∗(ij+jk+kl))$$\mathcal{O}(j\ast i\ast t+k\ast j\ast t+l\ast k\ast t)=\mathcal{O}(t\ast (ij+jk+kl))$$

I'm not sure if this can be simplified further or not. Maybe it's just O(t∗i∗j∗k∗l)$\mathcal{O}(t\ast i\ast j\ast k\ast l)$, but I'm not sure.

### Back-propagation algorithm

The back-propagation algorithm proceeds as follows. Starting from the output layer l→k$l\to k$, we compute the error signal, Elt${E}_{lt}$, a matrix containing the error signals for nodes at layer l$l$

Elt=f′(Slt)⊙(Zlt−Olt)$${E}_{lt}={f}^{\prime}({S}_{lt})\odot ({Z}_{lt}-{O}_{lt})$$

where ⊙$\odot $ means element-wise multiplication. Note that Elt${E}_{lt}$ has l$l$ rows and t$t$ columns: it simply means each column is the error signal for training example t$t$.

We then compute the "delta weights", Dlk∈Rl×k${D}_{lk}\in {\mathbb{R}}^{l\times k}$ (between layer l$l$ and layer k$k$)

Dlk=Elt∗Ztk$${D}_{lk}={E}_{lt}\ast {Z}_{tk}$$

where Ztk${Z}_{tk}$ is the transpose of Zkt${Z}_{kt}$.

We then adjust the weights

Wlk=Wlk−Dlk$${W}_{lk}={W}_{lk}-{D}_{lk}$$

For l→k$l\to k$, we thus have the time complexity O(lt+lt+ltk+lk)=O(l∗t∗k)$\mathcal{O}(lt+lt+ltk+lk)=\mathcal{O}(l\ast t\ast k)$.

Now, going back from k→j$k\to j$. We first have

Ekt=f′(Skt)⊙(Wkl∗Elt)$${E}_{kt}={f}^{\prime}({S}_{kt})\odot ({W}_{kl}\ast {E}_{lt})$$

Then

Dkj=Ekt∗Ztj$${D}_{kj}={E}_{kt}\ast {Z}_{tj}$$

And then

Wkj=Wkj−Dkj$${W}_{kj}={W}_{kj}-{D}_{kj}$$

where Wkl${W}_{kl}$ is the transpose of Wlk${W}_{lk}$. For k→j$k\to j$, we have the time complexity O(kt+klt+ktj+kj)=O(k∗t(l+j))$\mathcal{O}(kt+klt+ktj+kj)=\mathcal{O}(k\ast t(l+j))$.

And finally, for j→i$j\to i$, we have O(j∗t(k+i))$\mathcal{O}(j\ast t(k+i))$. In total, we have

O(ltk+tk(l+j)+tj(k+i))=O(t∗(lk+kj+ji))$$\mathcal{O}(ltk+tk(l+j)+tj(k+i))=\mathcal{O}(t\ast (lk+kj+ji))$$

which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be

O(t∗(ij+jk+kl)).$$O(t\ast (ij+jk+kl)).$$

This time complexity is then multiplied by number of iterations (epochs). So, we have

O(n∗t∗(ij+jk+kl)),$$O(n\ast t\ast (ij+jk+kl)),$$

where

n$n$ is number of iterations.

### Notes

Note that these matrix operations can greatly be paralelized by GPUs.

### Conclusion

We tried to find the time complexity for training a neural network that has 4 layers with respectively i$i$, j$j$, k$k$ and l$l$ nodes, with t$t$ training examples and n$n$ epochs. The result was O(nt∗(ij+jk+kl))$\mathcal{O}(nt\ast (ij+jk+kl))$.

We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)

Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.

I'm not sure what the results would be using other optimizers such as RMSprop.

### Sources

The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.

If you're not familiar with back-propagation, check this article:

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4