Programming, Mostly : Derive the Forward and Backward Formulas of the 3 Layers MLP, and Use It to Judge the Population Count

※ The code corresponding to this post is here.

MathJax LaTeX Example Page
Deep learning is based on the multi-layer perceptron(MLP), which is a structure composed of multi-connected-layers with the activation functions. To thoroughly understand how a simple MLP works is very profitable for designing the deep learning model for the practical applications.
In this post, I will show how to use the simplest MLP to judge the population count (popcount) and explain its mathematics behind the procedures.
For simplifying the matrix operations, I use Python with NumPy package. Benefiting from their convenience and simplicity, we could get rid of the pain of coding, to focus on the problem itself and the mathematics.
I choose popcount as the learning problem. It is because we could generate the input and label data(the answers) as much as we wish, it is not necessary to import them from the external. Besides, the problem is quite simple, we could totally concentrate our attention on the MLP.

The simplest MLP, is as this figure:

There are three layers : the input layer has 3 nodes ; there are 8 nodes(or called neurons) with a rectified linear unit(ReLU) function in the hidden layer ; In the output layer, the nodes is 6, where a softmax function is applied. This MLP structure would be used for our popcount problem with 64 neurons in the hidden layer.

A deep learning training process, is to input data as forward passing to get the prediction, then put the true answer with the prediction into backward passing for updating the parameters between the layers. This process would be dealt many times. Once the predictions meet the true answers in high-ratio, the training is called complete. Thus, we need to understand the input and output format, how the forward passing works, and how the parameters be updated in the backward passing.

零. popcount:

The popcount of a specific value is the number of set bits, i.e. how many bit is '1' in that value. For example: 5, in the binary format is 0b101, so the popcount is 2; 11 is 0b1011 , thus the popcount is 3. The popcount function is given as below, to generate the label data as the answers for training.

def Popcount(value):
    
     val = copy.deepcopy(value)
     count = 0
   
     while val > 0:
         if(val & 0x01):
             count += 1            
         val >>= 1    
         
     return count

To input the MLP, we use an array to represent the input bits; for the output, we use the one-hot encoding. Below shows the input being 0b01010100 and the output being 3.

一. Forward passing, to get the prediction

The process in mathematics is :

\[
O_{j}^{in} = \sum_iW_{ji}^{1}O_{i} + b_{j}^{1}\\
O_{j}^{out}= ReLU(O_{j}^{in}) \\
O_{k}^{in} = \sum_jW_{kj}^{2}O_{j}^{out} + b_{k}^{2} \\
O_{k}^{out}= softmax(O_{k}^{in})
\]
Where the \(softmax(x_{k})\) is defined as \(\frac{e^{x_{k}}}{\sum_ke^{x_{k}}}\), and \(ReLU(x) = \begin{cases}x & x \ge 0\\ 0 & x \lt 0\end{cases}\)

The weights(\(W\), a matrix) set and the biases(\(b\), a vector) set are the values we want to compute. Initially, we set those as the random values, they would keep being updated in the forward/backward iterations.

※The purpose of applying the activation functions (\(ReLU(x)\) and \(softmax(x_{k})\)) is to make the relation between the input and output nonlinear :\(f(c_1.x_1+c_2x_2...c_nx_n+b) \neq c_1f(x_1)+c_2f(x_2)...c_nf(x_n)+b\). Nonlinearity is ubiquitous in the physical phenomena, but the matrix operations are pure linear. Therefore, for the neural networks to approximate the nature, it is requisite to place a nonlinear operation following a matrix operation. The nonlinear operations are called the activation functions in deep learning. More discussion about the purpose of activation, please refer to here.

Below gives the code corresponding to the forward passing.

def Softmax(self, val):
 exps = np.exp(val - np.max(val))
 return exps / np.sum(exps, axis = 0)

def ReLU(self, val):
 return (val > 0) * val
 
def Forward(self, x):    
 self.x = x
 
 self.z1 = np.dot( self.W1, self.x)
 self.z1 += self.b1        
 self.z2 = self.ReLU(self.z1)           
 
 self.z3 = np.dot(self.W2, self.z2)        
 self.z3 += self.b2      
  
 self.out = self.Softmax(self.z3)        
 
 return self.out

Beware the softmax function : The exponential values, both in the numerator and denominator, minus the max value of the input. It is to avert the exponential arithmetic overflow.

二. Backward passing, to update the weights and biases

Consider a N-dimensional potential function \(U(p,q,..)\) in the Cartesian coordinate system. The gradient of this potential with negative sign, \( -\triangledown U =-(\frac{\partial U}{\partial p}\hat{p} +\frac{\partial U}{\partial q}\hat{q} + ..) \) is the fastest direction to decrease the potential. Assume we are at position \( \vec{r_0} = \vec{r}|_{p_0, q_0,..}\) in the begin, and we would like to move to the lowest location. we could move our position step by step:

\[ \vec{r_1}=\vec{r_1} - \triangle \vec{r_0} , \ \ \ \triangle \vec{r_0} = \gamma \cdot \triangledown F|_{r=r_0} \cdot \hat{r}
\\ \Rightarrow \vec{r_2}=\vec{r_1} - \triangle \vec{r_1} , \ \ \ \triangle \vec{r_1}=\gamma \cdot \triangledown F|_{r=r_1} \cdot \hat{r}
\\
\Rightarrow \vec{r_3}=\vec{r_2} - \triangle \vec{r_2} , \ \ \ \triangle \vec{r_2}=\gamma \cdot \triangledown F|_{r=r_2} \cdot \hat{r} \\ :
\] where the \(\gamma\) is a small value.

※This figure is original at here.

Obviously, it is an iterative process to seek the lowest location :
\(\vec{r_{n+1}}= \vec{r_n} - \triangle \vec{r_n} \), where \( \triangle \vec{r_n} = \gamma \cdot {\triangledown F}|_{r = r_n} \cdot \hat{r}\). This method, is called gradient descent .

※ If you do not familiar with the operations of vector analysis, it is, for a specific direction \(\hat {q}\), \(q_{n+1} = q_n - \triangle q_n = q_n - {\frac{\partial }{\partial q} F(..,q)}|_{q_n} \)

The loss function corresponding to the \(O_{k}^{out}\) (the output of \(softmax(O_{k}^{in})\)) is :
\[ E \equiv -\sum_ky_{k}\cdot ln(O_{k}^{out}) \]This loss funtion, also be called the cross entropy function.

Where \(O_{k}^{out}\) is the output of the forwarding (prediction), \(y_{k}\) is the label data (true answer). There are some interesting characteristics in this function \(E\) : while the prediction perfectly hits the true answer \(O_{k}^{out}|_{y_k=1}=1\), the \(E\) is zero; as the prediction is totally wrong( \(O_{k}^{out}|_{y_k=1}=0\) ), the \(E\) is \(\infty\). We treat \(E\) as a potential function, to use gradient descent method seeking the parameters corresponding to the minimum (it is \(0\), the prediction perfectly hits true answer). Due to \(E\) is a function of \(W_{ji}^{1}, W_{kj}^{2}, b_{j}^1, b_{k}^2\) sets, we need to update them in the backward passing:

\[ \rightarrow {W_{ji}^{1}}|_{n+1} = [W_{ji}^{1} -\gamma \cdot \frac{\partial E}{ \partial W_{ji}^{1} }]|_{n} \\
\rightarrow b^1_{j}|_{n+1} = [b^1_{j} - \gamma \cdot \frac{\partial E}{\partial b^1_{j}}]|_{n} \\
\rightarrow {W_{kj}^{2}}|_{n+1} = [W_{kj}^{1} -\gamma \cdot \frac{\partial E}{ \partial W_{kj}^{2} }]|_{n} \\
\rightarrow b^2_{k}|_{n+1} = {[ b^2_{k} - \gamma \cdot \frac{\partial E}{\partial b^2_{k}} ]}|_{n}
\]
※ Do not be confused the ordinal number with the number of power. In here, \( W^2_{kj} \) means the second weight set with indices \(k,j\), instead of the meaning \(W_{kj} \cdot W_{kj} \)

So, the backward passing, requires the derivation of \( \frac{\partial E}{ \partial W_{ji}^{1} } \), \(\frac{\partial E}{\partial b^1_{j}}\) ,\( \frac{\partial E}{ \partial W_{kj}^{2} } \), and \( \frac{\partial E}{\partial b^2_{k}} \).

1. Derive \(\frac{\partial E}{\partial W_{kj}^{2}} \) and \(\frac{\partial E}{\partial b_{k}^{2}} \), for updating the weights and biases between the hidden layer and the output layer.

From the chain rule, we have :
\[ \frac{\partial E}{\partial W_{kj}^{2}} = \frac{\partial E}{\partial O_{k}^{in}} \cdot \frac{\partial O_{k}^{in}}{\partial W_{kj}^{2}} \\ \]

For \( \frac{\partial O_{k}^{in}}{\partial W_{kj}} \) : \[ \frac{\partial O_{k}^{in}}{\partial W_{kj}^{2}} = \frac{\partial }{\partial W_{kj}^{2}}(\sum_jW_{kj}^{2}\cdot O_j^{out} + b_k) = \color{orange} { (O_{j}^{out})^T} \]
For \( \frac{\partial E}{\partial O_{k}^{in}} \) :
\[
\frac{\partial E}{\partial O_{k}^{in}} = \frac{\partial }{\partial O_{k}^{in}}(-\sum_k y_k \cdot ln(O_k^{out})) = -\sum_k y_k\frac{\partial ln(O_k^{out})}{\partial O_{k}^{in}}
\]
Now we define \( \Omega \equiv \sum_k e^{O_k^{in}}\), so \( O_k^{out}= \frac{ e^{O_k^{in}}}{\Omega}\)
To derive \( \frac{\partial}{\partial O_k^{in}} (ln(O_{k}^{out})) \), beware each term of \( O_k^{out}\) does not only depend on one term of \( O_k^{in}\), it is necessary to take all terms of \( O_k^{in}\) into account:
\[\frac{\partial}{\partial O_k^{in}} (ln(O_{k}^{out})) = \frac{\partial }{\partial O_k^{in}}ln(\frac{ e^{O^{in}_{k'}}}{\Omega}) \\

\rightarrow \frac{\partial}{\partial O_k^{in}} (ln(O_{k}^{out})) = \frac{\partial }{\partial O_k^{in}}ln(\frac{ e^{O^{in}_{k'}}}{\Omega}) =\frac{\partial}{\partial O_k^{in}}(O^{in}_{k'} -ln\Omega) = \delta_{k,k'} - \frac{\partial}{\partial O_k^{in}}(ln\Omega) \\

\because \frac{\partial}{\partial O_k^{in}}(ln\Omega) =\frac{O_k^{in}}{\Omega} = O_k^{out} \\ \therefore \frac{\partial E}{\partial O_k^{in}}= -\sum_{k} y_{k} \cdot \frac{\partial}{\partial O_k^{in}}( ln(O_{k'}^{out})) = -\sum_{k}y_{k}[\delta_{k, k'} - O_{k}^{out}] \\

\rightarrow \frac{\partial E}{\partial O_k^{in}} = -(y_{k} - O_{k}^{out} \cdot \sum_{k}y_{k}) \]
And the \( \sum_{k}y_{k} = 1\), as the one hot encoding assumption.
\[ \rightarrow \frac{\partial E}{\partial O_k^{in}}= \color{orange} { O_{k}^{out} - y_{k}} \]
Finally :
\[ \triangle W_{kj}^2 = \gamma \cdot \frac{\partial E}{\partial W^2_{kj}} = \color{yellow} {\gamma \cdot (O_{j}^{out})^T \cdot (O_{k}^{out} - y_{k} ) } \]
For \(\triangle b_k^2 \),The similar derivation, we get \( \triangle b_k^2 = \gamma \cdot \frac{\partial E}{\partial b^2_{k}} = \color{yellow} { \gamma \cdot (O_{k}^{out} - y_{k} )} \)

2. Updating the weights and biases between the input layer and the hidden layer.

Write \(\frac{\partial E}{\partial W_{ji}^1}\) as :
\[ \frac{\partial E}{\partial W_{ji}^1} = \frac{\partial E}{\partial O_{k}^{in}} \cdot \frac{\partial O_{k}^{in}}{\partial O_{j}^{out}}\cdot \frac{\partial O_{j}^{out}}{\partial O_{j}^{in}}\cdot \frac{\partial O_{j}^{in}}{\partial W_{ji}^1} \]
For \(\frac{\partial E}{\partial O_{k}^{in}} = \color{orange} { O_{k}^{out} - y_{k}} \), which is already known.

For \( \frac{\partial O_{k}^{in}}{\partial O_{j}^{out}} \):
\[ \frac{\partial O_{k}^{in}}{\partial O_{j}^{out}} = \frac{\partial }{\partial O_{j}^{out}}(\sum_{kj}W_{kj}^2 O_{j}^{out} + b_k^{2}) = \color{orange} { (W_{kj}^2)^T} \]

For \( \frac{\partial O_{j}^{out}}{\partial O_{j}^{in}} \):
\[ \frac{\partial O_{j}^{out}}{\partial O_{j}^{in}} = \frac{\partial ReLU(O_{j}^{in})}{\partial O_{j}^{in}} = \frac{d }{d O_{j}^{in}}(ReLU(O_{j}^{in}) =\color{orange} { \frac{d ReLU(x)}{dx} |_{x=O_{j}^{in}}} \] where \(\frac{d ReLU(x)}{dx}=\begin{cases}0 & x < 0\\1 & x\geq 0\end{cases}\)

For \(\frac{\partial O_{j}^{in}}{\partial W_{ji}^1} \):
\[ \frac{\partial O_{j}^{in}}{\partial W_{ji}^1} = \frac{\partial }{\partial W_{ji}^1}(\sum_jW_{ji}^1O_i + b^1_j) = \color{orange} {(O_i)^T} \]

Put them together: \[ \triangle W_{ji}^1 = \gamma \cdot \frac{\partial E}{\partial W_{ji}^1} = \color{yellow} {\gamma \cdot (O_i)^T \cdot (O_{k}^{out} - y_{k}) \cdot (W_{j}^2)^T \cdot \frac{d ReLU(x)}{dx} |_{x=O_{j}^{in}}} \]
The similar, we get \( \triangle b_j^1 = \gamma \cdot \frac{\partial E}{\partial b_{j}^1} = \color{yellow} { \gamma \cdot(O_{k}^{out} - y_{k}) \cdot (W_{kj}^2)^T \cdot \frac{d ReLU(x)}{dx} |_{x=O_{j}^{in}}} \)

※ The formula of getting \(\triangle W \) and \( \triangle b \) is called the delta rule.

Though the derivations are a little bit complicated, but the code corresponding to the backward passing is pretty simple :

def cross_entropy(self, y) :
 #return -np.sum( y * np.log(self.out))
 for i in range(y.size):
  if (0 != y[i]):
   return -np.log(self.out[i])
 pass
  
def deRelu(self, z):
 return (z > 0) * 1
  
def Backward(self, y):

 loss = self.cross_entropy(y)
 out_error = self.out - y

 z3_error = out_error
 z2_error = self.W2.T.dot(z3_error)
 z1_error = z2_error * self.deRelu(self.z2)
       
 z3_W_delta = z3_error.dot(self.z2.T)
 z3_b_delta = z3_error

 z1_W_delta = z1_error.dot(self.x.T)
 z1_b_delta = z1_error

 lr = 5e-3
 self.W2 -= z3_W_delta * lr 
 self.W1 -= z1_W_delta * lr

 self.b2 -= z3_b_delta * lr
 self.b1 -= z1_b_delta * lr

 return loss

三. Training and Prediction

We use a fixed dataset to deal with the training. When the entire dataset is passed forward and backward once, it is called a epoch. Due to the gradient descent method updates the weights and biases slowly, it is necessary to run multiple epochs to achieve the acceptable learning outcome.
Once the error rate of training data be less than a criteria value (I set it as 0.2%) or the loss function value be less than the other criteria (\(10^{-4}\) for me), the training is done. We use another 100 data for the testing to evaluate the prediction error rate.

For 8 bits:

The prediction error rate of 8 bits is 1%.

16 bits:

The prediction error rate of 16 bits is 3%.

31 bits:

The prediction error rate of 31 bits is 9%.

The results are acceptable. Beware that the more bits are, the more diverse the combinations of the bits are. To represent the diversity, it is necessary to use more data for the training.

四 Summary:
To hand-make a neural networks is immensely helpful for analyzing the actual deep learning issues, even we use the existed frameworks (Tensorflow, Keras Caffe.. and the other similar things). Before coding, we must understand what we will do, or we are going to suspect the bug exists in where is no bug.

Reference :
Make Your Own Neural Network, by Tariq Rashid. 2016, simplified Chinese edition 2018, translator : 林賜
https://stats.stackexchange.com/questions/235528/backpropagation-with-softmax-cross-entropy
https://medium.com/@14prakash/back-propagation-is-very-simple-who-made-it-complicated-97b794c97e5c

Programming, Mostly

2019年3月15日星期五

Derive the Forward and Backward Formulas of the 3 Layers MLP, and Use It to Judge the Population Count

沒有留言:

張貼留言

2019年3月15日 星期五

Derive the Forward and Backward Formulas of the 3 Layers MLP, and Use It to Judge the Population Count

沒有留言:

張貼留言

2019年3月15日星期五