Processing math: 100%

2019年3月15日 星期五

Derive the Forward and Backward Formulas of the 3 Layers MLP, and Use It to Judge the Population Count


※ The code corresponding to this post is here.

MathJax LaTeX Example Page
      Deep learning is based on the multi-layer perceptron(MLP), which is a structure composed of multi-connected-layers with the activation functions. To thoroughly understand how a simple MLP works is very profitable for designing the deep learning model for the practical applications.
      In this post, I will show how to use the simplest MLP to judge the population count (popcount) and explain its mathematics behind the procedures.
     For simplifying the matrix operations, I use Python with NumPy package. Benefiting from their convenience and simplicity, we could get rid of the pain of coding, to focus on the problem itself and the mathematics.
     I choose popcount as the learning problem. It is because we could generate the input and label data(the answers) as much as we wish, it is not necessary to import them from the external. Besides, the problem is quite simple, we could totally concentrate our attention on the MLP.

     The simplest MLP, is as this figure:


    There are three layers : the input layer has 3 nodes ; there are 8 nodes(or called neurons) with a rectified linear unit(ReLU) function in the hidden layer ; In the output layer, the nodes is 6, where a softmax function is applied. This MLP structure would be used for our popcount problem with 64 neurons in the hidden layer.
 
    A deep learning training process, is to input data as forward passing to get the prediction, then put the true answer with the prediction into backward passing for updating the parameters between the layers. This process would be dealt many times. Once the predictions meet the true answers in high-ratio, the training is called complete. Thus, we need to understand the input and output format, how the forward passing works, and how the parameters be updated in the backward passing.



   零. popcount:
 
    The popcount of a specific value is the number of set bits, i.e. how many bit is '1' in that value. For example: 5, in the binary format is 0b101, so the popcount is 2; 11 is 0b1011 , thus the popcount is 3. The popcount function is given as below, to generate the label data as the answers for training.

def Popcount(value):
    
     val = copy.deepcopy(value)
     count = 0
   
     while val > 0:
         if(val & 0x01):
             count += 1            
         val >>= 1    
         
     return count

    To input the MLP, we use an array to represent the input bits; for the output, we use the one-hot encoding. Below shows the input being 0b01010100 and the output being 3.



一. Forward passing, to get the prediction

The process in mathematics is :

  Oinj=iW1jiOi+b1jOoutj=ReLU(Oinj)Oink=jW2kjOoutj+b2kOoutk=softmax(Oink)
Where the softmax(xk) is defined as exkkexk, and ReLU(x)={xx00x<0

      The weights(W, a matrix) set and the biases(b, a vector) set are the values we want to compute. Initially, we set those as the random values, they would keep being updated in the forward/backward iterations.
   
※The purpose of applying the activation functions (ReLU(x) and softmax(xk)) is to make the relation between the input and output nonlinear :f(c1.x1+c2x2...cnxn+b)c1f(x1)+c2f(x2)...cnf(xn)+b. Nonlinearity is ubiquitous in the physical phenomena, but the matrix operations are pure linear. Therefore, for the neural networks to approximate the nature, it is requisite to place a nonlinear operation following a matrix operation. The nonlinear operations are called the activation functions in deep learning. More discussion about the purpose of activation, please refer to here.
  
    Below gives the code corresponding to the forward passing.
def Softmax(self, val):
 exps = np.exp(val - np.max(val))
 return exps / np.sum(exps, axis = 0)

def ReLU(self, val):
 return (val > 0) * val
 
def Forward(self, x):    
 self.x = x
 
 self.z1 = np.dot( self.W1, self.x)
 self.z1 += self.b1        
 self.z2 = self.ReLU(self.z1)           
 
 self.z3 = np.dot(self.W2, self.z2)        
 self.z3 += self.b2      
  
 self.out = self.Softmax(self.z3)        
 
 return self.out

Beware the softmax function : The exponential values, both in the numerator and denominator, minus the max value of the input. It is to avert the exponential arithmetic overflow.



二. Backward passing, to update the weights and biases

    Consider a N-dimensional potential function U(p,q,..) in the Cartesian coordinate system. The gradient of this potential with negative sign, U=(Upˆp+Uqˆq+..) is the fastest direction to decrease the potential.  Assume we are at position r0=r|p0,q0,.. in the begin, and we would like to move to the lowest location. we could move our position step by step:

r1=r1r0,   r0=γF|r=r0ˆrr2=r1r1,   r1=γF|r=r1ˆrr3=r2r2,   r2=γF|r=r2ˆr: where the γ is a small value.


※This figure is original at here.

    Obviously, it is an iterative process to seek the lowest location :
rn+1=rnrn, where rn=γF|r=rnˆr. This method, is called gradient descent .

※ If you do not familiar with the operations of vector analysis, it is,  for a specific direction ˆq, qn+1=qnqn=qnqF(..,q)|qn


    The loss function corresponding to the Ooutk (the output of softmax(Oink)) is :
Ekykln(Ooutk)This loss funtion, also be called the cross entropy function.

    Where Ooutk is the output of the forwarding (prediction), yk is the label data (true answer). There are some interesting characteristics in this function E : while the prediction perfectly hits the true answer Ooutk|yk=1=1, the E is zero; as the prediction is totally wrong( Ooutk|yk=1=0 ), the E is . We treat E as a potential function, to use gradient descent method seeking the parameters corresponding to the minimum (it is 0, the prediction perfectly hits true answer). Due to E is a function of W1ji,W2kj,b1j,b2k sets, we need to update them in the backward passing:

 W1ji|n+1=[W1jiγEW1ji]|nb1j|n+1=[b1jγEb1j]|nW2kj|n+1=[W1kjγEW2kj]|nb2k|n+1=[b2kγEb2k]|n
※ Do not be confused the ordinal number with the number of power. In here, W2kj means the second weight set with indices k,j, instead of the meaning WkjWkj

    So, the backward passing, requires the derivation of EW1ji, Eb1j ,EW2kj, and Eb2k.


1. Derive EW2kj and Eb2k, for updating the weights and biases between the hidden layer and the output layer.


From the chain rule, we have :
 EW2kj=EOinkOinkW2kj

For OinkWkj : OinkW2kj=W2kj(jW2kjOoutj+bk)=(Ooutj)T
 For EOink :
EOink=Oink(kykln(Ooutk))=kykln(Ooutk)Oink
Now we define ΩkeOink, so Ooutk=eOinkΩ
 To derive Oink(ln(Ooutk)), beware each term of Ooutk does not only depend on one term of Oink, it is necessary to take all terms of Oink into account:
Oink(ln(Ooutk))=Oinkln(eOinkΩ)Oink(ln(Ooutk))=Oinkln(eOinkΩ)=Oink(OinklnΩ)=δk,kOink(lnΩ)Oink(lnΩ)=OinkΩ=OoutkEOink=kykOink(ln(Ooutk))=kyk[δk,kOoutk]EOink=(ykOoutkkyk)
And the kyk=1, as the one hot encoding assumption.
EOink=Ooutkyk
 Finally :
 W2kj=γEW2kj=γ(Ooutj)T(Ooutkyk)
For b2k,The similar derivation, we get b2k=γEb2k=γ(Ooutkyk)


2.  Updating the weights and biases between the input layer and the hidden layer.


Write EW1ji as :
EW1ji=EOinkOinkOoutjOoutjOinjOinjW1ji
 For EOink=Ooutkyk, which is already known.

For OinkOoutj:
OinkOoutj=Ooutj(kjW2kjOoutj+b2k)=(W2kj)T

For OoutjOinj:
OoutjOinj=ReLU(Oinj)Oinj=ddOinj(ReLU(Oinj)=dReLU(x)dx|x=Oinj where dReLU(x)dx={0x<01x0

For OinjW1ji:
OinjW1ji=W1ji(jW1jiOi+b1j)=(Oi)T

 Put them together: W1ji=γEW1ji=γ(Oi)T(Ooutkyk)(W2j)TdReLU(x)dx|x=Oinj
The similar, we get b1j=γEb1j=γ(Ooutkyk)(W2kj)TdReLU(x)dx|x=Oinj

※ The formula of getting W and b is called the delta rule.

    Though the derivations are a little bit complicated, but the code corresponding to the backward passing is pretty simple :
def cross_entropy(self, y) :
 #return -np.sum( y * np.log(self.out))
 for i in range(y.size):
  if (0 != y[i]):
   return -np.log(self.out[i])
 pass
  
def deRelu(self, z):
 return (z > 0) * 1
  
def Backward(self, y):

 loss = self.cross_entropy(y)
 out_error = self.out - y

 z3_error = out_error
 z2_error = self.W2.T.dot(z3_error)
 z1_error = z2_error * self.deRelu(self.z2)
       
 z3_W_delta = z3_error.dot(self.z2.T)
 z3_b_delta = z3_error

 z1_W_delta = z1_error.dot(self.x.T)
 z1_b_delta = z1_error

 lr = 5e-3
 self.W2 -= z3_W_delta * lr 
 self.W1 -= z1_W_delta * lr

 self.b2 -= z3_b_delta * lr
 self.b1 -= z1_b_delta * lr

 return loss



三. Training and Prediction

      We use a fixed dataset to deal with the training. When the entire dataset is passed forward and backward once, it is called a epoch. Due to the gradient descent method updates the weights and biases slowly, it is necessary to run multiple epochs to achieve the acceptable learning outcome.
      Once the error rate of training data be less than a criteria value (I set it as 0.2%) or the loss function value be less than the other criteria (104 for me), the training is done. We use another 100 data for the testing to evaluate the prediction error rate.

For 8 bits:

The prediction error rate of 8 bits is 1%.



16 bits:


The prediction error rate of 16 bits is 3%.


31 bits:


The prediction error rate of  31 bits is 9%.

     The results are acceptable. Beware that the more bits are, the more diverse the combinations of the bits are. To represent the diversity, it is necessary to use more data for the training.



四 Summary:
     To hand-make a neural networks is immensely helpful for analyzing the actual deep learning issues, even we use the existed frameworks (Tensorflow, Keras Caffe.. and the other similar things). Before coding, we must understand what we will do,  or we are going to suspect the bug exists in where is no bug.


Reference :
   Make Your Own Neural Network, by Tariq Rashid. 2016, simplified Chinese edition 2018, translator : 林賜
   https://stats.stackexchange.com/questions/235528/backpropagation-with-softmax-cross-entropy
   https://medium.com/@14prakash/back-propagation-is-very-simple-who-made-it-complicated-97b794c97e5c
   

沒有留言:

張貼留言