Natural Language Processing/Deep Learning for NLP

[Deep Learning for NLP]Neural network basic

림밤빵 2021. 4. 29. 15:09
728x90

Relationships between DL, representation learning, ML, AI

 Machine learning is a subfield of artificial intelligence. Deep learning is a subfield of machine learning, and so on. In case of ML, researchers assume that there’s sort of extracted features for a given input by human effort. But, in DL, we don’t need to invent this kind of features. Because every feature are learned automatically. We just need an input and output and just neural network links them. This is a critical difference between ML and DL. Most of model in NLP are driven for DL. ML is kind of an old thing in NLP field.

 

 

Model in Neural Networks

 A neural network model is represented by its architectures that shows how to transform two or more inputs into an output. There’s learnable parameters W. What we have to do is just find the best weight value of W.

 

 

Artificial Neuron (Perceptron)

 A fundamental building block of a neural network is as follows. Perceptron is a simplified model of a biological neuron. First, we dot product of the transpose of an input vector X and the weight vector W. And then, we add a bias term and apply non-linearity activation function. In this figure, activation function is f.

 

 

Activation function

 The function that determines the output value of neurons in the hidden layer and the output layer is called the activation function. The purpose of activation function is to introduce non-linearities into the network. The sigmoid function and softmax function are used as an activation function. Sigmoid function takes any real number as input on the x-axis and it transforms that real number into a scalar output between 0 and 1. The softmax function takes the input of a vector in k-dimensional and estimates the probability of each class, where k is the total number of answer sheets (classes) to be classified.

 

 

Loss optimization

 

 We want to find the network weights that achieve the lowest loss. First, we randomly pick an initial (w0, w1 ). And then, we take small step in opposite direction of gradient. We repeat until convergence.

 

 

Gradient vanishing

 In the process of backpropagation, the gradient may gradually decrease toward the input layer. If the weight value is not updated properly at the layer close to the input layer, the best model will not be found. This is called gradient vanishing.

728x90