This tutorial is the forth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would discuss some useful modification to the core RNN seq2seq model we have covered in the last tutorial These Modifications are RNN modifications (GRU & LSTM) Bidirectional networks Multilayer networks About Series This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , , as is found on , (you can simply copy it to your google drive , learn more ) , and the for this series is written in Jupyter notebooks to run on can be found you don’t need to download the data nor do you need to run the code locally on your device data google drive here code google colab here We have covered so far (code for this series can be found ) here 0. (how to use google colab with google drive) Overview on the free ecosystem for deep learning Overview on the text summarization task and the different techniques for the task Data used and how it could be represented for our task What is seq2seq for text summarization and why so lets get started Quick Recap Our task is of text summarization, we call it abstractive as we teach the neural network to generate words not just copy words . The data that would be used would be news and their headers , it can be found on my google drive, so you just copy it to your google drive without the need to download it ( ) more on this We would represent the data using word embeddings , which is simply converting each word to a specific vector , we would create a dictionary for our words ( ) more on this There are for this task , they are built over a corner stone concept , and they keep on developing and building up , they start by working on a type of network called RNN , which is arranged in an Encoder/Decoder architecture called seq2seq ( ), the code for these different approaches can be found different approaches more on this here This tutorial has been based by the amazing work of Andrew NG , his course on RNN has been truly useful, i recommend you to see it Today we would go through some modifications made to the core component of the encoder/decoder model , these modifications occur on the RNN block itself , to increase its efficiency in the whole model. 1. RNN modifications (LSTM & GRU) There are 2 main problems with the RNN unit Occurs with deep networks (i.e: networks like in our case) , when we apply back propagation, the gradients would get too large . Actually this error can be solved rather easy , using the concept of , which is simply setting a specific threshold , that when the gradients exceed it , we would clip it to a certain value . Exploding Gradients : with many layers gradient clipping This proves a much harder problem to solve , this also occurs , but this comes from the inability of the normal RNN unit to remember old values that appeared early in the sequence Vanishing Gradients : due to large number of layers this is quite important when dealing with a nlp problem , as some words depends on words that appeared very early in the sentence like Here the word cat/cats which appeared early in the sentence would directly affect choosing either was/were later in the sentence. to solve this problem we would need a new RNN architecture , here we would discuss 2 main approaches : GRU (Gated Recurrent Unit) LSTM (Long Short term Memory) 1.A) GRU (Gated Recurrent Unit) Both GRU & LSTM solves the problem of vanishing gradients that normal RNN unit suffers from , they do it by implementing a memory cell within their network , this enables them to store data from early within the sequence to be used later within the sequence. Here we would talk about GRU (gated recurrent unit) , we begin with the activation equation of RNN ( ) more on this then we would apply some simple modifications to it till we finally have here denotes for the memory cell , here it would be the output of the GRU cell . c The sub letter denotes that it is the newly proposed c value (we would use it latter to generate the real c output of the GRU . N so here the new proposed output c (candidate), would depend on the old output c (old candidate) , and the current input at that time To remember the value of C (candidate), we use another parameter called F (gate update) , this would control whether we would update the value of c or not here we would use a sigmoid function , we would take into consideration , and the the old c current input X so to update the value of C we would use lets assume that C is a vector , that its first element would here we would assume that this feature is whether the word is cat or cats remember important features within the sentence , so at first the c vector is empty , till we see the word , then F would be set to 1 to remember that it is a singular word , and it its value until it is used later in the sentence (to generate ‘was’ not ‘were’) cat would keep there is just another modification that is needed to build our full GRU unit , it occurs on the function needed to create the new candidate C . Here we would have a learnable ( ) parameter to learn the relevance between and Fr C new C old so to sum it all up we have 4 main equations that govern GRU 1.B) LSTM (Long Short Term Memory) LSTM is another modification to RNN , it is also build using the same concept of memory , to remember long sequences of data , it was built proposed before GRU , so GRU is actually a simplification to LSTM Here in LSTM , we use activation values , not just C (candidate values ) , we also have 2 outputs from the cell , a new activation , and a new candidate value so to calculate the new candidate here in LSTM we control the memory cell through 3 different gates as we said before we have 2 outputs from LSTM , the new candidate and a new activation , in them we would use the previous gates To combine all of these together we could also output y prediction from LSTM (by passing them to softmax ) when we connect multiple LSTMs together , we can see that if the network correctly learned the gates parameters , we could pass the candidate values (red values) from early from the sequence to the very end of the sequence , so we can model long dependencies with high accuracy 2. Bidirectional networks this is a modification made on the normal RNN network to make it able to adjust to an important need in nlp problems , as in nlp , sometimes to understand a word we need not just to the previous word , but also to the coming word , like in this example Here to differ between the 2 different meanings of the word (one time it is part of a person name , while the other is part of the word bear ) we would need to look for the coming word , so this is the reason why we need to apply bidirectional networks teddy Bidirectional networks is a general architecture that can utilize any RNN model (normal RNN , GRU , LSTM) forward propagation for the 2 direction of cells Here we apply forward propagation 2 times , one for the forward cells and one for the backward cells Both activations (forward , backward) would be considered to calculate the output y^ at time t 3. Multilayer networks To achieve even greater results , we can stack multiple RNN(LSTM or GRU or normal RNN) on top of each other , but we must take into consideration that they work with time . So to get started , here is a normal deep network , we can see that it contains multiple layers (50 in this case) , while when we apply the same concept on RNN , we tend to choose much smaller number of layers , as it would be enough and because it would be computationaly excpensive now lets see how would we apply the concept of deep networks with RNN as we can see , since we are working on RNN or its variations , we must take into consideration the time factor , so each vertical column of cells represent a layer , while each progress in time we repeat this column so our notation would be [layer] <time> To get the value of any activation layer , we use both Previous activation in time (time 2 ) from the same layer (layer 2) 💚 green previous cell in the same time (time 3) in the previous layer (layer 1) 🔵 blue Next Time if GOD wills it , we would go through how to enhance our architecture even more using the concepts of Beam Search Attention Model I truly hope you have enjoyed reading this tutorial , and i hope i have made these concepts clear , all the code for this series of tutorials are found here , you can simply use google colab to run it , please review the tutorial and tell me what do you think about it , hope to see you again