Build an Abstractive Text Summarizer in 94 Lines of Tensorflow !! (Tutorial 6) This tutorial is the sixth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would build an abstractive text summarizer in tensorflow in an optimized way . Today we would go through one of the most optimized models that has been built for this task , this model has been written by , this is to his model , I have used his model model on different datasets (in different languages) and it resulted in truly amazing results , so I would truly like to thank him for his effort dongjun-Lee the link I have made multiple modifications to the model to enable it to enable it to run seamlessly on google colab ( ) , and i have hosted the data onto google drive ( ) , so no need to download neither the code , nor the data , you only need a google colab session to run the code , and copy the data from my google drive to yours ( ) , and connect google drive to your notebook of google colab link to my model more on how to link google drive to google colab more on this 0- Intro 0-A About Series This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , we call it abstractive as we teach the neural network to generate words not to merely copy words . We have covered so far (code for this series can be found ) here 0. (how to use google colab with google drive) Overview on the free ecosystem for deep learning Overview on the text summarization task and the different techniques for the task Data used and how it could be represented for our task (prerequisites for this tutorial) What is seq2seq for text summarization and why Mulitlayer Bidirectional LSTM/GRU Beam Search & Attention for text summarization 0-B About the Data Used The data that would be used would be news and their headers , it can be found on my google drive , so you just copy it to your google drive without the need to download it ( ) more on this We would represent the data using word embeddings , which is simply converting each word to a specific vector , we would create a dictionary for our words ( ) more on this (prerequisites for this tutorial) 0-C About the Model Used There are for this task , they are built over a corner stone concept , and they keep on developing and building up . different approaches Today we would start building this corner stone implementation which is a type of network called RNN , which is arranged in an Encoder/Decoder architecture called seq2seq ( ), then we would build the seq2seq in a mulitlayer bidirectional structure , where the rnn cell would be a LSTM cell ( ) , then we would add an attention mechanism to better interface the encoder with the decoder ( ) , then to generate better output we use the ingenious concept of beam search ( ) more on this more on this more on this more on this the code for all these different approaches can be found here so lets get started !! Model Structure our model can be seen to structured into different blocks these blocks are Initialization Block : Here we would initialize the needed tensorflow & , and here would define our that would be used throughout the model placeholders variables RNN cell Embedding Block : Here we would define the embedding matrix used in both the & the encoder decoder Encoder Block : Here we would define the ( ) that forms the encoder part of our model , and we state as an input to the decoder part multilayer bidirectional RNN more on this output the encoder Decoder Block : Here the decoder is actually portioned into 2 distinct parts ( ) which is used to better interface the encoder with the decoder , this would be used in phase Attention Mechanism more on this training ( ) which is used to generate better output from our model , this would be used in phase BeamSearch more on this testing Loss Block : This block would only be used in phase , here we would apply clipping to our gradients , and we would actually run our optimizer (Adam Optimizer is used here) , and here is the place where we would apply our gradients to the optimizer. training 1- Initialization Block First we would need to import the libs that we would use import tensorflow as tf from tensorflow.contrib import rnn #cell that we would use Before Building our Model Class we need to get define some tensorflow concepts first So we tend to define placeholders like this X = tf.placeholder(tf.int32, [None, article_max_len]) # here we define the input x as int32 , with promise to provide its # data in runtime # # we also provide its shape , where None is used for a dimension of # any size and for the variables we tend to define them as global_step = tf.Variable(0, trainable=False) # a variable must be intialized , # and we can set it to either be trainable or not Then lets build our Model Class class Model(object): def __init__(self, reversed_dict, article_max_len, summary_max_len, args, forward_only=False): self.vocabulary_size = len(reversed_dict) self.embedding_size = args.embedding_size self.num_hidden = args.num_hidden self.num_layers = args.num_layers self.learning_rate = args.learning_rate self.beam_width = args.beam_width we would pass an obj called args that would actually contain multiple parameters from embedding size (size of word2vector) num_hidden (size of RNN) num_layers (layers of RNN) ( ) more on this Learning Rate BeamWidth ( ) more on this Keep Prob we would also need to initialize the model with other paaremetrs like reversed dict (dict of keys , each key wich is a num points to a specific ord) ( ) more on how to build your reversed dict article_max_len & article_summary_len (max length of article sentence as input and max length of summary sentenceas output) Forward Only (bool value to indicate training or testing phase) ( ) Forward Only = False → training phase then to continue the initalization if not forward_only: #training phase #keep_prob as variable in training phase self.keep_prob = args.keep_prob else: #testing phase #keep_prob constant in testing phase self.keep_prob = 1.0 #here we would use LSTM as our cell self.cell = tf.nn.rnn_cell.BasicLSTMCell #projection layer that would be used in decoder in both #training and testing phase with tf.variable_scope("decoder/projection"): self.projection_layer = tf.layers.Dense(self.vocabulary_size, use_bias=False) #define batch size(our data would be provided in batches) self.batch_size = tf.placeholder(tf.int32, (), name="batch_size") #X as input , define as length of articles self.X = tf.placeholder(tf.int32, [None, article_max_len]) self.X_len = tf.placeholder(tf.int32, [None]) #define decoder (input , target , length) #using the summary length self.decoder_input = tf.placeholder(tf.int32, [None, summary_max_len]) self.decoder_len = tf.placeholder(tf.int32, [None]) self.decoder_target = tf.placeholder(tf.int32, [None, summary_max_len]) #define global step beginning from zero self.global_step = tf.Variable(0, trainable=False) 2- Embedding Block : Here we would represent both our and the using word2vector ( ) inputs articles that would be the embedded inputs decoder inputs more on this we would define our variables for embedding in a variable scope , we would name it embedding with tf.name_scope("embedding"): #if training , #and you enable args.glove variable to true if not forward_only and args.glove: #here we use tf.constant as we won't change it #get_init_embedding is a function #that returns the vector for each word in our dict init_embeddings = tf.constant(get_init_embedding(reversed_dict, self.embedding_size), dtype=tf.float32) else: #else random define the word2vector for testing init_embeddings = tf.random_uniform([self.vocabulary_size, self.embedding_size], -1.0, 1.0) self.embeddings = tf.get_variable("embeddings", initializer=init_embeddings) #then define for both encoder input self.encoder_emb_inp = tf.transpose(tf.nn.embedding_lookup(self.embeddings, self.X), perm=[1, 0, 2]) #and define for decoder input self.decoder_emb_inp = tf.transpose(tf.nn.embedding_lookup(self.embeddings, self.decoder_input), perm=[1, 0, 2]) 3- Encoder Block : Here we would actually define the multilayer bidirectional lstm for the encoder part of our seq2seq ( ) , we would define our variables here in a name scope that we would call “encoder”. more on this Here we would use the concept of , we would use it after each cell in our architecture , it is used to randomly activate a subset of our net, and is used during training for regularization. Dropout with tf.name_scope("encoder"): fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)] bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)] fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells] bw_cells = [rnn.DropoutWrapper(cell) for cell in bw_cells] Now after defining the forward and backward cells , we would need to actually connect them together to form the bidirectional structure , so we would use , which takes all of the following parameters as its inputs stack_bidirectional_dynamic_rnn forward cells Backward Cells Encoder emb input (input articles in word2vector format) X_len (length of articles) Using time_major = True is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. encoder_outputs, encoder_state_fw, encoder_state_bw = tf.contrib.rnn.stack_bidirectional_dynamic_rnn( fw_cells, bw_cells, self.encoder_emb_inp, sequence_length=self.X_len, time_major=True, dtype=tf.float32) Now we would need to actually use the output from this function , we mainly need 2 main outputs stack_bidirectional_dynamic_rnn encoder_output (would be used in attention calculation) ( ) more on attention encoder_state (would be used for the initial state of the decoder) so to get encoder_output we simply self.encoder_output = tf.concat(encoder_outputs, 2) then to get encoder_state , we would combine both (encoder_state_c) & (encoder_state_h) of both the forward & backward using LSTMStateTuple encoder_state_c = tf.concat((encoder_state_fw[0].c, encoder_state_bw[0].c), 1) encoder_state_h = tf.concat((encoder_state_fw[0].h, encoder_state_bw[0].h), 1) self.encoder_state = rnn.LSTMStateTuple(c=encoder_state_c, h=encoder_state_h) 4- Decoder Block : Here the decoder is divided into 2 parts Training part (to train attention model) ( ) more on attention model testing/running part (for attention & beam search) ( ) more on beam search so lets first define out (name scope) & (variable scope) for both parts , we would also define a multilayer cell structure that would be also used for both parts with tf.name_scope("decoder"), tf.variable_scope("decoder") as decoder_scope: decoder_cell = self.cell(self.num_hidden * 2) 4.a Training Part (Attention Model) First we need to prepare our attention structure , here we would use BahdanauAttention would be used inside the attention calculation ( ) encoder_output more on attention model attention_states = tf.transpose(self.encoder_output, [1, 0, 2]) attention_mechanism = tf.contrib.seq2seq.BahdanauAttention( self.num_hidden * 2, attention_states, memory_sequence_length=self.X_len, normalize=True) then we would further define the decoder cell (as from the first step in decoder , we just defined the decoder cell as a simple multilayer lstm , now we would add attention) , to do this we would use , which combines attention_mechanism with decoder cell AttentionWrapper decoder_cell = tf.contrib.seq2seq.AttentionWrapper(decoder_cell, attention_mechanism, attention_layer_size=self.num_hidden * 2) Now we would need to define the inputs to the decoder cell , this input actually comes from 2 sources ( ) more on seq2seq encoder output (used within initial step) decoder input (summary sentence in the training phase) so lets first define the initial state that would come from the encoder initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=self.batch_size) initial_state = initial_state.clone(cell_state=self.encoder_state) now we would combine both the initial state with the decoder input (summary sentence) , here to use the BasicDecoder , we need to provide the decoder input through a helper , this helper would combine all of (decoder_emb_inp , decoder_len) together helper = tf.contrib.seq2seq.TrainingHelper(self.decoder_emb_inp, self.decoder_len, time_major=True) decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, initial_state) outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True, scope=decoder_scope) now for the last step of the training phase , we would need to define the outputs (logits)from the decoder , to be used within the loss block for training #just use the rnn_outputs from aall the outputs self.decoder_output = outputs.rnn_output #then get logits , by performing a transpose on decoder output self.logits = tf.transpose(self.projection_layer(self.decoder_output), perm=[1, 0, 2]) #then reshape the logits self.logits_reshape = tf.concat( [self.logits, tf.zeros([self.batch_size, summary_max_len - tf.shape(self.logits)[1], self.vocabulary_size])], axis=1) 4.b Testing/Running part (Attention & Beam search) Here in this phase , there are 2 main goals divide the encoder output & encoder states & x_len (article length) to parts to actually perform the beam search methodology ( ) more on beam search build a decoder independent on decoder input , as in the testing phase we don’t have the summary sentence as our input , so we would need to build the decoder in a different way than above first lets divide encoder output & encoder states & x_len (article length) to parts to actually perform the beam search methodology , here we would use variable that was already defined above beam_width tiled_encoder_output = tf.contrib.seq2seq.tile_batch( tf.transpose(self.encoder_output, perm=[1, 0, 2]), multiplier=self.beam_width) tiled_encoder_final_state = tf.contrib.seq2seq.tile_batch(self.encoder_state, multiplier=self.beam_width) tiled_seq_len = tf.contrib.seq2seq.tile_batch(self.X_len, multiplier=self.beam_width) then lets define the attention mechanism (just like before , but taking the tiled variables into consideration) attention_mechanism = tf.contrib.seq2seq.BahdanauAttention( self.num_hidden * 2, tiled_encoder_output, memory_sequence_length=tiled_seq_len, normalize=True) decoder_cell = tf.contrib.seq2seq.AttentionWrapper(decoder_cell, attention_mechanism, attention_layer_size=self.num_hidden * 2) initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=self.batch_size * self.beam_width) initial_state = initial_state.clone(cell_state=tiled_encoder_final_state) then lets define our decoder , but here we would use the , this takes into consideration all of BeamSearchDecoder Decoder cell (previously defined) Embedding word2vector (defined in embedding part) projection layer (defined in the beginning of class) decoder initial state (previously defined) beam_width (user defined) start token & end token decoder = tf.contrib.seq2seq.BeamSearchDecoder( cell=decoder_cell, embedding=self.embeddings, start_tokens=tf.fill([self.batch_size], tf.constant(2)), end_token=tf.constant(3), initial_state=initial_state, beam_width=self.beam_width, output_layer=self.projection_layer ) then all what is left to do , is to define the outputs , that would actually directly reflect to the real output from the whole seq2seq architecture , as this phase is where prediction is actually computed outputs, _, _ = tf.contrib.seq2seq.dynamic_decode( decoder, output_time_major=True, maximum_iterations=summary_max_len, scope=decoder_scope) self.prediction = tf.transpose(outputs.predicted_ids, perm=[1, 2, 0]) 5- Loss Block : This block is where training actually occurs , here training actually occurs through multiple steps calculating loss ( ) more on loss calculation calculating gradients and applying clipping on gradients ( ) more on exploding gradients applying optimizer (here we would use Adam optimizer) First we define our name scope , and we would specify that this block would only work through the training phase with tf.name_scope("loss"): if not forward_only: Second we would calculate the loss ( ) more on loss calculation crossent = tf.nn.sparse_softmax_cross_entropy_with_logits( logits=self.logits_reshape, labels=self.decoder_target) weights = tf.sequence_mask(self.decoder_len, summary_max_len, dtype=tf.float32) self.loss = tf.reduce_sum(crossent * weights / tf.to_float(self.batch_size)) Third we would calculate our gradients , and apply clipping on gradients to solve the problem of exploding gradients ( ) more on exploding gradients ( from tutorial 4 ) Occurs with deep networks (i.e: networks like in our case) , when we apply back propagation, the gradients would get too large . Actually this error can be solved rather easy , using the concept of , which is simply setting a specific threshold , that when the gradients exceed it , we would clip it to a certain value . Exploding Gradients : with many layers gradient clipping params = tf.trainable_variables() gradients = tf.gradients(self.loss, params) clipped_gradients, _ = tf.clip_by_global_norm(gradients, 5.0) Forth we would apply our optimizer , here we would use Adam optimizer , here we would use the previously defined learning_rate optimizer = tf.train.AdamOptimizer(self.learning_rate) self.update = optimizer.apply_gradients(zip(clipped_gradients, params), global_step=self.global_step) Next Time if GOD wills it , we would go through the code needed to divide our data into batches needed code to use this model for training Then after we are done with this core model implementation , if GOD wills it , we would go other modern implementations for text summarization like pointer generator Using reinforcement learning with seq2seq ( ) more on different implementations for seq2seq for text summarization All the code for this tutorial is found as open source . here I truly hope you have enjoyed reading this tutorial , and i hope i have made these concepts clear , all the code for this series of tutorials are found here , you can simply use google colab to run it , please review the tutorial and tell me what do you think about it , hope to see you again <a href="https://medium.com/media/3c851dac986ab6dbb2d1aaa91205a8eb/href">https://medium.com/media/3c851dac986ab6dbb2d1aaa91205a8eb/href</a>
Share Your Thoughts