Build an Abstractive Text Summarizer in 94 Lines of Tensorflow !! (Tutorial 6)

Written by theamrzaki | Published 2019/04/16
Tech Story Tags: tensorflow | machine-learning | nlp | ai | deep-learning

TLDRvia the TL;DR App

Build an Abstractive Text Summarizer in 94 Lines of Tensorflow !! (Tutorial 6)

This tutorial is the sixth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would build an abstractive text summarizer in tensorflow in an optimized way .

Today we would go through one of the most optimized models that has been built for this task , this model has been written by dongjun-Lee , this is the link to his model , I have used his model model on different datasets (in different languages) and it resulted in truly amazing results , so I would truly like to thank him for his effort

I have made multiple modifications to the model to enable it to enable it to run seamlessly on google colab (link to my model) , and i have hosted the data onto google drive (more on how to link google drive to google colab) , so no need to download neither the code , nor the data , you only need a google colab session to run the code , and copy the data from my google drive to yours (more on this) , and connect google drive to your notebook of google colab

0- Intro

0-A About Series

This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , we call it abstractive as we teach the neural network to generate words not to merely copy words .

We have covered so far (code for this series can be found here)

0. Overview on the free ecosystem for deep learning (how to use google colab with google drive)

  1. Overview on the text summarization task and the different techniques for the task
  2. Data used and how it could be represented for our task (prerequisites for this tutorial)
  3. What is seq2seq for text summarization and why
  4. Mulitlayer Bidirectional LSTM/GRU
  5. Beam Search & Attention for text summarization

0-B About the Data Used

The data that would be used would be news and their headers , it can be found on my google drive , so you just copy it to your google drive without the need to download it (more on this)

We would represent the data using word embeddings , which is simply converting each word to a specific vector , we would create a dictionary for our words (more on this) (prerequisites for this tutorial)

0-C About the Model Used

There are different approaches for this task , they are built over a corner stone concept , and they keep on developing and building up .

Today we would start building this corner stone implementation which is a type of network called RNN , which is arranged in an Encoder/Decoder architecture called seq2seq (more on this), then we would build the seq2seq in a mulitlayer bidirectional structure , where the rnn cell would be a LSTM cell (more on this) , then we would add an attention mechanism to better interface the encoder with the decoder (more on this) , then to generate better output we use the ingenious concept of beam search (more on this)

the code for all these different approaches can be found here

so lets get started !!

Model Structure

our model can be seen to structured into different blocks these blocks are

Initialization Block :

Here we would initialize the needed tensorflow placeholders & variables , and here would define our RNN cell that would be used throughout the model

Embedding Block :

Here we would define the embedding matrix used in both the encoder & the decoder

Encoder Block :

Here we would define the multilayer bidirectional RNN (more on this) that forms the encoder part of our model , and we output the encoder state as an input to the decoder part

Decoder Block :

Here the decoder is actually portioned into 2 distinct parts

  1. Attention Mechanism (more on this) which is used to better interface the encoder with the decoder , this would be used in training phase
  2. BeamSearch (more on this) which is used to generate better output from our model , this would be used in testing phase

Loss Block :

This block would only be used in training phase , here we would apply clipping to our gradients , and we would actually run our optimizer (Adam Optimizer is used here) , and here is the place where we would apply our gradients to the optimizer.

1- Initialization Block

First we would need to import the libs that we would use

import tensorflow as tf
from tensorflow.contrib import rnn  #cell that we would use

Before Building our Model Class we need to get define some tensorflow concepts first

So we tend to define placeholders like this

X = tf.placeholder(tf.int32, [None, article_max_len])
# here we define the input x as int32 , with promise to provide its # data in runtime
#
# we also provide its shape , where None is used for a dimension of # any size 

and for the variables we tend to define them as

global_step = tf.Variable(0, trainable=False)
# a variable must be intialized , 
# and we can set it to either be trainable or not

Then lets build our Model Class

class Model(object):
    def __init__(self, reversed_dict, article_max_len, summary_max_len, args, forward_only=False):
        self.vocabulary_size = len(reversed_dict)
        self.embedding_size = args.embedding_size
        self.num_hidden = args.num_hidden
        self.num_layers = args.num_layers
        self.learning_rate = args.learning_rate
        self.beam_width = args.beam_width

we would pass an obj called args that would actually contain multiple parameters from

  1. embedding size (size of word2vector)
  2. num_hidden (size of RNN)
  3. num_layers (layers of RNN) (more on this)
  4. Learning Rate
  5. BeamWidth (more on this)
  6. Keep Prob

we would also need to initialize the model with other paaremetrs like

  1. reversed dict (dict of keys , each key wich is a num points to a specific ord) (more on how to build your reversed dict)
  2. article_max_len & article_summary_len (max length of article sentence as input and max length of summary sentenceas output)
  3. Forward Only (bool value to indicate training or testing phase) (Forward Only = False → training phase)

then to continue the initalization

if not forward_only: #training phase
            #keep_prob as variable in training phase
            self.keep_prob = args.keep_prob
else: #testing phase
            #keep_prob constant in testing phase
            self.keep_prob = 1.0
   
        #here we would use LSTM as our cell
        self.cell = tf.nn.rnn_cell.BasicLSTMCell 
  
        #projection layer that would be used in decoder in both 
        #training and testing phase
        with tf.variable_scope("decoder/projection"):
              self.projection_layer =  tf.layers.Dense(self.vocabulary_size, use_bias=False)
   
        #define batch size(our data would be provided in batches)
        self.batch_size = tf.placeholder(tf.int32, (), name="batch_size")
  
        #X as input , define as length of articles 
        self.X = tf.placeholder(tf.int32, [None, article_max_len])
        self.X_len = tf.placeholder(tf.int32, [None])
  
        #define decoder (input , target , length) 
        #using the summary length 
        self.decoder_input = tf.placeholder(tf.int32, [None, summary_max_len])
        self.decoder_len = tf.placeholder(tf.int32, [None])
        self.decoder_target = tf.placeholder(tf.int32, [None, summary_max_len])
  
        #define global step beginning from zero 
        self.global_step = tf.Variable(0, trainable=False)

2- Embedding Block :

Here we would represent both our inputs articles that would be the embedded inputs and the decoder inputs using word2vector (more on this)

we would define our variables for embedding in a variable scope , we would name it embedding

with tf.name_scope("embedding"):
            #if training , 
            #and you enable args.glove variable to true
            if not forward_only and args.glove:
                #here we use tf.constant as we won't change it
                #get_init_embedding is a function 
                #that returns the vector for each word in our dict
                init_embeddings = tf.constant(get_init_embedding(reversed_dict, self.embedding_size), dtype=tf.float32)
            else: #else random define the word2vector for testing
                init_embeddings = tf.random_uniform([self.vocabulary_size, self.embedding_size], -1.0, 1.0)
            self.embeddings = tf.get_variable("embeddings", initializer=init_embeddings)
            #then define for both encoder input
            self.encoder_emb_inp = tf.transpose(tf.nn.embedding_lookup(self.embeddings, self.X), perm=[1, 0, 2])
            #and define for decoder input
            self.decoder_emb_inp = tf.transpose(tf.nn.embedding_lookup(self.embeddings, self.decoder_input), perm=[1, 0, 2])

3- Encoder Block :

Here we would actually define the multilayer bidirectional lstm for the encoder part of our seq2seq (more on this) , we would define our variables here in a name scope that we would call “encoder”.

Here we would use the concept of Dropout , we would use it after each cell in our architecture , it is used to randomly activate a subset of our net, and is used during training for regularization.

with tf.name_scope("encoder"):

            fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]

            bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]

            fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]

            bw_cells = [rnn.DropoutWrapper(cell) for cell in bw_cells]

Now after defining the forward and backward cells , we would need to actually connect them together to form the bidirectional structure , so we would use stack_bidirectional_dynamic_rnn , which takes all of the following parameters as its inputs

  1. forward cells

  2. Backward Cells

  3. Encoder emb input (input articles in word2vector format)

  4. X_len (length of articles)

  5. Using time_major = True is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation.

    encoder_outputs, encoder_state_fw, encoder_state_bw = tf.contrib.rnn.stack_bidirectional_dynamic_rnn( fw_cells, bw_cells, self.encoder_emb_inp, sequence_length=self.X_len, time_major=True, dtype=tf.float32)

Now we would need to actually use the output from this stack_bidirectional_dynamic_rnn function , we mainly need 2 main outputs

  1. encoder_output (would be used in attention calculation) (more on attention)
  2. encoder_state (would be used for the initial state of the decoder)

so to get encoder_output we simply

self.encoder_output = tf.concat(encoder_outputs, 2)

then to get encoder_state , we would combine both (encoder_state_c) & (encoder_state_h) of both the forward & backward using LSTMStateTuple

encoder_state_c = tf.concat((encoder_state_fw[0].c, encoder_state_bw[0].c), 1)

encoder_state_h = tf.concat((encoder_state_fw[0].h, encoder_state_bw[0].h), 1)

self.encoder_state = rnn.LSTMStateTuple(c=encoder_state_c, h=encoder_state_h)

4- Decoder Block :

Here the decoder is divided into 2 parts

  1. Training part (to train attention model) (more on attention model)
  2. testing/running part (for attention & beam search) (more on beam search)

so lets first define out (name scope) & (variable scope) for both parts , we would also define a multilayer cell structure that would be also used for both parts

with tf.name_scope("decoder"), tf.variable_scope("decoder") as decoder_scope:
            decoder_cell = self.cell(self.num_hidden * 2)

4.a Training Part (Attention Model)

First we need to prepare our attention structure , here we would use BahdanauAttention

encoder_output would be used inside the attention calculation (more on attention model)

attention_states = tf.transpose(self.encoder_output, [1, 0, 2])
attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
                    self.num_hidden * 2, attention_states, memory_sequence_length=self.X_len, normalize=True)

then we would further define the decoder cell (as from the first step in decoder , we just defined the decoder cell as a simple multilayer lstm , now we would add attention) , to do this we would use AttentionWrapper , which combines attention_mechanism with decoder cell

decoder_cell = tf.contrib.seq2seq.AttentionWrapper(decoder_cell, attention_mechanism,                                                                 attention_layer_size=self.num_hidden * 2)

Now we would need to define the inputs to the decoder cell , this input actually comes from 2 sources (more on seq2seq)

  1. encoder output (used within initial step)
  2. decoder input (summary sentence in the training phase)

so lets first define the initial state that would come from the encoder

initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=self.batch_size)

initial_state = initial_state.clone(cell_state=self.encoder_state)

now we would combine both the initial state with the decoder input (summary sentence) , here to use the BasicDecoder , we need to provide the decoder input through a helper , this helper would combine all of (decoder_emb_inp , decoder_len) together

helper = tf.contrib.seq2seq.TrainingHelper(self.decoder_emb_inp, self.decoder_len, time_major=True)

decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, initial_state)

outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True, scope=decoder_scope)

now for the last step of the training phase , we would need to define the outputs (logits)from the decoder , to be used within the loss block for training

#just use the rnn_outputs from aall the outputs
self.decoder_output = outputs.rnn_output

#then get logits , by performing a transpose on decoder output
self.logits = tf.transpose(self.projection_layer(self.decoder_output), perm=[1, 0, 2])

#then reshape the logits 
self.logits_reshape = tf.concat(
                    [self.logits, tf.zeros([self.batch_size, summary_max_len - tf.shape(self.logits)[1], self.vocabulary_size])], axis=1)

4.b Testing/Running part (Attention & Beam search)

Here in this phase , there are 2 main goals

  1. divide the encoder output & encoder states & x_len (article length) to parts to actually perform the beam search methodology (more on beam search)
  2. build a decoder independent on decoder input , as in the testing phase we don’t have the summary sentence as our input , so we would need to build the decoder in a different way than above

first lets divide encoder output & encoder states & x_len (article length) to parts to actually perform the beam search methodology , here we would use beam_width variable that was already defined above

tiled_encoder_output = tf.contrib.seq2seq.tile_batch(
                    tf.transpose(self.encoder_output, perm=[1, 0, 2]), multiplier=self.beam_width)

tiled_encoder_final_state = tf.contrib.seq2seq.tile_batch(self.encoder_state, multiplier=self.beam_width)
                
tiled_seq_len = tf.contrib.seq2seq.tile_batch(self.X_len, multiplier=self.beam_width)

then lets define the attention mechanism (just like before , but taking the tiled variables into consideration)

attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
                    self.num_hidden * 2, tiled_encoder_output, memory_sequence_length=tiled_seq_len, normalize=True)

decoder_cell = tf.contrib.seq2seq.AttentionWrapper(decoder_cell, attention_mechanism,                                                                 attention_layer_size=self.num_hidden * 2)

initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=self.batch_size * self.beam_width)
                
initial_state = initial_state.clone(cell_state=tiled_encoder_final_state)

then lets define our decoder , but here we would use the BeamSearchDecoder , this takes into consideration all of

  1. Decoder cell (previously defined)

  2. Embedding word2vector (defined in embedding part)

  3. projection layer (defined in the beginning of class)

  4. decoder initial state (previously defined)

  5. beam_width (user defined)

  6. start token & end token

    decoder = tf.contrib.seq2seq.BeamSearchDecoder( cell=decoder_cell, embedding=self.embeddings, start_tokens=tf.fill([self.batch_size], tf.constant(2)), end_token=tf.constant(3), initial_state=initial_state, beam_width=self.beam_width, output_layer=self.projection_layer )

then all what is left to do , is to define the outputs , that would actually directly reflect to the real output from the whole seq2seq architecture , as this phase is where prediction is actually computed

outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(
                    decoder, output_time_major=True, maximum_iterations=summary_max_len, scope=decoder_scope)

self.prediction = tf.transpose(outputs.predicted_ids, perm=[1, 2, 0])

5- Loss Block :

This block is where training actually occurs , here training actually occurs through multiple steps

  1. calculating loss (more on loss calculation)
  2. calculating gradients and applying clipping on gradients (more on exploding gradients)
  3. applying optimizer (here we would use Adam optimizer)

First we define our name scope , and we would specify that this block would only work through the training phase

with tf.name_scope("loss"):
            if not forward_only:

Second we would calculate the loss (more on loss calculation)

crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
                    logits=self.logits_reshape, labels=self.decoder_target)

weights = tf.sequence_mask(self.decoder_len, summary_max_len, dtype=tf.float32)

self.loss = tf.reduce_sum(crossent * weights / tf.to_float(self.batch_size))

Third we would calculate our gradients , and apply clipping on gradients to solve the problem of exploding gradients (more on exploding gradients)

(from tutorial 4)

Exploding Gradients : Occurs with deep networks (i.e: networks with many layers like in our case) , when we apply back propagation, the gradients would get too large . Actually this error can be solved rather easy , using the concept of gradient clipping , which is simply setting a specific threshold , that when the gradients exceed it , we would clip it to a certain value .

params = tf.trainable_variables()
gradients = tf.gradients(self.loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, 5.0)

Forth we would apply our optimizer , here we would use Adam optimizer , here we would use the previously defined learning_rate

optimizer = tf.train.AdamOptimizer(self.learning_rate)

self.update = optimizer.apply_gradients(zip(clipped_gradients, params), global_step=self.global_step)

Next Time if GOD wills it , we would go through

  1. the code needed to divide our data into batches
  2. needed code to use this model for training

Then after we are done with this core model implementation , if GOD wills it , we would go other modern implementations for text summarization like

  1. pointer generator
  2. Using reinforcement learning with seq2seq

(more on different implementations for seq2seq for text summarization)

All the code for this tutorial is found as open source here .

I truly hope you have enjoyed reading this tutorial , and i hope i have made these concepts clear , all the code for this series of tutorials are found here , you can simply use google colab to run it , please review the tutorial and tell me what do you think about it , hope to see you again

<a href="https://medium.com/media/3c851dac986ab6dbb2d1aaa91205a8eb/href">https://medium.com/media/3c851dac986ab6dbb2d1aaa91205a8eb/href</a>


Published by HackerNoon on 2019/04/16