# SEQ2SEQ LEARNING PART D: Encoder-Decoder with Teacher Forcing

Welcome to Part D of the Seq2Seq Learning Tutorial Series. In this tutorial, we will design an Encoder-Decoder model to be trained with “Teacher Forcing” to solve the sample Seq2Seq problem introduced in Part A.

We will use the LSTM layer in Keras as the Recurrent Neural Network.

You can access all my SEQ2SEQ Learning videos on Murat Karakaya Akademi Youtube channel in ENGLISH or in TURKISH

You can access all the tutorials in this series from my blog at www.muratkarakaya.netIf you would like to follow up on Deep Learning tutorials, please subscribe to my YouTube Channel or follow my blog on muratkarakaya.net.  You can also access this Colab Notebook using the link.

If you are ready, let’s get started!

Photo by Vedrana FilipoviÄ‡ on Unsplash

REMINDER:

# A Simple Seq2Seq Problem: The reversed sequence problem

Assume that:

• We are given a parallel data set including X (input) and y (output) such that X[i] and y[i] have some relationship

In that tutorial, I will generate X and y parallel datasets such that the y sequence will be the reverse of the given X sequence. For example,

• Given sequence X[i] length of 4:

X[i]=[3, 2, 9, 1]

• Output sequence (y[i]) is the reversed input sequence (X[i])

y[i]=[1, 9, 2, 3]

I will call this parallel dataset: “the reversed sequence problem

In real life (like Machine Language Translation, Image Captioning, etc.), we are given (or build) a parallel dataset: X sequences and corresponding y sequences

• However, to set up an easily traceable example, I opt out to set y sequences as the reversed of X sequences
• However, you can create X and y parallel datasets as you wish: sorted, reverse sorted, odd, or even numbers selected, etc.
• We use the parallel data set to train a seq2seq model which would learn
• how to convert/transform an input sequence from X to an output sequence in y

IMPORTANT:

• In the reversed sequence problem, the input & output sequence lengths are fixed and the same.
• In PART E, we will change the problem and the solution such that we will be dealing with variable-length sequences after we built the encoder-decoder model.

## Configure the problem

• Number of Input Timesteps: how many tokens / distinct events /numbers / word etc in the input sequence
• Number of Features: how many features/dimensions are used to represent one tokens / distict events / numbers / word etc
• Here, we use one-hot encoding to represent the integers.
• The length of the one-hot coding vector is the Number of Features
• Thus, the greatest integer will be the Number of Features-1
• When the Number of Features=10 the greatest integer will be 9 and will be represented as [0 0 0 0 0 0 0 0 0 1]
`#@title Configure problemn_timesteps_in =   4#@param {type:"integer"}#each input sample has 4 valuesn_features = 10   #@param {type:"integer"}#each value is one_hot_encoded with 10 0/1#n_timesteps_out = 2  #@param {type:"integer"}#each output sample has 2 values padded with 0# generate random sequenceX,y = get_reversed_pairs(n_timesteps_in,  n_features, verbose=True)# generate datasetstrain_size= 2000 #@param {type:"integer"}test_size = 200  #@param {type:"integer"}X_train, y_train , X_test, 	y_test=create_dataset(train_size, test_size, n_timesteps_in,n_features , verbose=True)Sample X and yIn raw format:X=[7, 3, 6, 5], y=[5, 6, 3, 7]In one_hot_encoded format:X=[[0 0 0 0 0 0 0 1 0 0] [0 0 0 1 0 0 0 0 0 0] [0 0 0 0 0 0 1 0 0 0] [0 0 0 0 0 1 0 0 0 0]]y=[[0 0 0 0 0 1 0 0 0 0] [0 0 0 0 0 0 1 0 0 0] [0 0 0 1 0 0 0 0 0 0] [0 0 0 0 0 0 0 1 0 0]]Generated sequence datasets as followsX_train.shape:  (2000, 4, 10) y_train.shape:  (2000, 4, 10)X_test.shape:  (200, 4, 10) y_test.shape:  (200, 4, 10)time: 77.6 ms`

# Before starting, you need to know:

• Python
• Keras/TF
• Deep Neural Networks
• Recurrent Neural Network concepts
• LSTM parameters and outputs
• Keras Functional API
• Basics of Encoder-Decoder approach

## A Quick Reminder: Encoder & Decoder

• The encoder encodes the input into a new representation
• The decoder decodes the encoded representation of the input into the output

Note: There are other proposed methods to solve seq2seq problems such as Conv models or Reinforcement methods. In this tutorial, we focus on Enoder- Decoder architecture.

Image taken from here

## Key Concepts

• Training: During training, we train the encoder and decoder such that they work together to create a context (representation) between input and output
• Inference (Prediction): After learning how to create the context (representation), they can work together to predict the output
• Encode all- decode one at a time: Mostly, the encoder reads all the input sequences and creates a context (representation) vector. Decoder use this context (representation) vector and previously decoded results to create new output step by step.
• Teacher forcing: During training decoder receives the correct output from the training set as the previously decoded result to predict the next output. However, during inference decoder receives the previously decoded result to predict the next output. Teacher forcing improves the training process.

DO NOT WORRY! WE WILL SEE ALL THE ABOVE CONCEPTS IN ACTION BELOW!

`LSTMoutputDimension = 16 `

## HOW TO TRAIN A GENERIC ENCODER—DECODER

• The decoder produces the output sequence one by one
• For each output, the decoder consumes a context vector and an input
• The initial context vector is created by the encoder
• The initial input is a special symbol for the decoder to make it start, e.g. ‘start
• Using initial context and initial input, the decoder will generate the first output
• For the next output, the decoder will use its current state as a context vector and generated (predicted) output as input
• The decoder will work in such a loop using its state and output as the next step context vector and input until the generated output is a special symbol ‘stop’ or the pre-defined maximum steps (length of output) are reached.

Image taken from here

## HOW TO USE A GENERIC ENCODER—DECODER FOR INFERENCE (PREDICTION)

• We provide an input sequence (X_test) to the trained encoder-decoder model
• The trained encoder-decoder model outputs the predicted sequence

## HOW TO TRAIN AN ENCODER-DECODER WITH TEACHER FORCING

The initial steps are the same:

• The decoder produces the output sequence one by one
• For each output, the decoder consumes a context vector and an input
• The initial context vector is created by the encoder
• The initial input is a special symbol for the decoder to make it start, e.g. ‘start
• Using initial context and initial input, the decoder will generate the first output

Image taken from here

However, the input to the decoder during the loop is different

• For the next output,
• the decoder will use its current state as a context vector and generated (predicted) output as input
• we (the teacher!) provide the correct output to the decoder as input
• The difference is: the decoder uses the context vector and the correct input to the next output rather than using its prediction in the previous cycle
• The decoder will work in such a loop using its state and output provided correct output as the next step context vector and input until the generated output is a special symbol ‘stop’ or the pre-defined maximum steps (length of output) is reached.

Therefore, we need to provide 2 input sequences to train AN ENCODER — DECODER WITH TEACHER FORCING such that

1. input to encoder: [4 7 2 8]
2. input to decoder: [0 8 2 7]

Note that:

• The expected output is [8 2 7 4] — — reverse of input to the encoder
• 0 (zero) is selected as a special symbol for ‘start
• input to the decoder is created by shifting the expected output by one-time step and adding a ‘start’ token as the first token:
• [8 2 7 4] — — → [0 8 2 7]

During training:

• In the first cycle, the decoder will use the encoder’s state and its first input which is 0 from the decoder input sequence [0 8 2 7] to generate the first output which is expected to be 8 from [8 2 7 4]
• Assume that the decoder predicts 7
• In the generic Encoder-Decoder model, the decoder will use 7 to generate/predict the next token
• In teacher forcing, we (the teacher!) provide the second input from [0 8 2 7] which is 8 to the decoder to generate/predict the next token

Thus, during training, the teacher enforces the decoder to condition itself to generate/predict the next token according to the given correct input!

## HOW TO USE AN ENCODER—DECODER MODEL TRAINED WITH TEACHER FORCING FOR INFERENCE (PREDICTION)

• We need 2 input sequences:
1. Input for encoder: encoder_inputs
2. Input for decoder: decoder_inputs
• The encoder_inputs is given
• However, this time we do not have the correct outputs
• Therefore, we will provide the predicted output as the input.
• The first input is ‘start’ and the other inputs will be the outputs from the previous cycle

## CREATE AN ENCODER—DECODER MODEL WITH TEACHER FORCING TO TRAIN

• Define the model that will turn `encoder_input_data` & `decoder_input_data` into `decoder_predicted_data`
• complete the decoder model by adding a Dense layer with Softmax activation function for prediction of the next output
• The dense layer will output one-hot encoded representation as we did for the input
• Therefore, we will use n_features number of neurons
`# TRAINING WITH TEACHER FORCING# Define an input sequence and process it.encoder_inputs= Input(shape=(n_timesteps_in, n_features))encoder_lstm=LSTM(LSTMoutputDimension, return_state=True)LSTM_outputs, state_h, state_c = encoder_lstm(encoder_inputs)# We discard `LSTM_outputs` and only keep the other states.encoder_states = [state_h, state_c]decoder_inputs = Input(shape=(None, n_features), name='decoder_inputs')decoder_lstm = LSTM(LSTMoutputDimension, return_sequences=True, return_state=True, name='decoder_lstm')# Set up the decoder, using `context vector` as initial state.decoder_outputs, _, _ = decoder_lstm(decoder_inputs,                                     initial_state=encoder_states)#complete the decoder model by adding a Dense layer with Softmax activation function #for prediction of the next output#Dense layer will output one-hot encoded representation as we did for input#Therefore, we will use n_features number of neuronsdecoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')decoder_outputs = decoder_dense(decoder_outputs)# put togethermodel_encoder_training = Model([encoder_inputs, decoder_inputs], decoder_outputs, name='model_encoder_training')time: 489 ms`
• compile the model
`model_encoder_training.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])model_encoder_training.summary()plot_model(model_encoder_training, show_shapes=True)Model: "model_encoder_training"__________________________________________________________________________________________________Layer (type)                    Output Shape         Param #     Connected to                     ==================================================================================================input_8 (InputLayer)            [(None, 4, 10)]      0                                            __________________________________________________________________________________________________decoder_inputs (InputLayer)     [(None, None, 10)]   0                                            __________________________________________________________________________________________________lstm_3 (LSTM)                   [(None, 16), (None,  1728        input_8[0][0]                    __________________________________________________________________________________________________decoder_lstm (LSTM)             [(None, None, 16), ( 1728        decoder_inputs[0][0]                                                                              lstm_3[0][1]                                                                                      lstm_3[0][2]                     __________________________________________________________________________________________________decoder_dense (Dense)           (None, None, 10)     170         decoder_lstm[0][0]               ==================================================================================================Total params: 3,626Trainable params: 3,626Non-trainable params: 0__________________________________________________________________________________________________`
`time: 103 ms`

## PREPARE TRAINING DATASETS

• To train the Encoder-Decoder model we need to work on the train data set such that we will prepare 3 data sets:
1. Input for encoder (encoder_inputs): all sequences of 1 sample training input data (X)
2. Input for the decoder (decoder_inputs): 1 token from 1 sample training target sequence (y) for teacher forcing [should start with ‘start’ symbol]
3. The target for the decoder (decoder_predicted_data): 1 token from 1 sample training target sequence (y) [should end with ‘end’ symbol]
• We can modify the code such that the first number (0) of the input domain numbers is reserved for ‘start’ and ‘end’ symbols as follows:
`Generated sequence datasets as followsX_encoder_in.shape:  (5000, 4, 10)X_decoder_in.shape:  (5000, 4, 10)y_decoder_out.shape:  (5000, 4, 10)Sample sequences in raw format:X_encoder_in:  [8, 8, 9, 3]X_decoder_in:  [0, 3, 9, 8]y_decoder_out:  [3, 9, 8, 8]Sample sequences in one-hot encoded format:X_encoder_in:  [[0 0 0 0 0 0 0 0 1 0] [0 0 0 0 0 0 0 0 1 0] [0 0 0 0 0 0 0 0 0 1] [0 0 0 1 0 0 0 0 0 0]]X_decoder_in:  [[1 0 0 0 0 0 0 0 0 0] [0 0 0 1 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 1] [0 0 0 0 0 0 0 0 1 0]]y_decoder_out:  [[0 0 0 1 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 1] [0 0 0 0 0 0 0 0 1 0] [0 0 0 0 0 0 0 0 1 0]]time: 232 ms`
• We train the model while monitoring the loss on a held-out set of 20% of the samples as below.
• However, I prepared a custom `train_test` function for getting a detailed report of training and testing.
`# Run trainingmodel_encoder_training.fit([encoder_input_data, decoder_input_data], decoder_predicted_data,          batch_size=32,          epochs=50,          validation_split=0.2)train_test(model_encoder_training, [encoder_input_data, decoder_input_data], decoder_predicted_data , [encoder_input_data, decoder_input_data], decoder_predicted_data, epochs=50, batch_size=32, patience=5,verbose=2)training for  50  epochs begins with EarlyStopping(monitor= val_loss, patience= 5 )....Epoch 1/50141/141 - 1s - loss: 2.1932 - accuracy: 0.1816 - val_loss: 1.9553 - val_accuracy: 0.3305...Epoch 49/50141/141 - 1s - loss: 0.0023 - accuracy: 1.0000 - val_loss: 0.0026 - val_accuracy: 1.0000Epoch 50/50141/141 - 1s - loss: 0.0022 - accuracy: 1.0000 - val_loss: 0.0024 - val_accuracy: 1.000050  epoch training finished...PREDICTION ACCURACY (%):Train: 100.000, Test: 100.000`
`10 examples from test data...Input          Expected       Predicted         T/F[8, 8, 9, 3]   [3, 9, 8, 8]   [3, 9, 8, 8]   True[1, 7, 5, 9]   [9, 5, 7, 1]   [9, 5, 7, 1]   True[3, 5, 6, 5]   [5, 6, 5, 3]   [5, 6, 5, 3]   True[2, 3, 3, 9]   [9, 3, 3, 2]   [9, 3, 3, 2]   True[5, 4, 3, 5]   [5, 3, 4, 5]   [5, 3, 4, 5]   True[4, 2, 1, 5]   [5, 1, 2, 4]   [5, 1, 2, 4]   True[9, 5, 2, 4]   [4, 2, 5, 9]   [4, 2, 5, 9]   True[9, 8, 5, 5]   [5, 5, 8, 9]   [5, 5, 8, 9]   True[2, 9, 9, 7]   [7, 9, 9, 2]   [7, 9, 9, 2]   True[7, 2, 9, 8]   [8, 9, 2, 7]   [8, 9, 2, 7]   TrueAccuracy:  1.0time: 37.2 s`
• We observed that training is finalized with almost 100% accuracy with validation training sets

IMPORTANT: Please note that we execute validation and testing by just using the training data and Teacher Forcing in place!!!

• WHY?

Because:

• Teacher Forcing needs to know the correct output beforehand
• Teacher Forcing is a method for improving the training process
• The model employing Teacher Forcing CAN NOT BE USED in inference/testing

Therefore,

• the model that we trained above CAN NOT BE DIRECTLY USED in inference/testing
• We will use some layers (with their weights) of the trained model to create a new model
• The new model will not use Teacher Learning
• Thus, the input to the new model will NOT BE `[encoder_input_data, decoder_input_data]` as the way we designed in `model_encoder_training`
• Remember, in Teacher Forcing, we set `decoder_input_data` such that it begins with a special symbol `start` and continues with the target sequence data except for the last time step.
• Now, during inference (testing), we do not know the correct (expected) target data beforehand!
• We define the `decoder_input_data` as follows:
• it begins with a special symbol `start`
• it will continue with an input created by the decoder at the previous time step
• in other words, the decoder’s output at time step t will be used decoder’s input at time step t+1

# Let’s begin to design an encoder-decoder model for inference

## We create a separate encoder model by using the trained layers in the above model

IMPORTANT: pay attention that in this model we use `encoder_inputs, encoder_states` for encoding which are parts of the encoder model, we trained above. That is, these layers come with its **trained weights **with Teacher Forcing

`encoder_model = Model(encoder_inputs, encoder_states)time: 11 ms`

## Then we create a separate decoder model by using the trained layers in the above model

• Design the decoder model by defining layers for:
• inputs
• decoding (LSTM)
• outputs

IMPORTANT: pay attention that in this model we use `decoder_lstm` for decoding which is a part of the decoder model, we trained above. That is this layer comes with its trained weights with Teacher Forcing

`decoder_state_input_h = Input(shape=(LSTMoutputDimension,))decoder_state_input_c = Input(shape=(LSTMoutputDimension,))decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]decoder_outputs, state_h, state_c = decoder_lstm(    decoder_inputs, initial_state=decoder_states_inputs)decoder_states = [state_h, state_c]decoder_outputs = decoder_dense(decoder_outputs)decoder_model = Model(    [decoder_inputs] + decoder_states_inputs,    [decoder_outputs] + decoder_states)time: 237 ms`
• Even though we define the encoder and decoder models we still need to dynamically provide the `decoder_input_data` as follows:
• it begins with a special symbol `start`
• it will continue with an input created by the decoder at the previous time step
• in other words, the decoder’s output at time step t will be used decoder’s input at time step t+1
• Let’s code it as a function:
`def decode_sequence(input_seq):    # Encode the input as state vectors.    states_value = encoder_model.predict(input_seq)    # Generate empty target sequence of length 1.    target_seq = np.zeros((1, 1, n_features))    # Populate the first character of target sequence with the start character.    target_seq[0, 0, 0] = 1    # Sampling loop for a batch of sequences    # (to simplify, here we assume a batch of size 1).    stop_condition = False    decoded_seq = list()    while not stop_condition:        # in a loop        # decode the input to a token/output prediction + required states for context vector        output_tokens, h, c = decoder_model.predict(            [target_seq] + states_value)        # convert the token/output prediction to a token/output        sampled_token_index = np.argmax(output_tokens[0, -1, :])        sampled_digit = sampled_token_index        # add the predicted token/output to output sequence        decoded_seq.append(sampled_digit)                # Exit condition: either hit max length        # or find stop character.        if (sampled_digit == '\n' or           len(decoded_seq) == n_timesteps_in):            stop_condition = True        # Update the input target sequence (of length 1)         # with the predicted token/output         target_seq = np.zeros((1, 1, n_features))        target_seq[0, 0, sampled_token_index] = 1.        # Update input states (context vector)         # with the ouputed states        states_value = [h, c]        # loop back.....            # when loop exists return the output sequence    return decoded_seqtime: 22.4 ms`

## We can now put the encoder model and decoder model together by using the function above for inference as below.

IMPORTANT:

• Since we used the trained layers here we do NOT need to train recently created models.
• We only provide the input sequence to the function.
`print('Input \t\t\t  Expected  \t   Predicted \t\tT/F')correct =0 sampleNo =  10for sample in range(0,sampleNo):  predicted= decode_sequence(encoder_input_data[sample].reshape(1,n_timesteps_in,n_features))  if (one_hot_decode(decoder_predicted_data[sample])== predicted):    correct+=1  print( one_hot_decode(encoder_input_data[sample]), '\t\t',         one_hot_decode(decoder_predicted_data[sample]),'\t', predicted,        '\t\t',one_hot_decode(decoder_predicted_data[sample])== predicted)print('Accuracy: ', correct/sampleNo)Input 			  Expected  	   Predicted 		T/F[8, 8, 9, 3] 		 [3, 9, 8, 8] 	 [3, 9, 8, 8] 		 True[1, 7, 5, 9] 		 [9, 5, 7, 1] 	 [9, 5, 7, 1] 		 True[3, 5, 6, 5] 		 [5, 6, 5, 3] 	 [5, 6, 5, 3] 		 True[2, 3, 3, 9] 		 [9, 3, 3, 2] 	 [9, 3, 3, 2] 		 True[5, 4, 3, 5] 		 [5, 3, 4, 5] 	 [5, 3, 4, 5] 		 True[4, 2, 1, 5] 		 [5, 1, 2, 4] 	 [5, 1, 2, 4] 		 True[9, 5, 2, 4] 		 [4, 2, 5, 9] 	 [4, 2, 5, 9] 		 True[9, 8, 5, 5] 		 [5, 5, 8, 9] 	 [5, 5, 8, 9] 		 True[2, 9, 9, 7] 		 [7, 9, 9, 2] 	 [7, 9, 9, 2] 		 True[7, 2, 9, 8] 		 [8, 9, 2, 7] 	 [8, 9, 2, 7] 		 TrueAccuracy:  1.0time: 1.96 s`

# Observations:

• Teacher Forcing is a method to train encoder-decoder models in the Seq2Seq model to accelerate training
• Teacher Forcing can ONLY be used at Training
• We need to handle how to use the model in inference
• Even though Teacher Forcing improves the training process by fast converging, in the inference the model can generate low accuracy even with the training data.
• Therefore, you need to use Teacher Forcing with caution

# What if I don’t want to use teacher forcing for training?

• Please look at the previous part on how to implement an Encoder-Decoder without Teacher Forcing training:

# References:Blogs:tf.keras.layers.LSTM official websiteA ten-minute introduction to sequence-to-sequence learning in Keras by Francois CholletHow to Develop an Encoder-Decoder Model with Attention in Keras by Jason BrownleePresentations:LSTM from a programmer’s perspective by Murat Karakaya AkademiVideos:LSTM from scratch by Murat Karakaya AkademiNotebooks:LSTM intro by Murat Karakaya AkademiRecap LSTM outputs by Murat Karakaya Akademi

You can follow me on these social networks: