Character Level Text Generation with an LSTM Model

This tutorial is the fifth part of the “Text Generation in Deep Learning with Tensorflow & Keras” series. In this series, we have been covering all the topics related to Text Generation with sample implementations in Python, Tensorflow & Keras.

In this tutorial, we will focus on how to build a Language Model using Keras LSTM layer for Character Level Text Generation. First, we will download a sample corpus (text file). After opening the file, we will apply the TensorFlow input pipeline that we have developed in Part B to prepare the training dataset by preprocessing and splitting the text into input character sequence (X) and output character (y). Then, we will design an LSTM-based Language Model and trai n it using the train set. Later on, we will apply several sampling methods that we have implemented in Part D to generate text and observe the effect of these sampling methods on the generated text. Thus, in the end, we will have a trained LSTM-based Language Model for character-level text generation with three sampling methods.

You can access all the parts of the Text Generation in Deep Learning with Tensorflow & Keras tutorial series on my blog at muratkarakaya.net. You can watch all these parts on the Murat Karakaya Akademi channel on YouTube in ENGLISH or TURKISH. You can access this Colab Notebook using the link.

If you would like to learn more about Deep Learning with practical coding examples, please subscribe to Murat Karakaya Akademi YouTube Channel or follow my blog at muratkarakaya.net. Do not forget to turn on notifications so that you will be notified when new parts are uploaded.

If you are ready, let’s get started!

Photo by Jan Huber on Unsplash

I assume that you have already watched all the previous parts.

Please ensure that you have reviewed the previous parts to utilize this part better.

What is a Character Level Text Generation?

A Language Model can be trained to generate text character-by-character. In this case, each of the input and output tokens is a character. Moreover, Language Model outputs a conditional probability distribution over the character set.

For more details, please check Part A.

1. BUILD A TENSORFLOW INPUT PIPELINE

What is a Data Pipeline?

Data Pipeline is an automated process that involves extracting, transforming, combining, validating, and loading data for further analysis and visualization.

It provides end-to-end velocity by eliminating errors and combatting bottlenecks or latency.

It can process multiple data streams at once.

In short, it is an absolute necessity for today’s data-driven solutions.

If you are not familiar with TF data pipelines, you can check my tutorials in English or Turkish or at my blog muratkarakaya.net.

What will we do in this Text Data pipeline?

We will create a data pipeline to prepare training data for a character-level text generator.

Thus, in the pipeline, we will

open & load corpus (text file)
convert the text into a sequence of characters
remove unwanted characters such as punctuations, HTML tags, white spaces, etc.
generate input (X) and output (y) pairs as character sequences
concatenate input (X) and output (y) into train data
cache, prefetch, and batch the train data for performance

For more information please refer to Part B: Tensorflow Data Pipeline for Character Level Text Generation on Youtube ( ENGLISH / TURKISH) or my blog at muratkarakaya.net.

Input for the TensorFlow Pipeline

# For English text
!curl -O https://s3.amazonaws.com/text-datasets/nietzsche.txt% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  586k  100  586k    0     0   702k      0 --:--:-- --:--:-- --:--:--  701k# For Turkish text
#!curl -O https://raw.githubusercontent.com/kmkarakaya/ML_tutorials/master/data/mesnevi_Tumu.txt

TensorFlow Pipeline

For more information and the complete code please refer to Part B: Tensorflow Data Pipeline for Character Level Text Generation on Youtube ( ENGLISH / TURKISH) or my blog at muratkarakaya.net.

Result of the Data Pipeline:

for sample in train_ds.take(1):
  print("input (X) dimension: ", sample[0].numpy().shape, "\noutput (y) dimension: ",sample[1].numpy().shape)input (X) dimension:  (64, 20) 
output (y) dimension:  (64,)for sample in train_ds.take(1):
  print("input (sequence of chars): ", sample[0][0].numpy(), "\noutput (next char to complete the input): ",sample[1][0].numpy())input (sequence of chars):  [ 8  4  6  5  8  2  4 11  6  4  2  5  4 11  6  9  2 16  6 12]  output (next char to complete the input):  12for sample in train_ds.take(1):
  print("input (sequence of chars): ", decode_sequence (sample[0][0].numpy()), "\noutput (next char to complete the input): ",vectorize_layer.get_vocabulary()[sample[1][0].numpy()])input (sequence of chars):  al timesuch as the  
output (next char to complete the input):  s

2. PREPARE SAMPLING METHODS

In Text Generation, sampling means randomly picking the next token according to the generated conditional probability distribution.

That is, after generating the conditional probability distribution over the set of tokens (vocabulary) for the given input sequence, we need to carefully decide how to select the next token (sample) from this distribution.

There are several methods for sampling in text generation (see here and here):

Greedy Search (Maximization)
Temperature Sampling
Top-K Sampling
Top-P Sampling (Nucleus sampling)
Beam Search

In this tutorial, we will code Greedy Search, Temperature Sampling, and Top-K Sampling.

For more information about Sampling, please review Part D: Sampling in Text Generation on Youtube ( ENGLISH / TURKISH) or my blog at muratkarakaya.net.

def softmax(z):
   return np.exp(z)/sum(np.exp(z))def greedy_search(conditional_probability):
  return (np.argmax(conditional_probability))def temperature_sampling (conditional_probability, temperature=1.0):
  conditional_probability = np.asarray(conditional_probability).astype("float64")
  conditional_probability = np.log(conditional_probability) / temperature
  reweighted_conditional_probability = softmax(conditional_probability)
  probas = np.random.multinomial(1, reweighted_conditional_probability, 1)
  return np.argmax(probas)def top_k_sampling(conditional_probability, k):
  top_k_probabilities, top_k_indices= tf.math.top_k(conditional_probability, k=k, sorted=True)
  top_k_probabilities= np.asarray(top_k_probabilities).astype("float32")
  top_k_probabilities= np.squeeze(top_k_probabilities)
  top_k_indices = np.asarray(top_k_indices).astype("int32")
  top_k_redistributed_probability=softmax(top_k_probabilities)
  top_k_redistributed_probability = np.asarray(top_k_redistributed_probability).astype("float32")
  sampled_token = np.random.choice(np.squeeze(top_k_indices), p=top_k_redistributed_probability)
  return sampled_token

3 A LSTM-BASED LANGUAGE MODEL FOR TEXT GENERATION

In this tutorial, we will use the Keras LSTM layer to create a Language Model for character-level text generation.

3.1 Define the model

As we already prepared the training dataset, we need to define the input specs accordingly.

Remember that in the train set, the length of the input (X) sequence (sequence_length) is 20 tokens (chars). Thus, we provide this info to the embedding layer.

Next, we add a layer to map those vocab indices into a space of dimensionality ‘embedding_dim’.

After applying Dropout, we use an LSTM layer to process the sequence and learn to generate the next token with the help of a Dense layer.

Here is the complete code:

inputs = tf.keras.Input(shape=(sequence_length), dtype="int64")
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)
x = layers.LSTM(128, return_sequences=True)(x)
x = layers.Flatten()(x)
predictions=  layers.Dense(max_features, activation='softmax')(x)
model_LSTM = tf.keras.Model(inputs, predictions,name="model_LSTM")

3.2 Compile the model

Since we use integers to represent the output (Y), that is; we do not use one-hot encoding, we need to use sparse_categorical_crossentropy loss function.

model_LSTM.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_LSTM.summary())Model: "model_LSTM"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 20)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 20, 16)            1536      
_________________________________________________________________
dropout (Dropout)            (None, 20, 16)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 20, 128)           74240     
_________________________________________________________________
flatten (Flatten)            (None, 2560)              0         
_________________________________________________________________
dense (Dense)                (None, 96)                245856    
=================================================================
Total params: 321,632
Trainable params: 321,632
Non-trainable params: 0
_________________________________________________________________
None

3.3 Train the model

We train the Language Model for 3 epochs.

model_LSTM.fit(train_ds, epochs=3)Epoch 1/3
3077/3077 [==============================] - 30s 10ms/step - loss: 1.8037 - accuracy: 0.4592
Epoch 2/3
3077/3077 [==============================] - 30s 10ms/step - loss: 1.7858 - accuracy: 0.4636
Epoch 3/3
3077/3077 [==============================] - 30s 10ms/step - loss: 1.7677 - accuracy: 0.4680





<tensorflow.python.keras.callbacks.History at 0x7f7f4b8f6710>

3.4 An Auxillary Function for Decoding Token Index to Characters

We need to convert the given token index to the corresponding character for each token in the generated text. Therefore, I prepare the decode_sequence () function as below:

def decode_sequence (encoded_sequence):
  deceoded_sequence=[]
  for token in encoded_sequence:
    deceoded_sequence.append(vectorize_layer.get_vocabulary()[token])
  sequence= ''.join(deceoded_sequence)
  print("\t",sequence)
  return sequence

3.5 Another Auxillary Function for Generating Text

To generate text with various sampling methods, I prepare the following function. The generate_text(model, prompt, step) function takes the trained Language Model, the prompt, and the length of the text to be generated as the parameters. Then, it generates text with three different sampling methods.

def generate_text(model, seed_original, step):
    seed= vectorize_text(seed_original)
    print("The prompt is")
    decode_sequence(seed.numpy().squeeze())
    

    seed= vectorize_text(seed_original).numpy().reshape(1,-1)
    #Text Generated by Greedy Search Sampling
    generated_greedy_search = (seed)
    for i in range(step):
      predictions=model.predict(seed)
      next_index= greedy_search(predictions.squeeze())
      generated_greedy_search = np.append(generated_greedy_search, next_index)
      seed= generated_greedy_search[-sequence_length:].reshape(1,sequence_length)
    print("Text Generated by Greedy Search Sampling:")
    decode_sequence(generated_greedy_search)

    #Text Generated by Temperature Sampling
    print("Text Generated by Temperature Sampling:")
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print("\ttemperature: ", temperature)
        seed= vectorize_text(seed_original).numpy().reshape(1,-1)
        generated_temperature = (seed)
        for i in range(step):
            predictions=model.predict(seed)
            next_index = temperature_sampling(predictions.squeeze(), temperature)
            generated_temperature = np.append(generated_temperature, next_index)
            seed= generated_temperature[-sequence_length:].reshape(1,sequence_length)
        decode_sequence(generated_temperature)

    #Text Generated by Top-K Sampling
    print("Text Generated by Top-K Sampling:")
    for k in [2, 3, 4, 5]:
        print("\tTop-k: ", k)
        seed= vectorize_text(seed_original).numpy().reshape(1,-1)
        generated_top_k = (seed)
        for i in range(step):
            predictions=model.predict(seed)
            next_index = top_k_sampling(predictions.squeeze(), k)
            generated_top_k = np.append(generated_top_k, next_index)
            seed= generated_top_k[-sequence_length:].reshape(1,sequence_length)
        decode_sequence(generated_top_k)

We can call the generate_text() function by providing the trained LM, a prompt and the sequence length of the text to be generated as below.

You can run this method multiple times to observe the generated text with different sampling methods.

generate_text(model_LSTM,"WHO is it really that puts questions to", 100)

The output is:

The prompt is who is it really tha Text Generated by Greedy Search Sampling: who is it really that the soul of the same to the berieve the sense of the soul of the sempeth of the soul of the sempet Text Generated by Temperature Sampling: temperature: 0.2 who is it really that the soul of the some the same of the soul of the same that the more and the soul of the sempates a temperature: 0.5 who is it really that the secfit semelt of the soul and the sor to as a wire it is which as as its rame to the beadts temperature: 1.0 who is it really that wath tokest oud were abere at saed af any though niff to bolevse of the names bes rose it by rvab temperature: 1.2 who is it really that positions of jenw cisns oruaritions anither which thenringurcomsome itality his our gagacin astoo Text Generated by Top-K Sampling: Top-k: 2 who is it really than his dot to be and secf assemethe intentious as a cropessious of all thing one sindtence of his oth Top-k: 3 who is it really thasthisc here ass isthaction which heare of als actiable havifin be at a sinntlys hiss hor tath Top-k: 4 who is it really that werendivanse astains or silcussine attainss into thoug howy athort the himselv an ascimicustselved Top-k: 5 who is it really thattisserftut isturctyd for tomliatif sace of croselyd soungher ar indepfimiessinfstialtitude it o

4 OBSERVATIONS

Info about the corpus

The size of the vocabulary (number of distinct characters): 34
The number of generated sequences: 196980
The maximum length of sequences: 20

Note that this corpus actually is not sufficient to generate high-quality texts. However, due to the limitations of the Colab platform (RAM and GPU), I used this corpus for demonstration purposes.

Therefore, please keep in mind these limitations when considering the generated texts.

About the Language Modle

We implement a single-LSTM-layer model. Thus, we do not expect this simple model to create high-quality texts.
On the other hand, you can try to improve this model by incrementing the number of LSTM layers, the output dimension of the embedding and LSTM layers, etc.

Greedy Search

The method simply selects the token with the highest probability as its next token (word or char).
However, if we always sample the most likely word, the standard language model training objective causes us to get stuck in loops like above.

Temperature Sampling

If the temperature is set to very low values or 0, then Temperature Sampling becomes equivalent to the Greedy Search and results in extremely repetitive and predictable text.
With higher temperatures, the generated text becomes more random, interesting, surprising, even creative; it may sometimes invent completely new words (misspelled words) that sound somewhat plausible.

Top-k Sampling

In the studies, it is reported that the top-k sampling appears to improve quality by removing the tail and making it less likely to go off-topic.
However, in our case, there are not many tokens we could sample from reasonably (broad distribution).
Thus, in the above examples, Top-K generates texts that mostly look random.
Therefore, the k value should be chosen carefully concerning the size of the token dictionary.