Tuesday, November 8, 2022

Building an Efficient TensorFlow Input Pipeline for Word-Level Text Generation

 Building an Efficient TensorFlow Input Pipeline for Word-Level Text Generation


This tutorial is the third part of the “Text Generation in Deep Learning with Tensorflow & Keras” series.

In this series, we have been covering all the topics related to Text Generation with sample implementations in PythonThis tutorial will focus on how to build an Efficient TensorFlow Input Pipeline for Word-Level Text GenerationFirst, we will download a sample corpus (text file). After opening the file and reading it line-by-line, we will split the text into words. Then, we will generate pairs including an input word sequence (X) and an output word (y).

Using tf.data API and Keras TextVectorization methods, we will

  • preprocess the text,
  • convert the words into integer representation,
  • prepare the training dataset from the pairs,
  • and optimize the data pipeline.

Thus, in the end, we will be ready to train a Language Model for word-level text generation.

If you would like to learn more about Deep Learning with practical coding examples, please subscribe to Murat Karakaya Akademi YouTube Channel or follow my blog on muratkarakaya.net. Do not forget to turn on notifications so that you will be notified when new parts are uploaded.

If you are ready, let’s get started!



Photo by Quinten de Graaf on Unsplash


You can read all the posts about Text Generation in Deep Learning with Tensorflow & Keras Series here. You can watch all these parts on the Murat Karakaya Akademi channel on YouTube in ENGLISH or TURKISH.

assume that you have already watched all the previous parts. Please ensure that you have reviewed the previous parts to utilize this part better.

What is a Word Level Text Generation?

Language Model can be trained to generate text word-by-word.

In this case, each of the input and output tokens is made of word tokens.

In training, we supply a sequence of words as input (X) and a target word (next word to complete the input) as output (y)

After training, the Language Model learns to generate a conditional probability distribution over the vocabulary of words according to the given input sequence.

To generate text, we can iterate over the below steps:

  • Step 1: we provide a sequence of words to the Language Model as input
  • Step 2: the Language Model outputs a conditional probability distribution over the vocabulary
  • Step 3: we sample a word from the distribution
  • Step 4: we concatenate the newly sampled word to the generated text
  • Step 4: a new input sequence is generated by appending the newly sampled word

For more details, please check Part A and B.

Tensorflow Data Pipeline: tf.data

What is a Data Pipeline?

Data Pipeline is an automated process that involves extractingtransformingcombiningvalidating, and loading data for further analysis and visualization.

It provides end-to-end velocity by eliminating errors and combatting bottlenecks or latency.

It can process multiple data streams at once.

In short, it is an absolute necessity for today’s data-driven solutions.

If you are not familiar with data pipelines, you can check my tutorials in English or Turkish.

Why Tensorflow Data Pipeline?

The tf.data API enables us

  • to build complex input pipelines from simple, reusable pieces.
  • to handle large amounts of data, read from different data formats, and perform complex transformations.

What can be done in a Text Data Pipeline?

The pipeline for a text model might involve extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths.

What will we do in this Text Data pipeline?

We will create a data pipeline to prepare training data for word-level text generators.

Thus, in the pipeline, we will

  • open & load corpus (text file)
  • convert the text into a sequence of words
  • generate input (X) and output (y) tuples as word sequences
  • remove unwanted tokens such as punctuations, HTML tags, white spaces, etc.
  • vectorize the input (X) and output (y) tuples
  • concatenate input (X) and output (y) into train data
  • cache, prefetch, and batch the data pipeline for performance

After, this brief introduction to the Tensorflow Data pipeline, let’s start creating our text pipeline.

Building an Efficient TensorFlow Input Pipeline for Word-Level Text Generation

2. LOAD TEXT DATA LINE BY LINE

To load the data into our pipeline, I will use tf.data.TextLineDataset() method.

raw_data_ds = tf.data.TextLineDataset(["nietzsche.txt"])

Let’s see some lines from the uploaded text:

for elems in raw_data_ds.take(10):
print(elems.numpy()) #.decode("utf-8")
b'PREFACE'
b''
b''
b'SUPPOSING that Truth is a woman--what then? Is there not ground'
b'for suspecting that all philosophers, in so far as they have been'
b'dogmatists, have failed to understand women--that the terrible'
b'seriousness and clumsy importunity with which they have usually paid'
b'their addresses to Truth, have been unskilled and unseemly methods for'
b'winning a woman? Certainly she has never allowed herself to be won; and'
b'at present every kind of dogma stands with sad and discouraged mien--IF,'

3. SPLIT THE TEXT INTO WORD TOKENS

In Part A, we mentioned that we can train a language model and generate new text by using two different units:

  • character level
  • word level

That is, you can split your text into a sequence of characters or words.

In this tutorial, we will focus on word-level tokenization.

If you would like to learn how to create char level tokenization please take a look at Part B.

raw_data_ds= raw_data_ds.map(lambda x: tf.strings.split(x))
for elems in raw_data_ds.take(5):
print(elems.numpy())
[b'PREFACE']
[]
[]
[b'SUPPOSING' b'that' b'Truth' b'is' b'a' b'woman--what' b'then?' b'Is'
b'there' b'not' b'ground']
[b'for' b'suspecting' b'that' b'all' b'philosophers,' b'in' b'so' b'far'
b'as' b'they' b'have' b'been']

Flatten the dataset for further processing:

raw_data_ds=raw_data_ds.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x))
for elems in raw_data_ds.take(5):
print(elems.numpy())
b'PREFACE'
b'SUPPOSING'
b'that'
b'Truth'
b'is'

4. GENERATE X & y TUPLES

We can split the text into two datasets as below:

  • The first dataset (X) is the input data to the model which will hold fixed-size word sequences (partial sentences)
  • The second dataset (y) is the output data which has only one-word samples (next word)

To create these datasets (X input sequence of words & y next word), we can apply tf.data.Dataset.window() transformation.

First, define the size of the input sequence: How many words will be in the input?

input_sequence_size= 4

Then, apply the window() transformation such that each window will have input_sequence_size+1 words (|X|+|y|)

sequence_data_ds=raw_data_ds.window(input_sequence_size+1, drop_remainder=True)
for window in sequence_data_ds.take(3):
print(list(window.as_numpy_iterator()))
[b'PREFACE', b'SUPPOSING', b'that', b'Truth', b'is']
[b'a', b'woman--what', b'then?', b'Is', b'there']
[b'not', b'ground', b'for', b'suspecting', b'that']

However, the window() method returns a dataset containing windows, where each window is itself represented as a dataset. Something like {{1,2,3,4,5},{6,7,8,9,10},...}, where {...} represents a dataset.

print(window)<_VariantDataset shapes: (), types: tf.string>

But we just want a regular dataset containing tensors: {[1,2,3,4,5],[6,7,8,9,10],…}, where […] represents a tensor. The flat_map() method returns all the tensors in a nested dataset, after transforming each nested dataset.

If we didn’t batch, we would get: {1,2,3,4,5,6,7,8,9,10,…}. By batching each window to its full size, we get {[1,2,3,4,5],[6,7,8,9,10],…} as we desired.

You can find more details about the transformation here

sequence_data_ds = sequence_data_ds.flat_map(lambda window: window.batch(5))
for elem in sequence_data_ds.take(3):
print(elem)
tf.Tensor([b'PREFACE' b'SUPPOSING' b'that' b'Truth' b'is'], shape=(5,), dtype=string)
tf.Tensor([b'a' b'woman--what' b'then?' b'Is' b'there'], shape=(5,), dtype=string)
tf.Tensor([b'not' b'ground' b'for' b'suspecting' b'that'], shape=(5,), dtype=string)

Now each item in the dataset is a tensor, so we can split it into X & y datasets:

sequence_data_ds = sequence_data_ds.map(lambda window: (window[:-1], window[-1:]))
X_train_ds_raw = sequence_data_ds.map(lambda X,y: X)
y_train_ds_raw = sequence_data_ds.map(lambda X,y: y)

Let’s see some input-output pairs:

print("Input X  (sequence) \t\t    ----->\t Output y (next word)")
for elem1, elem2 in zip(X_train_ds_raw.take(3),y_train_ds_raw.take(3)):
print(elem1.numpy(),"\t\t----->\t", elem2.numpy())
Input X (sequence) -----> Output y (next word)
[b'PREFACE' b'SUPPOSING' b'that' b'Truth'] -----> [b'is']
[b'a' b'woman--what' b'then?' b'Is'] -----> [b'there']
[b'not' b'ground' b'for' b'suspecting'] -----> [b'that']

5. RESHAPE X DATASET

Input (X) is a vector of strings but we need to convert it to a string vector so that we can vectorize it properly.

Below is a python function for iterating the given tensor to join all the strings into a single string:

def convert_string(X: tf.Tensor):
str1 = ""
for ele in X:
str1 += ele.numpy().decode("utf-8")+" "
str1= tf.convert_to_tensor(str1[:-1])
return str1

We apply the convert_string function to every element of X_train_ds_raw

Note that to use a python function as a mapping function, you need to apply tf.py_function(). For more, please see the documentation here

X_train_ds_raw=X_train_ds_raw.map(lambda x: tf.py_function(func=convert_string,
inp=[x], Tout=tf.string))

After transformation, we have samples in X_train_ds_raw as desired:

print("Input X  (sequence) \t\t    ----->\t Output y (next word)")
for elem1, elem2 in zip(X_train_ds_raw.take(5),y_train_ds_raw.take(5)):
print(elem1.numpy(),"\t\t----->\t", elem2.numpy())
Input X (sequence) -----> Output y (next word)
b'PREFACE SUPPOSING that Truth' -----> [b'is']
b'a woman--what then? Is' -----> [b'there']
b'not ground for suspecting' -----> [b'that']
b'all philosophers, in so' -----> [b'far']
b'as they have been' -----> [b'dogmatists,']

However, the shape of X is unknown:

print(X_train_ds_raw.element_spec, y_train_ds_raw.element_spec)TensorSpec(shape=<unknown>, dtype=tf.string, name=None) TensorSpec(shape=(None,), dtype=tf.string, name=None)

To fix this, we can explicitly set the shape with another transformation:

X_train_ds_raw=X_train_ds_raw.map(lambda x: tf.reshape(x,[1]))

Now, it is ok

X_train_ds_raw.element_spec, y_train_ds_raw.element_spec(TensorSpec(shape=(1,), dtype=tf.string, name=None),
TensorSpec(shape=(None,), dtype=tf.string, name=None))

6. PREPROCESS THE TEXT

We need to process these datasets before feeding them into a model.

Here, we will use the Keras preprocessing layer “TextVectorization.

Why do we use Keras preprocessing layer?

Because:

  • The Keras preprocessing layers API allows developers to build Keras-native input processing pipelines. These input processing pipelines can be used as independent preprocessing code in non-Keras workflows, combined directly with Keras models, and exported as part of a Keras SavedModel.
  • With Keras preprocessing layers, we can build and truly end-to-end export models: models that accept raw images or raw structured data as input; models that handle feature normalization or feature value indexing on their own.

In the next part, we will create the end-to-end Text Generation model and we will see the benefits of using Keras preprocessing layers.

What are the preprocessing steps?

The processing of each sample contains the following steps:

  • standardize each sample (usually lowercasing + punctuation stripping):
  • In this tutorial, we will create a custom standardization function to show how to apply your code to strip un-wanted chars and symbols.
  • split each sample into substrings (usually words):
  • As in this part, we aim at splitting the text into fixed-size word sequences, we do not need to use a custom split function.
  • recombine substrings into tokens (usually ngrams): We will leave it as 1 ngram (word)
  • index tokens (associate a unique int value with each token)
  • transform each sample using this index, either into a vector of ints or a dense float vector.

Prepare custom standardization and split functions

  • We have our custom standardization and split functions
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
stripped_num = tf.strings.regex_replace(stripped_html, "[\d-]", " ")
stripped_punc =tf.strings.regex_replace(stripped_num,
"[%s]" % re.escape(string.punctuation), "")
return stripped_p
unc

Set the text vectorization parameters

  • We can limit the number of distinct words by setting max_features
  • We set an explicit sequence_length, since our model needs fixed-size input sequences.
max_features = 54762    # Number of distinct words in the vocabulary  
sequence_length = input_sequence_size # Input sequence size
batch_size = 128 # Batch size

Create the text vectorization layer

  • The text vectorization layer is initialized below.
  • We are using this layer to normalize, split, and map strings to integers, so we set our ‘output_mode’ to ‘int’.
vectorize_layer = TextVectorization(
standardize=custom_standardization,
max_tokens=max_features,
# split --> DEFAULT: split each sample into substrings (usually words)
output_mode="int",
output_sequence_length=sequence_length,
)

Adapt the Text Vectorization layer to the training dataset

Now that the Text Vectorization layer has been created, we can call adapt on a text-only dataset to create the vocabulary with indexing.

You don’t have to batch, but for very large datasets this means you’re not keeping spare copies of the dataset in memory.

vectorize_layer.adapt(raw_data_ds.batch(batch_size))

We can take a look at the size of the vocabulary

print("The size of the vocabulary (number of distinct words): ", len(vectorize_layer.get_vocabulary()))The size of the vocabulary (number of distinct words):  9903

Let’s see the first 5 entries in the vocabulary:

print("The first 10 entries: ", vectorize_layer.get_vocabulary()[:10])The first 10 entries:  ['', '[UNK]', 'the', 'of', 'and', 'to', 'in', 'is', 'a', 'that']

You can access the vocabulary by using an index:

vectorize_layer.get_vocabulary()[3]'of'

After preparing the Text Vectorization layer, we need a helper function to convert a given raw text to a Tensor by using this layer:

def vectorize_text(text):
text = tf.expand_dims(text, -1)
return tf.squeeze(vectorize_layer(text))

A simple test of the function:

Apply the Text Vectorization onto X and y datasets

Notice that we have a vector of strings before the text vectorization

for elem in X_train_ds_raw.take(3):
print("X: ",elem.numpy())
X: [b'PREFACE SUPPOSING that Truth']
X: [b'a woman--what then? Is']
X: [b'not ground for suspecting']
# Vectorize the data.
X_train_ds = X_train_ds_raw.map(vectorize_text)
y_train_ds = y_train_ds_raw.map(vectorize_text)

After the text vectorization, we have a vector of integers

for elem in X_train_ds.take(3):
print("X: ",elem.numpy())
X: [4041 576 9 119]
X: [ 8 147 41 143]
X: [ 15 1083 12 5783]
X_train_ds_raw.element_spec, y_train_ds_raw.element_spec(TensorSpec(shape=(1,), dtype=tf.string, name=None),
TensorSpec(shape=(None,), dtype=tf.string, name=None))

Convert y to a single char representation

Notice that even we want y to be a single word, after the text vectorization, it becomes a vector of integers as well!

We need to fix this.

for elem in y_train_ds.take(2):
print("shape: ", elem.shape, "\n next_char: ",elem.numpy())
shape: (4,)
next_char: [7 0 0 0]
shape: (4,)
next_char: [40 0 0 0]

We can solve this by simply selecting the first element of the array only.

y_train_ds=y_train_ds.map(lambda x: x[:1])

Now, we have y as expected:

for elem in y_train_ds.take(2):
print("shape: ", elem.shape, "\n next_char: ",elem.numpy())
shape: (1,)
next_char: [7]
shape: (1,)
next_char: [40]

Let’s see an example pair:

for (X,y) in zip(X_train_ds.take(5), y_train_ds.take(5)):
print(X.numpy(),"-->",y.numpy())
[4041 576 9 119] --> [7]
[ 8 147 41 143] --> [40]
[ 15 1083 12 5783] --> [9]
[ 18 160 6 38] --> [121]
[11 30 27 59] --> [2543]

7. FINALIZE THE DATA PIPELINE

Join the input (X) and output (y) values as a single dataset

We can now zip these two sets together as a single training dataset.

However, notice that after the transformation, tensor shapes become unknown!

X_train_ds_raw.element_spec, y_train_ds_raw.element_spec(TensorSpec(shape=(1,), dtype=tf.string, name=None),
TensorSpec(shape=(None,), dtype=tf.string, name=None))
train_ds = tf.data.Dataset.zip((X_train_ds,y_train_ds))
train_ds.element_spec
(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
TensorSpec(shape=<unknown>, dtype=tf.int64, name=None))

To fix this we can apply another transformation to set the shapes explicitly.

You can find more explanation about this issue here

def _fixup_shape(X, y):
X.set_shape([4])
y.set_shape([1])
return X, y

Let’s check the shapes of the samples after fixing:

train_ds=train_ds.map(_fixup_shape)
train_ds.element_spec
(TensorSpec(shape=(4,), dtype=tf.int64, name=None),
TensorSpec(shape=(1,), dtype=tf.int64, name=None))

Everything seems O.K. :)

for el in train_ds.take(5):
print(el)
(<tf.Tensor: shape=(4,), dtype=int64, numpy=array([4041, 576, 9, 119])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([7])>)
(<tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 8, 147, 41, 143])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([40])>)
(<tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 15, 1083, 12, 5783])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([9])>)
(<tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 18, 160, 6, 38])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([121])>)
(<tf.Tensor: shape=(4,), dtype=int64, numpy=array([11, 30, 27, 59])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([2543])>)

Set data pipeline optimizations

Do async prefetching / buffering of the data for best performance on GPU

AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.shuffle(buffer_size=512).batch(batch_size, drop_remainder=True).cache().prefetch(buffer_size=AUTOTUNE)

Check the shape of the dataset (in batches):

train_ds.element_spec(TensorSpec(shape=(128, 4), dtype=tf.int64, name=None),
TensorSpec(shape=(128, 1), dtype=tf.int64, name=None))

8. COMPLETE CODE

Here is the complete code:

CONCLUSION:

In this tutorial, we apply the following steps to create Tensorflow Data Pipeline for word-level text generation:

  • download a corpus
  • split the text into words
  • prepare input X and output y tuples
  • vectorize the text by using the Keras preprocessing layer “TextVectorization
  • optimize the data pipelines by batching, prefetching, and caching.

In the next parts, we will see

  • Part D: Recurrent Neural Network (LSTM) Model for Character Level Text Generation
  • Part E: Encoder-Decoder Model for Character Level Text Generation
  • Part F: Recurrent Neural Network (LSTM) Model for Word-Level Text Generation
  • Part G: Encoder-Decoder Model for Word-Level Text Generation

Comments or Questions?

Please share your Comments or Questions.

Thank you in advance.

Do not forget to check out the next parts!

Take care!

References

What is a Data Pipeline?

tf.data: Build TensorFlow input pipelines

Text classification from scratch

Working with Keras preprocessing layers

Character-level text generation with LSTM

Toward Controlled Generation of Text

Attention Is All You Need

Talk to Transformer

What is the difference between word-based and char-based text generation RNNs?

The survey: Text generation models in deep learning

Generative Adversarial Networks for Text Generation

FGGAN: Feature-Guiding Generative Adversarial Networks for Text Generation

How to sample from language models

How to generate text: using different decoding methods for language generation with Transformers

Hierarchical Neural Story Generation

You can follow me on these social networks:

YouTube

Facebook

Instagram

LinkedIn

Github

Kaggle

muratkarakaya.net