Tuesday, November 1, 2022

Part D: Preprocessing Text with TF Data Pipeline and Keras Text Vectorization Layer

 

Part D: Preprocessing Text with TF Data Pipeline and Keras Text Vectorization Layer

Multi-Topic Text Classification with Various Deep Learning Models

Author: Murat Karakaya
Date created….. 17 09 2021
Date published… 13 03 2022
Last modified…. 23 06 2022

Description: This is the Part D of the tutorial series “Multi-Topic Text Classification with Various Deep Learning Models which covers all the phases of text classification:

  • Exploratory Data Analysis (EDA),
  • Text preprocessing
  • TF Data Pipeline
  • Keras TextVectorization preprocessing layer
  • Multi-class (multi-topic) text classification
  • Deep Learning model design & end-to-end model implementation
  • Performance evaluation & metrics
  • Generating classification report
  • Hyper-parameter tuning
  • etc.

We will design various Deep Learning models by using

  • the Keras Embedding layer,
  • Convolutional (Conv1D) layer,
  • Recurrent (LSTM) layer,
  • Transformer Encoder block, and
  • pre-trained transformer (BERT).

We will cover all the topics related to solving Multi-Class Text Classification problems with sample implementations in Python / TensorFlow / Keras environment.

We will use a Kaggle Dataset in which there are 32 topics and more than 400K total reviews.

If you would like to learn more about Deep Learning with practical coding examples,

  • Please subscribe to the Murat Karakaya Akademi YouTube Channel or
  • Do not forget to turn on notifications so that you will be notified when new parts are uploaded.
  • Follow my blog at muratkarakaya.net

You can access all the codes, videos, and posts of this tutorial series from the links below.


PARTS

In this tutorial series, there are several parts to cover the Text Classification with various Deep Learning Models topics. You can access all the parts from this index page.



Photo by 苏 静斋 on Unsplash

PART D: PREPROCESSING TEXT WITH TF DATA PIPELINE AND KERAS TEXT VECTORIZATION LAYER

You can watch this part in English or Turkish on YouTube.


Below, we will first create a TensorFlow Data Pipeline to preprocess the data with the Keras TextVectorization layer.

pipeline for a text model mostly involves extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths.


Build the Train TensorFlow Datasets

Observe that we have reviews in the text as input and categories (topics) in integers as target values:

train_features.values[:5]array(['Sürat Kargo Problemi,"92110952412235 numaralı kargom 10 gündür İzmir aktarma merkezinde bekliyor. Kargom ile ilgili herhangi bir haber, bilgi bulunmamaktadır. Hala hareket görmedi. Arıyorum kimse ilgilenmiyor ilgilenmeyi bırakın cevap vermiyorlar. Acil olarak geri dönüş bekliyorum.Devamını oku"',
'Garanti BBVA Havale Sorunu Oldu,Garanti internet bankacılığı ile eşimin Garanti hesabına beş gün ara ile para transferi yaptım. Dekontları mevcut. Fakat karşı tarafın hesabında gözükmüyor 18 gün oldu müşteri temsilcisine bağlanamıyoruz sürekli bant yayını. Bant yayınıyla bu sorunu nasıl çözeceğiz.Devamını oku',
'Ösym Sınav Yeri Değişikliği,Binlerce insan bu sorundan muzdarip. Hastalık dolayısıyla sınav yeri tercihi yaptığı illerde olmayan insanlar var. ÖSYM bu sorunla ilgilenmiyormuş gibi bir tavır takınıyor. Sınav yeri tercihi yaparken müneccim olmadığımız için hastalığı öngöremedik ama şu an ki duruma bakılırsa hastalığın faturası b...Devamını oku',
'Hileli Yağ Yudum Egem\'den,"Yudum Egem\'den sızma zeytiyağını markasına güvenerek çok miktarda almıştım. Halen elimde 20 litre civarında sızma yağ var.',
"Danone Hayat İçecek Kampanyalı Suyu Getirmiyor!,Vodafone Danone Hayat İçecek kampanyası ile 1 Nisan'da söylemiş olduğumuz su henüz gelmedi. Nasıl bir hizmet anlayışı bu. Bu durumdan hiç memnun değiliz. Bir an önce çözüme kavuşturulsun. Ya da bu işten vazgeçin. İnsanları boşuna mağdur etmeyin. Yerine getiremeyeceğiniz sözler vermeyin.Devamını oku"],
dtype=object)
time: 9.31 ms (started: 2022-03-01 12:16:15 +00:00)train_targets.values[:5]array([17, 10, 16, 11, 14], dtype=int8)time: 8.99 ms (started: 2022-03-01 12:16:15 +00:00)

We create 2 TF Datasets from the raw Train Dataframe for further processing:

  1. for input (text/reviews)
  2. for target (categories/topics)
# this is the input (text/reviews) dataset
train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(train_features.values, tf.string)
)
time: 4.11 s (started: 2022-03-01 12:16:15 +00:00)# this is the target (categories/topics) dataset
train_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(train_targets.values, tf.int64),
)time: 125 ms (started: 2022-03-01 12:16:19 +00:00)

Decide the dictionary size and the review size

For preprocessing the text, we need to decide the dictionary (vocab) size and the maximum review (text) size.

As we observed in Part B, 75% of all the reviews are less than 50 words, thus, I have opted out max_len of the reviews as the max_review_size =50 words.

For the dictionary size (vocab_size), we observed that, in the raw dataset, we have more than 431K words.

That is too much!

Therefore, I opted out using 100K words as the vocab_size.

Of course, you can try different max_len and vocab_size sizes depending on your dataset, hardware, and the Deep Learning model. Actually, you need to tune these kinds of hyper-parameters to achieve better performance.

vocab_size = 100000  # Only consider the top 100K words
max_len = max_review_size # Max review size in words
time: 1 ms (started: 2022-03-01 12:16:19 +00:00)

Prepare the Keras Text Vectorization layer

To preprocess the text, I will use the Keras TextVectorization layer. There are many advantages to using the Keras Preprocessing Layers.

The Keras Preprocessing Layers API allows developers to build Keras-native input processing pipelines. These input processing pipelines can be used as independent preprocessing code in non-Keras workflows, combined directly with Keras models, and exported as part of a Keras SavedModel, which you will see at the end of this tutorial as well.

With Keras Preprocessing Layers, you can build and export models that are truly end-to-end: models that accept raw images or raw structured data as input; models that handle feature normalization or feature value indexing on their own.

There are many preprocessing Keras layers. For text preprocessing, I will use the TextVectorization layer during this tutorial.

Custom Standardization

As the first step of text preprocessing, we will standardize the text by using the below function.

@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
""" Remove html line-break tags and handle punctuation """
no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")
no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
#remove stop words
#no_stop_words = ' '+no_punctuations+ ' '
#for each in tr_stop_words.values:
# no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
no_extra_space = tf.strings.regex_replace(no_punctuations, " +"," ")
#remove Turkish chars
no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
no_O = tf.strings.regex_replace(no_I, "ö","o")
no_C = tf.strings.regex_replace(no_O, "ç","c")
no_S = tf.strings.regex_replace(no_C, "ş","s")
no_G = tf.strings.regex_replace(no_S, "ğ","g")
no_U = tf.strings.regex_replace(no_G, "ü","u")
return no_Utime: 17.8 ms (started: 2022-03-01 12:16:19 +00:00)

Quickly verify that custom_standardization works: try it on a sample Turkish input:

input_string = "Bu Issız Öğlenleyin de;  şunu ***1 Pijamalı Hasta***, ve  Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku"
print("input: ", input_string)
output_string= custom_standardization(input_string)
print("output: ", output_string.numpy().decode("utf-8"))
input: Bu Issız Öğlenleyin de; şunu ***1 Pijamalı Hasta***, ve Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku
output: bu issiz oglenleyin de sunu pijamali hasta ve ancak i̇ste yagiz sofore cabucak guvendi
time: 15.9 ms (started: 2022-03-01 12:16:19 +00:00)

Build a TextVectorization layer

Let’s build our TextVectorization layer:

# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
standardize=custom_standardization,
max_tokens=vocab_size - 1,
output_mode="int", #tf-idf / int / binary / count
output_sequence_length=max_len,
)
time: 866 ms (started: 2022-03-01 12:16:19 +00:00)

Note: Here, I opt out converting a string sequence into an integer sequence by setting ***output_mode="int"***. You can also try other encodings (representation) methods such as tf--idfbinary, or count. I left these options for you as an exercise :)

Adopt the Text Vectorization layer

TextVectorization preprocessing layer has an internal state that can be computed based on a sample of the training data. That is, TextVectorization holds a mapping between string tokens and integer indices.

Thus, we will adopt TextVectorization preprocessing layer ONLY to the training data.

Please note that: To prevent data leak, we DO NOT adopt TextVectorization preprocessing layer to the whole (train & test) data.

vectorize_layer.adapt(train_text_ds_raw)
vocab = vectorize_layer.get_vocabulary()
# To get words back from token indices
time: 17.5 s (started: 2022-03-01 12:16:20 +00:00)

Check the dictionary (vocab) and preprocessing

Let’s see some example conversions:

print("vocab has the ", len(vocab)," entries")
print("vocab has the following first 10 entries")
for word in range(10):
print(word, " represents the word: ", vocab[word])
print("2 sample text preprocessing:")
for X in train_text_ds_raw.take(2):
print(" Given raw data: " )
print(X.numpy().decode("utf-8") )
tokenized = vectorize_layer(tf.expand_dims(X, -1))
print(" Tokenized and Transformed to a vector of integers: " )
print (tokenized)
print(" Text after Tokenized and Transformed: ")
transformed = ""
for each in tf.squeeze(tokenized):
transformed= transformed+ " "+ vocab[each]
print(transformed)
vocab has the 35898 entries
vocab has the following first 10 entries
0 represents the word:
1 represents the word: [UNK]
2 represents the word: bir
3 represents the word: ve
4 represents the word: bu
5 represents the word: icin
6 represents the word: de
7 represents the word: da
8 represents the word: tl
9 represents the word: ama
2 sample text preprocessing:
Given raw data:
Sürat Kargo Problemi,"92110952412235 numaralı kargom 10 gündür İzmir aktarma merkezinde bekliyor. Kargom ile ilgili herhangi bir haber, bilgi bulunmamaktadır. Hala hareket görmedi. Arıyorum kimse ilgilenmiyor ilgilenmeyi bırakın cevap vermiyorlar. Acil olarak geri dönüş bekliyorum.Devamını oku"
Tokenized and Transformed to a vector of integers:
tf.Tensor(
[[ 301 25 398 93 235 108 323 1320 2010 741 235 15
116 226 2 709 121 3297 31 865 14325 187 329 1663
13960 1175 105 1006 292 34 45 73 284 0 0 0
0 0 0 0]], shape=(1, 40), dtype=int64)
Text after Tokenized and Transformed:
surat kargo problemi numarali kargom gundur i̇zmir aktarma merkezinde bekliyor kargom ile ilgili herhangi bir haber bilgi bulunmamaktadir hala hareket gormedi ariyorum kimse ilgilenmiyor ilgilenmeyi birakin cevap vermiyorlar acil olarak geri donus bekliyorum
Given raw data:
Garanti BBVA Havale Sorunu Oldu,Garanti internet bankacılığı ile eşimin Garanti hesabına beş gün ara ile para transferi yaptım. Dekontları mevcut. Fakat karşı tarafın hesabında gözükmüyor 18 gün oldu müşteri temsilcisine bağlanamıyoruz sürekli bant yayını. Bant yayınıyla bu sorunu nasıl çözeceğiz.Devamını oku
Tokenized and Transformed to a vector of integers:
tf.Tensor(
[[ 139 1570 2129 38 42 139 134 4958 15 1555 139 3802
924 13 1113 15 39 11594 149 15122 566 36 639 11786
9771 1782 13 42 24 1427 10671 72 3915 8358 3915 17308
4 38 76 7918]], shape=(1, 40), dtype=int64)
Text after Tokenized and Transformed:
garanti bbva havale sorunu oldu garanti internet bankaciligi ile esimin garanti hesabina bes gun ara ile para transferi yaptim dekontlari mevcut fakat karsi tarafin hesabinda gozukmuyor gun oldu musteri temsilcisine baglanamiyoruz surekli bant yayini bant yayiniyla bu sorunu nasil cozecegiz
time: 280 ms (started: 2022-03-01 12:16:37 +00:00)

Save & Load the adopted TextVectorization layer

Since adopting theTextVectorization layer may take considerable time, you would like to save it for future runs. As a simple and straightway, we can save the trained TextVectorization layer by embedding in a Keras model as follows:

%cd ../models/
%ls
/content/gdrive/MyDrive/Colab Notebooks/models
time: 438 ms (started: 2022-03-01 12:16:38 +00:00)
# Create a model to embed the trained TextVectorization layer
vectorizer_model = tf.keras.models.Sequential()
vectorizer_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
vectorizer_model.add(vectorize_layer)
vectorizer_model.summary()
# Save it
filepath = "vectorize_layer_model"
vectorizer_model.save(filepath, save_format="tf")
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
text_vectorization (TextVec (None, 40) 0
torization)

=================================================================
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: vectorize_layer_model/assets
time: 1.41 s (started: 2022-03-01 12:16:38 +00:00)
%lsvectorize_layer_model/
time: 238 ms (started: 2022-03-01 12:16:39 +00:00)

Now, let’s load the saved model:

# Load the saved model
loaded_vectorizer_model = tf.keras.models.load_model(filepath)
# Extract the trained TextVectorization layer out of the loaded model
loaded_vectorizer_layer = loaded_vectorizer_model.layers[0]
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
time: 555 ms (started: 2022-03-01 12:16:40 +00:00)

Check the loaded Text Vectorization layer

Actually, now, you have 2 options:

  • You can use the loaded model to vectorize the text, or
  • You can use the loaded layer.

Let’s check if the loaded model and the layer have the same vocab and preprocessing behavior.

loaded_vocab = loaded_vectorizer_layer.get_vocabulary()
print("original vocab has the ", len(vocab)," entries")
print("loaded_vectorizer_layer vocab has the ", len(loaded_vocab)," entries")
print("original vocab: ", vocab[:10])
print("loaded vocab : ", loaded_vocab[:10])
original vocab has the 35898 entries
loaded_vectorizer_layer vocab has the 35898 entries
original vocab: ['', '[UNK]', 'bir', 've', 'bu', 'icin', 'de', 'da', 'tl', 'ama']
loaded vocab : ['', '[UNK]', 'bir', 've', 'bu', 'icin', 'de', 'da', 'tl', 'ama']
time: 94.6 ms (started: 2022-03-01 12:16:40 +00:00)

As you see, the vocab is the same as the original one. Let’s check the preprocessing and tokenization:

for X in train_text_ds_raw.take(1):
print(" Given raw data: " )
print(X.numpy().decode("utf-8") )
tokenized = vectorize_layer(tf.expand_dims(X, -1))
print(" original vectorizer layer: Tokenized and Transformed to a vector of integers: " )
print (tokenized)
tokenized = loaded_vectorizer_layer(tf.expand_dims(X, -1))
print(" loaded_vectorizer_layer: Tokenized and Transformed to a vector of integers: " )
#print (tokenized.to_tensor(shape=[1, max_review_size]))
print (tokenized)

tokenized = loaded_vectorizer_model.predict(tf.expand_dims(X, -1))
print(" loaded_vectorizer_model: Tokenized and Transformed to a vector of integers: " )
#print (tokenized.to_tensor(shape=[1, max_review_size]))
print (tokenized)
print(" Text after Tokenized and Transformed: ")
transformed = ""
for each in tf.squeeze(tokenized):
transformed= transformed+ " "+ vocab[each]
print(transformed)
Given raw data:
Sürat Kargo Problemi,"92110952412235 numaralı kargom 10 gündür İzmir aktarma merkezinde bekliyor. Kargom ile ilgili herhangi bir haber, bilgi bulunmamaktadır. Hala hareket görmedi. Arıyorum kimse ilgilenmiyor ilgilenmeyi bırakın cevap vermiyorlar. Acil olarak geri dönüş bekliyorum.Devamını oku"
original vectorizer layer: Tokenized and Transformed to a vector of integers:
tf.Tensor(
[[ 301 25 398 93 235 108 323 1320 2010 741 235 15
116 226 2 709 121 3297 31 865 14325 187 329 1663
13960 1175 105 1006 292 34 45 73 284 0 0 0
0 0 0 0]], shape=(1, 40), dtype=int64)
loaded_vectorizer_layer: Tokenized and Transformed to a vector of integers:
tf.Tensor(
[[ 301 25 398 93 235 108 323 1320 2010 741 235 15
116 226 2 709 121 3297 31 865 14325 187 329 1663
13960 1175 105 1006 292 34 45 73 284 0 0 0
0 0 0 0]], shape=(1, 40), dtype=int64)
loaded_vectorizer_model: Tokenized and Transformed to a vector of integers:
[[ 301 25 398 93 235 108 323 1320 2010 741 235 15
116 226 2 709 121 3297 31 865 14325 187 329 1663
13960 1175 105 1006 292 34 45 73 284 0 0 0
0 0 0 0]]
Text after Tokenized and Transformed:
surat kargo problemi numarali kargom gundur i̇zmir aktarma merkezinde bekliyor kargom ile ilgili herhangi bir haber bilgi bulunmamaktadir hala hareket gormedi ariyorum kimse ilgilenmiyor ilgilenmeyi birakin cevap vermiyorlar acil olarak geri donus bekliyorum
time: 266 ms (started: 2022-03-01 12:16:40 +00:00)

As you see above, loaded_vectorizer_layer and loaded_vectorizer_model preprocess and tokenize the text as the same as the original vectorizer layer.

Preprocess the Train & Test Data by the adopted TextVecorization Layer

First, let’s code a function to preprocess a given review by using the vectorize_layer or loaded_vectorizer_layer.

def prepare_lm_inputs_labels(text):
text = tf.expand_dims(text, -1)
return tf.squeeze(vectorize_layer(text))
#return tf.squeeze(loaded_vectorizer_layer(text).to_tensor(shape=[1, max_review_size]))
time: 3.52 ms (started: 2022-03-01 12:16:41 +00:00)

Process the Train Data

Then, apply this function to every review in the train set:

train_text_ds = train_text_ds_raw.map(prepare_lm_inputs_labels, 
num_parallel_calls=tf.data.experimental.AUTOTUNE)
time: 94.9 ms (started: 2022-03-01 12:16:41 +00:00)

Check the output tensor shape and content

train_text_ds.element_specTensorSpec(shape=<unknown>, dtype=tf.int64, name=None)time: 12.7 ms (started: 2022-03-01 12:16:41 +00:00)for each in train_text_ds.take(1):
print(each)
tf.Tensor(
[ 301 25 398 93 235 108 323 1320 2010 741 235 15
116 226 2 709 121 3297 31 865 14325 187 329 1663
13960 1175 105 1006 292 34 45 73 284 0 0 0
0 0 0 0], shape=(40,), dtype=int64)
time: 65.6 ms (started: 2022-03-01 12:16:41 +00:00)

We can now create the training dataset by putting together the input (tokenized reviews) and the expected output (the topic/class id) as follows:

train_ds = tf.data.Dataset.zip(
( train_text_ds,
train_cat_ds_raw
)
)
time: 4.25 ms (started: 2022-03-01 12:16:41 +00:00)

Check the train dataset element specs and content

train_ds.element_spec(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
TensorSpec(shape=(), dtype=tf.int64, name=None))
time: 7.23 ms (started: 2022-03-01 12:16:41 +00:00)for X,y in train_ds.take(1):
print("X.shape: ",X.shape, "y.shape: ", y.shape)
print("X: ", X)
print("y: ", y)
input = " ".join([vocab[_] for _ in np.squeeze(X)])
output = id_to_category[y.numpy()]
print("input (review as text): " , input)
print("output (category as text): " , output)
X.shape: (40,) y.shape: ()
X: tf.Tensor(
[ 301 25 398 93 235 108 323 1320 2010 741 235 15
116 226 2 709 121 3297 31 865 14325 187 329 1663
13960 1175 105 1006 292 34 45 73 284 0 0 0
0 0 0 0], shape=(40,), dtype=int64)
y: tf.Tensor(17, shape=(), dtype=int64)
input (review as text): surat kargo problemi numarali kargom gundur i̇zmir aktarma merkezinde bekliyor kargom ile ilgili herhangi bir haber bilgi bulunmamaktadir hala hareket gormedi ariyorum kimse ilgilenmiyor ilgilenmeyi birakin cevap vermiyorlar acil olarak geri donus bekliyorum
output (category as text): kargo-nakliyat
time: 45.5 ms (started: 2022-03-01 12:16:41 +00:00)
# train dataset size
train_size = train_ds.cardinality().numpy()
print("Train size: ", train_size)
Train size: 6080
time: 4.26 ms (started: 2022-03-01 12:16:41 +00:00)

Process the Validation Data

Let’s create the input (reviews) and output (topic/class id) TF Datasets:

val_text_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(val_features.values, tf.string)
)
time: 10.4 ms (started: 2022-03-01 12:16:41 +00:00)val_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(val_targets.values, tf.int64),
)time: 4.08 ms (started: 2022-03-01 12:16:41 +00:00)

Let’s apply the same function prepare_lm_inputs_labels for the reviews in the validation data as follows:

val_text_ds = val_text_ds_raw.map(prepare_lm_inputs_labels, 
num_parallel_calls=tf.data.experimental.AUTOTUNE)
time: 86.5 ms (started: 2022-03-01 12:16:41 +00:00)

We can now create the validation dataset by putting together the input (tokenized reviews) and the expected output (the topic/class id) as follows:

val_ds = tf.data.Dataset.zip(
( val_text_ds,
val_cat_ds_raw
)
)
time: 4.02 ms (started: 2022-03-01 12:16:41 +00:00)

Check the validation dataset element specs and content:

for X,y in val_ds.take(1):
print("X.shape: ",X.shape, "y.shape: ", y.shape)
print("X: ", X)
print("y: ",y)
input = " ".join([vocab[_] for _ in np.squeeze(X)])
output = id_to_category[y.numpy()]
print("input (review as text): " , input)
print("output (category as text ): " , output)
X.shape: (40,) y.shape: ()
X: tf.Tensor(
[ 567 689 17 129 31933 3452 40 119 69 2464 4 26
665 1236 57 730 9 689 17 129 3452 40 196 2513
1992 21 7 245 964 598 7438 8 40 227 62 4
21 7727 4 26], shape=(40,), dtype=int64)
y: tf.Tensor(8, shape=(), dtype=int64)
input (review as text): i̇gdas faturam cok fazla cuzi miktarda geldi aydir fatura oduyorum bu ay virus dolayisiyla mi bilmiyorum ama faturam cok fazla miktarda geldi kabul etmiyorum sonuna kadar da sikayet edecegim i̇ki kisiyiz tl geldi bize hic bu kadar gelmiyordu bu ay
output (category as text ): enerji
time: 50.6 ms (started: 2022-03-01 12:16:41 +00:00)

Process the Test Data

Let’s create the input (reviews) and output (topic/class id) TF Datasets:

test_text_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(test_features.values, tf.string)
)
time: 81.6 ms (started: 2022-03-01 12:16:41 +00:00)test_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(test_targets.values, tf.int64),
)time: 4.82 ms (started: 2022-03-01 12:16:41 +00:00)

Let’s apply the same function prepare_lm_inputs_labels for the reviews in the test data as follows:

test_text_ds = test_text_ds_raw.map(prepare_lm_inputs_labels, 
num_parallel_calls=tf.data.experimental.AUTOTUNE)
time: 64.1 ms (started: 2022-03-01 12:16:41 +00:00)

We can now create the test dataset by putting together the input (tokenized reviews) and the expected output (the topic/class id) as follows:

test_ds = tf.data.Dataset.zip(
( test_text_ds,
test_cat_ds_raw
)
)
time: 2.56 ms (started: 2022-03-01 12:16:41 +00:00)

Check the test dataset element specs and content:

for X,y in test_ds.take(1):
print("X.shape: ",X.shape, "y.shape: ", y.shape)
print("X: ", X)
print("y: ",y)
input = " ".join([vocab[_] for _ in np.squeeze(X)])
output = id_to_category[y.numpy()]
print("input (review as text): " , input)
print("output (category as text ): " , output)
X.shape: (40,) y.shape: ()
X: tf.Tensor(
[ 478 814 2874 1137 3782 1199 38 3782 1 17 129 14622
1064 112 861 3 1 3782 269 3 23310 6 82 3782
5359 930 21882 3782 1 2907 1 15 1 0 0 0
0 0 0 0], shape=(40,), dtype=int64)
y: tf.Tensor(5, shape=(), dtype=int64)
input (review as text): egitim bilisim agi eba i̇ngilizce ders sorunu i̇ngilizce [UNK] cok fazla faydalanamiyorum nedeni ise soru ve [UNK] i̇ngilizce olmasi ve ogretmenin de sadece i̇ngilizce konusmasi sizden ricamiz i̇ngilizce [UNK] turkce [UNK] ile [UNK]
output (category as text ): egitim
time: 53.9 ms (started: 2022-03-01 12:16:41 +00:00)
# test dataset size
test_size = test_ds.cardinality().numpy()
print("Test size: ", test_size)
Test size: 84457
time: 3.87 ms (started: 2022-03-01 12:16:41 +00:00)

Finalize the TensorFlow Data Pipeline

We can now configure and optimize the train, validation, and test datasets.

batch_size=64
AUTOTUNE=tf.data.experimental.AUTOTUNE
train_ds=train_ds.shuffle(buffer_size=train_size)
train_ds=train_ds.batch(batch_size=batch_size,drop_remainder=True)
train_ds=train_ds.cache()
train_ds = train_ds.prefetch(AUTOTUNE)
val_ds=val_ds.shuffle(buffer_size=train_size)
val_ds=val_ds.batch(batch_size=batch_size,drop_remainder=True)
val_ds=val_ds.cache()
val_ds = val_ds.prefetch(AUTOTUNE)
test_ds=test_ds.shuffle(buffer_size=train_size)
test_ds=test_ds.batch(batch_size=batch_size,drop_remainder=True)
test_ds=test_ds.cache()
test_ds = test_ds.prefetch(AUTOTUNE)
time: 21.8 ms (started: 2022-03-01 12:16:42 +00:00)

Notice that we have now batches of reviews and topics:

train_ds.element_spec(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
TensorSpec(shape=(64,), dtype=tf.int64, name=None))
time: 6.2 ms (started: 2022-03-01 12:16:42 +00:00)

Let’s take a look at two samples in the first batch:

for X, y in train_ds.take(1):
print(X.shape, y.shape)
print("All categories values in this batch: ", y)
print("\nFirst sample in the batch:")
print("\tX is: " ,X[0])
print("\ty is: ", y[0].numpy)
input = " ".join([vocab[_] for _ in np.squeeze(X[0])])
output = id_to_category[y[0].numpy()]
print("\tinput (in text): " , input)
print("\toutput (in category): " , output)
print("\nSecond sample in the batch:")
print("\tX is: " ,X[1])
print("\ty is: ", y[1].numpy)
input = " ".join([vocab[_] for _ in np.squeeze(X[1])])
output = id_to_category[y[1].numpy()]
print("\tinput (in text): " , input)
print("\toutput (in category): " , output)
(64, 40) (64,)
All categories values in this batch: tf.Tensor(
[16 23 14 1 25 20 9 2 18 17 1 6 15 14 13 15 25 13 3 12 9 13 14 31
16 6 17 30 26 17 18 19 25 13 10 30 2 1 23 16 0 13 27 9 25 4 24 20
30 4 9 6 23 22 17 26 1 10 10 19 25 15 27 25], shape=(64,), dtype=int64)
First sample in the batch:
X is: tf.Tensor(
[ 77 451 298 3008 1088 577 655 243 123 7 577 655
243 77 1110 4952 106 1418 129 102 381 11 1216 11
1127 41 709 10 2630 410 3359 10668 6644 1127 55 886
2 1753 1583 721], shape=(40,), dtype=int64)
y is: <bound method _EagerTensorBase.numpy of <tf.Tensor: shape=(), dtype=int64, numpy=16>>
input (in text): e devlet turkiye gov tr pandemi sosyal destek nisan da pandemi sosyal destek e basvurdum basvurumun uzerinden aydan fazla zaman gecti ne olumlu ne olumsuz hicbir haber yok i̇nsanlar buna umut baglayip bekliyorlar olumsuz bile olsa bir yanit verilmesi lazim
output (in category): kamu-hizmetleri
Second sample in the batch:
X is: tf.Tensor(
[10393 1549 26880 3572 4184 456 118 117 3572 16 2 4032
17 10598 17 14014 20824 36 1738 19854 9855 2030 28692 165
32378 3402 30 14309 232 281 313 5797 4133 249 1219 1715
2110 33863 675 156], shape=(40,), dtype=int64)
y is: <bound method _EagerTensorBase.numpy of <tf.Tensor: shape=(), dtype=int64, numpy=23>>
input (in text): ceylin gold i̇mitasyon bileklik kendilerinin magazasindan iki adet bileklik aldim bir tanesini cok begendim cok icime sindi fakat digeri taktikca gozume hos gorunmemeye basladi cevrem sahte gibi gorundugunu soyledi bende degisim yapabilir miyiz dedim kendilerine onlarda magazada begenilip alinan urunun
output (in category): mucevher-saat-gozluk
time: 1.12 s (started: 2022-03-01 12:16:42 +00:00)

Summary

In this part, we have prepared the datasets and taken several actions and decisions:

  • we built a TF data pipeline
  • we configured a Keras TextVectorization layer for text preprocessing and tokenization
  • we adopted the Keras TextVectorization layer onto the training dataset
  • we applied the Keras TextVectorization layer to train, validation, and test datasets
  • we finalized the TF data pipeline by configuring it

In the end, we have the train, validation, and test datasets ready to input any ML/DL models.

In the next parts, we will use these datasets for text classification with different Deep Learning models.

Do you have any questions or comments? Please share them in the comment section.

Thank you for your attention!