Part D: Preprocessing Text with TF Data Pipeline and Keras Text Vectorization Layer

Multi-Topic Text Classification with Various Deep Learning Models

Author: Murat Karakaya
Date created….. 17 09 2021
Date published… 13 03 2022
Last modified…. 23 06 2022

Description: This is the Part D of the tutorial series “Multi-Topic Text Classification with Various Deep Learning Models” which covers all the phases of text classification:

Exploratory Data Analysis (EDA),
Text preprocessing
TF Data Pipeline
Keras TextVectorization preprocessing layer
Multi-class (multi-topic) text classification
Deep Learning model design & end-to-end model implementation
Performance evaluation & metrics
Generating classification report
Hyper-parameter tuning
etc.

We will design various Deep Learning models by using

the Keras Embedding layer,
Convolutional (Conv1D) layer,
Recurrent (LSTM) layer,
Transformer Encoder block, and
pre-trained transformer (BERT).

We will cover all the topics related to solving Multi-Class Text Classification problems with sample implementations in Python / TensorFlow / Keras environment.

We will use a Kaggle Dataset in which there are 32 topics and more than 400K total reviews.

If you would like to learn more about Deep Learning with practical coding examples,

Please subscribe to the Murat Karakaya Akademi YouTube Channel or
Do not forget to turn on notifications so that you will be notified when new parts are uploaded.
Follow my blog at muratkarakaya.net

You can access all the codes, videos, and posts of this tutorial series from the links below.

PARTS

In this tutorial series, there are several parts to cover the Text Classification with various Deep Learning Models topics. You can access all the parts from this index page.

PART D: PREPROCESSING TEXT WITH TF DATA PIPELINE AND KERAS TEXT VECTORIZATION LAYER

You can watch this part in English or Turkish on YouTube.

Below, we will first create a TensorFlow Data Pipeline to preprocess the data with the Keras TextVectorization layer.

A pipeline for a text model mostly involves extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths.

Build the Train TensorFlow Datasets

Observe that we have reviews in the text as input and categories (topics) in integers as target values:

train_features.values[:5]array(['Sürat Kargo Problemi,"92110952412235 numaralı kargom 10 gündür İzmir aktarma merkezinde bekliyor. Kargom ile ilgili herhangi bir haber, bilgi bulunmamaktadır. Hala hareket görmedi. Arıyorum kimse ilgilenmiyor ilgilenmeyi bırakın cevap vermiyorlar. Acil olarak geri dönüş bekliyorum.Devamını oku"',
       'Garanti BBVA Havale Sorunu Oldu,Garanti internet bankacılığı ile eşimin Garanti hesabına beş gün ara ile para transferi yaptım. Dekontları mevcut. Fakat karşı tarafın hesabında gözükmüyor 18 gün oldu müşteri temsilcisine bağlanamıyoruz sürekli bant yayını. Bant yayınıyla bu sorunu nasıl çözeceğiz.Devamını oku',
       'Ösym Sınav Yeri Değişikliği,Binlerce insan bu sorundan muzdarip. Hastalık dolayısıyla sınav yeri tercihi yaptığı illerde olmayan insanlar var. ÖSYM bu sorunla ilgilenmiyormuş gibi bir tavır takınıyor. Sınav yeri tercihi yaparken müneccim olmadığımız için hastalığı öngöremedik ama şu an ki duruma bakılırsa hastalığın faturası b...Devamını oku',
       'Hileli Yağ Yudum Egem\'den,"Yudum Egem\'den sızma zeytiyağını markasına güvenerek çok miktarda almıştım. Halen elimde 20 litre civarında sızma yağ var.',
       "Danone Hayat İçecek Kampanyalı Suyu Getirmiyor!,Vodafone Danone Hayat İçecek kampanyası ile 1 Nisan'da söylemiş olduğumuz su henüz gelmedi. Nasıl bir hizmet anlayışı bu. Bu durumdan hiç memnun değiliz. Bir an önce çözüme kavuşturulsun. Ya da bu işten vazgeçin. İnsanları boşuna mağdur etmeyin. Yerine getiremeyeceğiniz sözler vermeyin.Devamını oku"],
      dtype=object)time: 9.31 ms (started: 2022-03-01 12:16:15 +00:00)train_targets.values[:5]array([17, 10, 16, 11, 14], dtype=int8)time: 8.99 ms (started: 2022-03-01 12:16:15 +00:00)

We create 2 TF Datasets from the raw Train Dataframe for further processing:

for input (text/reviews)
for target (categories/topics)

# this is the input (text/reviews) dataset
train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_features.values, tf.string)
)time: 4.11 s (started: 2022-03-01 12:16:15 +00:00)# this is the target (categories/topics) dataset
train_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_targets.values, tf.int64),)time: 125 ms (started: 2022-03-01 12:16:19 +00:00)

Decide the dictionary size and the review size

For preprocessing the text, we need to decide the dictionary (vocab) size and the maximum review (text) size.

As we observed in Part B, 75% of all the reviews are less than 50 words, thus, I have opted out max_len of the reviews as the max_review_size =50 words.

For the dictionary size (vocab_size), we observed that, in the raw dataset, we have more than 431K words.

That is too much!

Therefore, I opted out using 100K words as the vocab_size.

Of course, you can try different max_len and vocab_size sizes depending on your dataset, hardware, and the Deep Learning model. Actually, you need to tune these kinds of hyper-parameters to achieve better performance.

vocab_size = 100000  # Only consider the top 100K words
max_len = max_review_size  # Max review size in wordstime: 1 ms (started: 2022-03-01 12:16:19 +00:00)

Prepare the Keras Text Vectorization layer

To preprocess the text, I will use the Keras TextVectorization layer. There are many advantages to using the Keras Preprocessing Layers.

The Keras Preprocessing Layers API allows developers to build Keras-native input processing pipelines. These input processing pipelines can be used as independent preprocessing code in non-Keras workflows, combined directly with Keras models, and exported as part of a Keras SavedModel, which you will see at the end of this tutorial as well.

With Keras Preprocessing Layers, you can build and export models that are truly end-to-end: models that accept raw images or raw structured data as input; models that handle feature normalization or feature value indexing on their own.

There are many preprocessing Keras layers. For text preprocessing, I will use the TextVectorization layer during this tutorial.

Custom Standardization

As the first step of text preprocessing, we will standardize the text by using the below function.

@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    #no_stop_words = ' '+no_punctuations+ ' '
    #for each in tr_stop_words.values:
    #  no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_punctuations, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")    return no_Utime: 17.8 ms (started: 2022-03-01 12:16:19 +00:00)

Quickly verify that custom_standardization works: try it on a sample Turkish input:

input_string = "Bu Issız Öğlenleyin de;  şunu ***1 Pijamalı Hasta***, ve  Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku"
print("input:  ", input_string)
output_string= custom_standardization(input_string)
print("output: ", output_string.numpy().decode("utf-8"))input:   Bu Issız Öğlenleyin de;  şunu ***1 Pijamalı Hasta***, ve  Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku
output:  bu issiz oglenleyin de sunu pijamali hasta ve ancak i̇ste yagiz sofore cabucak guvendi 
time: 15.9 ms (started: 2022-03-01 12:16:19 +00:00)

Build a TextVectorization layer

Let’s build our TextVectorization layer:

# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int", #tf-idf / int / binary / count
    output_sequence_length=max_len,
)time: 866 ms (started: 2022-03-01 12:16:19 +00:00)

Note: Here, I opt out converting a string sequence into an integer sequence by setting ***output_mode="int"***. You can also try other encodings (representation) methods such as tf--idf, binary, or count. I left these options for you as an exercise :)

Adopt the Text Vectorization layer

TextVectorization preprocessing layer has an internal state that can be computed based on a sample of the training data. That is, TextVectorization holds a mapping between string tokens and integer indices.

Thus, we will adopt TextVectorization preprocessing layer ONLY to the training data.

Please note that: To prevent data leak, we DO NOT adopt TextVectorization preprocessing layer to the whole (train & test) data.

vectorize_layer.adapt(train_text_ds_raw)
vocab = vectorize_layer.get_vocabulary()  
# To get words back from token indicestime: 17.5 s (started: 2022-03-01 12:16:20 +00:00)

Check the dictionary (vocab) and preprocessing

Let’s see some example conversions:

print("vocab has the ", len(vocab)," entries")
print("vocab has the following first 10 entries")
for word in range(10):
  print(word, " represents the word: ", vocab[word])
print("2 sample text preprocessing:")
for X in train_text_ds_raw.take(2):
  print(" Given raw data: " )
  print(X.numpy().decode("utf-8") )
  tokenized = vectorize_layer(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers: " )
  print (tokenized)
  print(" Text after Tokenized and Transformed: ")
  transformed = ""
  for each in tf.squeeze(tokenized):
    transformed= transformed+ " "+ vocab[each]
  print(transformed)vocab has the  35898  entries
vocab has the following first 10 entries
0  represents the word:  
1  represents the word:  [UNK]
2  represents the word:  bir
3  represents the word:  ve
4  represents the word:  bu
5  represents the word:  icin
6  represents the word:  de
7  represents the word:  da
8  represents the word:  tl
9  represents the word:  ama
2 sample text preprocessing:
 Given raw data: 
Sürat Kargo Problemi,"92110952412235 numaralı kargom 10 gündür İzmir aktarma merkezinde bekliyor. Kargom ile ilgili herhangi bir haber, bilgi bulunmamaktadır. Hala hareket görmedi. Arıyorum kimse ilgilenmiyor ilgilenmeyi bırakın cevap vermiyorlar. Acil olarak geri dönüş bekliyorum.Devamını oku"
 Tokenized and Transformed to a vector of integers: 
tf.Tensor(
[[  301    25   398    93   235   108   323  1320  2010   741   235    15
    116   226     2   709   121  3297    31   865 14325   187   329  1663
  13960  1175   105  1006   292    34    45    73   284     0     0     0
      0     0     0     0]], shape=(1, 40), dtype=int64)
 Text after Tokenized and Transformed: 
 surat kargo problemi numarali kargom gundur i̇zmir aktarma merkezinde bekliyor kargom ile ilgili herhangi bir haber bilgi bulunmamaktadir hala hareket gormedi ariyorum kimse ilgilenmiyor ilgilenmeyi birakin cevap vermiyorlar acil olarak geri donus bekliyorum       
 Given raw data: 
Garanti BBVA Havale Sorunu Oldu,Garanti internet bankacılığı ile eşimin Garanti hesabına beş gün ara ile para transferi yaptım. Dekontları mevcut. Fakat karşı tarafın hesabında gözükmüyor 18 gün oldu müşteri temsilcisine bağlanamıyoruz sürekli bant yayını. Bant yayınıyla bu sorunu nasıl çözeceğiz.Devamını oku
 Tokenized and Transformed to a vector of integers: 
tf.Tensor(
[[  139  1570  2129    38    42   139   134  4958    15  1555   139  3802
    924    13  1113    15    39 11594   149 15122   566    36   639 11786
   9771  1782    13    42    24  1427 10671    72  3915  8358  3915 17308
      4    38    76  7918]], shape=(1, 40), dtype=int64)
 Text after Tokenized and Transformed: 
 garanti bbva havale sorunu oldu garanti internet bankaciligi ile esimin garanti hesabina bes gun ara ile para transferi yaptim dekontlari mevcut fakat karsi tarafin hesabinda gozukmuyor gun oldu musteri temsilcisine baglanamiyoruz surekli bant yayini bant yayiniyla bu sorunu nasil cozecegiz
time: 280 ms (started: 2022-03-01 12:16:37 +00:00)

Save & Load the adopted TextVectorization layer

Since adopting theTextVectorization layer may take considerable time, you would like to save it for future runs. As a simple and straightway, we can save the trained TextVectorization layer by embedding in a Keras model as follows:

%cd ../models/
%ls/content/gdrive/MyDrive/Colab Notebooks/models
time: 438 ms (started: 2022-03-01 12:16:38 +00:00)# Create a model to embed the trained TextVectorization layer
vectorizer_model = tf.keras.models.Sequential()
vectorizer_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
vectorizer_model.add(vectorize_layer)
vectorizer_model.summary()# Save it
filepath = "vectorize_layer_model"
vectorizer_model.save(filepath, save_format="tf")Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVec  (None, 40)               0         
 torization)                                                     
                                                                 
=================================================================
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: vectorize_layer_model/assets
time: 1.41 s (started: 2022-03-01 12:16:38 +00:00)%ls[0m[01;34mvectorize_layer_model[0m/
time: 238 ms (started: 2022-03-01 12:16:39 +00:00)

Now, let’s load the saved model:

# Load the saved model
loaded_vectorizer_model = tf.keras.models.load_model(filepath)
# Extract the trained TextVectorization layer out of the loaded model
loaded_vectorizer_layer = loaded_vectorizer_model.layers[0]WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
time: 555 ms (started: 2022-03-01 12:16:40 +00:00)

Check the loaded Text Vectorization layer

Actually, now, you have 2 options:

You can use the loaded model to vectorize the text, or
You can use the loaded layer.

Let’s check if the loaded model and the layer have the same vocab and preprocessing behavior.

loaded_vocab = loaded_vectorizer_layer.get_vocabulary()
print("original vocab has the ", len(vocab)," entries")
print("loaded_vectorizer_layer vocab has the ", len(loaded_vocab)," entries")
print("original vocab: ", vocab[:10])
print("loaded vocab  : ", loaded_vocab[:10])original vocab has the  35898  entries
loaded_vectorizer_layer vocab has the  35898  entries
original vocab:  ['', '[UNK]', 'bir', 've', 'bu', 'icin', 'de', 'da', 'tl', 'ama']
loaded vocab  :  ['', '[UNK]', 'bir', 've', 'bu', 'icin', 'de', 'da', 'tl', 'ama']
time: 94.6 ms (started: 2022-03-01 12:16:40 +00:00)

As you see, the vocab is the same as the original one. Let’s check the preprocessing and tokenization:

for X in train_text_ds_raw.take(1):
  print(" Given raw data: " )
  print(X.numpy().decode("utf-8") )  tokenized = vectorize_layer(tf.expand_dims(X, -1))
  print(" original vectorizer layer: Tokenized and Transformed to a vector of integers: " )
  print (tokenized)  tokenized = loaded_vectorizer_layer(tf.expand_dims(X, -1))
  print(" loaded_vectorizer_layer: Tokenized and Transformed to a vector of integers: " )
  #print (tokenized.to_tensor(shape=[1, max_review_size]))
  print (tokenized)
  
  tokenized = loaded_vectorizer_model.predict(tf.expand_dims(X, -1))
  print(" loaded_vectorizer_model: Tokenized and Transformed to a vector of integers: " )
  #print (tokenized.to_tensor(shape=[1, max_review_size]))
  print (tokenized)  print(" Text after Tokenized and Transformed: ")
  transformed = ""
  for each in tf.squeeze(tokenized):
    transformed= transformed+ " "+ vocab[each]
  print(transformed)Given raw data: 
Sürat Kargo Problemi,"92110952412235 numaralı kargom 10 gündür İzmir aktarma merkezinde bekliyor. Kargom ile ilgili herhangi bir haber, bilgi bulunmamaktadır. Hala hareket görmedi. Arıyorum kimse ilgilenmiyor ilgilenmeyi bırakın cevap vermiyorlar. Acil olarak geri dönüş bekliyorum.Devamını oku"
 original vectorizer layer: Tokenized and Transformed to a vector of integers: 
tf.Tensor(
[[  301    25   398    93   235   108   323  1320  2010   741   235    15
    116   226     2   709   121  3297    31   865 14325   187   329  1663
  13960  1175   105  1006   292    34    45    73   284     0     0     0
      0     0     0     0]], shape=(1, 40), dtype=int64)
 loaded_vectorizer_layer: Tokenized and Transformed to a vector of integers: 
tf.Tensor(
[[  301    25   398    93   235   108   323  1320  2010   741   235    15
    116   226     2   709   121  3297    31   865 14325   187   329  1663
  13960  1175   105  1006   292    34    45    73   284     0     0     0
      0     0     0     0]], shape=(1, 40), dtype=int64)
 loaded_vectorizer_model: Tokenized and Transformed to a vector of integers: 
[[  301    25   398    93   235   108   323  1320  2010   741   235    15
    116   226     2   709   121  3297    31   865 14325   187   329  1663
  13960  1175   105  1006   292    34    45    73   284     0     0     0
      0     0     0     0]]
 Text after Tokenized and Transformed: 
 surat kargo problemi numarali kargom gundur i̇zmir aktarma merkezinde bekliyor kargom ile ilgili herhangi bir haber bilgi bulunmamaktadir hala hareket gormedi ariyorum kimse ilgilenmiyor ilgilenmeyi birakin cevap vermiyorlar acil olarak geri donus bekliyorum       
time: 266 ms (started: 2022-03-01 12:16:40 +00:00)

As you see above, loaded_vectorizer_layer and loaded_vectorizer_model preprocess and tokenize the text as the same as the original vectorizer layer.

Preprocess the Train & Test Data by the adopted TextVecorization Layer

First, let’s code a function to preprocess a given review by using the vectorize_layer or loaded_vectorizer_layer.

def prepare_lm_inputs_labels(text):
    text = tf.expand_dims(text, -1) 
    return tf.squeeze(vectorize_layer(text))
    #return tf.squeeze(loaded_vectorizer_layer(text).to_tensor(shape=[1, max_review_size]))time: 3.52 ms (started: 2022-03-01 12:16:41 +00:00)

Process the Train Data

Then, apply this function to every review in the train set:

train_text_ds = train_text_ds_raw.map(prepare_lm_inputs_labels, 
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)time: 94.9 ms (started: 2022-03-01 12:16:41 +00:00)

Check the output tensor shape and content

train_text_ds.element_specTensorSpec(shape=<unknown>, dtype=tf.int64, name=None)time: 12.7 ms (started: 2022-03-01 12:16:41 +00:00)for each in train_text_ds.take(1):
  print(each)tf.Tensor(
[  301    25   398    93   235   108   323  1320  2010   741   235    15
   116   226     2   709   121  3297    31   865 14325   187   329  1663
 13960  1175   105  1006   292    34    45    73   284     0     0     0
     0     0     0     0], shape=(40,), dtype=int64)
time: 65.6 ms (started: 2022-03-01 12:16:41 +00:00)

We can now create the training dataset by putting together the input (tokenized reviews) and the expected output (the topic/class id) as follows:

train_ds = tf.data.Dataset.zip(
    (       train_text_ds,
            train_cat_ds_raw
        )
)time: 4.25 ms (started: 2022-03-01 12:16:41 +00:00)

Check the train dataset element specs and content

train_ds.element_spec(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))time: 7.23 ms (started: 2022-03-01 12:16:41 +00:00)for X,y in train_ds.take(1):
  print("X.shape: ",X.shape, "y.shape: ", y.shape)
  print("X: ", X)
  print("y: ", y)
  input = " ".join([vocab[_] for _ in np.squeeze(X)])
  output = id_to_category[y.numpy()]
  print("input (review as text): " , input)
  print("output (category as text): " , output)X.shape:  (40,) y.shape:  ()
X:  tf.Tensor(
[  301    25   398    93   235   108   323  1320  2010   741   235    15
   116   226     2   709   121  3297    31   865 14325   187   329  1663
 13960  1175   105  1006   292    34    45    73   284     0     0     0
     0     0     0     0], shape=(40,), dtype=int64)
y:  tf.Tensor(17, shape=(), dtype=int64)
input (review as text):  surat kargo problemi numarali kargom gundur i̇zmir aktarma merkezinde bekliyor kargom ile ilgili herhangi bir haber bilgi bulunmamaktadir hala hareket gormedi ariyorum kimse ilgilenmiyor ilgilenmeyi birakin cevap vermiyorlar acil olarak geri donus bekliyorum       
output (category as text):  kargo-nakliyat
time: 45.5 ms (started: 2022-03-01 12:16:41 +00:00)# train dataset size 
train_size = train_ds.cardinality().numpy()
print("Train size: ", train_size)Train size:  6080
time: 4.26 ms (started: 2022-03-01 12:16:41 +00:00)

Process the Validation Data

Let’s create the input (reviews) and output (topic/class id) TF Datasets:

val_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(val_features.values, tf.string)
)time: 10.4 ms (started: 2022-03-01 12:16:41 +00:00)val_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(val_targets.values, tf.int64),)time: 4.08 ms (started: 2022-03-01 12:16:41 +00:00)

Let’s apply the same function prepare_lm_inputs_labels for the reviews in the validation data as follows:

val_text_ds = val_text_ds_raw.map(prepare_lm_inputs_labels, 
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)time: 86.5 ms (started: 2022-03-01 12:16:41 +00:00)

We can now create the validation dataset by putting together the input (tokenized reviews) and the expected output (the topic/class id) as follows:

val_ds = tf.data.Dataset.zip(
    (       val_text_ds,
            val_cat_ds_raw
       )
)time: 4.02 ms (started: 2022-03-01 12:16:41 +00:00)

Check the validation dataset element specs and content:

for X,y in val_ds.take(1):
  print("X.shape: ",X.shape, "y.shape: ", y.shape)
  print("X: ", X)
  print("y: ",y)
  input = " ".join([vocab[_] for _ in np.squeeze(X)])
  output = id_to_category[y.numpy()]
  print("input (review as text): " , input)
  print("output (category as text ): " , output)X.shape:  (40,) y.shape:  ()
X:  tf.Tensor(
[  567   689    17   129 31933  3452    40   119    69  2464     4    26
   665  1236    57   730     9   689    17   129  3452    40   196  2513
  1992    21     7   245   964   598  7438     8    40   227    62     4
    21  7727     4    26], shape=(40,), dtype=int64)
y:  tf.Tensor(8, shape=(), dtype=int64)
input (review as text):  i̇gdas faturam cok fazla cuzi miktarda geldi aydir fatura oduyorum bu ay virus dolayisiyla mi bilmiyorum ama faturam cok fazla miktarda geldi kabul etmiyorum sonuna kadar da sikayet edecegim i̇ki kisiyiz tl geldi bize hic bu kadar gelmiyordu bu ay
output (category as text ):  enerji
time: 50.6 ms (started: 2022-03-01 12:16:41 +00:00)

Process the Test Data

Let’s create the input (reviews) and output (topic/class id) TF Datasets:

test_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(test_features.values, tf.string)
)time: 81.6 ms (started: 2022-03-01 12:16:41 +00:00)test_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(test_targets.values, tf.int64),)time: 4.82 ms (started: 2022-03-01 12:16:41 +00:00)

Let’s apply the same function prepare_lm_inputs_labels for the reviews in the test data as follows:

test_text_ds = test_text_ds_raw.map(prepare_lm_inputs_labels, 
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)time: 64.1 ms (started: 2022-03-01 12:16:41 +00:00)

We can now create the test dataset by putting together the input (tokenized reviews) and the expected output (the topic/class id) as follows:

test_ds = tf.data.Dataset.zip(
    (       test_text_ds,
            test_cat_ds_raw
       )
)time: 2.56 ms (started: 2022-03-01 12:16:41 +00:00)

Check the test dataset element specs and content:

for X,y in test_ds.take(1):
  print("X.shape: ",X.shape, "y.shape: ", y.shape)
  print("X: ", X)
  print("y: ",y)
  input = " ".join([vocab[_] for _ in np.squeeze(X)])
  output = id_to_category[y.numpy()]
  print("input (review as text): " , input)
  print("output (category as text ): " , output)X.shape:  (40,) y.shape:  ()
X:  tf.Tensor(
[  478   814  2874  1137  3782  1199    38  3782     1    17   129 14622
  1064   112   861     3     1  3782   269     3 23310     6    82  3782
  5359   930 21882  3782     1  2907     1    15     1     0     0     0
     0     0     0     0], shape=(40,), dtype=int64)
y:  tf.Tensor(5, shape=(), dtype=int64)
input (review as text):  egitim bilisim agi eba i̇ngilizce ders sorunu i̇ngilizce [UNK] cok fazla faydalanamiyorum nedeni ise soru ve [UNK] i̇ngilizce olmasi ve ogretmenin de sadece i̇ngilizce konusmasi sizden ricamiz i̇ngilizce [UNK] turkce [UNK] ile [UNK]       
output (category as text ):  egitim
time: 53.9 ms (started: 2022-03-01 12:16:41 +00:00)# test dataset size 
test_size = test_ds.cardinality().numpy()
print("Test size: ", test_size)Test size:  84457
time: 3.87 ms (started: 2022-03-01 12:16:41 +00:00)

Finalize the TensorFlow Data Pipeline

We can now configure and optimize the train, validation, and test datasets.

batch_size=64
AUTOTUNE=tf.data.experimental.AUTOTUNEtrain_ds=train_ds.shuffle(buffer_size=train_size)
train_ds=train_ds.batch(batch_size=batch_size,drop_remainder=True)
train_ds=train_ds.cache()
train_ds = train_ds.prefetch(AUTOTUNE)val_ds=val_ds.shuffle(buffer_size=train_size)
val_ds=val_ds.batch(batch_size=batch_size,drop_remainder=True)
val_ds=val_ds.cache()
val_ds = val_ds.prefetch(AUTOTUNE)test_ds=test_ds.shuffle(buffer_size=train_size)
test_ds=test_ds.batch(batch_size=batch_size,drop_remainder=True)
test_ds=test_ds.cache()
test_ds = test_ds.prefetch(AUTOTUNE)time: 21.8 ms (started: 2022-03-01 12:16:42 +00:00)

Notice that we have now batches of reviews and topics:

train_ds.element_spec(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
 TensorSpec(shape=(64,), dtype=tf.int64, name=None))time: 6.2 ms (started: 2022-03-01 12:16:42 +00:00)

Let’s take a look at two samples in the first batch:

for X, y in train_ds.take(1):
  print(X.shape, y.shape)
  print("All categories values in this batch: ", y)
  print("\nFirst sample in the batch:")
  print("\tX is: " ,X[0])
  print("\ty is: ", y[0].numpy)
  input = " ".join([vocab[_] for _ in np.squeeze(X[0])])
  output = id_to_category[y[0].numpy()]
  print("\tinput (in text): " , input)
  print("\toutput (in category): " , output)  print("\nSecond sample in the batch:")
  print("\tX is: " ,X[1])
  print("\ty is: ", y[1].numpy)
  input = " ".join([vocab[_] for _ in np.squeeze(X[1])])
  output = id_to_category[y[1].numpy()]
  print("\tinput (in text): " , input)
  print("\toutput (in category): " , output)(64, 40) (64,)
All categories values in this batch:  tf.Tensor(
[16 23 14  1 25 20  9  2 18 17  1  6 15 14 13 15 25 13  3 12  9 13 14 31
 16  6 17 30 26 17 18 19 25 13 10 30  2  1 23 16  0 13 27  9 25  4 24 20
 30  4  9  6 23 22 17 26  1 10 10 19 25 15 27 25], shape=(64,), dtype=int64)First sample in the batch:
	X is:  tf.Tensor(
[   77   451   298  3008  1088   577   655   243   123     7   577   655
   243    77  1110  4952   106  1418   129   102   381    11  1216    11
  1127    41   709    10  2630   410  3359 10668  6644  1127    55   886
     2  1753  1583   721], shape=(40,), dtype=int64)
	y is:  <bound method _EagerTensorBase.numpy of <tf.Tensor: shape=(), dtype=int64, numpy=16>>
	input (in text):  e devlet turkiye gov tr pandemi sosyal destek nisan da pandemi sosyal destek e basvurdum basvurumun uzerinden aydan fazla zaman gecti ne olumlu ne olumsuz hicbir haber yok i̇nsanlar buna umut baglayip bekliyorlar olumsuz bile olsa bir yanit verilmesi lazim
	output (in category):  kamu-hizmetleriSecond sample in the batch:
	X is:  tf.Tensor(
[10393  1549 26880  3572  4184   456   118   117  3572    16     2  4032
    17 10598    17 14014 20824    36  1738 19854  9855  2030 28692   165
 32378  3402    30 14309   232   281   313  5797  4133   249  1219  1715
  2110 33863   675   156], shape=(40,), dtype=int64)
	y is:  <bound method _EagerTensorBase.numpy of <tf.Tensor: shape=(), dtype=int64, numpy=23>>
	input (in text):  ceylin gold i̇mitasyon bileklik kendilerinin magazasindan iki adet bileklik aldim bir tanesini cok begendim cok icime sindi fakat digeri taktikca gozume hos gorunmemeye basladi cevrem sahte gibi gorundugunu soyledi bende degisim yapabilir miyiz dedim kendilerine onlarda magazada begenilip alinan urunun
	output (in category):  mucevher-saat-gozluk
time: 1.12 s (started: 2022-03-01 12:16:42 +00:00)

Summary

In this part, we have prepared the datasets and taken several actions and decisions:

we built a TF data pipeline
we configured a Keras TextVectorization layer for text preprocessing and tokenization
we adopted the Keras TextVectorization layer onto the training dataset
we applied the Keras TextVectorization layer to train, validation, and test datasets
we finalized the TF data pipeline by configuring it

In the end, we have the train, validation, and test datasets ready to input any ML/DL models.

In the next parts, we will use these datasets for text classification with different Deep Learning models.

Do you have any questions or comments? Please share them in the comment section.

Thank you for your attention!

Tuesday, November 1, 2022