Keras Text Vectorization Layer: Configure, Adapt, Use, Save, Load, and Deploy

Author: Murat Karakaya
Date created: 05 Oct 2021
Last modified: 18 March 2023
Description: This is a tutorial about how to build, adapt, use, save, load, and deploy the Keras TextVectorization layer. You can access this tutorial on YouTube in English and Turkish. “TensorFlow Keras Text Vectorization Katmanı” / “TensorFlow Keras Text Vectorization Layer”.

In this tutorial, we will download a Kaggle Dataset in which there are 32 topics and more than 400K total reviews. We will use this dataset for a multi-class text classification task.

Our main aim is to learn how to effectively use the Keras TextVectorization layer in Text Processing and Text Classification.

The tutorial has 5 parts:

PART A: BACKGROUND
PART B: KNOW THE DATA
PART C: USE KERAS TEXT VECTORIZATION LAYER
PART D: BUILD AN END-TO-END MODEL
PART E: DEPLOY END-TO-END MODEL TO HUGGINGFACE SPACES USING GRADIO
SUMMARY

At the end of this tutorial, we will cover:

What a Keras TextVectorization layer is
Why we need to use a Keras TextVectorization layer in Natural Language Processing (NLP) tasks
How to employ a Keras TextVectorization layer in Text Preprocessing
How to integrate a Keras TextVectorization layer to a trained model
How to save and load a Keras TextVectorization layer and a model with a Keras TextVectorization layer
How to integrate a Keras TextVectorization layer with TensorFlow Data Pipeline API (tf.data)
How to design, train, save, and load an End-to-End model using Keras TextVectorization layer
How to deploy the End-to-End model with a Keras TextVectorization layer implemented with a custom standardize (custom_standardization) function using the Gradio library and the HuggingFace Spaces

Accessible on:

REFERENCES

PART A: BACKGROUND

You can watch this part on YouTube in Turkish or English.

1 TERMINOLOGY & CONCEPTS

1.1 What is Text Vectorization?

Text Vectorization is the process of converting text into a numerical representation.

There are many different techniques proposed to convert text to a numerical form such as:

One-hot Encoding (OHE)
Count Vectorizer
Bag-of-Words (BOW)
N-grams
Term Frequency
Term Frequency-Inverse Document Frequency (TF-IDF)
Embedding

1.2. What is Text Preprocessing?

Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more suitable form so that Machine Learning or Deep Learning algorithms can perform better.

The main phases of Text preprocessing:

Noise Removal (cleaning) — Removing unnecessary characters and formatting
Tokenization — break multi-word strings into smaller components
Normalization — a catch-all term for processing data; this includes stemming and lemmatization

Some of the common Noise Removal (cleaning) steps are:

Removal of Punctuations
Removal of Frequent words
Removal of Rare words
Removal of emojis
Removal of emoticons
Conversion of emoticons to words
Conversion of emojis to words
Removal of URLs
Removal of HTML tags
Chat words conversion
Spelling correction

Tokenization is about splitting strings of text into smaller pieces, or “tokens”. Paragraphs can be tokenized into sentences and sentences can be tokenized into words.

Noise Removal and Tokenization and are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization. Some of the common normalization steps are:

Upper or lowercasing
Stopword removal
Stemming — bluntly removing prefixes and suffixes from a word
Lemmatization — replacing a single-word token with its root

1.3. What is Keras Text Vectorization layer?

Thetf.keras.layers.TextVectorization layer is one of the Keras Preprocessing layers.

We can preprocess the input by using different libraries such as the Python String library, or SciKit Learn library, etc.

However, there are very important advantages to using the Keras Preprocessing layers:

You can build Keras-native input processing pipelines. These input processing pipelines can be used as independent preprocessing code in non-Keras workflows, combined directly with Keras models, and exported as part of a Keras SavedModel.
You can build and export models that are truly end-to-end: models that accept raw data (images or raw structured data) as input; models that handle feature normalization or feature value indexing on their own.

Today, we will deal with the tf.keras.layers.TextVectorization layer which:

turns raw strings into an encoded representation
that representation can be read by an Embedding layer or Dense layer.

That is, the tf.keras.layers.TextVectorization layer can be used in

Text Preprocessing and
Text Vectorization

2. IMPORT LIBRARIES

IMPORTANT: When I prepared this tutorial on 05 Oct 2021, the current version (2.6.0) of TF and Keras generate some errors in saving and uploading the tf.keras.layers.TextVectorization layer.

However, the nightly version has no problem handling these operations.

For more information about the bug, please see here

import tensorflow as tf

from tensorflow import keras

print("tf version:",tf.__version__)

print("keras version:", keras.__version__)

tf version: 2.6.0

keras version: 2.6.0

Therefore, below I first upload the TF nightly version.

tf version: 2.8.0-dev20211005
keras version: 2.7.0pip install tf-nightly --quiet --upgrade[K     |████████████████████████████████| 490.1 MB 9.9 kB/s 
[K     |████████████████████████████████| 5.8 MB 36.5 MB/s 
[K     |████████████████████████████████| 1.3 MB 33.9 MB/s 
[K     |████████████████████████████████| 13.4 MB 254 kB/s 
[K     |████████████████████████████████| 463 kB 42.9 MB/s 
[K     |████████████████████████████████| 2.1 MB 35.6 MB/s 
[?25himport numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import os

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import re
import string
import random
from sklearn.model_selection import train_test_splitprint("tf version:",tf.__version__)
print("keras version:", keras.__version__)tf version: 2.8.0-dev20211008
keras version: 2.8.0#@title Record Each Cell's Execution Time
!pip install ipython-autotime

%load_ext autotime

3. DOWNLOAD A KAGGLE DATASET INTO GOOGLE COLAB

The Multi-Class Classification Dataset for Turkish is a benchmark dataset for the Turkish text classification task.

It contains 430K comments/reviews for a total of 32 categories of products or services.

Each category roughly has 13K comments.

A baseline algorithm, Naive Bayes, gets %84 F1 score.

My blog post explaining how to download Kaggle Datasets is here.

My video tutorial explaining how to download Kaggle Datasets is here: Turkish/English

from google.colab import drive
drive.mount('/content/gdrive')Mounted at /content/gdrive
time: 3min 42s (started: 2021-10-08 14:36:24 +00:00)os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/Colab Notebooks/input"time: 2.88 ms (started: 2021-10-08 14:40:07 +00:00)#changing the working directory
%cd "/content/gdrive/MyDrive/Colab Notebooks/input"/content/gdrive/MyDrive/Colab Notebooks/input
time: 1.29 s (started: 2021-10-08 14:40:07 +00:00)#get the api command from kaggle dataset page
#!kaggle datasets download -d savasy/multiclass-classification-data-for-turkish-tc32time: 649 µs (started: 2021-10-08 14:40:08 +00:00)# check the downloaded zip file
!ls120001_PH1.csv	generatedReviews.csv	    kaggle.json        tr_stop_word.txt
320d.csv	generatedReviews_final.csv  model.png	       vocabPickle
corona.csv	generatedReviews_plus.csv   ticaret-yorum.csv
time: 328 ms (started: 2021-10-08 14:40:08 +00:00)# unzipping the zip files and deleting the zip files
!unzip \*.zip  && rm *.zipunzip:  cannot find or open *.zip, *.zip.zip or *.zip.ZIP.

No zipfiles found.
time: 131 ms (started: 2021-10-08 14:40:09 +00:00)# check the downloaded csv file
!ls120001_PH1.csv	generatedReviews.csv	    kaggle.json        tr_stop_word.txt
320d.csv	generatedReviews_final.csv  model.png	       vocabPickle
corona.csv	generatedReviews_plus.csv   ticaret-yorum.csv
time: 125 ms (started: 2021-10-08 14:40:09 +00:00)

4. LOAD STOP WORDS IN TURKISH

As you might know “Stop words” are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

I begin with uploading an existing list of stop words in Turkish below:

tr_stop_words = pd.read_csv('tr_stop_word.txt',header=None)
for each in tr_stop_words.values[:5]:
  print(each[0])ama
amma
anca
ancak
bu
time: 302 ms (started: 2021-10-08 14:40:09 +00:00)

5. LOAD THE DATASET

After downloading the dataset from the Kaggle website, we can upload it by using the Pandas library read_csv() function:

data = pd.read_csv('ticaret-yorum.csv')
pd.set_option('max_colwidth', 400)time: 5.97 s (started: 2021-10-08 14:40:09 +00:00)

PART B: KNOW THE DATA

You can watch this part on YouTube in Turkish or English.

6. EXPLORE THE DATASET

Before getting into the details of how to use the tf.keras.layers.TextVectorization layer, let me introduce the dataset briefly.

Shuffle Data

It is a really good and useful habit that, before doing anything else, as a first step in the preprocessing shuffle the data!

Actually, I will shuffle the data at the last step of the pipeline. But it does not hurt shuffling it twice :))

data= data.sample(frac=1)time: 103 ms (started: 2021-10-08 14:40:15 +00:00)

Summary Information about the dataset

Get the initial information about the dataset:

data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 431306 entries, 60837 to 242258
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   category  431306 non-null  object
 1   text      431306 non-null  object
dtypes: object(2)
memory usage: 9.9+ MB
time: 112 ms (started: 2021-10-08 14:40:15 +00:00)

We have a total of 431306 of rows and 2 columns: category & text.

According to data.info(), there is no null values in the dataset. If there are any null values in the dataset, we could drop these null values as follows:

df.dropna(inplace=True)

df.isnull().sum()

Sample Reviews and their categories:

data.head()

time: 19.6 ms (started: 2021-10-08 14:40:15 +00:00)

7. CREATE A TENSORFLOW DATA PIPELINE FOR TEXT PREPROCESSING & VECTORIZATION

So far, we have just observed some properties of the raw data. Using these observations, we are ready to preprocess the text data for a classifier model.

Below, we will begin to create a TensorFlow data pipeline that includes the Keras Text Vectorization layer for preprocessing the data and preparing it for a classifier.

A pipeline for a text model mostly involves extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths.

In this tutorial, I will use the TensorFlow “tf.data” API. If you are not familiar with TF data pipeline “tf.data” API, you can apply the below resources:

Official TensorFlow blog: tf.data: Build TensorFlow input pipelines
The Murat Karakaya Akademi YouTube playlist in Turkish: tf.data: TensorFlow Data Pipeline Anlamak ve Kullanmak
The Murat Karakaya Akademi YouTube playlist in English: TensorFlow Data Pipeline: How to Design Code Use TensorFlow Data Pipelines with Python & Keras
The Murat Karakaya Akademi Medium blog: tf.data: Tensorflow Data Pipelines

Convert Categories From Strings to Integer Ids

Observe that the categories (topics/class)of the reviews are strings:

data["category"]60837         cep-telefon-kategori
218953             kamu-hizmetleri
325173           mutfak-arac-gerec
188348                      icecek
183962              hizmet-sektoru
                    ...           
21408                   anne-bebek
152087                        gida
130392    etkinlik-ve-organizasyon
51513                   bilgisayar
242258              kargo-nakliyat
Name: category, Length: 431306, dtype: object



time: 8.4 ms (started: 2021-10-08 14:40:15 +00:00)

We need to create integer category ids from string category names by adding a new column to the data frame “category_id”:

data["category"] = data["category"].astype('category')
data["category_id"] = data["category"].cat.codes
data.head()

time: 86 ms (started: 2021-10-08 14:40:15 +00:00)

Lastly, we can check the number of categories. Note that it should be 32:

data['category']60837         cep-telefon-kategori
218953             kamu-hizmetleri
325173           mutfak-arac-gerec
188348                      icecek
183962              hizmet-sektoru
                    ...           
21408                   anne-bebek
152087                        gida
130392    etkinlik-ve-organizasyon
51513                   bilgisayar
242258              kargo-nakliyat
Name: category, Length: 431306, dtype: category
Categories (32, object): ['alisveris', 'anne-bebek', 'beyaz-esya', 'bilgisayar', ..., 'spor',
                          'temizlik', 'turizm', 'ulasim']



time: 14.2 ms (started: 2021-10-08 14:40:15 +00:00)

Build a Dictionary for id to text category (topic) look-up:

id_to_category = pd.Series(data.category.values,index=data.category_id).to_dict()
id_to_category{0: 'alisveris',
 1: 'anne-bebek',
 2: 'beyaz-esya',
 3: 'bilgisayar',
 4: 'cep-telefon-kategori',
 5: 'egitim',
 6: 'elektronik',
 7: 'emlak-ve-insaat',
 8: 'enerji',
 9: 'etkinlik-ve-organizasyon',
 10: 'finans',
 11: 'gida',
 12: 'giyim',
 13: 'hizmet-sektoru',
 14: 'icecek',
 15: 'internet',
 16: 'kamu-hizmetleri',
 17: 'kargo-nakliyat',
 18: 'kisisel-bakim-ve-kozmetik',
 19: 'kucuk-ev-aletleri',
 20: 'medya',
 21: 'mekan-ve-eglence',
 22: 'mobilya-ev-tekstili',
 23: 'mucevher-saat-gozluk',
 24: 'mutfak-arac-gerec',
 25: 'otomotiv',
 26: 'saglik',
 27: 'sigortacilik',
 28: 'spor',
 29: 'temizlik',
 30: 'turizm',
 31: 'ulasim'}



time: 74 ms (started: 2021-10-08 14:40:16 +00:00)

Reduce the Size of the Dataset

Since using a large dataset for testing your pipeline would take more time, you would prefer to take a portion of the raw dataset as below:

#limit the number of samples to be used in testing the pipeline
#data_size= 1000 #instead of 431306 
#data= data[:data_size]
#data.info()time: 1.55 ms (started: 2021-10-08 14:40:16 +00:00)

Split the Raw Dataset into Train and Test Datasets

To prevent data leakage during preprocessing the text data, we need to split the text into Train and Test data sets.

Data leakage refers to a mistake made by the creator of a machine learning model in which they accidentally share information between the test and training data sets. Typically, when splitting a data set into testing and training sets, the goal is to ensure that no data is shared between the two. This is because the test set’s purpose is to simulate real-world, unseen data. However, when evaluating a model, we do have full access to both our train and test sets, so it is up to us to ensure that no data in the training set is present in the test set.

In our case, since we want to classify reviews, we have not to use test reviews in text vectorization.

# save features and targets from the 'data'
features, targets = data['text'], data['category_id']

train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        random_state=42,
        shuffle = True,
        stratify=targets
    )time: 286 ms (started: 2021-10-08 14:40:16 +00:00)

Build the Train & Test TensorFlow Datasets

First, we create TensorFlow Datasets from the raw Train Dataframe for further processing.

Note that:

X: input (text/reviews)
y: target value (categories/topics/class)

Observe that we have reviews in the text as input and categories (topics) in integer as target values:

train_features.values[:5]array(['İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de lütfen yetkililerden bir açıklama bekliyorum bu gıda maddesinin içinde ne gibi bir madde olabilir. Bize nasıl ortamlarda ürettiğiniz ürünleri yediriyorsunuz kesinlikle küf değil fotoğrafını da ekliyoru...Devamını oku',
       'Philips TV İnternet Bağlantı Sorunu!,"Philips 32PFS5803/62 model Smart televizyonumu Vatan markete henüz 1 ay oldu alalı 1 ay her yere bağlanan TV internete bağlı olmasına rağmen YouTube.com, Smart TV, uygulama galerisi vb... Hiçbir uygulamayı açmıyor. Girmeye çalıştığım zaman ""bu TV\'yi internete bağlayın"" sayfası açılıyor ve bağlamaya ...Devamını oku"',
       'Anadolu Hastanesi (Çanakkale) Muayene Süresi Kısalığı,20 aylık çocuğum var devamlı çocuk Dr. y. A muayene oluyorum ama artık aynı sorunla karşılaşmaktan bıktım. Alel acele 5 dakikada muayene yapıor hastanın çıkmasını beklemeden yeni hasta alıyor ve onun yanında çocuk giydiriliyor belki özel konuşacaklarmış ya da özel durumumuz var düşünen yok. Paramızl...Devamını oku',
       'Digiturk Engelsiz Kampanyası Zulmü!,1014917147 numaralı aboneliğimle ilgili. Digiturk pazarlama stratejisi ile insanları örtülü olarak resmen yanıltıyor. Engelli indiriminden taahhütsüz olarak üye oldum. Sonra iptal etmek istedim 70 TL cayma bedeli talep ettiler. Taahhütsüz dedim ilk başta kurum yapıldı 1 yıl içinde iptal edilirse kur...Devamını oku',
       'Rowenta Elektrik Süpürge İyi Çekmiyor!,RO3723TA-JSO-3617 ürün kodlu Rowenta marka elektrik süpürgemi 18 aydır kullanmama rağmen iyi çekmediği için Çanakkale servisine götürdüm ve garantisi bile henüz dolmayan süpürge için filtre temizliği yapılacağından 130 TL ücret istenmektedir. Daha yeni süpürge hem çekmiyor hem de filtre temizliği iç...Devamını oku'],
      dtype=object)



time: 7.27 ms (started: 2021-10-08 14:40:16 +00:00)train_targets.values[:5]array([11,  6, 26, 20, 19], dtype=int8)



time: 5.68 ms (started: 2021-10-08 14:40:16 +00:00)

Prepare TensorFlow Datasets

We convert the data stored in Pandas Data Frame into data stored in TensorFlow Data Set as below:

# train X & y
train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_features.values, tf.string)
) 
train_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_targets.values, tf.int64),

) 
# test X & y
test_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(test_features.values, tf.string)
) 
test_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(test_targets.values, tf.int64),

)time: 1.81 s (started: 2021-10-08 14:40:16 +00:00)

Decide the dictionary size and the review size

For preprocessing the text, we need to decide the dictionary (vocabulary) size and the review (text) length.

vocab_size = 20000  # Only consider the top 20K words
max_len = 50  # Maximum review (text) size in wordstime: 1.49 ms (started: 2021-10-08 14:40:18 +00:00)

PART C: USE KERAS TEXT VECTORIZATION LAYER

You can watch this part on YouTube in Turkish or English.

8. PREPROCESS THE TEXT WITH THE KERAS `TEXTVECTORIZATION` LAYER

8.1. Define your own `custom_standardization` function

First, I define a function that will preprocess the given text. The custom_standardization function will convert the given string to a standard form by transforming the input applying several updates:

convert all characters to lowercase
remove special symbols, extra spaces, HTML tags, digits, and punctuations
remove stop words
replace the special Turkish letters with the corresponding English letters.

@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    no_stop_words = ' '+no_punctuations+ ' '
    for each in tr_stop_words.values:
      no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")

    return no_Utime: 17.1 ms (started: 2021-10-08 14:40:18 +00:00)

Quickly verify that custom_standardization works: try it on a sample Turkish input:

input_string = "Bu Issız Öğlenleyin de;  şunu ***1 Pijamalı Hasta***, ve  Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku"
print("input:  ", input_string)
output_string= custom_standardization(input_string)
print("output: ", output_string.numpy().decode("utf-8"))input:   Bu Issız Öğlenleyin de;  şunu ***1 Pijamalı Hasta***, ve  Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku
output:   issiz oglenleyin pijamali hasta i̇ste yagiz sofore cabucak guvendi 
time: 58.8 ms (started: 2021-10-08 14:40:18 +00:00)

8.2. Configure the Keras `TextVectorization` layer

To preprocess the text, I will use the Keras TextVectorization layer.

tf.keras.layers.TextVectorization(
    max_tokens=None,
    standardize="lower_and_strip_punctuation",
    split="whitespace",
    ngrams=None,
    output_mode="int",
    output_sequence_length=None,
    pad_to_max_tokens=False,
    vocabulary=None,
    **kwargs
)

The Keras TextVectorization layer processes each example in the dataset as follows:

Standardize each example (usually lowercasing + punctuation stripping)
Split each example into substrings (usually words)
Recombine substrings into tokens (usually ngrams)
Index tokens (associate a unique int value with each token)
Transform each example using this index, either into a vector of ints or a dense float vector.

Let’s build our TextVectorization layer by providing:

The custom_standardization() function for the standardize method (callable).
The vocab_size as the max_tokens number: The max_tokens is the maximum size of the vocabulary that will be created from the dataset. If None, there is no cap on the size of the vocabulary. Note that this vocabulary contains 1 OOV (Out Of Vocabulary) token, so the effective number of tokens is (max_tokens - 1 - (1 if output_mode == "int" else 0)).
The int keyword as the output_mode: Optional specification for the output of the layer. Values can be

“int”,
“multi_hot”,
“count” or
“tf_idf”,

Configuring the layer as follows:

“int”: Outputs integer indices, one integer index per split string token. When output_mode == “int”, 0 is reserved for masked locations; this reduces the vocab size to max_tokens — 2 instead of max_tokens — 1.
“multi_hot”: Outputs a single int array per batch, of either vocab_size or max_tokens size, containing 1s in all elements where the token mapped to that index exists at least once in the batch item.
“count”: Like “multi_hot”, but the int array contains a count of the number of times the token at that index appeared in the batch item.
“tf_idf”: Like “multi_hot”, but the TF-IDF algorithm is applied to find the value in each token slot.

For “int” output, any shape of input and output is supported.

For all other output modes, currently only rank 1 inputs (and rank 2 outputs after splitting) are supported.

output_sequence_length=max_len

# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size+2,
    output_mode="int",
    output_sequence_length=max_len,
)time: 158 ms (started: 2021-10-08 14:40:18 +00:00)

8.3. Adapt the Keras `TextVectorization` layer with the training data set, (not test data set!)

TextVectorization preprocessing layer has an internal state that can be computed based on a sample of the training data. That is, TextVectorization holds a mapping between string tokens and integer indices.

Thus, we will adopt TextVectorization preprocessing layer ONLY to the training data.

Please note that: To prevent and data leak, we DO NOT adopt TextVectorization preprocessing layer to the whole (train & test) data.

vectorize_layer.adapt(train_features)
vocab = vectorize_layer.get_vocabulary()  # To get words back from token indicestime: 2min 22s (started: 2021-10-08 14:40:18 +00:00)

Let’s see some example conversions:

print("vocab has the ", len(vocab)," entries")
print("vocab has the following first 10 entries")
for word in range(10):
  print(word, " represents the word: ", vocab[word])

for X in train_features[:2]:
  print(" Given raw data: " )
  print(X)
  tokenized = vectorize_layer(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers: " )
  print (tokenized)
  print(" Text after Tokenized and Transformed: ")
  transformed = ""
  for each in tf.squeeze(tokenized):
    transformed= transformed+ " "+ vocab[each]
  print(transformed)vocab has the  20002  entries
vocab has the following first 10 entries
0  represents the word:  
1  represents the word:  [UNK]
2  represents the word:  ne
3  represents the word:  tl
4  represents the word:  gun
5  represents the word:  urun
6  represents the word:  aldim
7  represents the word:  siparis
8  represents the word:  musteri
9  represents the word:  tarihinde
 Given raw data: 
İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de lütfen yetkililerden bir açıklama bekliyorum bu gıda maddesinin içinde ne gibi bir madde olabilir. Bize nasıl ortamlarda ürettiğiniz ürünleri yediriyorsunuz kesinlikle küf değil fotoğrafını da ekliyoru...Devamını oku
 Tokenized and Transformed to a vector of integers: 
tf.Tensor(
[[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
   1709    64   209  2335     1  4024   853   184  1037     1    72     2
   1709   623   177     1 18408   367     1   282  2582  3586     1     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0]], shape=(1, 50), dtype=int64)
 Text after Tokenized and Transformed: 
 i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]               
 Given raw data: 
Philips TV İnternet Bağlantı Sorunu!,"Philips 32PFS5803/62 model Smart televizyonumu Vatan markete henüz 1 ay oldu alalı 1 ay her yere bağlanan TV internete bağlı olmasına rağmen YouTube.com, Smart TV, uygulama galerisi vb... Hiçbir uygulamayı açmıyor. Girmeye çalıştığım zaman ""bu TV'yi internete bağlayın"" sayfası açılıyor ve bağlamaya ...Devamını oku"
 Tokenized and Transformed to a vector of integers: 
tf.Tensor(
[[  226    44   354  1078    17   226   215   206  9049  1079  2556   460
     11   574    11    19   294 13253    44  2481  1384   124   648   141
    206    44   672     1  2262    22  2862   890  5564  2058    67    44
    469  2481     1  4955  1862 15099     0     0     0     0     0     0
      0     0]], shape=(1, 50), dtype=int64)
 Text after Tokenized and Transformed: 
 philips tv i̇nternet baglanti sorunu philips model smart televizyonumu vatan markete henuz ay alali ay her yere baglanan tv internete bagli olmasina youtube com smart tv uygulama [UNK] vb hicbir uygulamayi acmiyor girmeye calistigim zaman tv yi internete [UNK] sayfasi aciliyor baglamaya        
time: 157 ms (started: 2021-10-08 14:42:41 +00:00)vocab[:5]['', '[UNK]', 'ne', 'tl', 'gun']



time: 4.75 ms (started: 2021-10-08 14:42:41 +00:00)

8.4. Save & Upload TextVectorization layer

Due to the fact that adapting the Keras TextVectorization layer on a large text dataset takes a considerable amount of time and porting the adapted layer to a different deployment environment is a high possibility, it is good to know how to save and load it.

How to save a Keras TextVectorization layer?

There are currently 2 ways of doing it:

save the Keras TextVectorization layer in a Keras Model
save the Keras TextVectorization layer as a pickle file.

In this tutorial, I will use the first approach as it is native to the TF/Keras environment.

8.4.1. Ensure that you are on the correct directory path :)

%cd ../models/
%ls/content/gdrive/My Drive/Colab Notebooks/models
[0m[01;34mcheckpoint[0m/                    [01;34mMultiClassTextClassificationExported[0m/
[01;34mend_to_end_model[0m/              [01;34mMultitopicTextGenerator[0m/
[01;34mMultiClassTextClassification[0m/  [01;34mvectorize_layer_model[0m/
time: 366 ms (started: 2021-10-08 14:42:41 +00:00)

8.4.2. Create a temporary Keras `model` by adding the adapted Keras `TextVectorization` layer

# Create model.
vectorize_layer_model = tf.keras.models.Sequential()
vectorize_layer_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
vectorize_layer_model.add(vectorize_layer)
vectorize_layer_model.summary()Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVec  (None, 50)               0         
 torization)                                                     
                                                                 
=================================================================
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________
time: 256 ms (started: 2021-10-08 14:42:41 +00:00)

8.4.3. Save the temporary model including the adapted Keras `TextVectorization` layer

filepath = "vectorize_layer_model"time: 721 µs (started: 2021-10-08 14:42:42 +00:00)vectorize_layer_model.save(filepath, save_format="tf")WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: vectorize_layer_model/assets
time: 4.86 s (started: 2021-10-08 14:42:42 +00:00)%ls[0m[01;34mcheckpoint[0m/                    [01;34mMultiClassTextClassificationExported[0m/
[01;34mend_to_end_model[0m/              [01;34mMultitopicTextGenerator[0m/
[01;34mMultiClassTextClassification[0m/  [01;34mvectorize_layer_model[0m/
time: 153 ms (started: 2021-10-08 14:42:46 +00:00)

8.4.4. Load the `vectorize_layer_model` back to check if saving is successful

loaded_vectorize_layer_model = tf.keras.models.load_model(filepath)WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
time: 1.93 s (started: 2021-10-08 14:42:47 +00:00)

8.4.5 Retrieve the loaded Keras `TextVectorization` layer

Here, you have 2 options:

use the loaded_model.predicted() method to use the Keras TextVectorization layer, or
get the Keras TextVectorization layer out of the loaded_model as below:

loaded_vectorize_layer = loaded_vectorize_layer_model.layers[0]time: 1.97 ms (started: 2021-10-08 14:42:49 +00:00)

8.4.6. Compare the original and loaded `TextVectorization` layers

loaded_vocab=loaded_vectorize_layer.get_vocabulary()
print("original vocab has the ", len(vocab)," entries")
print("loaded vocab has the   ", len(loaded_vocab)," entries")
print("loaded vocab has the following first 10 entries")
for word in range(10):
  print(word, " represents the word: ")
  print(vocab[word], " in original vocab")
  print(loaded_vocab[word], " in loaded vocab")
for X in train_features[:1]:
  print(" Given raw data: " )
  print(X)

  tokenized = vectorize_layer(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers by the original vectorize layer:" )
  print (tokenized)

  tokenized = loaded_vectorize_layer(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers by the loaded vectorize layer:" )
  print (tokenized)
  
  tokenized = loaded_vectorize_layer_model.predict(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers by the loaded_vectorize_layer_model:" )
  print (tokenized)

  print(" Text after Tokenized and Transformed by the original vectorize layer:: ")
  transformed = ""
  for each in tf.squeeze(tokenized):
    transformed= transformed+ " "+ vocab[each]
  print(transformed)

  print(" Text after Tokenized and Transformed by the loaded vectorize layer:")
  transformed = ""
  for each in tf.squeeze(tokenized):
    transformed= transformed+ " "+ loaded_vocab[each]
  print(transformed)original vocab has the  20002  entries
loaded vocab has the    20002  entries
loaded vocab has the following first 10 entries
0  represents the word: 
  in original vocab
  in loaded vocab
1  represents the word: 
[UNK]  in original vocab
[UNK]  in loaded vocab
2  represents the word: 
ne  in original vocab
ne  in loaded vocab
3  represents the word: 
tl  in original vocab
tl  in loaded vocab
4  represents the word: 
gun  in original vocab
gun  in loaded vocab
5  represents the word: 
urun  in original vocab
urun  in loaded vocab
6  represents the word: 
aldim  in original vocab
aldim  in loaded vocab
7  represents the word: 
siparis  in original vocab
siparis  in loaded vocab
8  represents the word: 
musteri  in original vocab
musteri  in loaded vocab
9  represents the word: 
tarihinde  in original vocab
tarihinde  in loaded vocab
 Given raw data: 
İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de lütfen yetkililerden bir açıklama bekliyorum bu gıda maddesinin içinde ne gibi bir madde olabilir. Bize nasıl ortamlarda ürettiğiniz ürünleri yediriyorsunuz kesinlikle küf değil fotoğrafını da ekliyoru...Devamını oku
 Tokenized and Transformed to a vector of integers by the original vectorize layer:
tf.Tensor(
[[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
   1709    64   209  2335     1  4024   853   184  1037     1    72     2
   1709   623   177     1 18408   367     1   282  2582  3586     1     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0]], shape=(1, 50), dtype=int64)
 Tokenized and Transformed to a vector of integers by the loaded vectorize layer:
tf.Tensor(
[[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
   1709    64   209  2335     1  4024   853   184  1037     1    72     2
   1709   623   177     1 18408   367     1   282  2582  3586     1     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0]], shape=(1, 50), dtype=int64)
 Tokenized and Transformed to a vector of integers by the loaded_vectorize_layer_model:
[[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
   1709    64   209  2335     1  4024   853   184  1037     1    72     2
   1709   623   177     1 18408   367     1   282  2582  3586     1     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0]]
 Text after Tokenized and Transformed by the original vectorize layer:: 
 i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]               
 Text after Tokenized and Transformed by the loaded vectorize layer:
 i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]               
time: 787 ms (started: 2021-10-08 14:42:49 +00:00)

As you see above, we successfully saved and loaded the adapted Keras TextVectorization layer!

We can continue to the TensorFlow data pipeline with the adapted Keras TextVectorization layer:

pwd'/content/gdrive/My Drive/Colab Notebooks/models'



time: 11.9 ms (started: 2021-10-08 14:42:49 +00:00)

9. APPLY KERAS `TEXTVECTORIZATION` TO TRAIN & TEST DATA SETS

We can define a function to apply the Keras TextVectorization on a given string as follows:

def convert_text_input(sample):
    text = sample
    text = tf.expand_dims(text, -1)  
    #return tf.squeeze(vectorize_layer(text))
    return tf.squeeze(loaded_vectorize_layer(text))time: 1.48 ms (started: 2021-10-08 14:42:49 +00:00)

We use the TensorFlow tf.data API (TF Data Pipeline) map() funtion to apply convert_text_input() on every sample in the text column (reviews) of the training dataset.

# Train X
train_text_ds = train_text_ds_raw.map(convert_text_input, 
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)
# Test X
test_text_ds = test_text_ds_raw.map(convert_text_input, 
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)time: 696 ms (started: 2021-10-08 14:42:49 +00:00)

Let’s see the converted/encoded texts (reviews)

for each in train_text_ds.take(3):
  print(each)tf.Tensor(
[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
  1709    64   209  2335     1  4024   853   184  1037     1    72     2
  1709   623   177     1 18408   367     1   282  2582  3586     1     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0], shape=(50,), dtype=int64)
tf.Tensor(
[  226    44   354  1078    17   226   215   206  9049  1079  2556   460
    11   574    11    19   294 13253    44  2481  1384   124   648   141
   206    44   672     1  2262    22  2862   890  5564  2058    67    44
   469  2481     1  4955  1862 15099     0     0     0     0     0     0
     0     0], shape=(50,), dtype=int64)
tf.Tensor(
[  465   171  3144   673   378     1   192  1280    10  1273   414  1023
   695    74   673  3805   102    25  1777     1  1706     1  6537  2406
   673     1  8569  9825  9478    79  1001   788   975   414     1   348
     1    28   348 13025    10 11558     1     0     0     0     0     0
     0     0], shape=(50,), dtype=int64)
time: 154 ms (started: 2021-10-08 14:42:50 +00:00)

GENERATE THE TRAIN SET BY COMBINING X & Y:

X: the preprocessed & encoded reviews
y: encoded categories)

train_ds = tf.data.Dataset.zip(
    (
            train_text_ds,
            train_cat_ds_raw
     )
)time: 3.9 ms (started: 2021-10-08 14:42:50 +00:00)

Similarly, let’s bundle test data sets as a single data set:

test_ds = tf.data.Dataset.zip(
    (
            test_text_ds,
            test_cat_ds_raw
     )
)time: 1.7 ms (started: 2021-10-08 14:42:50 +00:00)

We can see the result of the Text Vectorization in the Data Pipelining as follows:

for X,y in train_ds.take(1):
  print("input (review) X.shape: ", X.shape)
  print("output (category) y.shape: ", y.shape)
  print("input (review) X: ", X)
  print("output (category) y: ",y)
  input = " ".join([vocab[_] for _ in np.squeeze(X)])
  output = id_to_category[y.numpy()]
  print("X: input (review) in text: " , input)
  print("y: output (category) in text: " , output)input (review) X.shape:  (50,)
output (category) y.shape:  ()
input (review) X:  tf.Tensor(
[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
  1709    64   209  2335     1  4024   853   184  1037     1    72     2
  1709   623   177     1 18408   367     1   282  2582  3586     1     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0], shape=(50,), dtype=int64)
output (category) y:  tf.Tensor(11, shape=(), dtype=int64)
X: input (review) in text:  i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]               
y: output (category) in text:  gida
time: 167 ms (started: 2021-10-08 14:42:50 +00:00)

11. FINALIZE TENSORFLOW DATA PIPELINE

Finalize TensorFlow Data Pipeline by setting necessary parameters for batching, shuffling, and optimizing as follows:

batch_size = 64
AUTOTUNE = tf.data.experimental.AUTOTUNE
buffer_size= train_ds.cardinality().numpy()

train_ds = train_ds.shuffle(buffer_size=buffer_size)\
                   .batch(batch_size=batch_size,drop_remainder=True)\
                   .cache()\
                   .prefetch(AUTOTUNE)

test_ds = test_ds.shuffle(buffer_size=buffer_size)\
                   .batch(batch_size=batch_size,drop_remainder=True)\
                   .cache()\
                   .prefetch(AUTOTUNE)time: 17.7 ms (started: 2021-10-08 14:42:50 +00:00)train_ds.element_spec(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
 TensorSpec(shape=(64,), dtype=tf.int64, name=None))



time: 5.28 ms (started: 2021-10-08 14:42:50 +00:00)

PART D: BUILD AN END-TO-END MODEL

You can watch this part on YouTube in Turkish or English.

12. Create a Classification Model

For the sake of demonstration of the Keras TextVectorization layer, let's build a very simple model:

def create_model():
    inputs_tokens = layers.Input(shape=(max_len,), dtype=tf.int32)
    embedding_layer = layers.Embedding(vocab_size, 256)
    x = embedding_layer(inputs_tokens)
    x = layers.Flatten()(x)
    outputs = layers.Dense(32)(x)
    model = keras.Model(inputs=inputs_tokens, outputs=outputs)
    
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    metric_fn  = tf.keras.metrics.SparseCategoricalAccuracy()
    model.compile(optimizer="adam", loss=loss_fn, metrics=metric_fn)  
    
    return model
my_model=create_model()
my_model.summary()Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 50)]              0         
                                                                 
 embedding (Embedding)       (None, 50, 256)           5120000   
                                                                 
 flatten (Flatten)           (None, 12800)             0         
                                                                 
 dense (Dense)               (None, 32)                409632    
                                                                 
=================================================================
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 54.3 ms (started: 2021-10-08 14:42:51 +00:00)

13. Train the Classification Model

my_model.fit(train_ds, verbose=1, epochs=3)Epoch 1/3
5391/5391 [==============================] - 251s 9ms/step - loss: 0.3019 - sparse_categorical_accuracy: 0.9310
Epoch 2/3
5391/5391 [==============================] - 46s 8ms/step - loss: 0.0424 - sparse_categorical_accuracy: 0.9897
Epoch 3/3
5391/5391 [==============================] - 46s 8ms/step - loss: 0.0049 - sparse_categorical_accuracy: 0.9993





<keras.callbacks.History at 0x7fef7fe05050>



time: 6min 30s (started: 2021-10-08 14:42:51 +00:00)loss, accuracy = my_model.evaluate(test_ds)
print("Train accuracy: ", accuracy)1347/1347 [==============================] - 56s 3ms/step - loss: 0.2385 - sparse_categorical_accuracy: 0.9511
Train accuracy:  0.9511414170265198
time: 55.6 s (started: 2021-10-08 14:49:21 +00:00)

14. An End-To-End Classification Model

Pay attention that the above model is expected to receive batches of integer tensors as input:

Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 50)]              0

Thus, you can NOT supply raw data (some text) to the model for prediction. TensorFlow/Keras would generate error message as below:

raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']

predictions=my_model.predict(raw_data)

ValueError: in user code: Exception encountered when calling layer "model" (type Functional).
    
    Input 0 of layer "dense" is incompatible with the layer: expected axis -1of input shape to have value 12800, but received input with shape (None, 256)
    
    Call arguments received:
      • inputs=tf.Tensor(shape=(None,), dtype=string)
      • training=False
      • mask=None

However, sometimes it is a big advantage if we can design a model which accepts raw data as input, then, process the data by itself.

For example, such a model can be easily exported to different platforms/environments without the need of exporting the preprocess code!

Therefore, Keras provides several Preprocessing Layers so that we can integrate preprocessing logic as a layer into a Keras model.

After then, we can export such models and use any other platforms without re-writing preprocessing code on the exported platforms/environments.

These kinds of models can be called End-To-End Models. That is, an End-To-End model can accept Raw Input Data and preprocess it by itself.

**What could be Raw Data? **

It could be:

text
image
structure data
etc.

Let’s create an End-To-End Classification Model by integrating the adapted Keras TextVectorization layer into the trained model as the first layer.

You can create an End-To-End Model either by:

Keras Sequential API, or
Keras Functional API

14.1. Create an End-To-End Model with Keras Sequential API

end_to_end_model = tf.keras.Sequential([
  keras.Input(shape=(1,), dtype="string"),
  vectorize_layer,
  my_model,
  layers.Activation('softmax')
])

end_to_end_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)
end_to_end_model.summary()Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVec  (None, 50)               0         
 torization)                                                     
                                                                 
 model (Functional)          (None, 32)                5529632   
                                                                 
 activation (Activation)     (None, 32)                0         
                                                                 
=================================================================
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 282 ms (started: 2021-10-08 14:50:16 +00:00)

14.2. Create an End-To-End Model with Keras Functional API

inputs = keras.Input(shape=(1,), dtype="string")
x = vectorize_layer(inputs)
outputs = my_model(x)
end_to_end_model = keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)
end_to_end_model.summary()Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 50)               0         
 torization)                                                     
                                                                 
 model (Functional)          (None, 32)                5529632   
                                                                 
=================================================================
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 284 ms (started: 2021-10-08 14:50:17 +00:00)

14.3. Test the End-to-End model with Raw (Text) Data

raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']
predictions=end_to_end_model.predict(raw_data)
print(id_to_category[np.argmax(predictions[0])])
print(id_to_category[np.argmax(predictions[1])])alisveris
ulasim
time: 608 ms (started: 2021-10-08 14:50:17 +00:00)loss, accuracy = end_to_end_model.evaluate(test_features,test_targets)
print("end_to_end_model accuracy: ", accuracy)2696/2696 [==============================] - 47s 17ms/step - loss: 2.3769 - accuracy: 0.9511
end_to_end_model accuracy:  0.9511488080024719
time: 46.8 s (started: 2021-10-08 14:50:18 +00:00)

14.4. Save the End-to-End model

end_to_end_model.save("end_to_end_model")INFO:tensorflow:Assets written to: end_to_end_model/assets
time: 5.58 s (started: 2021-10-08 14:51:04 +00:00)

14.5. Load the End-to-End model

loaded_end_to_end_model = tf.keras.models.load_model("end_to_end_model")time: 2.65 s (started: 2021-10-08 14:51:10 +00:00)

14.6. Test the Loaded End-to-End model with Raw (Text) Data

raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']
predictions=loaded_end_to_end_model.predict(raw_data)
print(id_to_category[np.argmax(predictions[0])])
print(id_to_category[np.argmax(predictions[1])])alisveris
ulasim
time: 573 ms (started: 2021-10-08 14:51:13 +00:00)loss, accuracy = loaded_end_to_end_model.evaluate(test_features,test_targets)
print("loaded_end_to_end_model accuracy: ", accuracy)2696/2696 [==============================] - 46s 17ms/step - loss: 2.3769 - accuracy: 0.9511
loaded_end_to_end_model accuracy:  0.9511488080024719
time: 46.1 s (started: 2021-10-08 14:51:13 +00:00)

PART E: DEPLOY END-TO-END MODEL TO HUGGINGFACE SPACES USING GRADIO

In this part, we will learn how to deploy the End-to-End model with a Keras TextVectorization layer to the HuggingFace Spaces. For the interface, we will use the Gradio library.

Custom Standardization Function

Please note that: while we configured the Keras TextVectorization layer in this tutorial we did not use the "standard" standardization function. Instead, we implemented a custom standardization (custom_standardization) function as below.

@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    no_stop_words = ' '+no_punctuations+ ' '
    for each in tr_stop_words.values:
      no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")
    return no_U

It is really important that when you deploy the end-to-end model with a Keras TextVectorization layer, you have to register the custom standardization (custom_standardization) function to the Keras environment as well. Otherwise, you will receive errors and your end-to-end model will not work.

Gradio

As stated on the official website:

"Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!"

You can learn the details of the library from this demo section.

Since our aim is to learn how to deploy an end-to-end model, we will not get into the details of the Gradio interface library. Nevertheless, we will cover its related functionality for our purpose.

HuggingFace Spaces

HuggingFace provides us a free service to upload and deploy Machine Learning models. You can create your account for free and use the CPU-based platform for free. This service is called HuggingFace Spaces.

It is also integrated with GitHub. So you can connect your repos to be serviced directly to/by the HuggingFace Spaces.

Import Libraries

Let's import the necessary libraries to upload and run the end-to-end model:

import numpy as np 
import tensorflow as tf
import pickle
import string
import pandas as pd

Load STOP WORDS in Turkish

As you might remember we have used a "Stop words in Turkish" file for pre-processing the data. Let's load it:

path = "/content/gdrive/MyDrive/Colab Notebooks/input/"
tr_stop_words = pd.read_csv(path+'tr_stop_word.txt',header=None)
for each in tr_stop_words.values[:5]:
  print(each[0])

ama
amma
anca
ancak
bu

Load the End-to-End Model

As in the previous tutorial, we had saved the End-to-End Model, now we can try to load it back:

path = "/content/gdrive/MyDrive/Colab Notebooks/models/"
loaded_end_to_end_model = tf.keras.models.load_model(path+"end_to_end_model")

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-3-ce2aedddcfdb> in <module>
      1 path = "/content/gdrive/MyDrive/Colab Notebooks/models/"
----> 2 loaded_end_to_end_model = tf.keras.models.load_model(path+"end_to_end_model")


/usr/local/lib/python3.9/dist-packages/keras/saving/saving_api.py in load_model(filepath, custom_objects, compile, safe_mode, **kwargs)
    228 
    229     # Legacy case.
--> 230     return legacy_sm_saving_lib.load_model(
    231         filepath, custom_objects=custom_objects, compile=compile, **kwargs
    232     )


/usr/local/lib/python3.9/dist-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     68             # To get the full stack trace, call:
     69             # `tf.debugging.disable_traceback_filtering()`
---> 70             raise e.with_traceback(filtered_tb) from None
     71         finally:
     72             del filtered_tb


/usr/local/lib/python3.9/dist-packages/keras/saving/legacy/serialization.py in deserialize_keras_object(identifier, module_objects, custom_objects, printable_module_name)
    533             obj = object_registration._GLOBAL_CUSTOM_OBJECTS[object_name]
    534         else:
--> 535             obj = module_objects.get(object_name)
    536             if obj is None:
    537                 raise ValueError(


AttributeError: 'NoneType' object has no attribute 'get'

As you saw above, we have received an error message. The reason is that we tried to upload a model including the Keras Text Vectorization layer with a custom standardization function without providing this function.

To fix this, first let's register the custom standardization function as a Keras Serializable object by decorating it with the @tf.keras.utils.register_keras_serializable() decorator:

@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    no_stop_words = ' '+no_punctuations+ ' '
    for each in tr_stop_words.values:
      no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")

    return no_U

Now, let's re-try to load the end-to-end model:

path = "/content/gdrive/MyDrive/Colab Notebooks/models/"
loaded_end_to_end_model = tf.keras.models.load_model(path+"end_to_end_model")

This time, we successfully loaded the saved end-to-end model!

Load the Ids of Categories

As in the previous tutorial, we had saved the id_to_category dictionary, now we can load it back:

path = "/content/gdrive/MyDrive/Colab Notebooks/input/"
pkl_file = open(path+"id_to_category.pkl", "rb")
id_to_category = pickle.load(pkl_file)
print(id_to_category)

{2: 'beyaz-esya', 27: 'sigortacilik', 12: 'giyim', 4: 'cep-telefon-kategori', 26: 'saglik', 22: 'mobilya-ev-tekstili', 10: 'finans', 24: 'mutfak-arac-gerec', 20: 'medya', 29: 'temizlik', 19: 'kucuk-ev-aletleri', 21: 'mekan-ve-eglence', 18: 'kisisel-bakim-ve-kozmetik', 23: 'mucevher-saat-gozluk', 28: 'spor', 17: 'kargo-nakliyat', 13: 'hizmet-sektoru', 1: 'anne-bebek', 0: 'alisveris', 5: 'egitim', 9: 'etkinlik-ve-organizasyon', 30: 'turizm', 8: 'enerji', 3: 'bilgisayar', 7: 'emlak-ve-insaat', 31: 'ulasim', 16: 'kamu-hizmetleri', 6: 'elektronik', 11: 'gida', 14: 'icecek', 25: 'otomotiv', 15: 'internet'}

Create a Function for Classification

We now define a simple function that receives a review and returns its predicted class:

def classify (text):
  pred=loaded_end_to_end_model.predict([text])
  return id_to_category[np.argmax(pred)]

examples=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']

classify(examples[1])



'ulasim'

Add User Interface to the End-to-End Model with Gradio

We can design a very simple yet useful interface by the Gradio library.

First, install the library

!pip install gradio

Then, import it.

import gradio as gr

The main function of the Gradio library id the Interface() function. It takes 3 important parameters to build an interface:

A function to call with the inputs
A list of inputs
A list of outputs.

In our case, we would like to run the classify() function passing one text input and getting the result as one text output.

Moreover, if you like, you can provide some example inputs as well.

iface = gr.Interface(fn=classify, inputs="text", outputs="text", examples=examples)

After configuring the interface, we can launch and use it:

iface.launch()

Deploy to HF Spaces

Here, I will provide the simplest way to deploy the above model and its interface to the HF Spaces. Actually, you can use other ways, however, this is the simplest one.

First, sign in to your HF account, get to the Spaces section, and click create new Spaces.

Then, create your space by filling in the form

Click the FILES menu

Continue with creating a new file option. We will create 2 important files.

requirements.txt: It will include all the necessary libraries to run our code.

app.py: It will include all our above codes.

import numpy as np 
import tensorflow as tf
import pickle
import string
import pandas as pd
import gradio as gr

tr_stop_words = pd.read_csv('tr_stop_word.txt',header=None)

@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    no_stop_words = ' '+no_punctuations+ ' '
    for each in tr_stop_words.values:
      no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")

    return no_U
loaded_end_to_end_model = tf.keras.models.load_model("end_to_end_model")
pkl_file = open("id_to_category.pkl", "rb")
id_to_category = pickle.load(pkl_file)

def classify (text):
  pred=loaded_end_to_end_model.predict([text])
  return id_to_category[np.argmax(pred)]

examples=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']

iface = gr.Interface(fn=classify, inputs="text", outputs="text", examples=examples)
iface.launch()

Here is the screenshot:

Lastly, we need to upload all the necessary files:

stop words
id to category
end-to-end model.

You can drag and drop these files over the browser.

Do not forget to commit after you upload files and folders.

Notice that each time you commit a change, the system will build the service automatically:

You can observe the logs to monitor installations or locate any possible error messages if any.

When there are no errors, you will see the Running message. You can click the App button to interact with your model via the gradio interface. Well done!

SUMMARY

In this tutorial, we have learned:

What a Keras TextVectorization layer is
Why we need to use a Keras TextVectorization layer in Natural Language Processing (NLP) tasks
How to employ a Keras TextVectorization layer in Text Preprocessing
How to integrate a Keras TextVectorization layer to a trained model
How to save and upload a Keras TextVectorization layer and a model with a Keras TextVectorization layer
How to integrate a Keras TextVectorization layer with TensorFlow Data Pipeline API (tf.data)
How to design, train, save and load an End-to-End model using Keras TextVectorization layer
How to deploy the End-to-End model with a Keras TextVectorization layer implemented with a custom standardize (custom_standardization) function.
How to use the Gradio library and the HuggingFace Spaces platform.

All the above topics are presented in a multi-class text classification context.

If you like this tutorial, please follow the Murat Karakaya Akademi YouTube channel and muratkarakaya.net.

Thank you for your patience!

#Keep Deep Learning :)

Comments or Questions?

Please share your Comments or Questions.

Thank you in advance.

Do not forget to check out the next parts!

Take care!

You can access the Murat Karakaya Akademi via:

Friday, November 4, 2022