Wellcome to www.muratkarakaya.net

As I have recently moved my blog to www.muratkarakaya.net, I'm uploading my posts gradually. Thank you for your understanding.

Friday, November 4, 2022

Keras Text Vectorization Layer: Configure, Adapt, Use, Save, Load, and Deploy

 

Keras Text Vectorization Layer: Configure, Adapt, Use, Save, Load, and Deploy

Author: Murat Karakaya
Date created: 05 Oct 2021
Last modified: 18 March 2023
Description: This is a tutorial about how to build, adapt, use, save, load, and deploy the Keras TextVectorization layer. You can access this tutorial on YouTube in English and Turkish. TensorFlow Keras Text Vectorization Katmanı” / “TensorFlow Keras Text Vectorization Layer”. 

In this tutorial, we will download a Kaggle Dataset in which there are 32 topics and more than 400K total reviews. We will use this dataset for a multi-class text classification task.

Our main aim is to learn how to effectively use the Keras TextVectorization layer in Text Processing and Text Classification.

The tutorial has 5 parts:

  • PART A: BACKGROUND
  • PART B: KNOW THE DATA
  • PART C: USE KERAS TEXT VECTORIZATION LAYER
  • PART D: BUILD AN END-TO-END MODEL
  • PART E: DEPLOY END-TO-END MODEL TO HUGGINGFACE SPACES USING GRADIO
  • SUMMARY

At the end of this tutorial, we will cover:

  • What a Keras TextVectorization layer is
  • Why we need to use a Keras TextVectorization layer in Natural Language Processing (NLP) tasks
  • How to employ a Keras TextVectorization layer in Text Preprocessing
  • How to integrate a Keras TextVectorization layer to a trained model
  • How to save and load a Keras TextVectorization layer and a model with a Keras TextVectorization layer
  • How to integrate a Keras TextVectorization layer with TensorFlow Data Pipeline API (tf.data)
  • How to design, train, save, and load an End-to-End model using Keras TextVectorization layer
  • How to deploy the End-to-End model with a Keras  TextVectorization  layer implemented with a custom standardize (custom_standardization) function using the Gradio library and the HuggingFace Spaces

Accessible on:





Photo by Francois Olwage on Unsplash

REFERENCES

PART A: BACKGROUND

You can watch this part on YouTube in Turkish or English.

1 TERMINOLOGY & CONCEPTS

1.1 What is Text Vectorization?

Text Vectorization is the process of converting text into a numerical representation.

There are many different techniques proposed to convert text to a numerical form such as:

  • One-hot Encoding (OHE)
  • Count Vectorizer
  • Bag-of-Words (BOW)
  • N-grams
  • Term Frequency
  • Term Frequency-Inverse Document Frequency (TF-IDF)
  • Embedding

1.2. What is Text Preprocessing?

Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more suitable form so that Machine Learning or Deep Learning algorithms can perform better.

The main phases of Text preprocessing:

  • Noise Removal (cleaning) — Removing unnecessary characters and formatting
  • Tokenization — break multi-word strings into smaller components
  • Normalization — a catch-all term for processing data; this includes stemming and lemmatization

Some of the common Noise Removal (cleaning) steps are:

  • Removal of Punctuations
  • Removal of Frequent words
  • Removal of Rare words
  • Removal of emojis
  • Removal of emoticons
  • Conversion of emoticons to words
  • Conversion of emojis to words
  • Removal of URLs
  • Removal of HTML tags
  • Chat words conversion
  • Spelling correction

Tokenization is about splitting strings of text into smaller pieces, or “tokens”. Paragraphs can be tokenized into sentences and sentences can be tokenized into words.

Noise Removal and Tokenization and are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization. Some of the common normalization steps are:

  • Upper or lowercasing
  • Stopword removal
  • Stemming — bluntly removing prefixes and suffixes from a word
  • Lemmatization — replacing a single-word token with its root

1.3. What is Keras Text Vectorization layer?

Thetf.keras.layers.TextVectorization layer is one of the Keras Preprocessing layers.

We can preprocess the input by using different libraries such as the Python String library, or SciKit Learn library, etc.

However, there are very important advantages to using the Keras Preprocessing layers:

  • You can build Keras-native input processing pipelines. These input processing pipelines can be used as independent preprocessing code in non-Keras workflows, combined directly with Keras models, and exported as part of a Keras SavedModel.
  • You can build and export models that are truly end-to-end: models that accept raw data (images or raw structured data) as input; models that handle feature normalization or feature value indexing on their own.

Today, we will deal with the tf.keras.layers.TextVectorization layer which:

  • turns raw strings into an encoded representation
  • that representation can be read by an Embedding layer or Dense layer.

That is, the tf.keras.layers.TextVectorization layer can be used in

  • Text Preprocessing and
  • Text Vectorization

2. IMPORT LIBRARIES

IMPORTANT: When I prepared this tutorial on 05 Oct 2021, the current version (2.6.0) of TF and Keras generate some errors in saving and uploading the tf.keras.layers.TextVectorization layer.

However, the nightly version has no problem handling these operations.

For more information about the bug, please see here

import tensorflow as tf

from tensorflow import keras

print("tf version:",tf.__version__)

print("keras version:", keras.__version__)

tf version: 2.6.0

keras version: 2.6.0

Therefore, below I first upload the TF nightly version.

tf version: 2.8.0-dev20211005
keras version: 2.7.0
pip install tf-nightly --quiet --upgrade |████████████████████████████████| 490.1 MB 9.9 kB/s
 |████████████████████████████████| 5.8 MB 36.5 MB/s
 |████████████████████████████████| 1.3 MB 33.9 MB/s
 |████████████████████████████████| 13.4 MB 254 kB/s
 |████████████████████████████████| 463 kB 42.9 MB/s
 |████████████████████████████████| 2.1 MB 35.6 MB/s
[?25h
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import re
import string
import random
from sklearn.model_selection import train_test_split
print("tf version:",tf.__version__)
print("keras version:", keras.__version__)
tf version: 2.8.0-dev20211008
keras version: 2.8.0
#@title Record Each Cell's Execution Time
!pip install ipython-autotime

%load_ext autotime

3. DOWNLOAD A KAGGLE DATASET INTO GOOGLE COLAB

The Multi-Class Classification Dataset for Turkish is a benchmark dataset for the Turkish text classification task.

It contains 430K comments/reviews for a total of 32 categories of products or services.

Each category roughly has 13K comments.

A baseline algorithm, Naive Bayes, gets %84 F1 score.

My blog post explaining how to download Kaggle Datasets is here.

My video tutorial explaining how to download Kaggle Datasets is here: Turkish/English

from google.colab import drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive
time: 3min 42s (started: 2021-10-08 14:36:24 +00:00)
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/Colab Notebooks/input"time: 2.88 ms (started: 2021-10-08 14:40:07 +00:00)#changing the working directory
%cd "/content/gdrive/MyDrive/Colab Notebooks/input"
/content/gdrive/MyDrive/Colab Notebooks/input
time: 1.29 s (started: 2021-10-08 14:40:07 +00:00)
#get the api command from kaggle dataset page
#!kaggle datasets download -d savasy/multiclass-classification-data-for-turkish-tc32
time: 649 µs (started: 2021-10-08 14:40:08 +00:00)# check the downloaded zip file
!ls
120001_PH1.csv generatedReviews.csv kaggle.json tr_stop_word.txt
320d.csv generatedReviews_final.csv model.png vocabPickle
corona.csv generatedReviews_plus.csv ticaret-yorum.csv
time: 328 ms (started: 2021-10-08 14:40:08 +00:00)
# unzipping the zip files and deleting the zip files
!unzip \*.zip && rm *.zip
unzip: cannot find or open *.zip, *.zip.zip or *.zip.ZIP.

No zipfiles found.
time: 131 ms (started: 2021-10-08 14:40:09 +00:00)
# check the downloaded csv file
!ls
120001_PH1.csv generatedReviews.csv kaggle.json tr_stop_word.txt
320d.csv generatedReviews_final.csv model.png vocabPickle
corona.csv generatedReviews_plus.csv ticaret-yorum.csv
time: 125 ms (started: 2021-10-08 14:40:09 +00:00)

4. LOAD STOP WORDS IN TURKISH

As you might know “Stop words” are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

I begin with uploading an existing list of stop words in Turkish below:

tr_stop_words = pd.read_csv('tr_stop_word.txt',header=None)
for each in tr_stop_words.values[:5]:
print(each[0])
ama
amma
anca
ancak
bu
time: 302 ms (started: 2021-10-08 14:40:09 +00:00)

5. LOAD THE DATASET

After downloading the dataset from the Kaggle website, we can upload it by using the Pandas library read_csv() function:

data = pd.read_csv('ticaret-yorum.csv')
pd.set_option('max_colwidth', 400)
time: 5.97 s (started: 2021-10-08 14:40:09 +00:00)

PART B: KNOW THE DATA

You can watch this part on YouTube in Turkish or English.

6. EXPLORE THE DATASET

Before getting into the details of how to use the tf.keras.layers.TextVectorization layer, let me introduce the dataset briefly.

Shuffle Data

It is a really good and useful habit that, before doing anything else, as a first step in the preprocessing shuffle the data!

Actually, I will shuffle the data at the last step of the pipeline. But it does not hurt shuffling it twice :))

data= data.sample(frac=1)time: 103 ms (started: 2021-10-08 14:40:15 +00:00)

Summary Information about the dataset

Get the initial information about the dataset:

data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 431306 entries, 60837 to 242258
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 431306 non-null object
1 text 431306 non-null object
dtypes: object(2)
memory usage: 9.9+ MB
time: 112 ms (started: 2021-10-08 14:40:15 +00:00)

We have a total of 431306 of rows and 2 columns: category & text.

According to data.info(), there is no null values in the dataset. If there are any null values in the dataset, we could drop these null values as follows:

df.dropna(inplace=True)

df.isnull().sum()

Sample Reviews and their categories:

data.head()
png
time: 19.6 ms (started: 2021-10-08 14:40:15 +00:00)

7. CREATE A TENSORFLOW DATA PIPELINE FOR TEXT PREPROCESSING & VECTORIZATION

So far, we have just observed some properties of the raw data. Using these observations, we are ready to preprocess the text data for a classifier model.

Below, we will begin to create a TensorFlow data pipeline that includes the Keras Text Vectorization layer for preprocessing the data and preparing it for a classifier.

A pipeline for a text model mostly involves extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths.

In this tutorial, I will use the TensorFlow “tf.data” API. If you are not familiar with TF data pipeline “tf.data” API, you can apply the below resources:

Convert Categories From Strings to Integer Ids

Observe that the categories (topics/class)of the reviews are strings:

data["category"]60837         cep-telefon-kategori
218953 kamu-hizmetleri
325173 mutfak-arac-gerec
188348 icecek
183962 hizmet-sektoru
...
21408 anne-bebek
152087 gida
130392 etkinlik-ve-organizasyon
51513 bilgisayar
242258 kargo-nakliyat
Name: category, Length: 431306, dtype: object



time: 8.4 ms (started: 2021-10-08 14:40:15 +00:00)

We need to create integer category ids from string category names by adding a new column to the data frame “category_id”:

data["category"] = data["category"].astype('category')
data["category_id"] = data["category"].cat.codes
data.head()
png
time: 86 ms (started: 2021-10-08 14:40:15 +00:00)

Lastly, we can check the number of categories. Note that it should be 32:

data['category']60837         cep-telefon-kategori
218953 kamu-hizmetleri
325173 mutfak-arac-gerec
188348 icecek
183962 hizmet-sektoru
...
21408 anne-bebek
152087 gida
130392 etkinlik-ve-organizasyon
51513 bilgisayar
242258 kargo-nakliyat
Name: category, Length: 431306, dtype: category
Categories (32, object): ['alisveris', 'anne-bebek', 'beyaz-esya', 'bilgisayar', ..., 'spor',
'temizlik', 'turizm', 'ulasim']



time: 14.2 ms (started: 2021-10-08 14:40:15 +00:00)

Build a Dictionary for id to text category (topic) look-up:

id_to_category = pd.Series(data.category.values,index=data.category_id).to_dict()
id_to_category
{0: 'alisveris',
1: 'anne-bebek',
2: 'beyaz-esya',
3: 'bilgisayar',
4: 'cep-telefon-kategori',
5: 'egitim',
6: 'elektronik',
7: 'emlak-ve-insaat',
8: 'enerji',
9: 'etkinlik-ve-organizasyon',
10: 'finans',
11: 'gida',
12: 'giyim',
13: 'hizmet-sektoru',
14: 'icecek',
15: 'internet',
16: 'kamu-hizmetleri',
17: 'kargo-nakliyat',
18: 'kisisel-bakim-ve-kozmetik',
19: 'kucuk-ev-aletleri',
20: 'medya',
21: 'mekan-ve-eglence',
22: 'mobilya-ev-tekstili',
23: 'mucevher-saat-gozluk',
24: 'mutfak-arac-gerec',
25: 'otomotiv',
26: 'saglik',
27: 'sigortacilik',
28: 'spor',
29: 'temizlik',
30: 'turizm',
31: 'ulasim'}



time: 74 ms (started: 2021-10-08 14:40:16 +00:00)

Reduce the Size of the Dataset

Since using a large dataset for testing your pipeline would take more time, you would prefer to take a portion of the raw dataset as below:

#limit the number of samples to be used in testing the pipeline
#data_size= 1000 #instead of 431306
#data= data[:data_size]
#data.info()
time: 1.55 ms (started: 2021-10-08 14:40:16 +00:00)

Split the Raw Dataset into Train and Test Datasets

To prevent data leakage during preprocessing the text data, we need to split the text into Train and Test data sets.

Data leakage refers to a mistake made by the creator of a machine learning model in which they accidentally share information between the test and training data sets. Typically, when splitting a data set into testing and training sets, the goal is to ensure that no data is shared between the two. This is because the test set’s purpose is to simulate real-world, unseen data. However, when evaluating a model, we do have full access to both our train and test sets, so it is up to us to ensure that no data in the training set is present in the test set.

In our case, since we want to classify reviews, we have not to use test reviews in text vectorization.

# save features and targets from the 'data'
features, targets = data['text'], data['category_id']

train_features, test_features, train_targets, test_targets = train_test_split(
features, targets,
train_size=0.8,
test_size=0.2,
random_state=42,
shuffle = True,
stratify=targets
)
time: 286 ms (started: 2021-10-08 14:40:16 +00:00)

Build the Train & Test TensorFlow Datasets

First, we create TensorFlow Datasets from the raw Train Dataframe for further processing.

Note that:

  1. X: input (text/reviews)
  2. y: target value (categories/topics/class)

Observe that we have reviews in the text as input and categories (topics) in integer as target values:

train_features.values[:5]array(['İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de lütfen yetkililerden bir açıklama bekliyorum bu gıda maddesinin içinde ne gibi bir madde olabilir. Bize nasıl ortamlarda ürettiğiniz ürünleri yediriyorsunuz kesinlikle küf değil fotoğrafını da ekliyoru...Devamını oku',
'Philips TV İnternet Bağlantı Sorunu!,"Philips 32PFS5803/62 model Smart televizyonumu Vatan markete henüz 1 ay oldu alalı 1 ay her yere bağlanan TV internete bağlı olmasına rağmen YouTube.com, Smart TV, uygulama galerisi vb... Hiçbir uygulamayı açmıyor. Girmeye çalıştığım zaman ""bu TV\'yi internete bağlayın"" sayfası açılıyor ve bağlamaya ...Devamını oku"',
'Anadolu Hastanesi (Çanakkale) Muayene Süresi Kısalığı,20 aylık çocuğum var devamlı çocuk Dr. y. A muayene oluyorum ama artık aynı sorunla karşılaşmaktan bıktım. Alel acele 5 dakikada muayene yapıor hastanın çıkmasını beklemeden yeni hasta alıyor ve onun yanında çocuk giydiriliyor belki özel konuşacaklarmış ya da özel durumumuz var düşünen yok. Paramızl...Devamını oku',
'Digiturk Engelsiz Kampanyası Zulmü!,1014917147 numaralı aboneliğimle ilgili. Digiturk pazarlama stratejisi ile insanları örtülü olarak resmen yanıltıyor. Engelli indiriminden taahhütsüz olarak üye oldum. Sonra iptal etmek istedim 70 TL cayma bedeli talep ettiler. Taahhütsüz dedim ilk başta kurum yapıldı 1 yıl içinde iptal edilirse kur...Devamını oku',
'Rowenta Elektrik Süpürge İyi Çekmiyor!,RO3723TA-JSO-3617 ürün kodlu Rowenta marka elektrik süpürgemi 18 aydır kullanmama rağmen iyi çekmediği için Çanakkale servisine götürdüm ve garantisi bile henüz dolmayan süpürge için filtre temizliği yapılacağından 130 TL ücret istenmektedir. Daha yeni süpürge hem çekmiyor hem de filtre temizliği iç...Devamını oku'],
dtype=object)



time: 7.27 ms (started: 2021-10-08 14:40:16 +00:00)
train_targets.values[:5]array([11, 6, 26, 20, 19], dtype=int8)



time: 5.68 ms (started: 2021-10-08 14:40:16 +00:00)

Prepare TensorFlow Datasets

We convert the data stored in Pandas Data Frame into data stored in TensorFlow Data Set as below:

# train X & y
train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(train_features.values, tf.string)
)
train_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(train_targets.values, tf.int64),

)
# test X & y
test_text_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(test_features.values, tf.string)
)
test_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(test_targets.values, tf.int64),

)
time: 1.81 s (started: 2021-10-08 14:40:16 +00:00)

Decide the dictionary size and the review size

For preprocessing the text, we need to decide the dictionary (vocabulary) size and the review (text) length.

vocab_size = 20000  # Only consider the top 20K words
max_len = 50 # Maximum review (text) size in words
time: 1.49 ms (started: 2021-10-08 14:40:18 +00:00)

PART C: USE KERAS TEXT VECTORIZATION LAYER

You can watch this part on YouTube in Turkish or English.

8. PREPROCESS THE TEXT WITH THE KERAS TEXTVECTORIZATION LAYER

8.1. Define your own custom_standardization function

First, I define a function that will preprocess the given text. The custom_standardization function will convert the given string to a standard form by transforming the input applying several updates:

  • convert all characters to lowercase
  • remove special symbols, extra spaces, HTML tags, digits, and punctuations
  • remove stop words
  • replace the special Turkish letters with the corresponding English letters.
@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
""" Remove html line-break tags and handle punctuation """
no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")
no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
#remove stop words
no_stop_words = ' '+no_punctuations+ ' '
for each in tr_stop_words.values:
no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
#remove Turkish chars
no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
no_O = tf.strings.regex_replace(no_I, "ö","o")
no_C = tf.strings.regex_replace(no_O, "ç","c")
no_S = tf.strings.regex_replace(no_C, "ş","s")
no_G = tf.strings.regex_replace(no_S, "ğ","g")
no_U = tf.strings.regex_replace(no_G, "ü","u")

return no_U
time: 17.1 ms (started: 2021-10-08 14:40:18 +00:00)

Quickly verify that custom_standardization works: try it on a sample Turkish input:

input_string = "Bu Issız Öğlenleyin de;  şunu ***1 Pijamalı Hasta***, ve  Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku"
print("input: ", input_string)
output_string= custom_standardization(input_string)
print("output: ", output_string.numpy().decode("utf-8"))
input: Bu Issız Öğlenleyin de; şunu ***1 Pijamalı Hasta***, ve Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku
output: issiz oglenleyin pijamali hasta i̇ste yagiz sofore cabucak guvendi
time: 58.8 ms (started: 2021-10-08 14:40:18 +00:00)

8.2. Configure the Keras TextVectorization layer

To preprocess the text, I will use the Keras TextVectorization layer.

tf.keras.layers.TextVectorization(
max_tokens=None,
standardize="lower_and_strip_punctuation",
split="whitespace",
ngrams=None,
output_mode="int",
output_sequence_length=None,
pad_to_max_tokens=False,
vocabulary=None,
**kwargs
)

The Keras TextVectorization layer processes each example in the dataset as follows:

  1. Standardize each example (usually lowercasing + punctuation stripping)
  2. Split each example into substrings (usually words)
  3. Recombine substrings into tokens (usually ngrams)
  4. Index tokens (associate a unique int value with each token)
  5. Transform each example using this index, either into a vector of ints or a dense float vector.

Let’s build our TextVectorization layer by providing:

  1. The custom_standardization() function for the standardize method (callable).
  2. The vocab_size as the max_tokens number: The max_tokens is the maximum size of the vocabulary that will be created from the dataset. If None, there is no cap on the size of the vocabulary. Note that this vocabulary contains 1 OOV (Out Of Vocabulary) token, so the effective number of tokens is (max_tokens - 1 - (1 if output_mode == "int" else 0)).
  3. The int keyword as the output_mode: Optional specification for the output of the layer. Values can be
  • “int”,
  • “multi_hot”,
  • “count” or
  • “tf_idf”,

Configuring the layer as follows:

  • “int”: Outputs integer indices, one integer index per split string token. When output_mode == “int”, 0 is reserved for masked locations; this reduces the vocab size to max_tokens — 2 instead of max_tokens — 1.
  • “multi_hot”: Outputs a single int array per batch, of either vocab_size or max_tokens size, containing 1s in all elements where the token mapped to that index exists at least once in the batch item.
  • “count”: Like “multi_hot”, but the int array contains a count of the number of times the token at that index appeared in the batch item.
  • “tf_idf”: Like “multi_hot”, but the TF-IDF algorithm is applied to find the value in each token slot.

For “int” output, any shape of input and output is supported.

For all other output modes, currently only rank 1 inputs (and rank 2 outputs after splitting) are supported.

  1. output_sequence_length=max_len
# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
standardize=custom_standardization,
max_tokens=vocab_size+2,
output_mode="int",
output_sequence_length=max_len,
)
time: 158 ms (started: 2021-10-08 14:40:18 +00:00)

8.3. Adapt the Keras TextVectorization layer with the training data set, (not test data set!)

TextVectorization preprocessing layer has an internal state that can be computed based on a sample of the training data. That is, TextVectorization holds a mapping between string tokens and integer indices.

Thus, we will adopt TextVectorization preprocessing layer ONLY to the training data.

Please note that: To prevent and data leak, we DO NOT adopt TextVectorization preprocessing layer to the whole (train & test) data.

vectorize_layer.adapt(train_features)
vocab = vectorize_layer.get_vocabulary() # To get words back from token indices
time: 2min 22s (started: 2021-10-08 14:40:18 +00:00)

Let’s see some example conversions:

print("vocab has the ", len(vocab)," entries")
print("vocab has the following first 10 entries")
for word in range(10):
print(word, " represents the word: ", vocab[word])

for X in train_features[:2]:
print(" Given raw data: " )
print(X)
tokenized = vectorize_layer(tf.expand_dims(X, -1))
print(" Tokenized and Transformed to a vector of integers: " )
print (tokenized)
print(" Text after Tokenized and Transformed: ")
transformed = ""
for each in tf.squeeze(tokenized):
transformed= transformed+ " "+ vocab[each]
print(transformed)
vocab has the 20002 entries
vocab has the following first 10 entries
0 represents the word:
1 represents the word: [UNK]
2 represents the word: ne
3 represents the word: tl
4 represents the word: gun
5 represents the word: urun
6 represents the word: aldim
7 represents the word: siparis
8 represents the word: musteri
9 represents the word: tarihinde
Given raw data:
İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de lütfen yetkililerden bir açıklama bekliyorum bu gıda maddesinin içinde ne gibi bir madde olabilir. Bize nasıl ortamlarda ürettiğiniz ürünleri yediriyorsunuz kesinlikle küf değil fotoğrafını da ekliyoru...Devamını oku
Tokenized and Transformed to a vector of integers:
tf.Tensor(
[[ 3451 3133 1770 1566 1605 1709 3133 6372 640 1 2025 1605
1709 64 209 2335 1 4024 853 184 1037 1 72 2
1709 623 177 1 18408 367 1 282 2582 3586 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0]], shape=(1, 50), dtype=int64)
Text after Tokenized and Transformed:
i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]
Given raw data:
Philips TV İnternet Bağlantı Sorunu!,"Philips 32PFS5803/62 model Smart televizyonumu Vatan markete henüz 1 ay oldu alalı 1 ay her yere bağlanan TV internete bağlı olmasına rağmen YouTube.com, Smart TV, uygulama galerisi vb... Hiçbir uygulamayı açmıyor. Girmeye çalıştığım zaman ""bu TV'yi internete bağlayın"" sayfası açılıyor ve bağlamaya ...Devamını oku"
Tokenized and Transformed to a vector of integers:
tf.Tensor(
[[ 226 44 354 1078 17 226 215 206 9049 1079 2556 460
11 574 11 19 294 13253 44 2481 1384 124 648 141
206 44 672 1 2262 22 2862 890 5564 2058 67 44
469 2481 1 4955 1862 15099 0 0 0 0 0 0
0 0]], shape=(1, 50), dtype=int64)
Text after Tokenized and Transformed:
philips tv i̇nternet baglanti sorunu philips model smart televizyonumu vatan markete henuz ay alali ay her yere baglanan tv internete bagli olmasina youtube com smart tv uygulama [UNK] vb hicbir uygulamayi acmiyor girmeye calistigim zaman tv yi internete [UNK] sayfasi aciliyor baglamaya
time: 157 ms (started: 2021-10-08 14:42:41 +00:00)
vocab[:5]['', '[UNK]', 'ne', 'tl', 'gun']



time: 4.75 ms (started: 2021-10-08 14:42:41 +00:00)

8.4. Save & Upload TextVectorization layer

Due to the fact that adapting the Keras TextVectorization layer on a large text dataset takes a considerable amount of time and porting the adapted layer to a different deployment environment is a high possibility, it is good to know how to save and load it.

How to save a Keras TextVectorization layer?

There are currently 2 ways of doing it:

  • save the Keras TextVectorization layer in a Keras Model
  • save the Keras TextVectorization layer as a pickle file.

In this tutorial, I will use the first approach as it is native to the TF/Keras environment.

8.4.1. Ensure that you are on the correct directory path :)

%cd ../models/
%ls
/content/gdrive/My Drive/Colab Notebooks/models
checkpoint/ MultiClassTextClassificationExported/
end_to_end_model/ MultitopicTextGenerator/
MultiClassTextClassification/ vectorize_layer_model/
time: 366 ms (started: 2021-10-08 14:42:41 +00:00)

8.4.2. Create a temporary Keras model by adding the adapted Keras TextVectorization layer

# Create model.
vectorize_layer_model = tf.keras.models.Sequential()
vectorize_layer_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
vectorize_layer_model.add(vectorize_layer)
vectorize_layer_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
text_vectorization (TextVec (None, 50) 0
torization)

=================================================================
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________
time: 256 ms (started: 2021-10-08 14:42:41 +00:00)

8.4.3. Save the temporary model including the adapted Keras TextVectorization layer

filepath = "vectorize_layer_model"time: 721 µs (started: 2021-10-08 14:42:42 +00:00)vectorize_layer_model.save(filepath, save_format="tf")WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: vectorize_layer_model/assets
time: 4.86 s (started: 2021-10-08 14:42:42 +00:00)
%lscheckpoint/ MultiClassTextClassificationExported/
end_to_end_model/ MultitopicTextGenerator/
MultiClassTextClassification/ vectorize_layer_model/
time: 153 ms (started: 2021-10-08 14:42:46 +00:00)

8.4.4. Load the vectorize_layer_model back to check if saving is successful

loaded_vectorize_layer_model = tf.keras.models.load_model(filepath)WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
time: 1.93 s (started: 2021-10-08 14:42:47 +00:00)

8.4.5 Retrieve the loaded Keras TextVectorization layer

Here, you have 2 options:

  • use the loaded_model.predicted() method to use the Keras TextVectorization layer, or
  • get the Keras TextVectorization layer out of the loaded_model as below:
loaded_vectorize_layer = loaded_vectorize_layer_model.layers[0]time: 1.97 ms (started: 2021-10-08 14:42:49 +00:00)

8.4.6. Compare the original and loaded TextVectorization layers

loaded_vocab=loaded_vectorize_layer.get_vocabulary()
print("original vocab has the ", len(vocab)," entries")
print("loaded vocab has the ", len(loaded_vocab)," entries")
print("loaded vocab has the following first 10 entries")
for word in range(10):
print(word, " represents the word: ")
print(vocab[word], " in original vocab")
print(loaded_vocab[word], " in loaded vocab")
for X in train_features[:1]:
print(" Given raw data: " )
print(X)

tokenized = vectorize_layer(tf.expand_dims(X, -1))
print(" Tokenized and Transformed to a vector of integers by the original vectorize layer:" )
print (tokenized)

tokenized = loaded_vectorize_layer(tf.expand_dims(X, -1))
print(" Tokenized and Transformed to a vector of integers by the loaded vectorize layer:" )
print (tokenized)

tokenized = loaded_vectorize_layer_model.predict(tf.expand_dims(X, -1))
print(" Tokenized and Transformed to a vector of integers by the loaded_vectorize_layer_model:" )
print (tokenized)

print(" Text after Tokenized and Transformed by the original vectorize layer:: ")
transformed = ""
for each in tf.squeeze(tokenized):
transformed= transformed+ " "+ vocab[each]
print(transformed)

print(" Text after Tokenized and Transformed by the loaded vectorize layer:")
transformed = ""
for each in tf.squeeze(tokenized):
transformed= transformed+ " "+ loaded_vocab[each]
print(transformed)
original vocab has the 20002 entries
loaded vocab has the 20002 entries
loaded vocab has the following first 10 entries
0 represents the word:
in original vocab
in loaded vocab
1 represents the word:
[UNK] in original vocab
[UNK] in loaded vocab
2 represents the word:
ne in original vocab
ne in loaded vocab
3 represents the word:
tl in original vocab
tl in loaded vocab
4 represents the word:
gun in original vocab
gun in loaded vocab
5 represents the word:
urun in original vocab
urun in loaded vocab
6 represents the word:
aldim in original vocab
aldim in loaded vocab
7 represents the word:
siparis in original vocab
siparis in loaded vocab
8 represents the word:
musteri in original vocab
musteri in loaded vocab
9 represents the word:
tarihinde in original vocab
tarihinde in loaded vocab
Given raw data:
İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de lütfen yetkililerden bir açıklama bekliyorum bu gıda maddesinin içinde ne gibi bir madde olabilir. Bize nasıl ortamlarda ürettiğiniz ürünleri yediriyorsunuz kesinlikle küf değil fotoğrafını da ekliyoru...Devamını oku
Tokenized and Transformed to a vector of integers by the original vectorize layer:
tf.Tensor(
[[ 3451 3133 1770 1566 1605 1709 3133 6372 640 1 2025 1605
1709 64 209 2335 1 4024 853 184 1037 1 72 2
1709 623 177 1 18408 367 1 282 2582 3586 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0]], shape=(1, 50), dtype=int64)
Tokenized and Transformed to a vector of integers by the loaded vectorize layer:
tf.Tensor(
[[ 3451 3133 1770 1566 1605 1709 3133 6372 640 1 2025 1605
1709 64 209 2335 1 4024 853 184 1037 1 72 2
1709 623 177 1 18408 367 1 282 2582 3586 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0]], shape=(1, 50), dtype=int64)
Tokenized and Transformed to a vector of integers by the loaded_vectorize_layer_model:
[[ 3451 3133 1770 1566 1605 1709 3133 6372 640 1 2025 1605
1709 64 209 2335 1 4024 853 184 1037 1 72 2
1709 623 177 1 18408 367 1 282 2582 3586 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0]]
Text after Tokenized and Transformed by the original vectorize layer::
i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]
Text after Tokenized and Transformed by the loaded vectorize layer:
i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]
time: 787 ms (started: 2021-10-08 14:42:49 +00:00)

As you see above, we successfully saved and loaded the adapted Keras TextVectorization layer!

We can continue to the TensorFlow data pipeline with the adapted Keras TextVectorization layer:

pwd'/content/gdrive/My Drive/Colab Notebooks/models'



time: 11.9 ms (started: 2021-10-08 14:42:49 +00:00)

9. APPLY KERAS TEXTVECTORIZATION TO TRAIN & TEST DATA SETS

We can define a function to apply the Keras TextVectorization on a given string as follows:

def convert_text_input(sample):
text = sample
text = tf.expand_dims(text, -1)
#return tf.squeeze(vectorize_layer(text))
return tf.squeeze(loaded_vectorize_layer(text))
time: 1.48 ms (started: 2021-10-08 14:42:49 +00:00)

We use the TensorFlow tf.data API (TF Data Pipeline) map() funtion to apply convert_text_input() on every sample in the text column (reviews) of the training dataset.

# Train X
train_text_ds = train_text_ds_raw.map(convert_text_input,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
# Test X
test_text_ds = test_text_ds_raw.map(convert_text_input,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
time: 696 ms (started: 2021-10-08 14:42:49 +00:00)

Let’s see the converted/encoded texts (reviews)

for each in train_text_ds.take(3):
print(each)
tf.Tensor(
[ 3451 3133 1770 1566 1605 1709 3133 6372 640 1 2025 1605
1709 64 209 2335 1 4024 853 184 1037 1 72 2
1709 623 177 1 18408 367 1 282 2582 3586 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0], shape=(50,), dtype=int64)
tf.Tensor(
[ 226 44 354 1078 17 226 215 206 9049 1079 2556 460
11 574 11 19 294 13253 44 2481 1384 124 648 141
206 44 672 1 2262 22 2862 890 5564 2058 67 44
469 2481 1 4955 1862 15099 0 0 0 0 0 0
0 0], shape=(50,), dtype=int64)
tf.Tensor(
[ 465 171 3144 673 378 1 192 1280 10 1273 414 1023
695 74 673 3805 102 25 1777 1 1706 1 6537 2406
673 1 8569 9825 9478 79 1001 788 975 414 1 348
1 28 348 13025 10 11558 1 0 0 0 0 0
0 0], shape=(50,), dtype=int64)
time: 154 ms (started: 2021-10-08 14:42:50 +00:00)
  1. GENERATE THE TRAIN SET BY COMBINING X & Y:
  • X: the preprocessed & encoded reviews
  • y: encoded categories)
train_ds = tf.data.Dataset.zip(
(
train_text_ds,
train_cat_ds_raw
)
)
time: 3.9 ms (started: 2021-10-08 14:42:50 +00:00)

Similarly, let’s bundle test data sets as a single data set:

test_ds = tf.data.Dataset.zip(
(
test_text_ds,
test_cat_ds_raw
)
)
time: 1.7 ms (started: 2021-10-08 14:42:50 +00:00)

We can see the result of the Text Vectorization in the Data Pipelining as follows:

for X,y in train_ds.take(1):
print("input (review) X.shape: ", X.shape)
print("output (category) y.shape: ", y.shape)
print("input (review) X: ", X)
print("output (category) y: ",y)
input = " ".join([vocab[_] for _ in np.squeeze(X)])
output = id_to_category[y.numpy()]
print("X: input (review) in text: " , input)
print("y: output (category) in text: " , output)
input (review) X.shape: (50,)
output (category) y.shape: ()
input (review) X: tf.Tensor(
[ 3451 3133 1770 1566 1605 1709 3133 6372 640 1 2025 1605
1709 64 209 2335 1 4024 853 184 1037 1 72 2
1709 623 177 1 18408 367 1 282 2582 3586 1 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0], shape=(50,), dtype=int64)
output (category) y: tf.Tensor(11, shape=(), dtype=int64)
X: input (review) in text: i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]
y: output (category) in text: gida
time: 167 ms (started: 2021-10-08 14:42:50 +00:00)

11. FINALIZE TENSORFLOW DATA PIPELINE

Finalize TensorFlow Data Pipeline by setting necessary parameters for batching, shuffling, and optimizing as follows:

batch_size = 64
AUTOTUNE = tf.data.experimental.AUTOTUNE
buffer_size= train_ds.cardinality().numpy()

train_ds = train_ds.shuffle(buffer_size=buffer_size)\
.batch(batch_size=batch_size,drop_remainder=True)\
.cache()\
.prefetch(AUTOTUNE)

test_ds = test_ds.shuffle(buffer_size=buffer_size)\
.batch(batch_size=batch_size,drop_remainder=True)\
.cache()\
.prefetch(AUTOTUNE)
time: 17.7 ms (started: 2021-10-08 14:42:50 +00:00)train_ds.element_spec(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
TensorSpec(shape=(64,), dtype=tf.int64, name=None))



time: 5.28 ms (started: 2021-10-08 14:42:50 +00:00)

PART D: BUILD AN END-TO-END MODEL

You can watch this part on YouTube in Turkish or English.

12. Create a Classification Model

For the sake of demonstration of the Keras TextVectorization layer, let's build a very simple model:

def create_model():
inputs_tokens = layers.Input(shape=(max_len,), dtype=tf.int32)
embedding_layer = layers.Embedding(vocab_size, 256)
x = embedding_layer(inputs_tokens)
x = layers.Flatten()(x)
outputs = layers.Dense(32)(x)
model = keras.Model(inputs=inputs_tokens, outputs=outputs)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric_fn = tf.keras.metrics.SparseCategoricalAccuracy()
model.compile(optimizer="adam", loss=loss_fn, metrics=metric_fn)

return model
my_model=create_model()
my_model.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 50)] 0

embedding (Embedding) (None, 50, 256) 5120000

flatten (Flatten) (None, 12800) 0

dense (Dense) (None, 32) 409632

=================================================================
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 54.3 ms (started: 2021-10-08 14:42:51 +00:00)

13. Train the Classification Model

my_model.fit(train_ds, verbose=1, epochs=3)Epoch 1/3
5391/5391 [==============================] - 251s 9ms/step - loss: 0.3019 - sparse_categorical_accuracy: 0.9310
Epoch 2/3
5391/5391 [==============================] - 46s 8ms/step - loss: 0.0424 - sparse_categorical_accuracy: 0.9897
Epoch 3/3
5391/5391 [==============================] - 46s 8ms/step - loss: 0.0049 - sparse_categorical_accuracy: 0.9993





<keras.callbacks.History at 0x7fef7fe05050>



time: 6min 30s (started: 2021-10-08 14:42:51 +00:00)
loss, accuracy = my_model.evaluate(test_ds)
print("Train accuracy: ", accuracy)
1347/1347 [==============================] - 56s 3ms/step - loss: 0.2385 - sparse_categorical_accuracy: 0.9511
Train accuracy: 0.9511414170265198
time: 55.6 s (started: 2021-10-08 14:49:21 +00:00)

14. An End-To-End Classification Model

Pay attention that the above model is expected to receive batches of integer tensors as input:

Layer (type)                Output Shape              Param #   
=================================================================
input_3 (InputLayer) [(None, 50)] 0

Thus, you can NOT supply raw data (some text) to the model for prediction. TensorFlow/Keras would generate error message as below:

raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
'THY Uçak biletimi değiştirmek için başvurdum. Kimse geri dönüş yapmadı!']

predictions=my_model.predict(raw_data)

ValueError: in user code: Exception encountered when calling layer "model" (type Functional).

Input 0 of layer "dense" is incompatible with the layer: expected axis -1of input shape to have value 12800, but received input with shape (None, 256)

Call arguments received:
• inputs=tf.Tensor(shape=(None,), dtype=string)
• training=False
• mask=None

However, sometimes it is a big advantage if we can design a model which accepts raw data as input, then, process the data by itself.

For example, such a model can be easily exported to different platforms/environments without the need of exporting the preprocess code!

Therefore, Keras provides several Preprocessing Layers so that we can integrate preprocessing logic as a layer into a Keras model.

After then, we can export such models and use any other platforms without re-writing preprocessing code on the exported platforms/environments.

These kinds of models can be called End-To-End Models. That is, an End-To-End model can accept Raw Input Data and preprocess it by itself.

**What could be Raw Data? **

It could be:

  • text
  • image
  • structure data
  • etc.

Let’s create an End-To-End Classification Model by integrating the adapted Keras TextVectorization layer into the trained model as the first layer.

You can create an End-To-End Model either by:

  • Keras Sequential API, or
  • Keras Functional API

14.1. Create an End-To-End Model with Keras Sequential API

end_to_end_model = tf.keras.Sequential([
keras.Input(shape=(1,), dtype="string"),
vectorize_layer,
my_model,
layers.Activation('softmax')
])

end_to_end_model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)
end_to_end_model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
text_vectorization (TextVec (None, 50) 0
torization)

model (Functional) (None, 32) 5529632

activation (Activation) (None, 32) 0

=================================================================
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 282 ms (started: 2021-10-08 14:50:16 +00:00)

14.2. Create an End-To-End Model with Keras Functional API

inputs = keras.Input(shape=(1,), dtype="string")
x = vectorize_layer(inputs)
outputs = my_model(x)
end_to_end_model = keras.Model(inputs, outputs)
end_to_end_model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)
end_to_end_model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) [(None, 1)] 0

text_vectorization (TextVec (None, 50) 0
torization)

model (Functional) (None, 32) 5529632

=================================================================
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 284 ms (started: 2021-10-08 14:50:17 +00:00)

14.3. Test the End-to-End model with Raw (Text) Data

raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
'THY Uçak biletimi değiştirmek için başvurdum. Kimse geri dönüş yapmadı!']
predictions=end_to_end_model.predict(raw_data)
print(id_to_category[np.argmax(predictions[0])])
print(id_to_category[np.argmax(predictions[1])])
alisveris
ulasim
time: 608 ms (started: 2021-10-08 14:50:17 +00:00)
loss, accuracy = end_to_end_model.evaluate(test_features,test_targets)
print("end_to_end_model accuracy: ", accuracy)
2696/2696 [==============================] - 47s 17ms/step - loss: 2.3769 - accuracy: 0.9511
end_to_end_model accuracy: 0.9511488080024719
time: 46.8 s (started: 2021-10-08 14:50:18 +00:00)

14.4. Save the End-to-End model

end_to_end_model.save("end_to_end_model")INFO:tensorflow:Assets written to: end_to_end_model/assets
time: 5.58 s (started: 2021-10-08 14:51:04 +00:00)

14.5. Load the End-to-End model

loaded_end_to_end_model = tf.keras.models.load_model("end_to_end_model")time: 2.65 s (started: 2021-10-08 14:51:10 +00:00)

14.6. Test the Loaded End-to-End model with Raw (Text) Data

raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
'THY Uçak biletimi değiştirmek için başvurdum. Kimse geri dönüş yapmadı!']
predictions=loaded_end_to_end_model.predict(raw_data)
print(id_to_category[np.argmax(predictions[0])])
print(id_to_category[np.argmax(predictions[1])])
alisveris
ulasim
time: 573 ms (started: 2021-10-08 14:51:13 +00:00)
loss, accuracy = loaded_end_to_end_model.evaluate(test_features,test_targets)
print("loaded_end_to_end_model accuracy: ", accuracy)
2696/2696 [==============================] - 46s 17ms/step - loss: 2.3769 - accuracy: 0.9511
loaded_end_to_end_model accuracy: 0.9511488080024719
time: 46.1 s (started: 2021-10-08 14:51:13 +00:00)



PART E: DEPLOY END-TO-END MODEL TO HUGGINGFACE SPACES USING GRADIO

In this part, we will learn how to deploy the End-to-End model with a Keras TextVectorization layer to the HuggingFace Spaces. For the interface, we will use the Gradio library.

Custom Standardization Function

Please note that: while we configured the Keras TextVectorization layer in this tutorial we did not use the "standard" standardization function. Instead, we implemented a custom standardization (custom_standardization) function as below.



@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    no_stop_words = ' '+no_punctuations+ ' '
    for each in tr_stop_words.values:
      no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")
    return no_U



It is really important that when you deploy the end-to-end model with a Keras TextVectorization layer, you have to register the custom standardization (custom_standardization) function to the Keras environment as well. Otherwise, you will receive errors and your end-to-end model will not work.

Gradio

As stated on the official website:

"Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!"

You can learn the details of the library from this demo section.

Since our aim is to learn how to deploy an end-to-end model, we will not get into the details of the Gradio interface library. Nevertheless, we will cover its related functionality for our purpose.

HuggingFace Spaces

HuggingFace provides us a free service to upload and deploy Machine Learning models. You can create your account for free and use the CPU-based platform for free. This service is called HuggingFace Spaces.

It is also integrated with GitHub. So you can connect your repos to be serviced directly to/by the HuggingFace Spaces.

Import Libraries

Let's import the necessary libraries to upload and run the end-to-end model:


import numpy as np 
import tensorflow as tf
import pickle
import string
import pandas as pd

Load STOP WORDS in Turkish

As you might remember we have used a "Stop words in Turkish" file for pre-processing the data. Let's load it:


path = "/content/gdrive/MyDrive/Colab Notebooks/input/"
tr_stop_words = pd.read_csv(path+'tr_stop_word.txt',header=None)
for each in tr_stop_words.values[:5]:
  print(each[0])
ama
amma
anca
ancak
bu

Load the End-to-End Model

As in the previous tutorial, we had saved the End-to-End Model, now we can try to load it back:


path = "/content/gdrive/MyDrive/Colab Notebooks/models/"
loaded_end_to_end_model = tf.keras.models.load_model(path+"end_to_end_model")
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-3-ce2aedddcfdb> in <module>
      1 path = "/content/gdrive/MyDrive/Colab Notebooks/models/"
----> 2 loaded_end_to_end_model = tf.keras.models.load_model(path+"end_to_end_model")


/usr/local/lib/python3.9/dist-packages/keras/saving/saving_api.py in load_model(filepath, custom_objects, compile, safe_mode, **kwargs)
    228 
    229     # Legacy case.
--> 230     return legacy_sm_saving_lib.load_model(
    231         filepath, custom_objects=custom_objects, compile=compile, **kwargs
    232     )


/usr/local/lib/python3.9/dist-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     68             # To get the full stack trace, call:
     69             # `tf.debugging.disable_traceback_filtering()`
---> 70             raise e.with_traceback(filtered_tb) from None
     71         finally:
     72             del filtered_tb


/usr/local/lib/python3.9/dist-packages/keras/saving/legacy/serialization.py in deserialize_keras_object(identifier, module_objects, custom_objects, printable_module_name)
    533             obj = object_registration._GLOBAL_CUSTOM_OBJECTS[object_name]
    534         else:
--> 535             obj = module_objects.get(object_name)
    536             if obj is None:
    537                 raise ValueError(


AttributeError: 'NoneType' object has no attribute 'get'

As you saw above, we have received an error message. The reason is that we tried to upload a model including the Keras Text Vectorization layer with a custom standardization function without providing this function.

To fix this, first let's register the custom standardization function as a Keras Serializable object by decorating it with the @tf.keras.utils.register_keras_serializable() decorator:


@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*"" ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku""")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />""")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    no_stop_words = ' '+no_punctuations+ ' '
    for each in tr_stop_words.values:
      no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")

    return no_U

Now, let's re-try to load the end-to-end model:


path = "/content/gdrive/MyDrive/Colab Notebooks/models/"
loaded_end_to_end_model = tf.keras.models.load_model(path+"end_to_end_model")


This time, we successfully loaded the saved end-to-end model!

Load the Ids of Categories

As in the previous tutorial, we had saved the id_to_category dictionary, now we can load it back:


path = "/content/gdrive/MyDrive/Colab Notebooks/input/"
pkl_file = open(path+"id_to_category.pkl""rb")
id_to_category = pickle.load(pkl_file)
print(id_to_category)
{2: 'beyaz-esya', 27: 'sigortacilik', 12: 'giyim', 4: 'cep-telefon-kategori', 26: 'saglik', 22: 'mobilya-ev-tekstili', 10: 'finans', 24: 'mutfak-arac-gerec', 20: 'medya', 29: 'temizlik', 19: 'kucuk-ev-aletleri', 21: 'mekan-ve-eglence', 18: 'kisisel-bakim-ve-kozmetik', 23: 'mucevher-saat-gozluk', 28: 'spor', 17: 'kargo-nakliyat', 13: 'hizmet-sektoru', 1: 'anne-bebek', 0: 'alisveris', 5: 'egitim', 9: 'etkinlik-ve-organizasyon', 30: 'turizm', 8: 'enerji', 3: 'bilgisayar', 7: 'emlak-ve-insaat', 31: 'ulasim', 16: 'kamu-hizmetleri', 6: 'elektronik', 11: 'gida', 14: 'icecek', 25: 'otomotiv', 15: 'internet'}

Create a Function for Classification

We now define a simple function that receives a review and returns its predicted class:


def classify (text):
  pred=loaded_end_to_end_model.predict([text])
  return id_to_category[np.argmax(pred)]
examples=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']

classify(examples[1])


'ulasim'

Add User Interface to the End-to-End Model with Gradio

We can design a very simple yet useful interface by the Gradio library.

First, install the library


!pip install gradio

Then, import it.


import gradio as gr

The main function of the Gradio library id the Interface() function. It takes 3 important parameters to build an interface:

  • A function to call with the inputs
  • A list of inputs
  • A list of outputs.

In our case, we would like to run the classify() function passing one text input and getting the result as one text output.

Moreover, if you like, you can provide some example inputs as well.


iface = gr.Interface(fn=classify, inputs="text", outputs="text", examples=examples)


After configuring the interface, we can launch and use it:

iface.launch()





Deploy to HF Spaces

Here, I will provide the simplest way to deploy the above model and its interface to the HF Spaces. Actually, you can use other ways, however, this is the simplest one.


  • First, sign in to your HF account, get to the Spaces section, and click create new Spaces.




Then, create your space by filling in the form





Click the FILES menu





Continue with creating a new file option. We will create 2 important files.






  1. requirements.txt: It will include all the necessary libraries to run our code.



  1. app.py: It will include all our above codes.

import numpy as np 
import tensorflow as tf
import pickle
import string
import pandas as pd
import gradio as gr

tr_stop_words = pd.read_csv('tr_stop_word.txt',header=None)

@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    no_stop_words = ' '+no_punctuations+ ' '
    for each in tr_stop_words.values:
      no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")

    return no_U
loaded_end_to_end_model = tf.keras.models.load_model("end_to_end_model")
pkl_file = open("id_to_category.pkl", "rb")
id_to_category = pickle.load(pkl_file)

def classify (text):
  pred=loaded_end_to_end_model.predict([text])
  return id_to_category[np.argmax(pred)]

examples=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']

iface = gr.Interface(fn=classify, inputs="text", outputs="text", examples=examples)
iface.launch()



Here is the screenshot:





  1. Lastly, we need to upload all the necessary files:
  • stop words
  • id to category
  • end-to-end model.

You can drag and drop these files over the browser.







Do not forget to commit after you upload files and folders.

Notice that each time you commit a change, the system will build the service automatically:







You can observe the logs to monitor installations or locate any possible error messages if any.









When there are no errors, you will see the Running message. You can click the App button to interact with your model via the gradio interface. Well done!






SUMMARY

In this tutorial, we have learned:

  • What a Keras TextVectorization layer is
  • Why we need to use a Keras TextVectorization layer in Natural Language Processing (NLP) tasks
  • How to employ a Keras TextVectorization layer in Text Preprocessing
  • How to integrate a Keras TextVectorization layer to a trained model
  • How to save and upload a Keras TextVectorization layer and a model with a Keras TextVectorization layer
  • How to integrate a Keras TextVectorization layer with TensorFlow Data Pipeline API (tf.data)
  • How to design, train, save and load an End-to-End model using Keras TextVectorization layer
  • How to deploy the End-to-End model with a Keras TextVectorization  layer implemented with a custom standardize (custom_standardization) function. 
  • How to use the Gradio library and the HuggingFace Spaces platform.

All the above topics are presented in a multi-class text classification context.

If you like this tutorial, please follow the Murat Karakaya Akademi YouTube channel and muratkarakaya.net.

Thank you for your patience!

#Keep Deep Learning :)

Comments or Questions?

Please share your Comments or Questions.

Thank you in advance.

Do not forget to check out the next parts!

Take care!

You can access the Murat Karakaya Akademi via:

YouTube

Facebook

Instagram

LinkedIn

GitHub

Kaggle

muratkarakaya.net