# Questions & Answers in Machine Learning & Deep Learning & Artificial Intelligence

In this post, I will share the Answers to the Questions & Comments about Machine Learning & Deep Learning topics that are posted to my YouTube channel Murat Karakaya Akademi.

If you have any questions please do not hesitate to post them on the Murat Karakaya Akademi YouTube channel. I will reply to them as soon as possible.

You can access many tutorials on Machine Learning & Deep Learning topics implemented by Python, TensorFlow, and Keras in English or Turkish on my YouTube channel Murat Karakaya Akademi.

*Last updated: 12 03 2022*

# When I convert an existing Series or column to a category dtype my Category then I find this kind of error : TypeError: data type ‘Category ‘ not understood……how to resolve it, sir?

I think there are some values can not be converted to categorical values. Check the values in the column. For details see the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#object-creation

# Thank you sir, it works for me but when i change the dataset (eg. Movie_Poster-dataset) i got this error ! in the section (Show some samples from the data pipeline), File “/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py”, line 3363, in get_loc raise KeyError(key) from err KeyError: ‘Filenames’

It seems to me that pandas generates the error. Most probably, it is due to a key error in a data frame. You may refer to a record in a data frame without a correct key (name or record number). Please check how you did try to access the record in the data frame.

# Thanks for the awesome information! Do you have any information on multilabel classification for imbalanced datasets? I’ve been looking around for answers as to how to apply class weights to my model and so far I’ve only learned that the class weight parameter in the model.compile method is bugged.

In that playlist, you can find a series for handling imbalanced datasets: https://youtube.com/playlist?list=PLQflnv_s49v-RGv6jb_sFOb_of9VOEpKG

If we have 50 classes (multiclass problem like yours), so after training when I do `model.predict()`, it will return probabilities for 50 classes and the one with largest value is associated to the input. Suppose I would like to predict feature/input against arbitrary 10 labels only and not 50 classes, any hint/idea how to do that please?

The number of the hidden units in the last layer of the model must be 10.

# @Murat Karakaya Akademi Thank you Professor. I have an idea here to share, but before I will write full problem description: “i have trained a model where output is multi class, so a soft max during prediction. I have feature and a smaller set of condidates I want scored. So, let us say during training I have 100 possible output in softmax but during prediction I want to score 10 of them against a given feature.” What do you think now please? I suggest to use Bernoulli at loss function (binary loss) instead of categorical cross entropy loss with softmax layer as the labels will be treated independently of each other, in this way after training, if we pass 10 labels out of 50 and let others say 0, we can predict which among the 10 candidates is associated with input. What do you think now please?

Why do you have different outputs during training and inference time? It is not usual practice and requirement. Please review your approach to the problem. Take care!

# Hello sir, may I ask that if I were using a multivariate multistep encoder-decoder, my Inputs are multivariate and hope to predict for 1 output each timestep, but the prediction result will have multiple outputs in each time step, same as the shape of the Inputs, how do I choose which of them as my outputs? Thanks

As explained in the tutorial series, you need to prepare the dataset as inputs (X) and the outputs (y). after then, the last layer would have y number of neurons (hidden units). Take care!

COLAB: https://colab.research.google.com/drive/1ErnVEZOmlu_nInxaoLStW0BHzgT4meVj?usp=sharing Seq2Seq

playlist: https://youtube.com/playlist?list=PLQflnv_s49v-4aH-xFcTykTpcyWSY4Tww

# Thank you so much Mr. Karakaya! I’ve been working on deep learning subjects for a while now, but there are not many effective sources on this matter. This is the best channel, I’m glad I have found you. Thanks for your tutorials!

Thank you for the motivating comment!

# Thanks for the video, can you please help me with understanding this, since 1D convolution, works on rows, one filter sums all the 3 session scores for 1 student, how is it summing for all 5 students ? Does not 1D convolutions works only on rows and only on one column?

Check the following tutorial “Conv1D: Understanding tf.keras.layers”

# Hi. Thanks for the informative video. In the video, you used an example that predicted a single value (i.e. house prices). Is it possible to use it to predict more than one value at once (eg. house price and a number of rooms)?

if you can prepare the train data by including the number of rooms as the second target, then the shape of the train data will be [404, 12, 2] instead of [404, 13, 1]. Since the last layer of the model has as many neurons as n_features = train_data_reshaped.shape[2], the model will predict 2 numbers: price and room.

# Hello sir, after training the whole model I want to make a prediction for a single image, how can I do that?

You can use the trained model predict method such as:

model.fit(…..)

pred=model.predict(sample)

Be careful about the “sample” shape.

# Sir good evening. In encoder-decoder with teacher forcing I have used Embedded layer to represent input instead of one_hot encoding. I got a problem while executing decoder_lstm.predict( ) like lstm expects three tensors but receive one tensor. actually I passed [target_seq ]+encoder_state_values as argument to it. why I have been getting this error ? could you give me some suggestions please to get rid of this. error: ValueError: Layer dec_lstm_layer expects 3 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor ‘dummy/embedding_14/embedding_lookup/Identity_1:0’ shape=(None, 1, 10) dtype=float32>]

I was able to run like that: # Define an input sequence and process it. encoder_inputs= Input(shape=(n_timesteps_in), name=”encoder_inputs”) embedding = tf.keras.layers.Embedding(n_features, n_features) encoder_inputs_embedded = embedding(encoder_inputs) encoder_lstm=LSTM(LSTMoutputDimension, return_state=True, name=”encoder_lstm”) LSTM_outputs, state_h, state_c = encoder_lstm(encoder_inputs_embedded)

# Regarding the tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART D: CODING ENCODER DECODER MODEL WITH TEACHER FORCING” (https://youtu.be/RRP0czWtOeM)

# Thanks a lot for this kind of lecture to make understand the difficult concepts like encoder and decoders with simple examples … thank you sir

Thank you for the nice comment and keep Deep Learning π

# Hi sir how can I do semi-supervised text classification with a 20% label?

You can check this example “Semi-supervised Classification on a Text Dataset sci-kit-learn 1.0.1 documentation” https://scikit-learn.org/stable/auto_examples/semi_supervised/plot_semi_supervised_newsgroups.html

# Hello Professor, If we have instead of one-hot encoding vectors of length 10 a vector of length 50, the input has 2 vectors each of size 50 instead of 4 one-hot encoded vectors each of size 10 in your video. In our case, I replace the categorical cross-entropy you used with mse loss function, but I have a question. How I should initialize the decoder in our case since you initialized it as follows: # Prepare decoder initial input data: just contains the START character 0 # Note that we made it a constant one-hot-encoded in the model # that is, [1 0 0 0 0 0 0 0 0 0] is the initial input for each loop decoder_input_data = np.zeros((batch_size, 1, n_features)) decoder_input_data[:, 0, 0] = 1 Should we keep it as it’s given that our vectors are not one-hot encoded but real-valued? Should we initialize it randomly each time? What would you recommend, please? Thanks.

You first convert your data to one hot and then create your sequence. Add START symbols (which are one-hot encoded as well) to the head of your sequence. And train your model.

# Thank you, Professor. Can we use GPT3 you used similar to how we use the seq2seq model to encode input and decode it keeping in mind minimizing loss, please?

Yes, absolutely

# Good morning. could you explain to me how to use attention in teacher forcing, please? I was not able to use it. I got errors.

You can watch and access all the codes from the tutorials:

# Regarding the tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART D: CODING ENCODER DECODER MODEL WITH TEACHER FORCING” (https://youtu.be/RRP0czWtOeM) Hello Professor. I see that using the same trained decoder in the inference stage is problematic, so by how you did it, it seems you just create another decoder in inference time. This new decoder is trained to accept output in a generic way, but not provide actual output. But my question is, you did not train the model in inference time, but all you did is just avoid giving real output to the decoder to predict the next output, so what does that solve specific, please? I am just not getting this idea. Thanks.

Basically, you borrowed the trained layers from the trained model to create an inference model. Thus your inference model does not need to be re-trained. Watch the tutorial again: I emphasize that approach.

# Regarding the tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART D: CODING ENCODER DECODER MODEL WITH TEACHER FORCING” (https://youtu.be/RRP0czWtOeM) Can I use dropout layers in the for loop of the decoder or you don’t think this is recommended, please?

As far as I know, using dropout layers in Encoder-Decoder is not so common.

# Regarding the tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART D: CODING ENCODER DECODER MODEL WITH TEACHER FORCING” (https://youtu.be/RRP0czWtOeM) 18:46 would you only like to change the sample to have 0 in one for one of its one-hot encoded vectors please? Or you would like to set all samples first one-hot encoded vectors to have 0? Also, in the training, validation loss and loss is over 0.2, do you think this is a good indicator, please? How to reduce that? What would you recommend?

You can do it either way. The absolute value of loss depends on the input values, predicted values, and loss function. Thus you cannot look and say 0.2 is good. You need to select and monitor appropriate metric(s).

# Is one hot encoding necessary for series forecasting using encoder-decoder? Can I just use the raw numbers?

Good question, according to many experiences, using raw numbers as input to a neural network is not efficient to train. Using some encoding would create a relationship between these raw numbers which helps the model to relate them.

# Regarding the tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART F Encoder-Decoder with Bahdanau & Luong Attention” (https://youtu.be/FEVCmJXc7eI)

# Question 1: For 25:20, you mentioned h_t is for decoder and h_s for the encoder, but you said the opposite at 25:23, please correct me if that is wrong? Question 2: Also at 28:20, you set values = all hidden states of decoder but you said in the video encoder. Question 3: I am not sure if you used `V` or not at 29:27, please? Also, where W1 and W2 dense layers are added, please? Are they added on top of the LSTM unit cell after we get the hidden unit state? Question 4: Not sure why you returned attention weights at 31:23 given that we used attention weights overvalues to produce context vector. So what is the purpose please of returning it?

A1: h_t is for decoder and h_s for the encoder.

A2: My mistake. values = all hidden states of encoder.

A3: v is written in an open form where value= tf.nn.tanh().

A4: If you need to visualize them

# The best LSTM explanation!! Thanks a lot

SEQUENCE-TO-SEQUENCE LEARNING PART B: USING LSTM

Glad you think so!

# Can you implement nucleus sampling as well?

Thank you for the comment. The request is noted. However, as you know from the tutorial, you can modify the top-k sampling such that it can sum up the probabilities of the entries instead of picking up the top k entries directly. Nonetheless, when I’m available I will publish a new tutorial on it. Keep Deep Learning!

# I have one question when you did the top-k-sampling why did you do the softmax again?

After selecting top k candidates, we will select one of them according to their relative scores (probabilities). However since you selected top k from all possible candidates, the total of the probabilities of these k candidates would not sum up to 1.0! By using the Softmax function we can re- calculate their relative probabilities such that the total of them will be 1.0 (which is more a probability distribution! ) It’s that clear now?

# How do we generate text i.e many sentences with respect to classes (Harassments, Sexual Activity, Terrorism, and Drugs) in this regard please guide me.

You can generate text applying several different approaches in the literature. You can begin with my tutorials in this playlist Text Generation in Deep Learning with Tensorflow & Keras:

Enjoy!

# If I have X_train= (3181683, 263), Y_Train = (3181683, ) , X_test= (1363579, 263), and Y_test = ( 1363579, ) How do I write a 1D convolutional and 2D convolutional, train the model and get and the accuracy?

You can watch the tutorial to understand the Keras Conv1d layer:

# Hello sir, I am using a custom keras model and getting the error message: “raise NotImplementedError(str(self) + ‘ does not implement get_config()’) NotImplementedError: <__main__.OrthogonalRegularizer object at 0x000002810AB07520> does not implement get_config()” when using model.save(‘/path’). I provided the necessary files via gmail. I would appreciate it if you help me. Thank you.

Hello GΓΆkhan, as explained in the tutorial, you need to define “get_config()” method for the custom object, and when uploading, you need to provide the name of the custom object. Please check the shared code carefully: https://colab.research.google.com/drive/1gfvcXwBDel8USWuMeb-hrSKihXGy_bSl?usp=sharing If this does not solve your problem, try to use “tf-nightly” version by “!pip install tf-nightly”. Good luck!

# Hi Murat, ur explanation is awesome, I wanted to learn a pose estimation model to identify coordinates, can you provide complete tutorial on it?

Hi Nisha! I noted your request. The important concepts about pose detection are given in the below tutorial. Thank you!

# I tried to encode a large text corpus with one-hot encoding as you did, but unfortunately, I had a very large batch matrix of 2000x3000x400 which broke my memory and needs over 90 GB of RAM. What would you recommend in my case, please?

Hi Mohammad, I strongly recommend you use an Embedding Layer for encoding text inputs.

# Thank you very much, sir, for dedicating such an eye-opening video to understanding the LSTM parameters. Sir, I have a question At 19:25 why did you multiply each W, U,b parameter with 4 (line number 6,8,10)??? The second question is in every time step (i.e., 5 timesteps) h and c state will change??? Basically, I want to reproduce the LSTM using NumPy by Keras derived LSTM parameters, so what parameters should I keep on focus?

Thank you for the questions. 1. We multiply each W, U, b parameter with 4 because there are 4 gates in an LSTM cell. 2. Each time step, the LSTM cell creates new h and c states. 3. The input and the output parameters define the internal behavior of the LSTM cell.

# Hi sir, Awesome information. Content is straight to the point. 1. Please make a video on [ “multi-label text classification problem” ] 2. what are the score or metrics considered for [ multi-label text classification problem” ] Thank you.

Hi, we just published a new tutorial about Multi-Class Text Classification: https://youtu.be/KnSeSdXMX_Q In this tutorial we discussed the metrics. Also please keep following the playlist about Classification with Keras:

# While I can figure out how an auto-encoder works, in that the loss is based on comparing input and output which drives the weight updates through back-propagation, I have some troubles with the current model: how does the encoder learn? the encoder consumes the entire input sequence, while the decoder consumes one time-step at a time. Is this true for both training and inference?

Good questions.

- To understand how Encoder-Decoder learns please watch all parts A and B in the Seq2Seq playlist: https://youtube.com/playlist?list=PLQflnv_s49v-4aH-xFcTykTpcyWSY4Tww

2. Yes, it is true for both training and inference, the encoder consumes the entire input sequence, while the decoder consumes one time-step at a time.

# Thank you very much for the LSTM tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART B: USING LSTM” (https://youtu.be/7L5bkMu0Pgg). It’s very clear, but I have only one question please about TimeDistrivuted(Dense(n_features, activation=’softmax’)). I did not understand the combination here please of TimeDistrivuted(Dense()). Second, I did not understand why you used n_features with the Dense layer, please?

Thank you for the question. In the earlier versions of Keras API, DENSE layer could produce output only in [BatchSize, Features] shape. However, Δ±n the tutorial, we need output in the shape [BatchSize, TimeSteps, Features]. Therefore, we used TimeDistrivuted() wrapping class to repeat/duplicate DENSE layer as mana as TimeSteps. However, nowadays, you do not need to use TimeDistrivuted(), because now DENSE layer can produce output only in [BatchSize, TimeSteps, Features] shape.

# One question regarding the LSTM tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART B: USING LSTM” (https://youtu.be/7L5bkMu0Pgg) please: if we have 3 inputs to a model not only one vector [1,2,3,4] where you create a custom one-hot encoded vector and then train the model on top of it. I mean by 3 inputs, 3 source of inputs where is an array by itself. I am not sure if LSTM is applicable in this case please?

Thank you for the question. As I explained in the tutorial LSTM expects a batch of input [Batch, TimeSteps, Features]. For each time TimeSteps there could be many features. If each input is an array you can flatten it. Then you would have 3*4 = 12 inputs as Features. I hope now it is clear?

# In the tutorial “Sampling in Text Generation” (https://youtu.be/0RFQ6QOYL68), suppose I want to predict which next word based on the previous 10 words( same as phone words prediction). Which sampling approach should I use?

As I explained in the tutorial, you have a few options to try. Each option has its advantages and disadvantages explained in the tutorial. Top-k or Top-p would be more useful for text prediction or generation. Take a look at this tutorial list to see how I apply sampling techniques for text generation: https://www.youtube.com/playlist?list=PLQflnv_s49v9QOres0xwKyu21Ai-Gi3Eu

# Hello, “Conv1d: Keras 1D Convolution Model For Regression (Boston House Prices Prediction)” (https://youtu.be/JzoIHdkFcQU) is a great video, I was just wondering how to determine batch size? And how to vary it? I currently have a tensor with rows being my inputs in one column, and I have approximately 1300 of these vectors in the form of a tensor.

Hello, Thank you for the comment. Batch size (or Mini Batch size) is determined (or limited) with respect to many factors such as the dataset size, GPU capacity, model complexity, running time expectations, speed of convergence of the solution, etc. There is no hard rules for that. You can check out this discussion for more details: https://stackoverflow.com/questions/46654424/how-to-calculate-optimal-batch-size. For the second question, we fix a batch size during a training. If you want to see the impact of the batch size on YOUR DATASET & MODEL, you increase or decrease it generally by the power of 2. For instance, if initial batch size is 128, you might like to use 64 or 256 as possible batch sizes. But is is not a rule, just a custom thing for most people. In deep learning, 1300 samples are not considered enough data mostly. To train a model well, we need usually many more samples so that the model can function better in a real life case. I suggest you to collect more data. DL is a data dependent process. Good Luck!

# Hi, in the tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART F Encoder-Decoder with Bahdanau & Luong Attention” (https://youtu.be/FEVCmJXc7eI), if inputs to model is not multiple of batch_size I am getting shape error while training. How can this problem be solved? I have used STEPS_PER_EPOCHS, but still, this did not help

YouΔ± can use “drop_remainder” argument in the tf.data.Dataset API for input pipelines here:

drop_remainder (Optional.) A tf.bool scalar tf.Tensor, representing whether the last batch should be dropped in the case it has fewer than batch_size elements; the default behavior is not to drop the smaller batch.

# Fantastic tutorial “Text Generation in Deep Learning with Keras: Fundamentals” (https://youtu.be/Ait1_xNmxII). Can we use this with time series, please?

Thank you for the comment! It is a very interesting question. I have not seen that the Text Generation approaches are used to generate time-series data but why not? We can try it.

# Hi, I just had a question regarding the technique to deal with variable-length sequences. You’ve mentioned in the second video in the seq2seq series that there are other ways other than padding to deal with Variable length sequences, as in some applications we cannot really judge the maximum length of the sequence. So are there other techniques that deal with variable length sequences and could you mention what they are?

Very good question! Another method is to use the ragged tensors in TF. Unfortunately, they are a little bit confusing. You can visit the official documentation to learn more: https://www.tensorflow.org/guide/ragged_tensor

# If we use Teacher Forcing here then how to proceed in the tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART F Encoder-Decoder with Bahdanau & Luong Attention” (https://youtu.be/FEVCmJXc7eI)? Any hints?

As in the notebook I kindly requested all followers (DO IT YOURSELF: Add Teacher Forcing). You need to watch/study https://youtube.com/playlist?list=PLQflnv_s49v-4aH-xFcTykTpcyWSY4Tww tutorial series. Then, you can prepare the data pipeline as *encoder_input*, *decoder_input* and *decoder_output* for training. Then you can use these datasets for teacher forcing. Good luck!

# What is the shape of Y in model.fit(X,Y) is it ( num_of_images, 5 ) in the tutorial “How to solve Multi-Label Classification Problems in Deep Learning with Tensorflow & Keras?” (https://youtu.be/QBHjpjymqbM)?

Pay attention: The last layer of the model has *number_of_classes* units. So the output (*y_pred*) will be a vector with [*batch_size, number_of_classes*] dimension.

# Thank you very much for the video “Download Datasets from Kaggle to GOOGLE COLAB” (https://youtu.be/_rlt4mzLDLc) . I’m stuck here as you can see below please help. !kaggle competitions download -c predictive-maintenance Traceback (most recent call last): File “/usr/local/bin/kaggle”, line 5, in <module> from kaggle.cli import main File “/usr/local/lib/python2.7/dist-packages/kaggle/__init__.py”, line 23, in <module> api.authenticate() File “/usr/local/lib/python2.7/dist-packages/kaggle/api/kaggle_api_extended.py”, line 146, in authenticate self.config_file, self.config_dir)) IOError: Could not find kaggle.json. Make sure it’s located in /root/.kaggle. Or use the environment method.

Please ensure that you downloaded your Kaggle Token to the correct directory used in the video/Colab notebook.

# Can you please explain to me the Lambda layer, along with the K.concatenate function? why are we using axis=1 in the tutorial “SEQUENCE-TO-SEQUENCE LEARNING PART D: CODING ENCODER DECODER MODEL WITH TEACHER FORCING” (https://youtu.be/RRP0czWtOeM)

First, `Lambda`

layers are best suited for simple operations or quick experimentation. Here we just gather outputs for each time step one-by-one. We use a `Lambda`

layer to put them as a single output. to do so we use K.concatenate function since each output for the each step has 3 dimensions. K.concatenate is a Keras layer that concatenates a list of inputs. *axis=1* indicates that the input vectors are concatenated by the columns. You can see an example here: https://keras.io/api/layers/merging_layers/concatenate/

# Hello sir can u explain unsupervised encoder and decoder extractive summarization thank u!

Unfortunately I have not worked on that area yet.

# I have a question for the tutorial “LSTM: Understanding the Number of Parameters” (https://youtu.be/B08baRr2LlY) . If I am giving a one-to-many approach. Let’s say I have 6 input parameters, then my input vector would be (batch, 1, 6) and if I want to predict an output parameter value for 500 time steps, my output vector would be (1, 500). Now, should my LSTMoutputDimension be 500 or 1? (if I have return sequence = True).

First, remember that LSTM *expects* input data to be a 3D tensor such that: `[batch_size, timesteps, feature]`

Please note that *timesteps *and *features *are two different concepts! If you want to predict an output parameter value for 500 time steps, than the shape would be something like `[batch_size, 500, feature]`

! Not `[batch_size, timesteps, 500]`

! THese are two different outputs! Please first understand the problem and the expected output in terms of `[timesteps, feature]!`

I hope it is clear now.

# Hi I want to generate images with text that have uniform with and height using trdg TextRecognitionDataGenerator or any other method for training an OCR model. Please can you help me to this? Thanks.

Unfortunately I have not worked on that area yet

# @Murat Karakaya Akademi Fantastic explanation “LSTM: How it works? How to use? How to set up parameters correctly?” Could you please explain how it’s 48? Why we have 2 (None, 1) in 17:43 (https://www.youtube.com/watch?v=7nnSjZBJVDs&t=17m43s) please?

It is a mistake of me. As I pinned the below comment: “*WARNING: At 17:00: Please ignore the verbal explanation at this moment. Let me clarify the numbers: The shape of the hidden cell state (state_c) is (None,1) and the total number of parameters of the LSTM layer is 48! Sorry for the inconvenience.*” The output in the summary does not fit in the space on the notebook. In deed, there should be *4 (None,1) shapes* for the output of that layer. Thank you!

# Sir can u please make a video on seq2seq with attention using Keras Attention layers. I tried using other resources but errors prompting using keras Attention layers

Hello Dinesh, Actually I have a tutorial on this topic: https://youtu.be/FEVCmJXc7eI . I had implemented the Bahdanau and Luong attention layers from scratch. At that time Keras did NOT have attention layers. Now it provides. Luckly, my implementation and Keras implementation are very similar. So you can import Keras layers and use tem as I did im my tutorial as a custom layer. Take care!

# Awesome. This (LSTM: How it works? How to use? How to set up parameters correctly? https://youtu.be/7nnSjZBJVDs) is actually how you teach LSTM ! Everybody just blabbers LSTM LSTM and does coding without even understanding it.

Thank you so much!

# Hi Sir, Hope you and your family is well! can you please explain, why have you used batch_size for the decoder_input_data? i am little confused here. Lets take below scenario. 1. Suppose we have 20 rows in our data set (lets assume shape of each row is changed to (4,10)) 2. Now lets use batch_size = 2. Will the first 10 rows from our dataset get sent into the decoder and then the loss get calculated? 3. Now after calculating the loss, the decoder will make weight updates? and then second batch will be sent?

Hi Sourav! In general, we train a deep learning model by using mini batches from the data. It is a well-known approach to increase the speed of convergence of the model during training. In you scenario, your data shape is (20,4,10). if batch size is 2, you have 10 batches of data and each batch has only 2 rows (samples). Now your mini batch shape is (2,4,10). That is, the first **2 rows** from your dataset get sent into the decoder and then the loss get calculated according to the average errors in these 2 predictions. Then, the decoder’s weights get updated according to the calculated gradients. After then, by using the next batch in the forward pass, we continue to train the model. I hope this clarifies your questions. Take care!

# Sir from the last lectures: “SEQUENCE-TO-SEQUENCE LEARNING PART D: CODING ENCODER DECODER MODEL WITH TEACHER FORCING (https://youtu.be/RRP0czWtOeM)”, I understood that we need to use TimeDistributed Dense for the 2nd LSTMi.e decoder for this case. Why did you use only the Dense layer for outputs?

Hi Dinesh. As shown in the tutorials, we can use TimeDistributed(Dense) and Dense layer after a Recurrent Neural Network layer such as LSTM and they would behave the same as long as LSTM parameter retΔ±rn_sequences= True is set. Both approaches will apply a single Dense layer (with the same exact weights) to each time-step output of the LSTM layer and then average the loss and apply it to the dense layer weights. You can read more discussions here: https://stackoverflow.com/questions/44611006/timedistributeddense-vs-dense-in-keras-same-number-of-parameters/44616780

# I have gone through a ton of resources but still, I couldn’t understand how exactly practical code works with theory. After seeing your videos I got a good understanding of LSTM. Thank u, sir.

Great to hear! Thank you :)

# Struggling to understand LSTM for a long time. This video “LSTM: Understanding the Number of Parameters” https://youtu.be/B08baRr2LlY is an eye-opener.

Thank you for your motivating feedback!

# Thanks, very good playlist on seq2seq (Seq2Seq Learning Tutorials https://youtube.com/playlist?list=PLQflnv_s49v-4aH-xFcTykTpcyWSY4Tww)

Glad you liked it!

# Thank you for the great tutorial (tf.data: Build Efficient TensorFlow Input Pipelines for Image Datasets https://youtu.be/5MQ63pDxULw).

# Point1: Regarding the last part with TensorBoard, you explained the input-pipeline analyzer, which shows that the major part of the time is spent with the input pipeline, even after using the map, cache, and prefetch methods. It would be a great help if you can give more information on this point, and also inform me if there are some more steps to improve the input pipeline.

# Point2: It would be a great help if you can share an example with a python generator with tf. data. (ie. tf.data.Dataset.from_generator) using real image datasets. Thank you again for the great tutorial.

Thank you for the feedback! For Point 1 TensorBoard is a continuously changing utility. Therefore, I wait for it gets somewhat mature. For Point 2, I do not use generators anymore, since we have the TensorFlow Data Pipeline tf.data :)) Take care!

# I want to ask about, I have 4 Classes but my problem here I don’t have Train.Csv file that contains information of Labels Data images, how can I label the dataset I have …?

Either you have to store each class of images in separate folders or you need to create a list (csv file) manually which labels image files with the corresponding label.

# Thank you so much, sir! One video on “1D CNN classification problem” please!

Thank you for the comment and the request. Noted. I will prepare and upload a tutorial on “1D CNN classification problem” probably in June 2021. Please activate the channel notification. By the way, you can watch the “Conv1d: Keras 1D Convolution Model For Regression (Boston House Prices Prediction)” tutorial. Take care!

# This is a brilliant tutorial on Conv1D: Understanding tf.keras.layers https://youtu.be/WZdxt9xatrY!

Thank you!

# How to deploy seq2seq models in production eg. using flask. Can you make tutorial on deployment?

I actually have a video showing how to serve with fastapi: https://youtu.be/iZaWHylSvh0 But, it is in Turkish :))

# Thank sou so much. I was looking everywhere for this tutorial LSTM: Understanding Outputs (https://youtu.be/B66760rvHA8)

You are most welcome! Thank you for the comment. Take care!

# I didn’t understand this part in the SEQUENCE-TO-SEQUENCE LEARNING PART D: CODING ENCODER DECODER MODEL WITH TEACHER FORCING video:

decoder_state_input_h = Input(shape=(LSTMoutputDimension,)) decoder_state_input_c = Input(shape=(LSTMoutputDimension,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] decoder_outputs, state_h, state_c = decoder_lstm( decoder_inputs, initial_state=decoder_states_inputs) decoder_states = [state_h, state_c] decoder_outputs = decoder_dense(decoder_outputs) decoder_model = Model( [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

As explained in the video; at each step, we need to initialize the Decoder LSTM with hidden and cell states. therefore we first concatenate current decoder_state_input_h and decoder_state_input_c as decoder_states_inputs for the next step input. Then we initialize the LSTM with it: decoder_lstm( decoder_inputs, initial_state=decoder_states_inputs). We also provide the encoded data to produce the next step output. Since we have already trained the model, we can use the layers to create a new model for Δ±nference. That is it. You need to understand the difference between training and Δ±nference. Take care.

# decoder_lstm() ‘s initial_state={shouldn’t it be h and c state of encoder? } and second question: decoder_model = Model() ‘s [decoder_inputs] + decoder_stqtes_inputs why is there a + sign?? both are separate thing and should be multiplied with seperate weight value i.e, one goes to embedding layer and the states should go to states weight.

- decoder_lstm() ‘s initial_state={shouldn’t it be h and c state of encoder? }: YES
- Keras Model class (https://keras.io/api/models/model/) knows that we provide 2 inputs : [decoder_inputs] & decoder_stqtes_inputs when we use ‘+’ symbol with these parameters. Please notice that: we DO NOT sum up these parameters. We CONCATENATE them as a single input since the Keras Model expects a single input! You can check this example for more explanation for multiple inputs: https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/ I hope everything is crystal clear :)

# One more question, when we want to use the trained model, is there any need to have another loop? or we can input data like model.predict() and done! in the teacher force, for prediction, we have another loop ?? I am a little bit confused?

Encode just encodes the data. The decoder will decode and then classify/regress the data. Therefore, in most of the problem settings, you need to use them together. That is the reason why they are called encoder-decoder. However, as in Auto Encoder setup, after training encoder-decoder, you might use just encoder (or decoder) for encode (or decode) data and then use it somewhere else. But you need to know why you do need this :)

# Hello professor, Thank you for this clear presentation. Why did you say that Loss should be Crossentropy?

I said, for the classification tasks, most of the time in deep learning, we use a loss function based on Cross-entropy. I also provided links to other types of loss functions in the video and in the notebook. You can access all here: https://keras.io/api/losses/ As you will see in the list, the other losses are Poisson and KL divergence (for classification tasks!). That is, out of 5, 3 loss functions are based on Cross-entropy.

# Hello, Thank you for this tutorial. I have a question if we have the input of shape (None, 7,6) and output of shape (None, 2,6), Is there any need for padding (you told in the video that for ANN all the input and output should be in the same shape !)? As I see you applied padding to both x and y train in the tutorial and you only applied to mask on the input. How about my case? Input: (None,7,6) and Output:(None, 2,6). Thank you!

Hello Ehsan! As explained in the tutorial, the padding is only necessary for the input as the models expect a fixed input size most of the time. Moreover, paddings are important for improving recurrent layers (transformers too!). We do not use (mostly) padding for the output. However, in your train set (X & y), you need to provide tensors with a fixed size shape. Ragged tensors are rare in applications as far as I have seen. I hope it is clear now?

# @Murat Karakaya Akademi Thank you if there is no need for padding of output and there is an obligation of having equal size input and output, so how can we make them equal? could you please give me one example for the case of having different input and output sizes? input: (none, 7,6) and input is (none, 2,6)

I have to clarify the situation. As explained in the tutorial, in general, the input (X) and the output (y) should be fixed size tensors. But they do not necessarily need to be in equal shape! As in your example, X’s shape could be (none, 7,6). That means every sample in X must be a 7x6 shape (Rank 2) tensor. Likewise in y, every sample in the y data set must have a shape of 2x6. You do not make X and y shapes equal. However, (most of the time), you have to ensure that every sample in a data set ( X or y) has the same tensor shape. There is a special kind of tensors called ragged tensors in which we have the exception. Look here: https://www.tensorflow.org/guide/ragged_tensor. I hope everything is clear and simple :) Let me know.

# This is a really amazing video: How to solve Multi-Class Classification Problems in Deep Learning with Tensorflow & Keras? (https://www.youtube.com/watch?v=t8T43ayNPTQ). The Google official tutorial pages should have links to his videos. Your tutorials are really awesome, not personal but if you could improve the accent you would have more PVs. Once again, your videos are the best on the planet.

Thank you for the nice comment and the suggestion. I’m working on my pronunciation.

# No one on YouTube can match you, sir! This channel is the best channel for ML aspirants.

Thank you so much Ablishek!

# In case if our encoder-decoder doesn’t perform well, can we create an encoder-decoder model by stacking multiple LSTM layers? If yes how would we stack?

Yes, you’re right. You can stack LSTM layers in different ways. One simple way is to stack them by connecting each hidden state at each step. You can learn how to do it in this video: https://www.youtube.com/watch?v=B66760rvHA8 and that video: https://www.youtube.com/watch?v=7L5bkMu0Pgg. Take care!

# This Encoder-Decoder model is supervised or unsupervised?

Encoder-Decoder models have supervised models since we provide the correct outputs during training.

# I train my cyclegan model and I have errors likes : WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0.

Hi there! As the error message clearly identifies the problem, we are sure that the cyclegan model is using an old version of Tensorflow and tf.data. Thus, there could be many other issues when trying to compile this cyclegan model implementation. I strongly suggest checking the documentation of the cyclegan model to locate the TensorFlow version used in the original implementation. For a quick fix, you can follow the warning suggestion and update the code where a dataset is accessed by using tf.data class. Good luck!

# Please, can you explain how to export Keras subclassed model (the Decoder) with 2 args as a SavedModel? I didn’t can’t save my subclass model, can you help me?

I have not tried it by myself but the official documentation offers a solution for subclassed models: https://www.tensorflow.org/guide/keras/save_and_serialize#custom_objects Good luck!

# Murat Karakaya Akademi; it is a really good example. Can you please share some details regarding Total params: 8,817, Trainable params: 8,817, Non-trainable params: 0. How we are getting these params?

The parameters are the weights and biases in each neuron. You can calculate the number of parameters considering input shape, number of neurons at each layer. Trainable parameters are the parameters whose weight and bias values are updated during the train. If you create a new layer in a model mostly its parameters are trainable. On the other hand, if you import some pre-trained layer with its parameters (e.g. in Transfer Learning) and if you want to keep these parameter values intact during training you can set them untrainable. Thus their parameter numbers will be reported as Non-trainable parameters. I hope now it is clear. If you want more specific information you can watch LSTM: Understanding the Number of Parameters: https://youtu.be/B08baRr2LlY

# Sir, please create videos on the transformer (Vaswani et all). It will be really helpful for people in remote places like me…thank you

Good News: I’m working on it :) But it will take some time…

# Sir plz upload the self-attention and transformer videos too we really need it plz it’s a request.

Thank you for the comment. They are on my agendaπ

# Best tutorial to learn in an orderly fashion, best from a programmer's point of view!…thank you…can you also make a tutorial on a transformer with attention(vaswani et all 2017) please.

Yes, soon :)) REPLY

# Only one word it is AWESOME π WONDERFUL EXPLANATION, nice presentation very smooth sir the way you explain the flow with diagram and code is outstanding. I have been searching seq2seq for a long time and ur playlist it’s just way awesome really got the whole seq2seq. Salute!! Can u plz make videos on the NLP domain I mean transformers, Bert, and all? I would be the first viewer I assure u thatπ I really want to understand these things and yr EXPLANATION is awesome everyone just gives an overview but ur content is to the point. Thank u so much for uploading such wonderful content. Keep it upππ

Thank you so much. I plan to prepare self-attention and multi-attention topics and then transformers. After then I will prepare the implementation of them on NLP. Please keep watching :)

# I can't wait for tf.nn.conv2d explanation

Actually, I already prepared a 3- part video series for Conv2D. Here is the link https://youtu.be/ukZTjkkhjOk

# Thanks so much, it is really wonderful, please one video for LSTM with the same example for a regression problem

Thank you for the comment. Here is it: Conv1d: Keras 1D Convolution Model For Regression (Boston House Prices Prediction)

# Murat Karakaya Akademi thanks so much for your efforts and we are looking for LSTM and hybrid CNN and LSTM

Thank you. I noted.

# One video “CNN-1d for classification” pleasee π©π»π©π»π©π» That’ll be so useful!

I will prepare soon.

# @Murat Karakaya Akademi please if possible to mention how to visualize the weights of conv1d ( am using personalized weights that is why) and the data after being treated in each convolutional layer (the data filtered with the given filter in each layer)

I noted it.

# Is there please any way to force Con1d to apply a Gaussian 1d filter on my data? I do need this, please

Hello, I have new content for Conv1d: https://youtu.be/JzoIHdkFcQU maybe it helps you.

# I just have a 1D array input. For example: x_train = [1.3, 1.5, 0, 0, 2, 1.7, 0, 0] y_train = [1, 1, 1, 1, 2, 2, 2, 2] x_train are sparse coefficients and y_train are class labels. I get this by processing some other data. I need to pass it as an input to conv1D and classify the data. How should I give the input shape of the data? How should I modify the data while I give used in model.fit(x_train, y_train, epochs =10)?

Thank you for the question. I could not understand the question but, as discussed in the video, you should shape your input in 3D for supplying Conv1D: (batch size, time dimension, feature dimension). The meaning of these dimensions depends on your application and its requirements. I have prepared new content on how to use Conv1d for a regression problem. I hope it helps you. : https://youtu.be/JzoIHdkFcQU

# Selam Murat bey, at first, thanks a lot for these great and easy-to-understand explanations. I was wondering why the center of the ‘“+” sign is not yellow in Feature Map 1and Feature Map 2, rather seems to represent some values near zero.

Thank you for the question. Actually, I tried to explain in the video. SΔ°nce we are multiplying an 8x8 matrix and moving 1 column and row at a time, around the center of the + there are overlapping areas where some parts are mostly 0 and the feature maps are just a compilation of these areas and filter. That is the reason. If we have only 1 filter which perfectly sits on the center of + there will be a perfect output. However, we are moving the filter over the image including the center. I hope this clarifies.

# Also, I have the same problem in the input shape, how can I arrange the data to apply ConvD1 in regression? (My data size is 478 observation with 8 features and have only one output) so what’s the input shape?

Thank you for the question. You can first decide the batch size. Let’s say GPU memory is enough and you can feed all the data at once. So the batch size will be 1. then a number of inputs will be the observations: 478 and features (dimension of the input) is 8. So input dimension would be (1,478,8) That is, there are 478 rows and 8 columns in 1 table. Actually, the dimensions of the input depending on your exact application needs/requirements. I hope this helps.

# @Murat Karakaya Akademi thanks so much for your reply but I am still confused about my data size and how many hidden layers should add. So kindly tell the whole structure model of CNN for data size (478 and 8 features) and if the data size reaches 10000 and 8 features also what is the changes of the model will happen? thanks so much!

Hello again. I have prepared new content on how to use Conv1d for a regression problem. I hope it helps you. : https://youtu.be/JzoIHdkFcQU

# @Murat Karakaya Akademi A video on classification problem in the case of CNN-1d with a visualization of the output of conv1D layer each time will be very helpful

I consider preparing a tutorial for 1D Conv for classification. Please keep following.

# Data type if we are using MaxPooling 1d could be numpy.array of shape (x,y,1) and we don't need obligatory to convert that to tensor right?

No, you do not need to convert numpy array to tensor for inputting layers. As you see in the video, firstDay, secondDay and log are all numpy arrays

# Comments or Questions?

Please share your Comments or Questions.

Thank you in advance.

Take care!