Hi Reddit
I’m comparing 2 input pipelines. One is built using tf.keras.utils.image_dataset_from_directory and the other build “manually” by reading files from a list using tf.data.Dataset.from_tensor_slices. My first intuition was that the tf.data.Dataset.from_tensor_slices should be faster, as demonstrated here.
But this is not the case. The image_dataset_from_directory is approximatively x6 time faster for batches of 32 to 128 images. Similar performance factor on Collab and on my local machine (run from PyCharm).
So far, I tried to avoid the “zip” of two dataset by having a read_image to output both the image and the label at once. Did not change anything.
Can you help me to build a decent input pipeline with tf.data.Dataset.from_tensor_slices. I would like to work with a huge dataset to train a GAN, and I do not want to loose time with the data loading. Did I code something wrong or are the test from here outdated ?
To be pragmatic, I will use the fastest approach. But as an exercise, I would like to know if my input pipeline wiht tf.data.Dataset.from_tensor_slices is ok.
Here are the code. data_augmentation_train is a sequential network (same in both approaches)
================================= Approach n°1: tf.keras.utils.image_dataset_from_directory ================================= AUTOTUNE = tf.data.AUTOTUNE train_ds = tf.keras.utils.image_dataset_from_directory( trainFolder, validation_split=0.2, subset="training", seed=123, image_size=(img_height, img_width), batch_size=batch_size) class_names = train_ds.class_names print(class_names) train_ds = train_ds.cache() train_ds = train_ds.shuffle(1000) train_ds = train_ds.map(lambda x, y: (data_augmentation_train(x, training=True), y), num_parallel_calls=AUTOTUNE) train_ds.prefetch(buffer_size=AUTOTUNE)
======================================= Approach n°2:tf.data.Dataset.from_tensor_slices ======================================= def read_image(filename): image = tf.io.read_file(filename) image = tf.image.decode_jpeg(image, channels=3) image = tf.image.resize(image, [img_height, img_width]) return image def configure_dataset(filenames, labels, augmentation=False): dsfilename = tf.data.Dataset.from_tensor_slices(filenames) dsfile = dsfilename.map(read_image, num_parallel_calls=AUTOTUNE) if augmentation: dsfile = dsfile.map(lambda x: data_augmentation(x, training=True)) dslabels=tf.data.Dataset.from_tensor_slices(labels) ds = tf.data.Dataset.zip((dsfile,dslabels)) ds = ds.shuffle(buffer_size=1000) ds = ds.batch(batch_size) ds = ds.prefetch(buffer_size=AUTOTUNE) return ds filenames, labels, class_names = readFilesAndLabels(trainFolder) ds = configure_dataset(filenames, labels, augmentation=True)