Categories
Misc

Community spotlight: Fun with torchopt

Today, we want to call attention to a highly useful package in the torch ecosystem: torchopt. It extends torch by providing a set of popular optimization algorithms not available in the base library. As this post will show, it is also fun to use!

Categories
Offsites

Community spotlight: Fun with torchopt

Today, we want to call attention to a highly useful package in the torch ecosystem: torchopt. It extends torch by providing a set of popular optimization algorithms not available in the base library. As this post will show, it is also fun to use!

Categories
Misc

Arguably a newbie question (discussion) about epochs and training.

Imagine we have a neural network, or any network that can be trained iteratively with epochs and consider the two training cases below;

  1. We have x amount of data and we train until we reach 0.1 loss. Let us say this took 100 epochs.
  2. We have greatly increased the size of our data to 100x and we have trained for a greatly reduced number of epochs (let us say 1 epoch) to reach 0.1 loss.

In terms of testing performance and realtime predictions, will there be any significant difference between these two cases even though the loss (let us assume the validation and training accuracies are the same as well.) is the same?

I am feeling the contradiction between training for large epochs with small data and training with small data for lower epochs.

As far as I am concerned, I could not get a transformer network to converge to acceptable metrics without data augmentation, while training for a low number of epochs. I have tried training it for thousands of epochs with limited data, no luck.

In order to keep the argument easily arguable, I assumed every other thing I did not change between two cases did not actually change. Quite obviously increasing the data size to 100x will not result in 0.1 loss with 1 epochs, models do not work linearly like that.

submitted by /u/ege6211
[visit reddit] [comments]

Categories
Offsites

Vector-Quantized Image Modeling with Improved VQGAN

In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora.

This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer codes (represented as natural numbers), and modeling them autoregressively (i.e., predicting sequences one token at a time). In these approaches, a convolutional neural network (CNN) is trained to encode an image into discrete tokens, each corresponding to a small patch of the image. A second stage CNN or Transformer is then trained to model the distribution of encoded latent variables. The second stage can also be applied to autoregressively generate an image after the training. But while such models have achieved strong performance for image generation, few studies have evaluated the learned representation for downstream discriminative tasks (such as image classification).

In “Vector-Quantized Image Modeling with Improved VQGAN”, we propose a two-stage model that reconceives traditional image quantization techniques to yield improved performance on image generation and image understanding tasks. In the first stage, an image quantization model, called VQGAN, encodes an image into lower-dimensional discrete latent codes. Then a Transformer model is trained to model the quantized latent codes of an image. This approach, which we call Vector-quantized Image Modeling (VIM), can be used for both image generation and unsupervised image representation learning. We describe multiple improvements to the image quantizer and show that training a stronger image quantizer is a key component for improving both image generation and image understanding.

Vector-Quantized Image Modeling with ViT-VQGAN
One recent, commonly used model that quantizes images into integer tokens is the Vector-quantized Variational AutoEncoder (VQVAE), a CNN-based auto-encoder whose latent space is a matrix of discrete learnable variables, trained end-to-end. VQGAN is an improved version of this that introduces an adversarial loss to promote high quality reconstruction. VQGAN uses transformer-like elements in the form of non-local attention blocks, which allows it to capture distant interactions using fewer layers.

In our work, we propose taking this approach one step further by replacing both the CNN encoder and decoder with ViT. In addition, we introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for lookup of the integer tokens. Specifically, we reduced the encoder output from a 768-dimension vector to a 32- or 8-dimension vector per code, which we found encourages the decoder to better utilize the token outputs, improving model capacity and efficiency.

Overview of the proposed ViT-VQGAN (left) and VIM (right), which, when working together, is capable of both image generation and image understanding. In the first stage, ViT-VQGAN converts images into discrete integers, which the autoregressive Transformer (Stage 2) then learns to model. Finally, the Stage 1 decoder is applied to these tokens to enable generation of high quality images from scratch.

With our trained ViT-VQGAN, images are encoded into discrete tokens represented by integers, each of which encompasses an 8×8 patch of the input image. Using these tokens, we train a decoder-only Transformer to predict a sequence of image tokens autoregressively. This two-stage model, VIM, is able to perform unconditioned image generation by simply sampling token-by-token from the output softmax distribution of the Transformer model.

VIM is also capable of performing class-conditioned generation, such as synthesizing a specific image of a given class (e.g., a dog or a cat). We extend the unconditional generation to class-conditioned generation by prepending a class-ID token before the image tokens during both training and sampling.

Uncurated set of dog samples from class-conditioned image generation trained on ImageNet. Conditioned classes: Irish terrier, Norfolk terrier, Norwich terrier, Yorkshire terrier, wire-haired fox terrier, Lakeland terrier.

To test the image understanding capabilities of VIM, we also fine-tune a linear projection layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Similar to ImageGPT, we take a layer output at a specific block, average over the sequence of token features (frozen) and insert a softmax layer (learnable) projecting averaged features to class logits. This allows us to capture intermediate features that provide more information useful for representation learning.

Experimental Results
We train all ViT-VQGAN models with a training batch size of 256 distributed across 128 CloudTPUv4 cores. All models are trained with an input image resolution of 256×256. On top of the pre-learned ViT-VQGAN image quantizer, we train Transformer models for unconditional and class-conditioned image synthesis and compare with previous work.

We measure the performance of our proposed methods for class-conditioned image synthesis and unsupervised representation learning on the widely used ImageNet benchmark. In the table below we demonstrate the class-conditioned image synthesis performance measured by the Fréchet Inception Distance (FID). Compared to prior work, VIM improves the FID to 3.07 (lower is better), a relative improvement of 58.6% over the VQGAN model (FID 7.35). VIM also improves the capacity for image understanding, as indicated by the Inception Score (IS), which goes from 188.6 to 227.4, a 20.6% improvement relative to VQGAN.

Model Acceptance
Rate
FID IS
Validation data 1.0 1.62 235.0
DCTransformer 1.0 36.5 N/A
BigGAN 1.0 7.53 168.6
BigGAN-deep 1.0 6.84 203.6
IDDPM 1.0 12.3 N/A
ADM-G, 1.0 guid. 1.0 4.59 186.7
VQVAE-2 1.0 ~31 ~45
VQGAN 1.0 17.04 70.6
VQGAN 0.5 10.26 125.5
VQGAN 0.25 7.35 188.6
ViT-VQGAN (Ours) 1.0 4.17 175.1
ViT-VQGAN (Ours) 0.5 3.04 227.4
Fréchet Inception Distance (FID) comparison between different models for class-conditional image synthesis and Inception Score (IS) for image understanding, both on ImageNet with resolution 256×256. The acceptance rate shows results filtered by a ResNet-101 classification model, similar to the process in VQGAN.

After training a generative model, we test the learned image representations by fine-tuning a linear layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Our model outperforms previous generative models on the image understanding task, improving classification accuracy through linear probing (i.e., training a single linear classification layer, while keeping the rest of the model frozen) from 60.3% (iGPT-L) to 73.2%. These results showcase VIM’s strong generation results as well as image representation learning abilities.

Conclusion
We propose Vector-quantized Image Modeling (VIM), which pretrains a Transformer to predict image tokens autoregressively, where discrete image tokens are produced from improved ViT-VQGAN image quantizers. With our proposed improvements on image quantization, we demonstrate superior results on both image generation and understanding. We hope our results can inspire future work towards more unified approaches for image generation and understanding.

Acknowledgements
We would like to thank Xin Li, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu for the preparation of the VIM paper. We thank Wei Han, Yuan Cao, Jiquan Ngiam‎, Vijay Vasudevan, Zhifeng Chen and Claire Cui for helpful discussions and feedback, and others on the Google Research and Brain Team for support throughout this project.

Categories
Misc

using Resnet (pretrained) with keras functional API (Transfer learning)

I am using keras Functional API for multiple stream of inputs.

I want to use pretrained Resnet50 inbetween my layers. I would like to freeze the layers of Resnet to train i.e, Transfer learning.

How can I do it

“` CNN_Input = Input(shape=INPUT_SHAPE)

resnet = tf.keras.applications.ResNet50( include_top=False, input_shape=None, pooling=’avg’,

classes=NUM_CLASSES,

 weights='imagenet')(CNN_Input) 

————–BELOW TWO LINES ARE CAUSING ERRORS—————– I assumed we do the same way we did for Sequential Model

for layer in resnet.layers: layer.trainable=False

CNN=Flatten()(resnet) CNN=BatchNormalization()(CNN) CNN=Dropout(0.2)(CNN) CNN=Dense(1024, activation=’relu’)(CNN) CNN=Dense(512, activation=’relu’)(CNN) CNN=Dense(256, activation=’relu’)(CNN) CNN=BatchNormalization()(CNN) CNN=Dropout(0.2)(CNN) CNN=Dense(64,activation=’relu’)(CNN)

CAT_Input = Input(shape=(3,))

CAT=Dense(32, activation=’relu’)(CAT_Input) CAT=Dropout(0.2)(CAT) CAT=Dense(64, activation=’relu’)(CAT)

merge=concatenate([CNN,CAT]) hidden=Dense(64, activation=’relu’)(merge) hidden=Dense(64, activation=’relu’)(hidden) hidden=Dense(32,activation=’relu’)(hidden) output=Dense(NUM_CLASSES,activation=’softmax’)(hidden)

model = Model(inputs=[CNN_Input, CAT_Input], outputs=output) print(model.summary())

“`

submitted by /u/im-AMS
[visit reddit] [comments]

Categories
Misc

Customize TF object detection API

Let say I have an object detector from TF object detection API. Not much of customization can be done to the detector itself, but I want to add a branch on the detected images with LSTM to generate a description. Is there a way to do it in a single architecture? Or the only way is to train the detector and then train seperately the LSTM on detected images (2 stage)?

submitted by /u/giakou4
[visit reddit] [comments]

Categories
Misc

TensorFlow Introduces A New On-Device Embedding-based Search Library That Allows Find Similar Images, Text or Audio From Millions of Data Samples in a Few Milliseconds

TensorFlow Introduces A New On-Device Embedding-based Search Library That Allows Find Similar Images, Text or Audio From Millions of Data Samples in a Few Milliseconds

Anew on-device embedding-based search library has been announced that lets people find similar images, text, or audio in just a few milliseconds from millions of data samples.

It works by putting the search query into a high-dimensional vector that semantically shows what the query means. Then, it uses a program called ScaNN (Scalable Nearest Neighbors) to look for similar items in a database that has already been set up. To use it with a dataset, one needs to build a custom TFLite Searcher model with the Model Maker Searcher API (tutorial) and then send it to devices with the Task Library Searcher API (vision/text).

https://i.redd.it/vn0r4rvhu5091.gif

submitted by /u/No_Coffee_4638
[visit reddit] [comments]

Categories
Misc

Using tensorflow without keras

If I don’t have a need to define custom layers, is it still a common practice to use tensorflow (2.0) without the keras API, or is everyone just using keras now? Is there a scenario that makes sense to just use TF without keras if I’m not defining custom layers or anything.

EDIT: I don’t even see any tutorials on tensorflow.org docs that use defining custom layers without keras. Is it even possible to build a simple model with TF 2.0 without using keras?

submitted by /u/berimbolo21
[visit reddit] [comments]

Categories
Misc

Model predictions are shifted

Hey, I have trained an image classification problem to classify plant diseases, the training went well when testing the model with evaluate it gives 90% accuracy, it’s all rainbows, and till I started predicting single image.

I have 38 classes problem is if I tested it on an image from class 1 it will predict the image is class 10,

test it on an image from class 2 it will predict the image is class 11, class 3 will be class 12 class 4 will be class 13 and so on, like the labels are shifted linearly.

most images from any class will have their labels shifted to another class. I don’t know what’s going on here, any help would be appreciated.

code for reading the image and loading it to the model

img=np.resize(img,(256,256,3)) # Preprocessing the image img = image.img_to_array(img) # x = np.true_divide(x, 255) img = np.expand_dims(img, axis=0) img = img/255 prediction = model.predict(img) 

submitted by /u/Ali99695
[visit reddit] [comments]

Categories
Misc

Getting a TypeError when using cuda, but not when using cpu

Im trying to import a lot of images into tensorflow, to traing a Keras AI, to do some image enhancement for me.

When disabling my gpu with:

import os
os.environ[“CUDA_VISIBLE_DEVICES”] = “-1”

The code runs absolutely fine, but the training takes ages. But when i enable my RTX 3060, with Cuda v11.7 installed proberly, i get the error:’

TypeError: Input ‘filename’ of ‘ReadFile’ Op has type float32 that does not match expected type of string.

Why? how can it change the datatype, that i switch from CPU to GPU?

The failing code is this:

def read_image(image_path):
image = tf.io.read_file(image_path) This Line
image = tf.image.decode_png(image, channels=3)
image.set_shape([None, None, 3])
image = tf.cast(image, dtype=tf.float32) / 255.0
return image

def random_crop(low_image, enhanced_image):
low_image_shape = tf.shape(low_image)[:2]
low_w = tf.random.uniform(
shape=(), maxval=low_image_shape[1] – IMAGE_SIZE + 1, dtype=tf.int32
)
low_h = tf.random.uniform(
shape=(), maxval=low_image_shape[0] – IMAGE_SIZE + 1, dtype=tf.int32
)
enhanced_w = low_w
enhanced_h = low_h
low_image_cropped = low_image[
low_h : low_h + IMAGE_SIZE, low_w : low_w + IMAGE_SIZE
]
enhanced_image_cropped = enhanced_image[
enhanced_h : enhanced_h + IMAGE_SIZE, enhanced_w : enhanced_w + IMAGE_SIZE
]
return low_image_cropped, enhanced_image_cropped

def load_data(low_light_image_path, enhanced_image_path):
low_light_image = read_image(low_light_image_path) And this line
enhanced_image = read_image(enhanced_image_path)
low_light_image, enhanced_image = random_crop(low_light_image, enhanced_image)
return low_light_image, enhanced_image

def get_dataset(low_light_images, enhanced_images):
dataset = tf.data.Dataset.from_tensor_slices((low_light_images, enhanced_images))
dataset = dataset.map(load_data, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
return dataset

train_low_light_images = sorted(glob(“./lol_dataset/our485/low/*”))[:MAX_TRAIN_IMAGES]
train_enhanced_images = sorted(glob(“./lol_dataset/our485/high/*”))[:MAX_TRAIN_IMAGES]

And no, this is not my code, but grapped from a github repository. It’s a prove of concept, to see if i can do, what i want it to do. I’ll rewrite it, if i decide to go forward with Keras

submitted by /u/lynet_101
[visit reddit] [comments]