Building an app for school, we chose to identify fish species. I manually downloaded images for 2 species but it took forever. Do you guys know of any site that has a bunch of pictures of a specific fish species? I know it’s a long shot but it’s worth asking. thanks!
Hello, I am planning a project with tensorflow and looking through the options provided in directions to take the project there are a number of considerations. I have done some basic guided projects, though am looking to create a complete study from start to finish on my own. I am interested in finding out from anyone who has completed their own research projects their favorite or preferred libraries. Any ideas or information is greatly appreciated, thank you.
I am attempting to create an image classifier with VGG16 transfer learning using the ImageFolder dataset builder. The data is successfully built from a directory using this code:
ValueError: Missing data for input "vgg16_input". You passed a data dictionary with keys ['image', 'image/filename', 'label']. Expected the following keys: ['vgg16_input']
Due to this error, I have attempted to index the ds variable with [‘image’] and [‘label’], but to no avail. How can I proceed in using the Image_Folder dataset as training material for my VGG16 transfer learning CNN?
If the answer is obvious, please forgive me. This is my first time attempting to create a CNN with Tensorflow.
Use the high-level nvCOMP API for easy compression and decompression and the low-level API for more advanced workflows.
Compression can improve performance in a variety of use cases such as DL workloads, databases, and general HPC. On the GPU, compression can accelerate inter-GPU communications for collaborative workflows. It can increase the size of datasets that a single GPU can handle by compressing data before it’s stored to global memory. It can also accelerate the data link between the CPU and GPU.
For any of these workflows to be beneficial, compression and decompression must be fast and operate at a high enough compression ratio on a given dataset to be useful. However, compression ratios and throughputs of different algorithms vary widely from dataset to dataset. It can be difficult to select the best one without a lot of specialized knowledge about the algorithms and data statistics.
The NVIDIA nvCOMP library enables you to incorporate high-performance GPU compression and decompression in your applications. The library provides a set of unified APIs that allow you to quickly swap compression formats to achieve best performance on your datasets with minimal changes to code.
With nvCOMP, you can quickly and easily experiment with different algorithms to find the one with the best performance for your use case. In recent releases, we’ve updated nvCOMP to further improve and unify the interfaces. As of the newly released version 2.2, we provide an easy-to-use, high-level C++ API and a versatile low-level batch C API. In this post, we cover both interfaces in detail. You also learn how to use them effectively and when you should choose one over the other.
High-level API
The high-level API is easier to use and abstracts the work of exposing parallelism to the GPU. It is most useful when you have to compress a contiguous buffer into a contiguous, compressed buffer. This works well, for example, when compressing a buffer before sending it over a network or saving it to disk.
The following examples use the high throughput GDeflate compression format. GDeflate is deflate-like and can be mapped efficiently to data parallel architectures, such as GPUs. It is a good starting point if you that don’t have constraints on the compression format to use.
The high-level interface is a C++ API based on the nvcompManagerBase class hierarchy. Each derived Manager class is declared in its associated header in nvcomp/include. For example, the GDeflateManager used in this post is declared in nvcomp/include/gdeflate.hpp.
To get started, construct the desired Manager class. Each Manager constructor has a unique set of arguments; however, a few arguments are generally shared. All subclasses allow construction with a specified stream ID to use for all kernels and memory transfers. You can also specify the device ID to use. If you don’t specify values for these two arguments, the default stream and device are used.
Another common input is the uncompressed chunk size. This is used during compression to split the buffer into independent chunks for processing. Larger chunk sizes typically lead to higher compression ratios at the expense of less parallelism exposed to the GPU. A good starting chunk size is 64 KB, but feel free to experiment with these values to explore the associated tradeoffs for your datasets.
The Manager classes are also constructed with format-specific arguments. You can check the associated header in nvcomp/include for a description of the arguments to the Manager class constructor and to see how to construct the Manager object for your chosen format.
const size_t uncomp_chunk_size = 64 * 1024;
cudaStream_t stream;
cudaStreamCreate(&stream));
const int gdeflate_algorithm = 0; // Use standard GDeflate
const int device_id = 0; // Use the default device
GdeflateManager gdeflate_manager{chunk_size, gdeflate_algorithm, stream, device_id};
nvcompManager requires a temporary scratch workspace to do compression and decompression. This required scratch space is of fixed size based on the particular compression format arguments and the maximum occupancy of the compression and decompression kernels. If it makes sense for your use case, you can provide a scratch buffer to the nvcompManager object after construction, using set_scratch_buffer.
Manually setting the scratch buffer may be desirable to control the memory allocation scheme used for this allocation. If you’re OK with the default, we suggest skipping this step and enabling the nvcompManager object to handle the allocation.
This buffer is reused for all compression and decompression operations that nvcompManager performs. If the nvcompManager object allocates the scratch buffer, it is freed when the object is destroyed.
Compression
Now you’re ready to compress a buffer. First, configure the compression using the configure_compression API. This asynchronous operation returns a CompressionConfig object.
The configuration step only requires the size of the input-uncompressed buffer. You must allocate a GPU-accessible memory buffer of at least this size to serve as the result buffer for the compression routine. With this information, compression can be performed, as shown in the following code example:
The buffer that results from high-level interface compression includes a header before the compressed data (Figure 1). This header includes information about how the buffer was compressed, so that you can construct an nvcompManager object from a compressed buffer without knowing how it was compressed. This enables you to decompress a buffer without knowing how it was compressed.
Figure 1. HLIF compressed data format
To do this, use the create_manager API declared in nvcompManagerFactory.hpp. This synchronous API takes as input the compressed buffer along with optional stream and device IDs.
auto decomp_nvcomp_manager = create_manager(comp_buffer, stream);
If you already have the information about how the buffer was compressed, you can construct a new manager using that configuration as described earlier. You can also reuse the same nvcompManager object that was used for compression to perform decompression. These approaches have the advantage that they don’t require synchronizing the stream.
Given an nvcompManager object and a compressed buffer, decompression is performed similarly to compression with a couple of minor differences. For one, there are two possible ways to do the decompression configuration. If you have the CompressionConfig object used for the compression, you can configure the decompression completely asynchronously.
One example use case for this API is in the training of large neural networks. The size of the neural network or the size of the training set that you can use is limited based on the memory capacity of the GPU. Using compression, you can effectively increase this capacity without having to offload data to the CPU or use multiple GPUs.
Specifically, backpropagation-based training involves computing activation maps during the forward pass and then reusing them in the computation of the backward pass. These activation maps are large and relatively sparse, making them good fits for compression. Use the gdeflate_manager to compress the maps and hold in memory the compressed buffers and the CompressionConfig objects from each layer of the network. This enables fully asynchronous backpropagation, including decompression.
You can also configure the decompression using the compressed buffer if you don’t have the CompressionConfig object that was used. This is a synchronous operation that must perform a cudaMemcpyAsync operation from the device. All synchronization is on the stream specified in the nvcompManager constructor and is not device-wide.
Finally, there are two types of error checking in the high-level API: std::runtime_error exceptions and checking the nvcompStatus_t value.
If any CUDA APIs fail, these raise std::runtime_error exceptions. You can catch these in your application or leave them unhandled, in which case your application fails with a descriptive error message of what went wrong. This can happen if, for example, the output buffer that you provided was of insufficient size or wasn’t accessible on the GPU.
The second form of error-checking is to check the nvcompStatus_t value in the CompressionConfig or DecompressionConfig object. This status is set during the associated kernel call. Corrupt input buffers and other errors trigger it.
Low-level API
The low-level API provides a C API for more advanced workflows. The low-level API simultaneously compresses and decompresses batches of independent chunks that you provided. It’s up to you to chunk the data and to provide a sufficient number of chunks to exploit the GPU’s parallel processing capabilities.
This is the most efficient way to process the data if you have many independent, discontiguous buffers. The low-level API avoids the workload of packing the resulting compressed chunks into a single contiguous-compressed buffer. It also avoids the compression ratio overhead associated with saving information about how the buffer was compressed as in the high-level API.
This workflow fits well with database applications, for example, where you tend to have many independent columns to compress or decompress. This API is used in RAPIDS and in the NVIDIA Spark implementation.
Compression
For compression in the low-level API, you must allocate a temporary scratch buffer. The temporary buffer is similar to that described in the high-level API. However, the buffer size is dependent on the size of the input buffer so it must be redefined and possibly reallocated with each new set of user inputs.
Next, the maximum size of a compressed chunk in the batch should be computed. This allows you to allocate a collection of result buffers. In the following example, batch_size is the number of chunks to process. The device array of result pointers is constructed in pinned host memory before copying to the device.
size_t max_out_bytes;
nvcompBatchedGdeflateCompressGetMaxOutputChunkSize(chunk_size, nvcompBatchedGdeflateDefaultOpts, &max_out_bytes);
// Allocate output space on the device
void ** host_compressed_ptrs;
cudaMallocHost((void**)&host_compressed_ptrs, sizeof(size_t) * batch_size);
for(size_t ix_chunk = 0; ix_chunk
With all these inputs computed, you can now do compression asynchronously as shown.
To begin work towards decompression, pre-compute the decompressed sizes based on the compressed buffer. If you already have this information, skip this step.
nvCOMP provides a set of benchmarks for each of the formats in the low-level and high-level format. Figure 2 compares the performance of high-level and low-level on a few different datasets, with large contiguous buffers. The results were collected using the A100 GPU.
Figure 2a. Decompression throughputs for various datasets.Figure 2b. Compression ratios for various datasets.Figure 2c. Compression throughputs for various datasets
As you can see from the results, the difference in performance between the low– and high-level APIs is negligible when working with large contiguous buffers. The choice of which to use then comes down to your use case. Use the low-level API if you have many small buffers or to avoid the memory footprint associated with the high-level API.
Figure 3 shows performance across different buffer sizes in log-scale. To produce these results, the mortgage-int dataset presented as part of Figure 2 was split into many batches of batchSize as shown. The file is over 314 MB. For the 1 MB batch size, 315 compression and decompression operations are performed. At a 400 MB batch size, a single compression and decompression operation is performed.
Batching the data in this way doesn’t affect the low-level batch API.
Figure 3a: Compression throughputs for various batch sizes operating on a 314 MB file.Decompression throughputs for various batch sizes operating on a 314 MB file.
As demonstrated, the performance of the high-level interface degrades heavily for small batch sizes. This shows the utility of using the low-level batch API when compressing or decompressing many smaller buffers. The low-level batch API can do the operations using fewer, higher-occupancy kernels, while the high-level API requires many small kernel launches with associated tail effects and occupancy concerns.
We include benchmark applications with the library so that you can try out different compression formats and see which works best on your data. The provided benchmarks are benchmark_hlif and benchmark__chunked. For more information, see the nvCOMP README.
Summary
Now you’ve learned how to use the high-level nvCOMP API for easy compression and decompression. You’ve learned when it may be better to use the low-level API as well as how to use it.
For more information, see the latest version of the NVIDIA/nvcomp GitHub repo. For fully worked, compilable examples that you can adapt to your use cases, see the lowlevel_c_quickstart.md and highlevel_cpp_quickstart.md walkthroughs along with the associated example files.
this is an NLP and python based project we are trying to achieve something new this project is almost there but the only file connecting thing is leftover😌
hey everyone so i have been experimenting with object detection using python,opencv and tensorflow but i keep getting this error P.S both the code an “myData” are in the same folder
the code:
import numpy as np import matplotlib.pyplot as plt from keras.models import Sequential from keras.layers import Dense from tensorflow.keras.optimizers import Adam from keras.utils.np_utils import to_categorical from keras.layers import Dropout, Flatten from keras.layers.convolutional import Conv2D, MaxPooling2D import cv2 from sklearn.model_selection import train_test_split import pickle import os import pandas as pd import random from keras.preprocessing.image import ImageDataGenerator
########### Parameters
path = “myData” # folder with all the class folders labelFile = ‘labels.csv’ # file with all names of classes batch_size_val = 50 # how many to process together steps_per_epoch_val = 2000 epochs_val = 10 imageDimesions = (32, 32, 3) testRatio = 0.2 # if 1000 images split will 200 for testing validationRatio = 0.2 # if 1000 images 20% of remaining 800 will be 160 for validation
X_train = ARRAY OF IMAGES TO TRAIN y_train = CORRESPONDING CLASS ID ######################### TO CHECK IF NUMBER OF IMAGES MATCHES TO NUMBER OF LABELS FOR EACH DATA SET
print(“Data Shapes”) print(“Train”, end=””); print(X_train.shape, y_train.shape) print(“Validation”, end=””); print(X_validation.shape, y_validation.shape) print(“Test”, end=””); print(X_test.shape, y_test.shape) assert (X_train.shape[0] == y_train.shape[ 0]), “The number of images in not equal to the number of lables in training set” assert (X_validation.shape[0] == y_validation.shape[ 0]), “The number of images in not equal to the number of lables in validation set” assert (X_test.shape[0] == y_test.shape[0]), “The number of images in not equal to the number of lables in test set” assert (X_train.shape[1:] == (imageDimesions)), ” The dimesions of the Training images are wrong ” assert (X_validation.shape[1:] == (imageDimesions)), ” The dimesionas of the Validation images are wrong ” assert (X_test.shape[1:] == (imageDimesions)), ” The dimesionas of the Test images are wrong”
######################### READ CSV FILE
data = pd.read_csv(labelFile) print(“data shape “, data.shape, type(data))
######################### DISPLAY SOME SAMPLES IMAGES OF ALL THE CLASSES
num_of_samples = [] cols = 5 num_classes = noOfClasses fig, axs = plt.subplots(nrows=num_classes, ncols=cols, figsize=(5, 300)) fig.tight_layout() for i in range(cols): for j, row in data.iterrows(): x_selected = X_train[y_train == j] axs[j][i].imshow(x_selected[random.randint(0, len(x_selected) – 1), :, :], cmap=plt.get_cmap(“gray”)) axs[j][i].axis(“off”) if i == 2: axs[j][i].set_title(str(j) + “-” + row[“Name”]) num_of_samples.append(len(x_selected))
######################### DISPLAY A BAR CHART SHOWING NO OF SAMPLES FOR EACH CATEGORY
print(num_of_samples) plt.figure(figsize=(12, 4)) plt.bar(range(0, num_classes), num_of_samples) plt.title(“Distribution of the training dataset”) plt.xlabel(“Class number”) plt.ylabel(“Number of images”) plt.show()
######################### PREPROCESSING THE IMAGES
def preprocessing(img): img = grayscale(img) # CONVERT TO GRAYSCALE img = equalize(img) # STANDARDIZE THE LIGHTING IN AN IMAGE img = img / 255 # TO NORMALIZE VALUES BETWEEN 0 AND 1 INSTEAD OF 0 TO 255 return img
X_train = np.array(list(map(preprocessing, X_train))) # TO IRETATE AND PREPROCESS ALL IMAGES X_validation = np.array(list(map(preprocessing, X_validation))) X_test = np.array(list(map(preprocessing, X_test))) cv2.imshow(“GrayScale Images”, X_train[random.randint(0, len(X_train) – 1)]) # TO CHECK IF THE TRAINING IS DONE PROPERLY
######################### AUGMENTATAION OF IMAGES: TO MAKEIT MORE GENERIC
dataGen = ImageDataGenerator(width_shift_range=0.1, # 0.1 = 10% IF MORE THAN 1 E.G 10 THEN IT REFFERS TO NO. OF PIXELS EG 10 PIXELS height_shift_range=0.1, zoom_range=0.2, # 0.2 MEANS CAN GO FROM 0.8 TO 1.2 shear_range=0.1, # MAGNITUDE OF SHEAR ANGLE rotation_range=10) # DEGREES dataGen.fit(X_train) batches = dataGen.flow(X_train, y_train, batch_size=20) # REQUESTING DATA GENRATOR TO GENERATE IMAGES BATCH SIZE = NO. OF IMAGES CREAED EACH TIME ITS CALLED X_batch, y_batch = next(batches)
NVIDIA’s GTC conference is packed with smart people and programming. The virtual gathering — which takes place from March 21-24 — sits at the intersection of some of the fastest-moving technologies of our time. It features a lineup of speakers from every corner of industry, academia and research who are ready to paint a high-definition Read article >
I finished reading the new book on TinyML— The Tiny ML Cookbook that has just been released. Whether you are a professional seeking to dive deeper into the world of TinyML, or just starting out, you will find this book useful for your practical experiments.
I would like to share my impressions of this book with you and give you a list of similar books on this subject. This book is focused on practical TinyML use cases, which are referred to as “recipes”. The author is Gian Marco Iodice, a team and tech lead in the Machine Learning Group at Arm, who co-created the Arm Compute Library in 2017, which is currently the most performant library for ML on Arm, and it’s deployed on billions of devices worldwide.
Summarizing, it is worth mentioning the top three takeaways from this book:
Practicing the whole workflow to develop ML models for microcontroller
Learning techniques to build tiny ML models for memory-constrained devices
Developing a complete and memory-efficient vision recognition pipeline for microcontrollers
This book touches upon many use cases that will allow you to start developing machine learning applications on microcontrollers through practical examples quickly without any prior knowledge of edge devices.
The TinyML Cookbook gives a comprehensive overview of the Tiny ML applications, covering some of the essentials for developing intelligent apps on the Arduino Nano 33 BLE Sense and Raspberry Pi Pico, as well as general requirements for a good dataset. Most importantly, the book contains many examples of code and datasets ready to be deployed on any device.
Here is one thing that I find particularly useful for myself. I have been facing particular challenges in implementing an LED status indicator on the breadboard. In the past, I used to physically link two or more metal connections together when connecting external components to the microcontroller. Because of the small area between each pin, making direct contact with the microcontroller’s pins might be difficult.To make this operation run smoothly, the author provided links to the Arduino Nano and Raspberry Pi Pico pinout diagrams, offering step-by-step instructions for constructing the circuit that will turn the LED on when the platform is plugged into power. He also suggested using a new Digikey online tool to identify the color bands of the resistor.
I think it will also be useful to mention here the top five books on TinyML that I found relevant to this topic.