Categories
Misc

Optimizing Models with Grappler

I read about Grappler and that it can optimize TF models, which is an interesting feature.

I use keras to create, train and save models, and also convert them to tflite in order to deploy them on other systems. My question is, whether this optimization is already integrated into keras and / or the tflite conversion.

It seems like quite a big topic to get into, so if it already happens automatically, I don’t want to spend so much time on it.

submitted by /u/norobot12
[visit reddit] [comments]

Categories
Misc

Compiling older version of Tensorflow on a Pi 400

Howdy there! Pardon if this ain’t the right place but basically the gist is, I need an older version of Tensorflow that I (miraculously) used to run on an 18 year old 32-bit PC because I was too poor for a pi before my friend got me one as a gift once my dino finally died. I specifically need to get Tensorflow 1.12.2 running for a bot that I host for my buddies and I, but I’m at something of a loss as to how I’d do it. I’ve looked into guides but all of them are for 2.0 and or just announce “Tensorflow can be installed with pip/apt now!” and a command that doesn’t work for me. It feels as if all the guides for how to install Tensorflow 1.X on a Pi 4 onward have been wiped from the internet or replaced with stuff that just doesnt work or help, what guides I could find just didn’t work either. Bazel’s certainly not making it easy either. With that no python wheels work either, with all of them seemingly not made for arch64 or the version of TF I specifically need. So yeah, thats where I’m at. My Pi OS is the 64-bit lite version if that helps at all. Anything at all would be greatly appreciated as my poor bot has been offline for a long while. Thanks!

submitted by /u/Loooooooooouie
[visit reddit] [comments]

Categories
Misc

help with resnet50 and identifying new, not seen before, images

I need help in coming up with a technique to determine when an object is not in the list of already classified objects.

I am sorting parts that are the identical shape, but differ in the text written on them. Unfortunately the text is written in a circle, so I can’t simply read the text with OCR. (At least I couldn’t make it work.)

I have about 20 different labels on the parts I currently have. Some parts I have 1,000 pictures of, others I only have 100 pictures.

I discovered while I was sorting that there are actually more than 20 different labels, there are parts I didn’t know about that only occur once or twice in five hundred parts. I would like to be able to identify and handle parts like this separately.

I am using keras.resnet50 to classify the images (parts). I get about a 96% accuracy of a part being identified correctly when I use two different classifiers to look at the same image and then check their agreement. I do this because I would rather reject parts than have them sorted incorrectly. Classifier two just has a different number of epochs than classifier one.

My images are 300×300.

I look at the weight of the prediction, and 95% of my predictions are almost certain, say .99999, so I haven’t been able to do the simple thing of just looking at the strength of the prediction.

In case this is relevant, the resnet50 classifier that I have trained might be overtrained. Within three epochs it is at like 99.7% accuracy with a val_accuracy of 98%.

Epoch 3/3

99/99 [==============================] – 572s 6s/step – loss: 0.0138 – accuracy: 0.9967 – val_loss: 0.3574 – val_accuracy: 0.9796

Any ideas?

Also, should I be posting this question on another sub as well?

submitted by /u/marlon1492
[visit reddit] [comments]

Categories
Misc

Algorithm for interrogating a model to find what activates a specific class?

I have an NN multi-class text classifier. Input is a vector of embeddings (the input is a sequence of words that are looked up on a vocab, and its embedding is used, of course). Since the domain of this model is a closed set (the vocabulary, and the size of the sequence), I’m thinking it should be possible to try a small set of all possible inputs (random phrases) to get an idea of what words would maximize a given class.

My question: is there a more efficient algorithm to do this, other than trying random sequences of words from the vocabulary?

I’m thinking there has to be a way to do this backwards – looking at the weights in the various units in the last layers relative to the embedding values.

Thanks

submitted by /u/AbIgnorantesBurros
[visit reddit] [comments]

Categories
Misc

help with resnet50 and identifying previously unencountered objects

I need help in coming up with a technique to determine when an object is not in the list of already classified objects.

I am sorting parts that are the identical shape, but differ in the text written on them. Unfortunately the text is written in a circle, so I can’t simply read the text with OCR. (At least I couldn’t make it work.)

I have about 20 different labels on the parts I currently have. Some parts I have 1,000 pictures of, others I only have 100 pictures.

I discovered while I was sorting that there are actually more than 20 different labels, there are parts I didn’t know about that only occur once or twice in five hundred parts. I would like to be able to identify and handle parts like this separately.

I am using keras.resnet50 to classify the images (parts). I get about a 96% accuracy of a part being identified correctly when I use two different classifiers to look at the same image and then check their agreement. I do this because I would rather reject parts than have them sorted incorrectly. Classifier two just has a different number of epochs than classifier one.

My images are 300×300.

I look at the weight of the prediction, and 95% of my predictions are almost certain, say .99999, so I haven’t been able to do the simple thing of just looking at the strength of the prediction.

In case this is relevant, the resnet50 classifier that I have trained might be overtrained. Within three epochs it is at like 99.7% accuracy with a val_accuracy of 98%.

Epoch 3/3

99/99 [==============================] – 572s 6s/step – loss: 0.0138 – accuracy: 0.9967 – val_loss: 0.3574 – val_accuracy: 0.9796

Any ideas?

submitted by /u/marlon1492
[visit reddit] [comments]

Categories
Misc

Optimizing Access to Parquet Data with fsspec

This post details how the filesystem specification’s new parquet model provides a format-aware byte-cashing optimization.

As datasets continue to grow in size, the adoption of cloud-storage platforms like Amazon S3 and Google Cloud Storage (GCS) are becoming more popular. Although node-local storage is likely to result in better IO performance, this approach can become impractical after the dataset exceeds the single-terabyte scale.

For cases where remote storage is the only practical solution, much of the PyData ecosystem has already adopted Filesystem Spec (fsspec) as a universal file-system API. While fsspec has provided users with adequate remote storage access since the introduction of s3fs and gcsfs, the internal byte-transfer and caching algorithms have not yet leveraged format-specific optimizations for high-performance file formats like Parquet.

Key learnings

In this post, we introduce the fsspec.parquet module, which provides a format-aware, byte-caching optimization for remote Parquet files. This module is both experimental and limited in scope to a single public API: open_parquet_file.

This API provides faster remote-file access. When compared with the default read_parquet behavior in the pandas and cuDF DataFrame libraries, there is a consistent performance boost in overall throughput for partial I/O (column-chunk and row-group selection) from large Parquet files.

We also discuss the optimizations used within this new module and present the preliminary performance results.

What is fsspec?

Filesystem Spec (fsspec) is an open-sourced project providing a unified Python interface to a variety of backend storage systems. fsspec corresponds to a specific fsspec Python library and a larger GitHub organization containing many system-specific repositories (for example, s3fs and gcsfs).

The advantage of using fsspec within a Python-based library or application is that the same POSIX-like file API can write and read from remote objects and local files alike. When a cloud-based object is opened by an fsspec-compatible file system, the underlying application is given an AbstractBufferedFile object, or some other file-like object designed to duck-type with native Python file objects. Now, these file-like objects can be treated the same way as local Python-file handles.

One obvious difference is the AbstractBufferedFile object must use file system-specific commands internally to put and get bytes to and from the remote storage system. As internal data-transfer operations typically have higher latency than local disk accesses, fsspec offers several caching strategies to prefetch data while avoiding many small requests.

Later in this post, we explain why the default caching strategy is often suboptimal for large Parquet files, and how a new KnownPartsOfAFile (“parts”) option can dramatically reduce read latency.

What is Parquet?

Parquet is a binary column-oriented data-storage format, designed with performance, compression, and partial I/O in mind. Due to its dramatic performance advantages over text-formatted files like CSV, the open-source format has experienced rapid growth in popularity since its initial release in 2013.

To understand the optimizations used within the Fsspec-Parquet module, it is useful to have a high-level understanding of the Parquet specification. Figure 1 shows that all Parquet files consist of a collection of contiguously stored row-groups, and each of these row-groups includes a collection of contiguously stored column chunks.

A Parquet file is divided into row-groups, and row-groups are divided into column chunks. Every Parquet file also includes a single block of footer metadata.
Figure 1. High-level visual representation of the Parquet file format

Altogether, the majority of all bytes in a Parquet file correspond to these column chunks. The remaining bytes are used to store file metadata within a thrift-formatted footer. This footer metadata includes byte offsets and optional statistics (min, max, valid count, null count) for every column-chunk in the file.

Because this vital information is consolidated in the footer, it is only necessary to sample the end of a Parquet file to determine the specific location of each column-chunk in the file.

What is the purpose of fsspec’s new Parquet module?

fsspec is already the most common means of loading Parquet files under Python. The primary purpose of the new fsspec.parquet module is to provide an optimized utility for exactly this task.

Internally, this new utility (open_parquet_file) essentially wraps a conventional open call in Parquet-specific logic that is designed to begin the transfer of all relevant data from the remote file into local memory before the user has even initiated an explicit read operation.

Although reads of any size may benefit from this new utility, the most significant improvements are seen when the read operation targets only a subset of all columns and row-groups. For example, when the remote read is using column projection, the same list of columns can be passed directly to open_parquet_file:

from fsspec.parquet import open_parquet_file
import pandas as pd

path = “://my-bucket/my-data”
columns = [“col1”, “col2”]
options = {“necessary”: ”credentials”}

with open_parquet_file(
    path,
    columns=columns,
    storage_options=options,
) as f:
    df = pd.read_parquet(path, columns=columns)

As we mentioned earlier, the primary purpose of the fsspec new open_parquet_file function is to improve read performance from large Parquet files by using a format-aware caching strategy within AbstractBufferedFile.open.

To understand why the specific optimizations used by the Parquet module are beneficial, it is helpful to start with an understanding of the simple cache-free approach, as well as the default read-ahead strategy.

Understanding cache-free and read-ahead approaches

Figure 2 shows how cache-free file access is likely to map onto remote get calls within a read_parquet operation (after the remote file is already opened) for both an ideal and naive sequence of read calls.

The overall number of remote-get operations can be dramatically reduced by coalescing multiple requests for adjacent data blocks.
Figure 2. Comparison of ideal and naive data access patterns from a remote Parquet file

In this particular example, assume the Parquet file is large enough (~400MB+) to comprise four distinct row-groups, and the read_parquet call is targeting two out of the four available columns. This means the AbstractBufferedFile object must ultimately transfer eight distinct column-chunk ranges, along with the footer metadata, into local memory.

In the case that caching is disabled, it is possible for a well-tuned Parquet library to move only the necessary data into local memory using five distinct requests. However, as fsspec’s file-like interface contains state, these five read calls are always serial and incur one whole latency time for each call.

This ideal-access scenario is incapable of leveraging concurrency to minimize latency, even though the strategy is likely to minimize the number of file-system requests and produce high overall throughput. For Parquet libraries that take a naive-access approach and do not explicitly minimize the number of read calls, the read_parquet operation is likely to suffer from high latency and low overall throughput!

At this point, we’ve established that the I/O library is unable to take advantage of file-system concurrency after the file is already opened for reading. What is even more important to consider is that the default read-ahead caching strategy can degrade the observed performance even further when partial I/O is involved. This is because the inherent assumption of read-ahead caching is that sequential file accesses are likely to be contiguous.

Figure 3 shows that read-ahead caching is likely to transfer about 20 MB of unnecessary data, 5 MB of read-ahead for each row-group.

Using read-ahead caching to load a subset of columns from a Parquet file will result in unnecessary data transfer.
Figure 3. Comparison of the read-ahead and cache-free file-access strategies in fsspec

As you see in the following performance results, the benefit of disabled caching over read-ahead caching is dependent on both the engine used within read_parquet and the specific storage system.

If the engine already includes format-aware optimizations to move the necessary byte ranges into local memory (that is, pyarrow and fastparquet), then the no-caching option is likely the better choice. When the engine assumes local file access is fast (cuDF), some form of fsspec-level caching may be critical.

In either case, the engine is unable to begin the transfer of the required byte ranges until after the fsspec file object is created. It is limited by the additive latency of sequential data transfer.

Now that we’ve established a high-level understanding of both default and cache-free AbstractBufferedFile behavior within a typical read_parquet call, we can now explain the two general optimizations used by open_parquet_file to improve overall read throughput.

Optimization 1: Use the “parts” caching strategy

Instead of the format-agnostic caching strategies used within AbstractBufferedFile by default, you use the new KnownPartsOfAFile (“parts”) option to cache exactly what you need before the file is even open.

In other words, begin by using an external Parquet engine (either fastparquet or pyarrow) to parse the footer metadata up-front. Then, transfer the necessary byte-ranges into local memory, and use the parts’ cache to ensure that downstream applications never have to wait for remote data after an open fsspec file object is acquired.

Figure 4 shows that there is no longer any relationship between the logic used to access data within read_parquet and the number or size of get calls used for remote-data access. From the perspective of the Parquet engine, any read operation is near-instantaneous, because all necessary data is already cached in memory.

When the “parts” caching strategy is used, read operations will not result in new remote-file accesses.
Figure 4. The fsspec’s “parts” caching strategy. All necessary data must be cached in local memory before the file-like object is even opened.

Optimization 2: Transfer “parts” asynchronously and in parallel

Although the previous optimization enables you to avoid many small get operations and unnecessary data transfer within read_parquet, you must still populate the KnownPArtsOfAFile cache before AbstractBufferedFile can be initialized.

To do this as efficiently as possible, use cat_ranges to fetch all the required column-chunks at one time, both asynchronously and in parallel using asyncio. Because the total number of transferred column chunks can be large for files containing many fields or multiple row-groups, you also aggregate adjacent byte-range requests as long as the size of the aggregated request stays below an upper limit.

Figure 5 shows that this approach ultimately leads to the concurrent transfer of multiple byte ranges of optimal size.

Diagram shows that all necessary remote data is transferred into local-memory in parallel. Adjacent data transfers are coalesced to reduce the overall number of remote-file accesses.
Figure 5. Data-transfer optimizations used in fsspec.parquet

Preliminary fsspec.parquet benchmark results

To compare the performance of open_parquet_file with other fsspec– and pyarrow-based file-handling methods, use the available Python script:

  • pyarrow-6.0.1
  • fastparquet-0.8.0
  • cudf-22.04
  • fsspec/s3fs/gcfs-2022.2.0

Using this script, we read a single column from a 789M Parquet file containing a total of 10 row-groups and 30 columns with snappy compression requiring the transfer of roughly 27M of data. The results for both S3 and GCS, summarized in Figures 6-7, clearly show a significant performance boost of 85% or more when moving from the default caching strategy to open_parquet_file.

In fact, the new function effectively matches the performance of PyArrow’s native S3FileSystem implementation, which is specifically designed for optimal performance with its Parquet engine. In this post, we compare it only with a native PyArrow filesystem for this single Amazon S3 benchmark, because PyArrow only offers publicly supported implementations for Amazon S3 and Hadoop at the time of publication.

Using fsspec’s Parquet module reduces the read latency for both cuDF and fastparquet, and roughly matches optimal performance in pyarrow.
Figure 6. General benchmark results for S3 storage
Using fsspec’s Parquet module reduces the read latency for cudf, fastparquet, and pyarrow. The latency is consistently reduced by 50% or more.
Figure 7. General benchmark results for GCS storage

To illustrate the benefit of using open_parquet_file scales with the overall size of the file, we also read a single column from a 12 GB Parquet file containing the first day of the Criteo dataset from public GCS storage with compression disabled. Figure 8 shows that the new fsspec function can offer a 10x speedup or more over default caching.

Using fsspec’s Parquet module significantly reduces the read latency for cuDF, fastparquet, and PyArrow. The measured speedup is a factor of ten or more.
Figure 8. Criteo-dataset benchmark results for GCS storage

Test it out yourself

In this post, we introduced the fsspec.parquet module, which provides a format-aware, byte-caching optimization for opening remote Parquet files, open_parquet_file. Benchmarking clearly suggests that the new optimization can offer significant performance improvements over default file-opening behavior in fsspec, and may even approach the performance of optimized C++ file-system implementations in PyArrow for partial I/O.

Since being officially released in the 2021.11.0 version of fsspec, the open_parquet_file utility has already been adopted by both the RAPIDS cuDF library and Dask-Dataframe.

Due to dramatic and consistent improvement available for cuDF-based workflows, this new feature has already been adopted as the default file-opening approach within cudf.read_parquet and dask_cudf.read_parquet.

For Dask users without GPU resources, the optimized caching approach can now be opted into by passing an open_file_options argument to read_parquet. For example, the following code example instructs Dask to open all Parquet data files with the parquet pre-caching method:

import dask.dataframe as dd

ddf = dd.read_parquet(
    path, open_file_options={“precache_options”: {“method”: “parquet”}}
)

Given this early success, we are hoping to expand and simplify the available pre-caching options in fsspec, and establish a clear precache_options API within all file-opening functions.

Community feedback here is critical. Please engage on GitHub or comment below, and let us know what you think!

Categories
Misc

linux seems stucking on cudnn how to fix it?

hi

same repo to run the tensorflow gpu code.

on Win, it passed the stage like :

- weight name: discriminator/gan/conv6/bias:0, shape: [256], size: 256 [32;1mTrigger callback: [0mTotal counts of trainable weights: 33579064. Total size of trainable weights: 0G 32M 24K 56B (Assuming32-bit data type.) 2022-05-05 11:29:14.865680: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll 2022-05-05 11:29:15.131382: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2022-05-05 11:29:15.901900: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows Relying on driver to perform ptx compilation. This message will be only logged once. 

however the linux stays at the

Total size of trainable weights: 0G 32M 24K 56B (Assuming32-bit data type.) 

for ever,

assume this linux wasn’t able to open the cudnn???

cudnn is installed by running

$conda list cudatoolkit 10.0.130 hf841e97_10 conda-forge cudnn 7.6.5.32 ha8d7eb6_1 conda-forge 

the version seems fine with tensorflow-gpu 1.15 tensorflow-cuda

after installing the cuda with system package manager,

but there is another cudatoolkit 10 and the cudnn comes withthe tensorflow-gpu. did not restart the machine. does this matter?

2022-05-05 12:46:14.496348: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA 2022-05-05 12:46:14.520365: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2601325000 Hz 2022-05-05 12:46:14.521958: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5565bcb72f40 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2022-05-05 12:46:14.521982: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2022-05-05 12:46:14.523737: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2022-05-05 12:46:14.526821: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination 2022-05-05 12:46:14.526889: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 510.60.2 2022-05-05 12:46:14.526904: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 495.44.0 2022-05-05 12:46:14.526916: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 495.44.0 does not match DSO version 510.60.2 -- cannot find working devices in this configuration 

is this because the cuda 11 installed by system not compatible with the cudatoolkit 10 by conda?

any idea how to fix this one?

thanks a lot

submitted by /u/boydbuilding
[visit reddit] [comments]

Categories
Misc

does tensorflow 1.15 work with cuda 11 and cudnn 8?

Hi,

found this

https://github.com/NVIDIA/tensorflow

but have no idea to make it work.

no matches found: nvidia-tensorflow[horovod] 

right now

$ nvcc --version Cuda compilation tools, release 11.6, 

and

$conda list cudatoolkit 10.0.130 hf841e97_10 conda-forge cudnn 7.6.5.32 ha8d7eb6_1 conda-forge 

and tensorflow doesn’t work right now. the repo were developed under tensorflow 1.15..

what is the best solution right now?

to make tensorflow 1.15 work with cuda 11 and cudnn 8?

or downgrade cuda 11 on system

or what

thanks a ton

submitted by /u/boydbuilding
[visit reddit] [comments]

Categories
Misc

Driver’s Ed: How Waabi Uses AI, Simulation to Teach Autonomous Vehicles to Drive

Teaching the AI brains of autonomous vehicles to understand the world as humans do requires billions of miles of driving experience. The road to achieving this astronomical level of driving leads to the virtual world. On the latest episode of the AI Podcast, Waabi CEO and founder Raquel Urtasun joins NVIDIA’s Katie Burke Washabaugh to Read article >

The post Driver’s Ed: How Waabi Uses AI, Simulation to Teach Autonomous Vehicles to Drive appeared first on NVIDIA Blog.

Categories
Misc

Upcoming Event: Session on Vision AI at the Edge, from Zero to Deployment Using Low-Code Development at Embedded Vision Summit

This Embedded Vision Summit session will showcase a low-code approach to develop fully optimized and accelerated vision AI applications using DeepStream SDK and graph composer.