Categories
Misc

Scaling XGBoost Performance with RAPIDS cuML, Dask, and Google Kubernetes Engine (GKE)

RAPIDS cuML provides scalable, GPU-accelerated machine learning models with a Python interface based on the scikit-learn API. This guide will walk through how to easily train cuML models on multi-node, multi-GPU (MNMG) clusters managed by Google’s Kubernetes Engine (GKE) platform. We will examine a subset of the available MNMG algorithms, illustrate their use of leveraging … Continued

RAPIDS cuML provides scalable, GPU-accelerated machine learning models with a Python interface based on the scikit-learn API. This guide will walk through how to easily train cuML models on multi-node, multi-GPU (MNMG) clusters managed by Google’s Kubernetes Engine (GKE) platform. We will examine a subset of the available MNMG algorithms, illustrate their use of leveraging Dask on a large public dataset and provide a series of code samples for exploring and recording their performance.

Dask as our distributed framework

Our first task will be to bring up our Dask cluster within Kubernetes. This will provide us with the ability to run distributed algorithms in a MNMG environment, and explore some of the implications this has on how we design our workflows, do analysis, and build models.

MNMG cuML and XGBoost

Once our Dask cluster is up and running, and we’ve had a chance to load some data and get a feel for the major ideas, we’ll take a look at the machine learning models RAPIDS has available, the flavors they come in: out-of-band and in-framework, and go through the process of training some of those models and looking at their performance in our cluster.

Pre-Requisites

Before we get started, we need to have a few pieces of software and a running Kubernetes cluster. I’ll provide a quick run through for spinning one up in GKE on Google’s Cloud Platform (GCP) as well as this more detailed guide; if you’re interested in more details about GCP or Kubernetes, I encourage you to to look into the links at the end of this guide.

Software

Local Environment

Python versioning and package manager. You’ll want to create a new environment (which the rapids install command below does).

$ conda create -n rapids-core-0.19 -c rapidsai -c nvidia -c conda-forge 
    -c defaults cuml=0.19 python=3.8 cudatoolkit=11.0

We’ll modify this to be our scheduler/worker deployment container for our Dask cluster.

$ docker pull rapidsai/rapidsai-core:0.19-cuda11.0-runtime-ubuntu18.04-py3.8

File system interface to GCP’s cloud storage (GCS) system.

$ conda activate rapids-core-0.19
$ conda install -c conda-forge dask-kubernetes gcsfs jupyterlab nodejs
$ pip install xgboost seaborn dask_labextension tqdm

This adds a plugin to Jupyter that lets us see what our Dask cluster is doing in real time. Very handy.

$ jupyter labextension install dask-labextension
$ python -m ipykernel install --user --name rapids-core-0.19 --display-name "RAPIDS-0.19"
  • Pull the latest notebook code and specification files from rapids cloud-ml-examples. Relevant files are found in dask/kubernetes
$ git clone https://github.com/rapidsai/cloud-ml-examples.git

At this point, you’ve got a RAPIDS-0.19 conda environment, configured with all the libraries we need, and quality of life updates for Jupyter that will make the Dask experience more interactive.

Next up: what data are we using and where do we get it?

Data

For this guide, we’ll be using a subset of the public NYC-Taxi dataset, hosted on GCS by Anaconda. The data can be accessed directly from ‘gcs://anaconda-public-data/nyc-taxi’, and can be explored easily with the gsutil utility.

$ gsutil ls -r gs://anaconda-public-data/nyc-taxi

We’ll examine a medium-sized, 150 million record set, stored in parquet format, and the larger, ~450 million record set, for 2014, 2015, 2016 saved in CSV format. This will give us a chance to observe the substantial benefit associated with selecting the proper storage format.

Kubernetes on GKE

Optional: The steps below need to be completed to allow distributed inference using the Forest Inference Library, or for experimenting with the parquet converted mortgage data.

Configuring our Dask Cluster

Before we can launch our Dask cluster we need to create our scheduler/worker container and push it to GCR and update our Dask-Kubernetes configuration files to reflect your specific Kubernetes cluster.

Cluster specific items

  • Navigate to the cloud-ml-examples repo you downloaded in the ‘Local Environment’ step above.
$ cd Dask/kubernetes
$ ls
Dask_Cuml_Exploration.ipynb   Dockerfile   specs
  • Build your scheduler/worker container, tag it with the gcr path corresponding to your GCP project, and push your GCR repo.
$ docker build --tag gcr.io/${YOUR_PROJECT}/${YOUR_REPO}/dask-unified:0.19 --file Dockerfile .
$ docker push gcr.io/${YOUR_PROJECT}/${YOUR_REPO}/dask-unified:0.19
  • Update the two yaml files sched-spec.yaml and worker-spec.yaml found in ./spec
    • Find the image entry under the containers block and set it to your GCR image path. Next, locate the limits and requests blocks and set their cpu and memory elements based on available resources in your cluster.

For example, n1-standard-4 has 4 vcpus, and 15 GB of memory, so we might configure our container specification as follows (you can find the exact amount of allocatable resources in the GCP console by looking at the ‘Nodes’ table in your cluster details).

containers:
   - image: gcr.io/${YOUR_PROJECT}/${YOUR_REPO}/Dask-unified:0.19
   … SNIP … 
   resources:
     limits:
       cpu: “3”
       memory: 13G
       nvidia.com/gpu: 1
     requests:
       cpu: “3”
       memory: 13G

Deploy the cluster

At this point, we’re finished with all the configuration elements and can start exploring the code. I’ll reference the relevant bits here, and you can refer to the underlying notebook for additional details. To get started, bring up a jupyter lab notebook instance on your workstation and open ‘Dask_cuML_Exploration.ipynb’.

  • Make sure you select the RAPIDS-0.19 kernel we installed previously.

  • Run the first three cells to launch your Dask cluster. These will:
    • Create scheduler and worker pod templates from ‘sched-spec.yaml’ and ‘worker-spec.yaml’.
    • Create a cluster from the pod templates, attach a Dask client, and scale up the cluster to have two workers.
cluster = KubeCluster(pod_template=worker_pod,
                      scheduler_pod_template=sched_pod)

client = Client(cluster)

n_workers = 2
cluster.scale(n_workers)

Note: This process may take 5-10 minutes for the first run, as each worker will need to pull its container.

  • During this time, it can be useful to open a separate terminal window and monitor your kubernetes activity with kubectl. This will also allow you to get the external-ip of the Dask scheduler, once it’s created and being monitoring the cluster.
$ watch kubectl get all
Every 2.0s: kubectl get all                                                                                                                                                                                                                                                  	drobison-mint: Thu Feb 11 12:21:02 2021

NAME                            	READY   STATUS	RESTARTS   AGE
pod/Dask-61d38cef-e57k2r   1/1 	Running   0      	54m
pod/Dask-61d38cef-e7gbzk   1/1 	Running   0      	54m
pod/Dask-61d38cef-ebck7r   1/1 	Running   0      	56m

NAME                      TYPE       	 CLUSTER-IP   EXTERNAL-IP
service/Dask-61d38cef-e   LoadBalancer   10.44.8.55  [YOUR EXTERNAL IP]  
Figure 1. Dask cluster connection panel.
  • Once the cluster is finished creating, you should see something like the screen below.
Figure 2. Dask dashboard during a running task.

Running the next few cells will create a number of helper functions to help aggregate timings and scale worker counts, create some predefined data loading mechanisms for our medium and large NYC-Taxi datasets along with some pre-processing and data clean up, and create some simple visualization functions to let us explore the results.

ETL example

Most data scientists are probably aware that the choice of file format matters, but it’s not always clear how much or what the underlying trade off is. As a quick illustration, let’s look at the time required to read in ~150 million rows from CSV vs parquet data formats.

CSV

base_path = 'gcs://anaconda-public-data/nyc-taxi/csv'

with SimpleTimer() as timer_csv:
    df_csv_2014 = dask_cudf.read_csv(f'{base_path}/2014/yellow_*.csv', chunksize=25e6)
    df_csv_2014 = clean(df_csv_2014, remap, must_haves)
    df_csv_2014 = df_csv_2014.query(' and '.join(query_frags))
    
    with Dask.annotate(workers=set(workers)):
        df_csv_2014 = client.persist(collections=df_csv_2014)
        
    wait(df_csv_2014)

print(df_csv_2014.columns)
rows_csv = df_csv_2014.iloc[:,0].shape[0].compute()
print(f"CSV load took {timer_csv.elapsed/1e9} sec. For {rows_csv} rows of data => {rows_csv/(timer_csv.elapsed/1e9)} rows/sec")

On an eight GPU cluster this takes around 350 seconds, for ~155,500,000 rows.

Parquet

with SimpleTimer() as timer_parquet:
    df_parquet = dask_cudf.read_parquet(f'gs://anaconda-public-data/nyc-taxi/nyc.parquet', chunksize=25e6)
    df_parquet = clean(df_parquet, remap, must_haves)
    df_parquet = df_parquet.query(' and '.join(query_frags))
    
    with Dask.annotate(workers=set(workers)):
        df_parquet = client.persist(collections=df_parquet)
    
    wait(df_parquet)

print(df_parquet.columns)
rows_parquet = df_parquet.iloc[:,0].shape[0].compute()
print(f"Parquet load took {timer_parquet.elapsed/1e9} sec. For {rows_parquet} rows of data => {rows_parquet/(timer_parquet.elapsed/1e9)} rows/sec")

On the same eight GPU cluster, the parquet read takes around 98 seconds, for ~138,300,000 rows. A speedup of more than 3x over the CSV reads in terms of rows per second; for larger datasets this can result in a tremendous amount of time saved.

Multi-Node cuML training

Here, we’ll examine the process of training a Random Forest Regressor model across a set of workers in your cluster, examine the performance, and outline how we can scale up to more workers when necessary.

The performance sweep code goes through a fairly straightforward process.

  • Calls the data loader, which reads and load-balances our dataset across Dask workers.
  • Calls model.fit, ‘sample’ times, in either an X ~ y format for supervised models like RF, or just using X (KMeans, NN, etc..), and records the resulting timings.
  • Calls model. predict, ‘sample’ times, for all rows in X, and records the resulting timings.

Two-Node performance

From the RAPIDS’ documentation: This distributed algorithm uses an embarrassingly-parallel approach. For a forest with N trees being built on w workers, each worker simply builds N/w trees on the data it has available locally. In many cases, partitioning the data so that each worker builds trees on a subset of the total dataset works well, but it generally requires the data to be well-shuffled in advance. Alternatively, callers can replicate all of the data across workers so that rf.fit receives w partitions, each containing the same data. This would produce results approximately identical to single-GPU fitting.

from cuml.dask.ensemble import RandomForestRegressor

rf_kwargs = {
    "workers": client.has_what().keys(),
    "n_estimators": 10,
    "max_depth": 6
}
rf_csv_path = f"./{out_prefix}_random_forest_classifier.csv"

benchmark_sweep(client=client, model=RandomForestRegressor,
                **benchmark_kwargs,
                out_path=rf_csv_path,
                response_dtype=np.int32,
                model_kwargs=rf_kwargs)

visualize_csv_data(rf_csv_path)

Running this, you’ll see the following:

Starting weak-scaling performance sweep for:
 model      : 
 data loader: .
Configuration
==========================
Worker counts             : [2]
Fit/Predict samples       : 5
Data load samples         : 1
- Max data fraction       : 1.00
 - Train                  : 1.00
 - Infer                  : 1.00
Model fit                 : X ~ y
- Response DType          : 
Writing results to        : ./taxi_medium_random_forest_regression.csv
- Method                  : append


Sampling  load times with 2 workers.  With 12.5 percent of total data

100%|██████████| 1/1 [17:19, samples, to  workers with a mean time of 1039.3022 sec.
Sweeping  'fit' with  workers. Sampling   times with 12.5 percent of total data.

100%|██████████| 5/5 [06:55, 'fit' samples using   workers, with a mean time of 83.0431 sec.
Sweeping  'predict' with  workers. Sampling  times with 12.5 percent of total data.

100%|██████████| 5/5 [07:23, 'predict' samples using   workers, with a mean time of 88.6003 sec.


  hardware  n_workers     type     ci.low    ci.high
0        T4          2      fit  82.610233  83.476041
1        T4          2  predict  86.701627  90.498879
Figure 3. Example box plots for 2 worker fit and predict after five iterations using T4 hardware.

Note that if we wanted to check our algorithm performance for multiple hardware types, we could rerun the previous commands on a different cluster configuration, and since we’re set to append to our existing data set we would then see something similar to the graph below. (See the Vis and Analysis section of the notebook for more information).

Figure 4. Example box plots for 2 worker fit and predict with Random Forest for T4 and A100 hardware.

Scaling up and out

At this point we’ve trained our random forest using two workers and collected some data; now let’s assume we want to scale up our workflow to support a larger dataset.

There are two possible scaling cases we need to consider, the first is that we want to scale our worker counts, and our Kubernetes cluster already has sufficient resources; in this case, all we need to do is tell our KubeCluster object to scale the cluster, and it will spin up additional worker pods and connect them to the scheduler.

n_workers = 16
cluster.scale(n_workers)

The second scenario is one where we don’t have sufficient Kubernetes resources to launch additional workers. In this case, we’ll need to go back to GKE and increase the size of our node pool before we can scale up our worker count. Once we’ve done that, we can go back and update our sweep configuration to run with four and eight workers, and kick off another run. Examining the results, we see the relatively flat profile that we would expect for a weak scaling run.

… SNIP … 
  hardware  n_workers     type     ci.low    ci.high
0        T4          2      fit  82.610233  83.476041
2        T4          8  predict  21.372635  25.744045
3        T4         16      fit   3.417929   3.491950
6        T4         16  predict  17.952964  21.310312
8        T4          4  predict  45.985099  47.991526
9        T4          2  predict  86.701627  90.498879
10       T4          8      fit   8.470403   9.081820
11       T4          4      fit  30.791707  31.451253
Figure 5. T4 Random Forest weak scaling with 2, 4, 8, and 16 worker nodes, using the small Taxi dataset.

Similarly, if we want to gather additional scaling data for another hardware type, say V100’s, we can rebuild our cluster, selecting V100s instead of T4s, and re-run our performance sweeps to produce the following.

… SNIP … 
  hardware  n_workers     type     ci.low    ci.high
1      A100          4  predict  36.307192  39.593791
2        T4          8  predict  21.372635  25.744045
3        T4         16      fit   3.417929   3.491950
4      A100         16      fit   1.438821   1.541184
7      A100          8  predict  23.161591  25.073527
9        T4          8      fit   8.470403   9.081820
10     A100          4      fit   5.009861   5.124773
11       T4          4      fit  30.791707  31.451253
13       T4          2      fit  82.610233  83.476041
14     A100          2      fit  13.178733  13.259799
15     A100          2  predict  42.331923  44.573160
17     A100         16  predict  17.292678  17.742979
19       T4         16  predict  17.952964  21.310312
21     A100          8      fit   2.302896   2.337939
22       T4          4  predict  45.985099  47.991526
23       T4          2  predict  86.701627  90.498879
Figure 6. T4 and A100 Random Forest weak scaling with 2, 4, 8, and 16 worker nodes, using the small Taxi dataset.

XGBoost performance

Following similar steps, we can evaluate cluster performance for all our other algorithms, including XGBoost. The following example is trained on a subset of the much larger ‘mortgage’ dataset, which is available here. Note that because this dataset is not publicly hosted on GCP, some additional steps are required to pull the data, push to a private GCP bucket, convert the dataset to Parquet. The setup required for GCP/GKE is covered in the optional portion of the ‘Kubernetes on GKE’ section of this document; the scripts for converting the mortgage dataset to parquet can be found here.

In addition to the steps described above, we will also utilize the RAPIDS Forest Inference Library (FIL) framework for accelerated inference of our trained XGBoost model. The process for this is somewhat different from what occurs with RandomForest. After training our initial XGBoost model is fit, we will save the model to a centralized GCP bucket, and subsequently instantiate the model as a FIL object on each of our available workers. Once that step is completed, we can perform FIL based inference locally on each worker for its portion of the dataset.

Conclusion

Congratulations! At this point you’ve gone through the process of spinning up a Dask cluster in GKE, loaded a substantial dataset, performed distributed training using multiple nodes and GPUs, and built familiarity with the Dask ecosystem and monitoring tools for Jupyter.

Going forward, this should provide you with a basic template for utilizing Dask, RAPIDS, and XGboost with your own datasets to build and evaluate your workflow in Kubernetes.

For more information about the technologies we’ve used, such as RAPIDS, Dask, and Kubernetes, check out the links below.

Categories
Misc

Does someone know why this is happening?

Does someone know why this is happening?

I’ve been trying to use tensorflow.js recently with a model that I had trained in Python and converted it to .JSON (normal procedure).

My model architecture was:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 256) 7680
_________________________________________________________________
bidirectional (Bidirectional (None, 2048) 10493952
_________________________________________________________________
dense (Dense) (None, 128) 262272
_________________________________________________________________
dense_1 (Dense) (None, 10) 1290
=================================================================
Total params: 10,765,194
Trainable params: 10,765,194
Non-trainable params: 0
_________________________________________________________________
Inside that Birectional layer there is a LSTM layer with 1024 units

But on Javascript when I call:

const MODEL_URL = "../model/model.json" async function run() { // Load the model from the CDN. const model = await tf.loadLayersModel(MODEL_URL, strict=false); // Print out the architecture of the loaded model. // This is useful to see that it matches what we built in Python. console.log(model.summary()); } 

I get this error:

Uncaught (in promise) TypeError: e.forEach is not a function at bg (util_base.js:681) at Mw (tensor_ops_util.js:44) at Lw (tensor.js:56) at Ww (io_utils.js:225) at RM (models.js:334) at models.js:316 at c (runtime.js:63) at Generator._invoke (runtime.js:293) at Generator.next (runtime.js:118) at bv (runtime.js:747) 

But I saw that util_base.js file and there’s no e.forEach function being called at line 681, actually it doesn’t even have 681 lines:

The util_base.js file

I created an issue on tfjs repository on Github telling that it was a bug but I think the contributors didn’t believe me or didn’t want to help so they simply told me that this code ran normaly on their execution, now I don’t know what to do.
Does anybody have an idea on what is causing this error?
If you want to reproduce the code by yourselves here is the Glitch link: https://glitch.com/edit/#!/spotted-difficult-neptune

submitted by /u/fatorius_hs
[visit reddit] [comments]

Categories
Misc

Illegal instruction (core dumped)

Can anyone help me fix this error:

$ sudo docker run -it tensorflow/tensorflow:latest-gpu-jupyter bash ________ _______________ ___ __/__________________________________ ____/__ /________ __ __ / _ _ _ __ _ ___/ __ _ ___/_ /_ __ /_ __ _ | /| / / _ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ / /_/ ___//_/ /_//____/ ____//_/ /_/ /_/ ____/____/|__/ WARNING: You are running this container as root, which can cause new files in mounted volumes to be created as the root user on your host machine. To avoid this, run the container by specifying your user's userid: $ docker run -u $(id -u):$(id -g) args... root@4d6368436b20:/tf# python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" Illegal instruction (core dumped) 

My system is: Debian 11, NVIDIA driver version 460.73

submitted by /u/notooth1
[visit reddit] [comments]

Categories
Misc

Does ModelCheckpoint Callback Reset For Each Model.fit()?

I’m running a classifier that is drawing data from two large datasets using two generators. I build one model and then train it in a loop that looks something like this:

myModelCheckpoint = ModelCheckpoint("dirname") for _ in range( nIterations ): x_train, y_train = getTrainingDataFromGenerators() model.fit( x_train, y_train, ... epochs=10, callbacks=[myModelCheckpoint ]) 

What I want is for ModelCheckpoint to fire on the single best model over all nIterations. But it seems like it resets and starts over for each model.fit(). I’ve seen a model get saved for a particular val_acc that is lower than the best val_acc of the previous model.fit().

Essentially I want a global ModelCheckpoint, not local to a particular model.fit(). Is that possible?

submitted by /u/Simusid
[visit reddit] [comments]

Categories
Misc

NEW on NGC: MATLAB R2021a Container for Ampere GPUs Fast Tracks Deep Learning and Scientific Computing

The latest version provides full support for running deep learning, automotive and scientific analysis on NVIDIA’s Ampere GPUs.

Announcing the availability of MATLAB 2021a on the NGC catalog, NVIDIA’s hub of GPU-optimized AI and HPC software. The latest version provides full support for running deep learning, automotive and scientific analysis on NVIDIA’s Ampere GPUs.

In addition to supporting Ampere GPUs, the latest version also includes the following features and benefits: 

Download the MATLAB R2021a container from the NGC catalog. 

Categories
Misc

Getting the Most Out of NVIDIA T4 on AWS G4 Instances

With the continued growth of AI models and data sets and the rise of real-time applications, getting optimal inference performance has never been more important. In this post, you learn how to get the best natural language inference performance from AWS G4dn instance powered by NVIDIA T4 GPUs, and how to deploy BERT networks easily using NVIDIA Triton Inference Server.

As the explosive growth of AI models continues unabated, natural language processing and understanding are at the forefront of this growth. As the industry heads toward trillion-parameter models and beyond, acceleration for AI inference is now a must-have.

Many organizations deploy these services in the cloud and seek to get optimal performance and utility out of every instance they rent. Instances like the AWS G4dn, powered by NVIDIA T4 GPUs, is a great platform for delivering AI inference to cutting-edge applications. The combination of Tensor Core technology, TensorRT, INT8 precision, and NVIDIA Triton Inference Server team up to get the best inference performance from AWS.

For starters, here are the dollars and cents. Running BERT-based networks, AWS customers can get a million sentences inferenced for about a dime. Using BERT Large, which is about three times larger than BERT Base, you can get a million sentences inferenced for around 30 cents. The efficiency of the T4 GPU that powers the AWS g4dn.xlarge instance means you can cost effectively deploy smart, powerful natural language applications to attract new customers, and deliver great experiences to existing customers.

BERT language networks are compute-intensive but inferencing with g4dn.xlarge costs under 50 cents per million sentences.
Figure 1. NVIDIA T4 on the AWS g4dn.xlarge instance delivers great performance that translates into cost savings and great customer experiences.

Deploying inference-powered applications is still sometimes harder than it must be. To that end, we created NVIDIA Triton Inference Server. This open-source server software eases deployment, with automatic load balancing, automatic scaling, and dynamic batching. This last feature is especially useful as many natural language AI applications must operate in real time.

Diagram shows NVIDIA Triton architecture.
Figure 2. Triton Inference Server simplifies model deployment on any CPU or GPU-powered systems.

Currently, NVIDIA Triton supports a variety of major AI frameworks, including TensorFlow, TensorRT, PyTorch, and ONNX. You can also implement your own custom inference workload by using the Python and C++ custom backend.

With the new feature introduced in the NVIDIA Triton tools model analyzer, you can set a latency budget of five milliseconds. NVIDIA Triton automatically sets the optimal batch size for best throughput while maintaining that latency budget. In addition, NVIDIA Triton is tightly coupled with Kubernetes. It can be used with cloud provider-managed Kubernetes services like Amazon EKS, Google Kubernetes Engine, and Azure Kubernetes Service.

BERT inference performance

Because language models are often used in real-time applications, we discuss performance at several latency targets, specifically 5ms and 10ms. To simulate a real-world application, assume that there are multiple end users all sending inference requests with a batch size of one simultaneously. What’s of interest is how many requests can be handled per second.

For these measurements on G4dn, we used NVIDIA Triton to serve the BERT QA model with a sequence length of 128 and precision of INT8. Using INT8 precision, we saw up to an 80% performance improvement compared to FP16, which translates into more simultaneous requests at any given latency requirement.

Much higher throughput can be obtained with the NVIDIA Triton optimizations of concurrent model execution and dynamic batching. You can also make use of the model analyzer tool to help you find the optimal configurations to maximize the throughput under the required latency of 5 ms and 10 ms.

Batch throughput Real-time throughput Batch inference
cost per 1M inferences
Real-time inference
Cost per 1M inferences
BERT Base 1,794 1,794 $0.08 $0.08
BERT Large 525 449 $0.33 $0.28
Table 1. Low-latency performance and cost per million inferences. Throughput measured in sentences/second.

For BERT Base, you can see that T4 can get nearly 1,800 sentences/sec within a 10ms latency budget. T4 very quickly achieves its maximum throughput, and so the real-time throughput is about the same as the high batch throughput. This performance means that a single T4 GPU can simultaneously deliver answers to nearly 1,800 simultaneous requests and deliver a million of these answers for less than a dime, making it a cost-effective solution.

BERT-Large is about three times larger than BERT-Base and can deliver more accurate and refined answers. With this model, T4 on G4dn can deliver 449 samples per second within the 10ms latency limit, and 525 sentences/sec for batch throughput. In terms of cost per million inferences, this translates into an instance cost of 33 cents for real-time and 28 cents for batch throughput, again delivering great performance/dollar.

Optimal inference with TensorRT

TensorRT makes it easy to get the most out of your GPU inference performance, with layer and tensor fusion, reduced mixed precision and optimized kernels.
Figure 2. TensorRT delivers optimal performance, latency, accuracy and efficiency on NVIDIA data center platforms.

NVIDIA TensorRT plays a key role in getting the most performance and value out of AWS G4 instances. This inference SDK delivers high-performance, deep learning inference. It includes a deep learning inference optimizer and runtime that brings low latency and high throughput for deep learning inference applications. Figure 2 shows the major features:

  1. Reduce mixed precision: Maximizes throughput by quantizing models to INT8 while preserving accuracy.
  2. Layer and tensor fusion: Optimized use of GPU memory and bandwidth by fusing nodes in a kernel.
  3. Kernel auto-tuning: Selects best layers and algorithms based on the target GPU platform.
  4. Dynamic tensor memory: Minimizes memory footprint and reuses memory for tensors efficiently.
  5. Multi-stream execution: Uses a scalable design to process multiple input streams in parallel.
  6. Time fusion: Optimizes recurrent neural networks over time with dynamically generated kernels.

TensorRT maximizes throughput by quantizing models to INT8 while preserving accuracy, and automatically selects best data layers and algorithms that are optimized for the target GPU platform.

TensorRT and NVIDIA Triton Inference Server software are both available from NGC Catalog, the curated set of NVIDIA GPU-optimized software for AI, HPC, and visualization. The NGC Catalog consists of containers, pretrained models, Helm charts for Kubernetes deployments, and industry-specific AI toolkits with SDKs. TensorRT and NVIDIA Triton are also both available in the NGC Catalog in AWS Marketplace, making it even easier to use these resources on AWS G4 instances.

Amazon EC2 G4 instances

AWS offers the G4dn Instance based on NVIDIA T4 GPUs, and describes G4dn as “the lowest cost GPU-based instances in the cloud for machine learning inference and small scale training.”

Amazon EC2 offers a variety of G4 instances with one or multiple GPUs, and with different amounts of vCPU and memory. You can perform BERT inference below 5 ms on a single T4 GPU with 16 GB, such as on a g4dn.xlarge instance. The cost of this instance at the time of publication is $0.526 per hour on demand in the US East (N. Virginia) Region

Running BERT on AWS G4dn

Here’s how to get the most performance from the popular language model BERT, a transformer-based model introduced by Google a few years ago. We discuss performance for both BERT-Base and BERT-Large and then walk through how to set up NVIDIA Triton to perform inferences on both models.

To experience the outstanding performance shown earlier, follow the detailed steps in the TensorRT demo. To save all the efforts needed for complicated environment setup and configuring, you can start directly with updated monthly, performance-optimized containers available on NGC.

Launch an EC2 instance. Select the Deep Learning AMI (Ubuntu 18.04) version 43.0 to run on a g4dn.xlarge instance with at least 150G of storage space. Log into your instance.

Clone TensorRT repository into your local environment:

git clone -b master https://github.com/nvidia/TensorRT TensorRT 

Pull and launch the container:

docker run --gpus all -it --rm -v $HOME/TensorRT:/workspace/TensorRT nvcr.io/nvidia/tensorflow:21.04-tf1-py3 

Set up the NGC command line interface and download dataset as well as models. To set up the NGC command line interface, follow the download instructions based on your OS (AMD64 Linux, in this case).

Change to the BERT directory:

cd /workspace/TensorRT/demo/BERT 

Download SQuAD v2.0 training and dev dataset.

bash ./scripts/download_squad.sh v2_0 

Download TensorFlow checkpoints for BERT base model with sequence length 128, fine-tuned for SQuAD v2.0. It takes few minutes to download the model.

bash scripts/download_model.sh base 

Install related packages and Build the TensorRT engine. To build an engine, follow these steps. Install the required package:

pip install pycuda 

Create a directory to store the engines:

mkdir -p engines 

Run the builder.py script to build the engine with FP16 precision:

python3 builder.py 
         -m models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/model.ckpt 
         -o engines/bert_base_128.engine 
         -b 1 -s 128 --fp16 
         -c models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1 

Run the builder.py script to build the engine with INT8 precision to obtain the best performance, as shown earlier:

python3 builder.py 
         -m models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/model.ckpt 
         -o engines/bert_base_128_int8mix.engine 
         -b 1 -s 128 --int8 --fp16 --strict  
         -c models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1 
         --squad-json ./squad/train-v2.0.json 
         -v models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/vocab.txt 
         --calib-num 100 -iln -imh 

Test and benchmark the TensorRT engine that you just created. For benchmarking the model generated at 5.c, run the following command:

python3 perf.py -e /workspace/TensorRT/demo/BERT/engines/bert_base_128.engine -b 1 -s 128 

For benchmarking the model generated at 5.d, run the following command:

python3 perf.py -e /workspace/TensorRT/demo/BERT/engines/bert_base_128_int8mix.engine -b 1 -s 128 

For more information about performance results on various GPU architectures and settings, see BERT Inference Using TensorRT: Results.

Deploy the BERT QA model for inference with NVIDIA Triton Inference Server

NVIDIA Triton supports the following optimization modes:

  • Concurrent model execution: Enables multiple models, or multiple instances of the same model, to execute in parallel on the same GPU or on multiple GPUs to exploit the parallelism of GPU better.
  • Dynamic batching: Instruct the server to wait a predefined amount of time to combine individual inference requests into a preferred batch size preconfigured to enhance GPU utilization and improve inference throughput.

For more information about framework-specific optimization, see Optimization.

In this section, we show you how to deploy the TensorRT model with NVIDIA Triton Inference Server and turn on concurrent model execution. We also demonstrate dynamic batching with only a few lines of code. Follow these steps on the g4dn.xlarge instance launched earlier.

In the first step, you regenerate the model files with a larger batch size to enable NVIDIA Triton dynamic batching optimizations. This is supposed to run in the same container as the one in previous sections.

Regenerate the TensorRT engine files with a larger batch size:

mkdir -p triton_engines
 python3 builder.py 
         -m models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/model.ckpt 
         -o triton_engines/bert_base_128_int8mix.engine 
         -b 16 -s 128 --int8 --fp16 --strict  
         -c models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1 
         --squad-json ./squad/train-v2.0.json 
         -v models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/vocab.txt 
         --calib-num 100 -iln -imh 

Exit the current Docker environment and create a directory for triton_serving by running the following commands:

exit
 mkdir -p $HOME/triton_serving/bert_base_qa/1
 cp $HOME/TensorRT/demo/BERT/triton_engines/bert_base_128_int8mix.engine $HOME/triton_serving/bert_base_qa/1/model.plan 

You can change the bert_base_128_int8mix.engine to other engine files that you would like to serve on NVIDIA Triton.

It should be in the following format if everything works correctly:

triton_serving
 └── bert_base_qa
     └── config.pbtxt (optional)
     └── 1
         └── model.plan 

In this format, triton_serving is the model repository containing all your models, bert_base_qa is the model name, and 1 is the version number.

If you don’t know what to put into the config.pbtxt file yet, you may use the --strict-model-config False flag to let NVIDIA Triton serve the model with an automatically generated configuration.

In addition to the default configuration automatically generated by the NVIDIA Triton server, we recommend finding an optimal configuration based on the actual workload that users need. Download our example config.pbtxt file.

As you can see from the config.pbtxt file, you only need four lines of code to enable dynamic batching:

dynamic_batching {
   preferred_batch_size: 4
   max_queue_delay_microseconds: 2000
 } 

Here, the preferred_batch_size option means the preferred batch size that you would like to combine your input requests into. The max_queue_delay_microseconds option is how long the NVIDIA Triton server waits when the preferred size cannot be created from the available requests.

For concurrent model execution, directly specify the model concurrency per GPU by changing the count number in the instance_group.

instance_group {
   count: 2
   kind: KIND_GPU
 } 

For more information about the configuration files, see Model Configuration.

Start the NVIDIA Triton server by running the following command:

docker run --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $HOME/triton_serving/:/models nvcr.io/nvidia/tritonserver:21.04-py3 tritonserver --model-repository=/models (--strict-model-config False) 

The --strict-model-config False is only needed if you are not including the config.pbtxt file in your NVIDIA Triton server directory.

With that, congratulations on having your first NVIDIA Triton server running! Feel free to use the HTTP and grpc protocols to send your request or use the Performance Analyzer tool to test the server performance.

NVIDIA Triton performance benchmarking with perf_analyzer

For the following benchmark, you use the perf_analyzer application to generate concurrent inference requests and measure the throughput and latency of those requests. By default, perf_analyzer sends requests with concurrency number 1 and batch size 1. The whole process works as follows:

  • The perf_analyzer sends one inference request to NVIDIA Triton, waits for the response, and only sends the subsequent request once the previous response is received.
  • To simulate multiple end users using the service simultaneously, increase the request concurrency number to generate more loads to the NVIDIA Triton server.

While the NVIDIA Triton server from the previous step is still running, open a new terminal, connect using SSH to the instance that you were running, and run the NGC NVIDIA Triton SDK container:

docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:21.04-py3-sdk 

Generate a real input file with the format required. For an easy start, you can also directly download one example available input.json.

Run the perf analyzer with the following command:

perf_analyzer -m bert_base_qa --input-data input.json 

This starts the perf analyzer sending the request with the default request concurrency 1 and batch size 1. A detailed log of throughput and latency with breakdown is printed for further analysis. You may also add the --concurrency-range and -b flags to increase the request concurrency and batch size to simulate more heavy load scenarios.  For example:

perf_analyzer -m bert_base_qa --input-data input.json --concurrency-range 8 -b 1 

The preceding command sends the request with request concurrency 8 and batch size 1 to the NVIDIA Triton server.

Tables 2 and 3 show the inference throughput and latency results.

Instance Batch size Request concurrency Model concurrency GPU Preferred batch size for dynamic batching Throughput (sentences/sec) p99 latency (ms)
g4dn.xlarge 1 1 1 Not enabled 427 2.5
g4dn.xlarge 1 8 2 4 1,639 5.2
g4dn.xlarge 1 16 2 8 1,794 9.4
Table 2. NVIDIA Triton serving BERT-Base QA inference performance (concurrent model execution and dynamic batching).
Instance Batch size Request concurrency Model concurrency GPU Preferred batch size for dynamic batching Throughput (sentences/sec) p99 latency (ms)
g4dn.xlarge 1 1 1 Not enabled 211 4.8
g4dn.xlarge 1 4 2 2 449 9.5
Table 3. NVIDIA Triton serving BERT-Large QA inference performance (concurrent model execution and dynamic batching).

Table 3 shows that NVIDIA Triton can provide higher throughput with the concurrent model execution and dynamic batching features compared to the baseline without these optimizations on the same infrastructure.

With the default configuration for BERT-Base, you can reduce P99 latency down to 2.5 ms. You can achieve a throughput of 1639 sentences/sec with P99 latency around 5ms and 1794 sentences/sec with P99 latency less than 10ms by combining dynamic batching and concurrent model execution.

With dynamic batching and concurrent model execution enabled, you can do inference for BERT-Large:

  • A P99 latency of 4.8 ms with the lowest latency
  • A best throughput of 449 sentences/sec under 10 ms P99 latency

Conclusion

To summarize, you converted a fine-tuned BERT model for QA tasks into a TensorRT engine, which is highly optimized for inference. The optimized BERT QA engine was then deployed on NVIDIA Triton Inference Server, with concurrent model execution and dynamic batching to get the best performance from NVIDIA T4 GPUs.

Be sure to visit NGC, where you can find GPU-optimized AI, high-performance computing (HPC), and data analytics applications, as well as enterprise-grade containers, pretrained AI models, and industry-specific SDKs to aid in the development of your own workload. Also, stay tuned for the upcoming TensorRT 8, which includes new features like sparsity optimization for NVIDIA Ampere Architecture GPUs, quantization-aware training, and an enhanced compiler to accelerate transformer-based networks.

Categories
Misc

AI Slam Dunk: Startup’s Checkout-Free Stores Provide Stadiums Fast Refreshments

With live sports making a comeback, one thing remains a constant: Nobody likes to miss big plays while waiting in line for a cold drink or snack. Zippin offers sports fans checkout-free refreshments, and it’s racking up wins among stadiums as well as retailers, hotels, apartments and offices. The startup, based in San Francisco, develops Read article >

The post AI Slam Dunk: Startup’s Checkout-Free Stores Provide Stadiums Fast Refreshments appeared first on The Official NVIDIA Blog.

Categories
Offsites

Learning to Manipulate Deformable Objects

While the robotics research community has driven recent advances that enable robots to grasp a wide range of rigid objects, less research has been devoted to developing algorithms that can handle deformable objects. One of the challenges in deformable object manipulation is that it is difficult to specify such an object’s configuration. For example, with a rigid cube, knowing the configuration of a fixed point relative to its center is sufficient to describe its arrangement in 3D space, but a single point on a piece of fabric can remain fixed while other parts shift. This makes it difficult for perception algorithms to describe the complete “state” of the fabric, especially under occlusions. In addition, even if one has a sufficiently descriptive state representation of a deformable object, its dynamics are complex. This makes it difficult to predict the future state of the deformable object after some action is applied to it, which is often needed for multi-step planning algorithms.

In “Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks,” to appear at ICRA 2021, we release an open-source simulated benchmark, called DeformableRavens, with the goal of accelerating research into deformable object manipulation. DeformableRavens features 12 tasks that involve manipulating cables, fabrics, and bags and includes a set of model architectures for manipulating deformable objects towards desired goal configurations, specified with images. These architectures enable a robot to rearrange cables to match a target shape, to smooth a fabric to a target zone, and to insert an item in a bag. To our knowledge, this is the first simulator that includes a task in which a robot must use a bag to contain other items, which presents key challenges in enabling a robot to learn more complex relative spatial relations.

The DeformableRavens Benchmark
DeformableRavens expands our prior work on rearranging objects and includes a suite of 12 simulated tasks involving 1D, 2D, and 3D deformable structures. Each task contains a simulated UR5 arm with a mock gripper for pinch grasping, and is bundled with scripted demonstrators to autonomously collect data for imitation learning. Tasks randomize the starting state of the items within a distribution to test generality to different object configurations.

Examples of scripted demonstrators for manipulation of 1D (cable), 2D (fabric), and 3D (bag) deformable structures in our simulator, using PyBullet. These show three of the 12 tasks in DeformableRavens. Left: the task is to move the cable so it matches the underlying green target zone. Middle: the task is to wrap the cube with the fabric. Right: the task is to insert the item in the bag, then to lift and move the bag to the square target zone.

Specifying goal configurations for manipulation tasks can be particularly challenging with deformable objects. Given their complex dynamics and high-dimensional configuration spaces, goals cannot be as easily specified as a set of rigid object poses, and may involve complex relative spatial relations, such as “place the item inside the bag”. Hence, in addition to tasks defined by the distribution of scripted demonstrations, our benchmark also contains goal-conditioned tasks that are specified with goal images. For goal-conditioned tasks, a given starting configuration of objects must be paired with a separate image that shows the desired configuration of those same objects. A success for that particular case is then based on whether the robot is able to get the current configuration to be sufficiently close to the configuration conveyed in the goal image.

Goal-Conditioned Transporter Networks
To complement the goal-conditioned tasks in our simulated benchmark, we integrated goal-conditioning into our previously released Transporter Network architecture — an action-centric model architecture that works well on rigid object manipulation by rearranging deep features to infer spatial displacements from visual input. The architecture takes as input both an image of the current environment and a goal image with a desired final configuration of objects, computes deep visual features for both images, then combines the features using element-wise multiplication to condition pick and place correlations to manipulate both the rigid and deformable objects in the scene. A strength of the Transporter Network architecture is that it preserves the spatial structure of the visual images, which provides inductive biases that reformulate image-based goal conditioning into a simpler feature matching problem and improves the learning efficiency with convolutional networks.

An example task involving goal-conditioning is shown below. In order to place the green block into the yellow bag, the robot needs to learn spatial features that enable it to perform a multi-step sequence of actions to spread open the top opening of the yellow bag, before placing the block into it. After it places the block into the yellow bag, the demonstration ends in a success. If in the goal image the block were placed in the blue bag, then the demonstrator would need to put the block in the blue bag.

An example of a goal-conditioned task in DeformableRavens. Left: A frontal camera view of the UR5 robot and the bags, plus one item, in a desired goal configuration. Middle: The top-down orthographic image of this setup, which is size 160×320 and passed as the goal image to specify the task success criterion. Right: A video of the demonstration policy showing that the item goes into the yellow bag, instead of the blue one.

Results
Our results suggest that goal-conditioned Transporter Networks enable agents to manipulate deformable structures into flexibly specified configurations without test-time visual anchors for target locations. We also significantly extend prior results using Transporter Networks for manipulating deformable objects by testing on tasks with 2D and 3D deformables. Results additionally suggest that the proposed approach is more sample-efficient than alternative approaches that rely on using ground-truth pose and vertex position instead of images as input.

For example, the learned policies can effectively simulate bagging tasks, and one can also provide a goal image so that the robot must infer into which bag the item should be placed.

An example of policies trained using Transporter Networks applied in action on bagging tasks, where the objective is to first open the bag, then to put one (left) or two (right) items in the bag, then to insert the bag into the target zone. The left animation is zoomed in for clarity.
An example of the learned policy using Goal-Conditioned Transporter Networks. Left: The frontal camera view. Middle: The goal image that the Goal-Conditioned Transporter Network receives as input, which shows that the item should go in the red bag, instead of the blue distractor bag. Right: The learned policy putting the item in the red bag, instead of the distractor bag (colored yellow in this case).

We encourage other researchers to check out our open-source code to try the simulated environments and to build upon this work. For more details, please check out our paper.

Future Work
This work exposes several directions for future development, including the mitigation of observed failure modes. As shown below, one failure is when the robot pulls the bag upwards and causes the item to fall out. Another is when the robot places the item on the irregular exterior surface of the bag, which causes the item to fall off. Future algorithmic improvements might allow actions that operate at a higher frequency rate, so that the robot can react in real time to counteract such failures.

Examples of failure cases from the learned Transporter-based policies on bag manipulation tasks. Left: the robot inserts the cube into the opening of the bag, but the bag pulling action fails to enclose the cube. Right: the robot fails to insert the cube into the opening, and is unable to perform recovery actions to insert the cube in a better location.

Another area for advancement is to train Transporter Network-based models for deformable object manipulation using techniques that do not require expert demonstrations, such as example-based control or model-based reinforcement learning. Finally, the ongoing pandemic limited access to physical robots, so in future work we will explore the necessary ingredients to get a system working with physical bags, and to extend the system to work with different types of bags.

Acknowledgments
This research was conducted during Daniel Seita’s internship at Google’s NYC office in Summer 2020. We thank our collaborators Pete Florence, Jonathan Tompson, Erwin Coumans, Vikas Sindhwani, and Ken Goldberg.

Categories
Misc

NLP and Text Processing with RAPIDS: Now Simpler and Faster

TL;DR: Google famously noted that “speed isn’t just a feature, it’s the feature,” This is not only true for search engines but all of RAPIDS. In this post, we will showcase performance improvements for string processing across cuDF and cuML, which enables acceleration across diverse text processing workflows. Introduction In our previous post, we showed … Continued

This post was originally published on the RAPIDS AI Blog.

TL;DR: Google famously noted that “speed isn’t just a feature, it’s the feature,” This is not only true for search engines but all of RAPIDS. In this post, we will showcase performance improvements for string processing across cuDF and cuML, which enables acceleration across diverse text processing workflows.

Introduction

In our previous post, we showed basic text preprocessing with RAPIDS. Since then, we have come a long way in speed improvements, memory reductions, and API simplification.

Here is what we’ll cover in this post:

  • Built-in, Simplified String and Categorical Support
  • GPU TextVectorizers: Leaner and Meaner
  • Accelerating Diverse String Workflows

Built-in Support for Strings and Categoricals

Goodbye, cuStrings, nvStrings, and nvCategory! We hardly knew ye. Our first couple of posts about string manipulation on GPUs involved separate, specialized libraries for working with string data on the device. It also required significant expertise to integrate with other RAPIDS libraries like cuDF and cuML. Since then, we open-sourced, rearchitected, and migrated those string and text-related features into more user-friendly DataFrame APIs as part of cuDF. In addition, we adopted the “Apache Arrow” format for cuDF’s string representation, resulting in substantial memory savings and speedups.

Old Categorization Using nvcategory

Old Catagorization using Nvcategroy.

Updated Categorization Using Inbuilt Categorical dtype

Updated Categorization with inbuilt support

Example workflow

As a concrete, non-toy example of these improvements, consider our recently updated Gutenberg corpus analysis notebook. Previously we had to (slowly) jump through a few hoops, but no longer!

With our improved Pandas string API coverage, we not only have simpler code, but we also get double the performance. We took 2.31s previously, now we only take 1.05s, pushing our overall speedup against Pandas to 151x.

Check out the comparison between the previous versus updated notebooks below.

Previous:

Pre-Processing using nvtext+nvstrings

Update:

Updated Pre-Processing with the latest API

GPU TextVectorizers: leaner and meaner

We recently launched the feature.text subpackage in cuML by adding Count and TF-IDF vectorizers, kick starting a series of natural language processing (NLP) transformers on GPUs.

Since then, we have added hashing vectorizer (20x faster than scikit-learn) and improved our existing Count/TF-IDF vectorizer performance by 3.3x and memory by 2x.

Hashing Vectorizer Speed Up vs Sklearn

In our recent NLP post, we analyzed 5 million COVID-related tweets by first vectorizing them using TF-IDF and then clustering and searching in the vector space. With our recent improvements (GitHub 2554, 2575, 5666), we have improved that TF-IDF vectorization of that workflow on both memory and run time fronts.

  • Peak memory usage decreased from 19 GB to 8 GB.
  • Run time improved from 26s to 8 s, pushing our overall speed up to 21x over scikit-learn.

All the preceding improvements mean that your TF-IDF work can scale much further.

Scale-out TF-IDF across multiple machines

You can also scale your TF-IDF workflow to multiple GPUs and machines using cuml’s distributed TF-IDF Transformer. The transformer gives you a distributed vectorized matrix, which can be used with distributed machine learning models like cuml.dask.naive_bayes to get end-to-end acceleration across machines.  

Accelerating diverse string workflows

We are adding more string functionality like character_tokenize, character_ngrams, ngram_tokenize, filter_tokens, filter_aphanum, as well as, adding higher-level text-processing API’s like GPU-accelerated BERT tokenizer, text vectorizers, helping enable more complex string and text manipulation logic like you find in real-world NLP applications.

In the next installment where we put all these features through their paces in a specialized NLP benchmark. In the meantime, try RAPIDS in your NLP work on Google Colab or blazingsql notebooks, see our documentation docs page, and if you see something missing, we welcome feature requests on GitHub!

Categories
Misc

No dashboards are active for the current data set

I am trying detecting objects using tensorflow and Google colab. The steps is given in the link below: https://medium.com/swlh/tensorflow-2-object-detection-api-with-google-colab-b2af171e81cc

When I came to the step starting tensorboard, I’m facing :

 No dashboards are active for the current data set. 

After two steps, training the model, now I’m facing a lot warnings and eventually an error:

Traceback (most recent call last): File "model_main_tf2.py", line 113, in <module> tf.compat.v1.app.run() File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "model_main_tf2.py", line 110, in main record_summaries=FLAGS.record_summaries) File "/usr/local/lib/python3.7/dist-packages/object_detection-0.1-py3.7.egg/object_detection/model_lib_v2.py", line 639, in train_loop loss = _dist_train_step(train_input_iter) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__ result = self._call(*args, **kwds) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call return self._stateless_fn(*args, **kwds) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__ filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 560, in call ctx=ctx) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[100,51150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node Loss/Compare_9/IOU/Intersection/Minimum_1 (defined at /local/lib/python3.7/dist-packages/object_detection-0.1-py3.7.egg/object_detection/core/box_list_ops.py:257) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Func/Loss/localization_loss_1/write_summary/summary_cond/then/_0/input/_71/_348]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[100,51150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node Loss/Compare_9/IOU/Intersection/Minimum_1 (defined at /local/lib/python3.7/dist-packages/object_detection-0.1-py3.7.egg/object_detection/core/box_list_ops.py:257) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored. [Op:__inference__dist_train_step_51563] Errors may have originated from an input operation. Input Source operations connected to node Loss/Compare_9/IOU/Intersection/Minimum_1: Loss/Compare_9/IOU/Intersection/split (defined at /local/lib/python3.7/dist-packages/object_detection-0.1-py3.7.egg/object_detection/core/box_list_ops.py:250) Input Source operations connected to node Loss/Compare_9/IOU/Intersection/Minimum_1: Loss/Compare_9/IOU/Intersection/split (defined at /local/lib/python3.7/dist-packages/object_detection-0.1-py3.7.egg/object_detection/core/box_list_ops.py:250) Function call stack: _dist_train_step -> _dist_train_step 

How can I handle these issues? And tensorboard plays important role? Or it activating tensorboard plays an important part or it just is something like optional?

Thx in advance for your answers.

submitted by /u/kursat44
[visit reddit] [comments]