Categories
Misc

Speedy Model Training With RAPIDS + Determined AI

Model developers no longer face a steep learning curve to accelerate model training. By utilizing two open-source software projects, Determined AI’s Deep Learning Training Platform and the RAPIDS accelerated data science toolkit, they can easily achieve up to 10x speedups in data preprocessing and train models at scale.  Making GPUs accessible As the field of … Continued

Model developers no longer face a steep learning curve to accelerate model training. By utilizing two open-source software projects, Determined AI’s Deep Learning Training Platform and the RAPIDS accelerated data science toolkit, they can easily achieve up to 10x speedups in data preprocessing and train models at scale. 

Making GPUs accessible

As the field of deep learning advances, practitioners are increasingly expected to make a significant investment in GPUs, either on-prem or from the cloud. Hardware is only half the story behind the proliferation of AI, though. NVIDIA’s success in powering data science has as much to do with software as hardware: widespread GPU adoption would be very difficult without convenient software abstractions that make GPUs easy for model developers to use. RAPIDS is a software suite that bridges the gap from CUDA primitives to data-hungry analytics and machine learning use cases.

Similarly, Determined AI’s deep learning training platform frees the model developer from hassles: operational hassles they are guaranteed to hit in a cluster setting, and model development hassles as they move from toy prototype to scale.  On the operational side, the platform handles distributed systems concerns like training job orchestration, storage layer integration, centralized logging, and automatic fault tolerance for long-running jobs.  On the model development side, machine learning engineers only need to maintain one version of code from the model prototype phase to more advanced tasks like multi-GPU (and multi-node) distributed training and hyperparameter tuning.  Further, the platform handles the boilerplate engineering required to track workload dependencies, metrics, and checkpoints.

At their core, both Determined AI and RAPIDS make the GPU accessible to machine learning engineers via intuitive APIs: Determined as the platform for accelerating and tracking deep learning training workflows, and RAPIDS as the suite of libraries speeding up parts of those training workflows.

For the remainder of this post, we’ll examine a model development process in which RAPIDS accelerates training data set construction within a Determined cluster, at which point Determined handles scaled out, fault-tolerant model training and hyperparameter tuning.

The RAPIDS is not alone in offering familiar interfaces atop GPU acceleration.  E.g., CuPy is NumPy-compatible, and OpenCV’s GPU module API interface is “kept similar with the CPU interface where possible.” experience will look familiar to ML engineers who are accustomed to tackling data manipulation with pandas or NumPy, and model training with PyTorch or TensorFlow.

Getting started

To use Determined and RAPIDS to accelerate model training, there are a few requirements to meet upfront. On the RAPIDS side, OS and CUDA version requirements are listed here.  One is worth calling out explicitly: RAPIDS requires NVIDIA P100 or later generation GPUs, ruling out the NVIDIA K80 in AWS P2 instances.

After satisfying these prerequisites, making RAPIDS available to tasks running on Determined is simple.  Because Determined supports custom Docker images for running training workloads, we can create an image that contains the appropriate version of RAPIDS1 installed via conda.  This is as simple as specifying the RAPIDS dependency in a Conda environment file:

name: Rapids
channels:
 - rapidsai
 - nvidia
 - conda-forge
dependencies:
 - rapids=0.14

And updating the base Conda environment in your custom image Dockerfile:

FROM determinedai/environments:cuda-10.0-pytorch-1.4-tf-1.15-gpu-0.7.0 as base
COPY environment.yml /tmp/
RUN conda --version && 
   conda env update --name base --file /tmp/environment.yml && 
   conda clean --all --force-pkgs-dirs --yes
RUN eval "$(conda shell.bash hook)" && conda activate base

After building and pushing this image to a Docker repository, you can run experiments, notebooks, or shell sessions by configuring the environment image that these tasks should use.

The model

To showcase the potency of integrating RAPIDS and Determined, we picked a tabular learning task that would typically benefit from nontrivial data preprocessing, based on the TabNet architecture and the pytorch-tabnet library implementing it. TabNet brings the power of deep learning to tabular data-driven use cases and offers some nice interpretability properties to boot.  One benchmark explored in the TabNet paper is the Rossman store sales prediction task of building a model to predict revenue across thousands of stores based on tabular data describing the stores, promotions, and nearby competitors.  Since Rossman dataset access requires signing off on an agreement, we train our model on generated data of a similar schema and scale so that users can more easily run this example.  All assets for this experiment are available on GitHub.

Data prep with RAPIDS, training with Determined

With multiple CSVs to ingest and denormalize, the Rossman revenue prediction task is ripe for RAPIDS.  The high level flow to develop a revenue prediction model looks like this:

  • Read location and historical sales CSVs into cuDF DataFrames residing in GPU memory.
  • Join these data sets into a denormalized DataFrame. This GPU-accelerated join is handled by cuDF.
  • Construct a PyTorch Dataset from the denormalized DataFrame.
  • Train with Determined!

RAPIDS cuDF’s familiar pandas-esque interface makes data ingest and manipulation a breeze:

df_store = cudf.read_csv(STORE_CSV)
df_train = cudf.read_csv(TRAIN_CSV).join(df_store,
                                        how='left',
                                        on='store_id',
                                        rsuffix='store')
df_valid = cudf.read_csv(VAL_CSV).join(df_store,
                                      how='left',
                                      on='store_id',
                                      rsuffix='store')

We then use CuPy to get from a cuDF DataFrame to a PyTorch Dataset and DataLoader to expose via Determined’s Trial interface.

Given that RAPIDS cuDF is a drop-in replacement for pandas, it’s trivial to toggle between the two libraries and compare performance of their analogous APIs. In this simplified case, cuDF showed a 10x speedup over pandas, requiring only 6 seconds to complete on a single NVIDIA V100 GPU that took a minute on the vCPU.

  Another option is to use DLPack as the intermediate format that both cuDF and PyTorch support, either directly, or using NVTabular’s Torch Dataloader which does the same under the covers.

Figure 1: Faster enterprise AI with RAPIDS and Determined.

On an absolute scale, this might not seem like a big deal: whether the overall training job takes 20 or 21 minutes doesn’t seem to matter much. However, given the iterative nature of deep learning model tuning, the time and cost savings quickly add up. For a hyperparameter tuning experiment training hundreds or thousands of models, on data larger than the couple of GB, and perhaps with more complex data transformations, savings on the order of GPU-minutes per trained model can translate to savings on the order of GPU-days or weeks at scale, netting your organization hundreds or thousands of dollars in infrastructure cost.

Determined and the broader RAPIDS toolkit

The RAPIDS library suite goes far beyond manipulation of data frames that we leveraged in this example.  To name a couple:

  • RAPIDS cuML offers GPU-accelerated ML algorithms mirroring sklearn.
  • NVTabular, which sits atop RAPIDS, offers high-level abstractions for feature engineering and building recommenders.

If you’re using these libraries, you’ll soon be able to train on a Determined cluster and get the platform’s resource management, experiment tracking, and hyperparameter tuning capabilities. We’ve heard from our users that the need for these tools isn’t limited to deep learning, so we are pushing into the broader ML space and making Determined not only the platform for PyTorch and TensorFlow model development, but for any Python-based model development. Stay tuned, and in the meantime you can learn more about this development from our 2021 roadmap discussion during our most recent community meetup.

Get started

If you’d like to learn more about (and test drive!) RAPIDS and Determined, check out the RAPIDS quick start and Determined’s quick start documentation. We’d love to hear your feedback on the RAPIDS and Determined community Slack channels. Happy training!

Categories
Misc

NVIDIA Deepens Commitment to Streamlining Recommender Workflows with GTC Spring Sessions

Here a few key sessions from industry leaders in media, delivery-on-demand, and retail at GTC Spring 2021.

Ensuring recommenders are meaningful, personalized, and relevant to a single customer is not easy. Scaling a personalized recommender experience to hundreds of thousands, or millions of customers, comes with unique challenges that data scientists and machine learning engineers tackle every day. Scaling challenges often provide obstacles to effective ETL, training, retraining, or deploying models into production.

To tackle these challenges, machine learning engineers and data scientists within the industry utilize a combination, or hybrid of tools, techniques, and algorithms. NVIDIA is committed to help streamline recommender workflows with Merlin, an open-source framework that is interoperable and designed to support machine learning engineers and data scientists with preprocessing, feature engineering, training, and inference. Merlin supports industry leaders who are tackling common recommender challenges as they provide relevant, impactful, and fresh recommenders at scale.

Here a few key sessions from industry leaders in media, delivery-on-demand, and retail at GTC Spring 2021.

  • AI-First Social Media Feeds: A View From the Trenches
    Discusses efficient training of extremely large recommender models with billions of parameters distributed across multiple GPUs and workers as well as the importance of continual model updates in near-real time to deal with the key challenge of concept drift. The session includes how solutions in NVIDIA’s Merlin stack resolve key bottlenecks faced in general purpose deep learning frameworks.

Registration is free, visit the GTC website for more information.

Categories
Misc

Completing a TensorFlow android app

Hello everyone!

I have some questions on finishing the implementation of my TensorFlow application. I need advice on how to optimize my model.

Background

I have been working on an object detection Android app based on the one provided by TensorFlow. I have added bluetooth capabilities and implemented my own standalone Simple Online and Realtime Tracking algorithm (just so I could understand the code better in case I have to tune things). I do not want to get into the specifics of my application of the Android app but the simplest analogy is an Android app looking down at a conveyor belt. When the Android app sees a specific object on the conveyor belt at a certain location, it sends a bluetooth signal for some mechanism to take action on the specific object at the certain location (this probably describes half the possible apps here haha).

My application has been tested and works successfully when using one of the default tflite models in a simulation environment. However, the objects I plan to track are not in the standard tflite models. Therefore I need to create my own custom model. This is the final step of my app development.

I have (with much pain) figured out how to create a model generation pipeline: Tfrecords > train > convert to tflite > test on android app. I have not studied machine learning but realize that with my technical/programming/math skills I can kind of brute force a basic model and then learn the theory in more detail once my prototype is working. I have spent a fair bit of time browsing the TensorFlow’s github issues to produce a model that can somewhat detect my objects but not well enough and slower than the example tflite model (on my phone inference time is now 150ms instead of average of 50ms). I am now looking to decrease inference time and accuracy of my model.

My current model generation pipeline uses ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8 (as I couldn’t get ssd_mobilenet_v2_320x320_coco17_tpu-8 to work), uses my tfrecords then trains on the data, then converts to tflite (with optimization tf.lite.Optimize.DEFAULT flag) and finally attaches metadata. I plug this into the android app and then test.

My computer is slow, so I eventually plan on renting an EC2 and going through a bunch of parameters in ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8’s pipeline.config and thus generating a bunch of tflite models and rate their accuracy. As a final final test step, I will test the models for speed on my phone. Combination of fastest/accurate will be the tflite model of choice.

Questions

In ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8’s pipeline.config what parameters are good to vary to get a good parameter sweep?

What parameters are good to vary so that the resultant tflite model is faster ( 5mb tflite model is 50ms inference while 10mb model is 150ms inference time)?

What EC2 machine do you recommend using? I understand that amazon has machine learning tools, but with the time I spent creating my model and generating pipeline I am very hesitant to jump into additional exploratory work.

I’ll add the ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8’s pipeline.config file in the comments.

submitted by /u/tensorpipelinetest
[visit reddit] [comments]

Categories
Misc

What Will NVIDIA CEO Jensen Huang Cook Up This Time at NVIDIA GTC?

Don’t blink. Accelerated computing is moving innovation forward faster than ever. And there’s no way to get smarter, quicker, about how it’s changing your world than to tune in to NVIDIA CEO Jensen Huang’s GTC keynote Monday, April 12, starting at 8:30 a.m. PT. The keynote, delivered again from the kitchen in Huang’s home, will Read article >

The post What Will NVIDIA CEO Jensen Huang Cook Up This Time at NVIDIA GTC? appeared first on The Official NVIDIA Blog.

Categories
Misc

GTC 21: 5 Data Center Networking and Ecosystem Sessions You Shouldn’t Miss!

We at NVIDIA are on a mission to bring the next generation data center vision to reality. Join us at NVIDIA GTC’21 (Apr 12-16, 2021) to witness the data center innovation we are pioneering.

As NVIDIA CEO Jensen Huang stated in last year’s GTC, “the data center is the new unit of computing.” Traditional way of using the server as the unit of computing is fading away quickly. More and more applications are moving to data centers that are located at the edge, in different availability zones or in private enterprise clouds. Modern workloads such as AI/ML, edge computing, cloud-native microservices and 5G services are becoming increasingly disaggregated, distributed and data hungry. These applications demand efficient, secure and accelerated computing and data processing across all the layers of the application stack. 

Computing accelerated by NVIDIA GPUs and data processing units (DPUs) is at the heart of modern data centers. DPUs are a game-changing new technology that accelerates GPU and CPU access to data while enabling software-defined, hardware-accelerated infrastructure. With DPUs, organizations can efficiently and effectively deploy networking, cyber security, and storage in virtualized as well as containerized environments.

We at NVIDIA are on a mission to bring the next generation data center vision to reality. Join us at NVIDIA GTC’21 (Apr 12-16, 2021) to witness the data center innovation we are pioneering.  Register for the top DPU sessions at GTC to learn how NVIDIA Networking solutions are powering the next generation data centers.

Palo Alto Networks and NVIDIA present  Accelerated 5G Security: DPU-Based Acceleration of Next-Generation Firewalls [S31671]

5G offers many new capabilities such as lower latency, higher reliability, throughput, agile service deployment through cloud-native architectures, greater device density, etc. A new approach is needed to achieve L7 security at these rates with software-based firewalls. Integrating Palo Alto Network next generation firewall with the NVIDIA DPU enables industry-leading high-performance security. NVIDIA’s Bluefield-2 DPU provides a rich set of network offload engines designed to address the acceleration needs of security-focused network functions in today’s most demanding markets such as 5G and the cloud.

Speakers:

  • Sree Koratala, VP Product Management Mobility Security, Palo Alto Networks
  • John McDowall, Senior Distinguished Engineer, Palo Alto Networks
  • Ash Bhalgat, Senior Director, Cloud, Telco & Security Market Development, NVIDIA

China Mobile, Nokia and NVIDIA present Turbocharge Cloud-Native Applications with Virtual Data Plane Accelerated Networking [S31563]

Great progress has been made leveraging hardware to accelerate cloud networking for a wide range of cloud-based applications. This talk will examine how cloud networking for public cloud infrastructure as a service can be accelerated using the NVIDIA’s new open standards technology called virtual data plane acceleration (vDPA). In addition, this presentation will examine the early validation results and acceleration benefits of deploying NVIDIA ASAP2 vDPA technology in China Mobile’s BigCloud cloud service.

Speakers:

  • Sharko Cheng, Senior Network Architect, Cloud Networking Products Department, CMCC
  • Mark Iskra, Director, Nokia/Nuage Networks
  • Ash Bhalgat, Senior Director, Cloud, Telco & Security Market Development, NVIDIA

Red Hat and NVIDIA present  Implementing Virtual Network Offloading Using Open Source tools on BlueField-2 [S31380]

NVIDIA and Red Hat have been working together to provide an elegant and 100% open-source solution using the BlueField DPU for hardware offloading of the software-defined networking tasks in cloud-native environments. With BlueField DPUs, we can encrypt, encapsulate, switch, and route packets right on the DPU, effectively dedicating all the server’s processing capacity to running business applications. This talk will discuss typical use cases and demonstrate the performance advantages of using BlueField’s hardware offload capabilities with Red Hat Enterprise Linux and the Red Hat OpenShift container platform.

Speakers:

  • Rashid Khan, Director of Networking, Red Hat
  • Rony Efraim, Networking Software and Systems Architect, NVIDIA

NVIDIA DPU team presents  Program Data Center Infrastructure Acceleration with the Release of DOCA and the Latest DPU Software [S32205]

NVIDIA is releasing the first version of DOCA, a set of libraries, SDKs, and tools for programming the NVIDIA BlueField DPU, as well as the new version 3.6 of the DPU software. Together these enable new infrastructure acceleration and management features in BlueField, simplify programming and application integration. DPU developers can offload and accelerate networking, virtualization, security, and storage features including VirtIO for NFV/VNFs, BlueField SNAP for elastic storage virtualization, regular expression matching for malware detection, and deep packet inspection to enable sophisticated routing, firewall, and load-balancing applications.

Speakers:

  • Ami Badani, VP Marketing, NVIDIA
  • Ariel Kit, Director of Product Marketing for Networking, NVIDIA

F5/NGINX and NVIDIA presents  kTLS Hardware Offload Performance Benchmarking for NGINX Web Server [S31551]

Encrypted communication usage is steadily growing across most internet services. TLS is the leading security protocol implemented on top of TCP. kTLS (kernel TLS) serves as a unit that offers TLS operations support in the Linux kernel layer. It was introduced in kernel v4.13 as a software offload for user-space TLS libraries and was extended in kernel v4.18 with an infrastructure for performing hardware-accelerated encryption/decryption in the SmartNIC and DPUs. This session will review the life cycle of a hardware-offloaded kTLS connection and the driver-hardware interaction to support it while demonstrating and analyzing the significant performance gain by offloading kTLS operations to the hardware, using NGINX as the target workload and NVIDIA’s mlx5e driver on top of a ConnectX-6 Dx SmartNIC.

Speakers:

  • Damian Curry, Business Development Technical Director, F5
  • Bar Tuaf, Software engineer, NVIDIA

Register today for free and start building your schedule. 

Once you are signed in, you can explore all GTC conference topics here. Topics include areas of interest such as data center networking and virtualization, HPC, deep learning, data science, and autonomous machines, or industries including healthcare, public sector, retail, and telecommunications.

See you at the GTC’21!

Categories
Misc

World of Difference: GTC to Spotlight AI Developers in Emerging Markets

Startups don’t just come from Silicon Valley — they hail from Senegal, Saudi Arabia, Pakistan, and beyond. And hundreds will take the stage at the GPU Technology Conference. GTC, running April 12-16, will spotlight developers and startups advancing AI in Africa, Latin America, Southeast Asia, and the Middle East. Registration is free, and provides access Read article >

The post World of Difference: GTC to Spotlight AI Developers in Emerging Markets appeared first on The Official NVIDIA Blog.

Categories
Misc

NVIDIA-Powered Systems Ready to Bask in Ice Lake

Data-hungry workloads such as machine learning and data analytics have become commonplace. To cope with these compute-intensive tasks, enterprises need accelerated servers that are optimized for high performance. Intel’s 3rd Gen Intel Xeon Scalable processors (code-named “Ice Lake”), launched today, are based on a new architecture that enables a major leap in performance and scalability. Read article >

The post NVIDIA-Powered Systems Ready to Bask in Ice Lake appeared first on The Official NVIDIA Blog.

Categories
Misc

We built Datature, a platform that allows anyone to train their own TensorFlow Object Detection / Segmentation model using drag and drop!

We built Datature, a platform that allows anyone to train their own TensorFlow Object Detection / Segmentation model using drag and drop! submitted by /u/xusty
[visit reddit] [comments]
Categories
Misc

Tutorial: Building a Magic Wand

Tutorial: Building a Magic Wand submitted by /u/AR_MR_XR
[visit reddit] [comments]
Categories
Misc

Jetson Project of the Month: DeepWay, AI-based navigation aid for the visually impaired

Satinder Singh won the Jetson Project of the Month for DeepWay, an AI-based navigation assistance system for the visually impaired. The project, which runs on an NVIDIA Jetson Nano Developer Kit, monitors the path of a person and provides guidance to walk on the correct side and avoid any oncoming pedestrians.  In addition to the … Continued

Satinder Singh won the Jetson Project of the Month for DeepWay, an AI-based navigation assistance system for the visually impaired. The project, which runs on an NVIDIA Jetson Nano Developer Kit, monitors the path of a person and provides guidance to walk on the correct side and avoid any oncoming pedestrians. 

In addition to the Jetson Nano, Satinder’s setup includes an Arduino Nano, a webcam, two servo motors, a USB audio adapter, 3D printed eye-glasses and a few peripherals. The Arduino Nano is used to control the servo motors, which nudge the user in the correct direction by gently tapping them on either side of the head. If the Jetson Nano identifies pedestrians, a voice prompt sent through the USB audio adapter can be used to warn the user. 

To train his convolutional neural network, Satinder collected 10,000 images across three classes for left, center and right positions within a lane, and trained his model on a GPU-enabled virtual machine on Microsoft Azure. He used both PyTorch and Keras for training a U-net semantic segmentation model. After further analysis, he picked the U-Net model trained in Keras for its better performance.

DeepWay – Project Walk Through 

Satinder believes that his solution stands out among others because of its portability and affordability (plan including further optimizing it using NVIDIA TensorRT. We can’t wait to see how this project evolves.

The World Health Organization estimates at least 2.2 billion people around the world have a vision impairment. Many among this group would benefit from navigation aids. Open-source prototypes such as DeepWay foster innovation around these use-cases and are great examples of AI being used for social good.

If you’re interested in building your own ‘DeepWay’ or extending it, Satinder has shared the instructions and the code here.