Categories
Misc

How to execute tf.signal.stft?

Hi,

i am trying to get the result of tf.signal.stft, eg.

test_stft = tf.math.log(tf.abs(tf.signal.stft(test,frame_length=512,frame_step=128))) 

I thought in eager execution it will give me the result, but all i get is:

 tf.Tensor([], shape=(20000, 0, 257), dtype=float32) 

What can I do, to get Tensorflow to finally calcuate the result. I have trouble to understand EagerMode and GraphMode. Maybe a good Youtube resource may also help.

submitted by /u/alex_bababu
[visit reddit] [comments]

Categories
Misc

NASA and NVIDIA Collaborate to Accelerate Scientific Data Science Use Cases, Part 1

Over the past couple of years, NVIDIA and NASA have been working closely on accelerating data science workflows using RAPIDS and integrating these GPU-accelerated libraries with scientific use cases. In this blog, we’ll share some of the results from an atmospheric science use case, and code snippets to port existing CPU workflows to RAPIDS on … Continued

Over the past couple of years, NVIDIA and NASA have been working closely on accelerating data science workflows using RAPIDS and integrating these GPU-accelerated libraries with scientific use cases. In this blog, we’ll share some of the results from an atmospheric science use case, and code snippets to port existing CPU workflows to RAPIDS on NVIDIA GPUs.

Accelerated Simulation of Air Pollution from Christoph Keller

One example science use case from NASA Goddard simulates chemical compositions of the atmosphere to monitor, forecast, and better understand the impact of air pollution on the environment, vegetation, and human health. Christoph Keller, a research scientist at the NASA Global Modeling and Assimilation Office, is exploring alternative approaches based on machine learning models to simulate the chemical transformation of air pollution in the atmosphere. Doing such calculations with a numerical model is computationally expensive, which limits the use of comprehensive air quality models for real-time applications such as air quality forecasting. For instance, the NASA GEOS composition forecast model GEOS-CF, which simulates the distribution of 250 chemical species in the Earth atmosphere in near real-time, needs to be run on more than 3000 CPUs and more than 50% of the required compute cost is related to the simulation of chemical interactions between these species. 

Visualization of NASA GEOS composition forecast model GEOS-CF, which simulates the distribution of 250 chemical species in Earth's atmosphere.
Figure 1: Simulation of Atmospheric Chemistry, 56 million grid cells (25×25 km2, 72 levels) and 250 chemical species.

We were able to accelerate the simulation of atmospheric chemistry in the NASA GEOS Model with GEOS-Chem chemistry more than 10-fold by replacing the default numerical chemical solver in the model with XGBoost emulators. To train these gradients boosted decision tree models, we produced a dataset using hourly output from the original GEOS model with GEOS-Chem chemistry. The input dataset contains 126 key physical and chemical parameters such as air pollution concentrations, temperature, humidity, and sun intensity. Based on these inputs, the XGBoost model is trained to predict the chemical formation (or destruction) of an air pollutant under the given atmospheric conditions. Separate emulators are trained for individual chemicals.

To make sure that the emulators are accurate for the wide range of atmospheric conditions found in the real world, the training data needs to capture all geographic locations and annual seasons. This results in very large training datasets – quickly spanning 100s of millions of data points, making it slow to train. Using RAPIDS Dask-cuDF (GPU-accelerated dataframes) and training XGBoost on an NVIDIA DGX-1 with 8 V100 GPUs, we are able to achieve 50x overall speedup compared to Dual 20-Core Intel Xeon E5-2698 CPUs on the same node.

An example of this is given in the gc-xgb repo sample code, showcasing the creation of an emulator for the chemical compound ozone (O3), a key air pollutant and climate gas. For demonstration purposes, a comparatively small training data set spanning 466,830 samples is used. Each sample contains up to 126 non-zero features, and the full size of the training data contains 58,038,743 entries. In the provided example, the training data – along with the corresponding labels – is loaded from a pre-generated txt file in svmlight / libsvm format, available in the GMAO code repo:

Loading the training data from a pre-generated text file, as shown in the example here, sidesteps the data preparation process whereby the 4-dimensional model data (latitude x longitude x altitude x time) as generated by the GEOS model (in netCDF format) are being read, subsampled and flattened.

The loaded training data can directly be used to train an XGBoost model:

Setting the tree_method to ‘gpu_hist’ instead of ‘hist’ performs the training on GPUs instead of CPUs, highlighting a significant speed-up in training time even for the comparatively small sample training data used in this example. This difference is exacerbated on the much larger data sets needed for developing emulators suitable for actual use in the GEOS model. Since our application requires training of dozens of ML emulators – ideally on a recurring basis as new model data is produced – the much shorter training time on RAPIDS is critical and ensures a short enough model development cycle.

As shown in the figure below, the chemical tendencies of ozone (i.e., the change in ozone concentration due to atmospheric chemistry) predicted by the gradient boosted decision tree model shows good agreement with the true chemical tendencies simulated by the numerical model. Given the relatively small training sample size (466,830 samples), the here trained model shows some signs of overfitting, with the correlation coefficient R2 dropping from 0.95 for the training data to 0.88 in the validation data, and the normalized root means square error (NRMSE) increasing from 22% to 35%. This indicates that larger training samples are needed to ensure that the training dataset captures all chemical environments.

The two charts compare ozone tendencies predicted by the XGBoost model (y axis) vs. the true value as simulated by the numerical model (x axis) for the training data (left) and the validation data (right).
Figure 2: Simulation of Atmospheric Chemistry, 56 million grid cells (25×25 km2, 72 levels) and 250 chemical species.

In order to deploy the XGBoost emulator in the GEOS model as a replacement to the GEOS-Chem chemical solver, the XGBoost algorithm needs to be called from within the GEOS modeling system, which is written in Fortran. To do so, the trained XGBoost model is saved to disk so that it can then be read (and evoked) from a Fortran model by leveraging XGBoost’s C API (The XGBoost interface for Fortran can be found in the fortran2xgb GitHub repo.

As shown in the figure below, running the GEOS model with atmospheric chemistry emulated by XGBoost produces surface ozone concentrations that are similar to the numerical solution (red vs. black line). The blue line shows a simulation using a model with no chemistry, highlighting the critical role of atmospheric chemistry for surface ozone.

GEOS model simulations using XGBoost emulators instead of the GEOS-Chem chemical solver have the potential to be 20-50% faster than the reference simulation, depending on the model configuration (such as horizontal and temporal resolution). By offering a much faster calculation of atmospheric chemistry, these ML emulators open the door for a range of new applications, such as probabilistic air quality forecasts or a better combination of atmospheric observations and model simulations. Further improvements to the ML emulators can be achieved through mass balance considerations and by accounting for error correlations, tasks that Christoph and colleagues are currently working on.

The image shows 4 charts. Running the GEOS model with atmospheric chemistry emulated by XGBoost produces surface ozone concentrations that are similar to the numerical solution (red vs. black line). The blue line shows a simulation using a model with no chemistry, highlighting the critical role of atmospheric chemistry for surface ozone.
Figure 3: Surface concentrations of O3 at four locations for the GEOS-Chem reference (black), XGBoost model (red) and simulation with no chemistry (blue), indicate that these regions are well reproduced by the XGB model and capture the concentration patterns.

In the next blog, we’ll talk about another application leveraging XGBoost and RAPIDS for live monitoring of air quality across the globe during the COVID-19 pandemic.

References:

Keller, C. A., Clune, T. L., Thompson, M. A., Stroud, M. A., Evans, M. J., and Ronaghi,  Z.: Accelerated Simulation of Air Pollution Using NVIDIA RAPIDS, GPU Technology Conference, https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20190033152.pdf, 2019.

Keller, C. A. and Evans, M. J.: Application of random forest regression to the calculation of gas-phase chemistry within the GEOS-Chem chemistry model v10, Geosci. Model Dev., 12, 1209–1225, https://doi.org/10.5194/gmd-12-1209-2019, 2019.

Categories
Misc

Sequence-example static shape in ‘map’ of tf.data.Dataset

My dataset uses tf.train.SequenceExample, which contains a sequence of N elements, and this N by definition can vary from one sequence to another. I want to select M, which is fixed for all sequences, elements uniformly from the N elements. For example, if the sequence has N=10 elements, then for M = 2 I want to select index=0, index=5 elements. M will always be smaller than any N in the dataset.

Now the issue is, when dataset iterator calls parser function through the ‘map’ method it is executed in the ‘graph’ mode and axis dimension corresponding to ‘N’ is ‘None’. So, I can’t iterate on that axis to find the value of N.

I resolved this issue by using tf.py_function, but it is 10X slower. I tried using tf.data.AUTOTUNE in num_parallel_calls and also in prefetch, and also set deterministic=False, But performance is still 10X slower.

What is the other possible solution for this?

submitted by /u/learnml
[visit reddit] [comments]

Categories
Misc

Model was constructed with shape (None, 1061, 4) for input … but it was called on an input with incompatible shape (None, 4).

EDIT: SOLVED. Thank you all so much!

I’m building a neural network where my inputs are 2d arrays, each representing one day of data.

I have a container array that holds 7 days’ arrays, each of which has 1,061 4×1 arrays. That sounds very confusing to me so here’s a diagram:

container array [ matrix 1 [ vector 1 [a, b, c, d] ... vector 1061 [e, f, g, h] ] ... matrix 7 [ vector 1 [i, j, k, l] ... vector 1061 [m, n, o, p] ] ] 

In other words, the container’s shape is (7, 1061, 4).

That container array is what I pass to the fit method for “x”. And here’s how I construct the network:

input_shape = (1061, 4) network = Sequential() network.add(Input(shape=input_shape)) network.add(Dense(2**6, activation="relu")) network.add(Dense(2**3, activation="relu")) network.add(Dense(2, activation="linear")) network.compile( loss="mean_squared_error", optimizer="adam", ) 

The network compiles and trains, but I get the following warning while training:

WARNING:tensorflow:Model was constructed with shape (None, 1061, 4) for input KerasTensor(type_spec=TensorSpec(shape=(None, 1061, 4), dtype=tf.float32, name=’input_1′), name=’input_1′, description=”created by layer ‘input_1′”), but it was called on an input with incompatible shape (None, 4).

I double-checked my inputs, and indeed there are 7 arrays of shape (1061, 4). What am I doing wrong here?

Thank you in advance for the help!

submitted by /u/bens_scraper
[visit reddit] [comments]

Categories
Misc

Siege the Day as Stronghold Series Headlines GFN Thursday

It’s Thursday, which means it’s GFN Thursday — when GeForce NOW members can learn what new games and updates are streaming from the cloud. This GFN Thursday, we’re checking in on one of our favorite gaming franchises, the Stronghold series from Firefly Studios. We’re also sharing some sales Firefly is running on the Stronghold franchise. Read article >

The post Siege the Day as Stronghold Series Headlines GFN Thursday appeared first on The Official NVIDIA Blog.

Categories
Misc

NVIDIA’s Shalini De Mello Talks Self-Supervised AI, NeurIPS Successes

Shalini De Mello, a principal research scientist at NVIDIA who’s made her mark inventing computer vision technology that contributes to driver safety, finished 2020 with a bang — presenting two posters at the prestigious NeurIPS conference in December. A 10-year NVIDIA veteran, De Mello works on self-supervised and few-shot learning, 3D reconstruction, viewpoint estimation and Read article >

The post NVIDIA’s Shalini De Mello Talks Self-Supervised AI, NeurIPS Successes appeared first on The Official NVIDIA Blog.

Categories
Offsites

Announcing the 2021 Research Scholar Program Recipients

In March 2020 we introduced the Research Scholar Program, an effort focused on developing collaborations with new professors and encouraging the formation of long-term relationships with the academic community. In November we opened the inaugural call for proposals for this program, which was received with enthusiastic interest from faculty who are working on cutting edge research across many research areas in computer science, including machine learning, human computer interaction, health research, systems and more.

Today, we are pleased to announce that in this first year of the program we have granted 77 awards, which included 86 principal investigators representing 15+ countries and over 50 universities. Of the 86 award recipients, 43% identify as an historically marginalized group within technology. Please see the full list of 2021 recipients on our web page, as well as in the list below.

We offer our congratulations to this year’s recipients, and look forward to seeing what they achieve!

Algorithms and Optimization
Alexandros Psomas, Purdue University
Auction Theory Beyond Independent, Quasi-Linear Bidders
Julian Shun, Massachusetts Institute of Technology
Scalable Parallel Subgraph Finding and Peeling Algorithms
Mary Wootters, Stanford University
The Role of Redundancy in Algorithm Design
Pravesh K. Kothari, Carnegie Mellon University
Efficient Algorithms for Robust Machine Learning
Sepehr Assadi, Rutgers University
Graph Clustering at Scale via Improved Massively Parallel Algorithms

Augmented Reality and Virtual Reality
Srinath Sridhar, Brown University
Perception and Generation of Interactive Objects

Geo
Miriam E. Marlier, University of California, Los Angeles
Mapping California’s Compound Climate Hazards in Google Earth Engine
Suining He, The University of Connecticut
Fairness-Aware and Cross-Modality Traffic Learning and Predictive Modeling for Urban Smart Mobility Systems

Human Computer Interaction
Arvind Satyanarayan, Massachusetts Institute of Technology
Generating Semantically Rich Natural Language Captions for Data Visualizations to Promote Accessibility
Dina EL-Zanfaly, Carnegie Mellon University
In-the-making: An intelligence mediated collaboration system for creative practices
Katharina Reinecke, University of Washington
Providing Science-Backed Answers to Health-related Questions in Google Search
Misha Sra, University of California, Santa Barbara
Hands-free Game Controller for Quadriplegic Individuals
Mohsen Mosleh, University of Exeter Business School
Effective Strategies to Debunk False Claims on Social Media: A large-scale digital field experiments approach
Tanushree Mitra, University of Washington
Supporting Scalable Value-Sensitive Fact-Checking through Human-AI Intelligence

Health Research
Catarina Barata, Instituto Superior Técnico, Universidade de Lisboa
DeepMutation – A CNN Model To Predict Genetic Mutations In Melanoma Patients
Emma Pierson, Cornell Tech, the Jacobs Institute, Technion-Israel Institute of Technology, and Cornell University
Using cell phone mobility data to reduce inequality and improve public health
Jasmine Jones, Berea College
Reachout: Co-Designing Social Connection Technologies for Isolated Young Adults
Mojtaba Golzan, University of Technology Sydney, Jack Phu, University of New South Wales
Autonomous Grading of Dynamic Blood Vessel Markers in the Eye using Deep Learning
Serena Yeung, Stanford University
Artificial Intelligence Analysis of Surgical Technique in the Operating Room

Machine Learning and data mining
Aravindan Vijayaraghavan, Northwestern University, Sivaraman Balakrishnan, Carnegie Mellon University
Principled Approaches for Learning with Test-time Robustness
Cho-Jui Hsieh, University of California, Los Angeles
Scalability and Tunability for Neural Network Optimizers
Golnoosh Farnadi, University of Montreal, HEC Montreal/MILA
Addressing Algorithmic Fairness in Decision-focused Deep Learning
Harrie Oosterhuis, Radboud University
Search and Recommendation Systems that Learn from Diverse User Preferences
Jimmy Ba, University of Toronto
Model-based Reinforcement Learning with Causal World Models
Nadav Cohen, Tel-Aviv University
A Dynamical Theory of Deep Learning
Nihar Shah, Carnegie Mellon University
Addressing Unfairness in Distributed Human Decisions
Nima Fazeli, University of Michigan
Semi-Implicit Methods for Deformable Object Manipulation
Qingyao Ai, University of Utah
Metric-agnostic Ranking Optimization
Stefanie Jegelka, Massachusetts Institute of Technology
Generalization of Graph Neural Networks under Distribution Shifts
Virginia Smith, Carnegie Mellon University
A Multi-Task Approach for Trustworthy Federated Learning

Mobile
Aruna Balasubramanian, State University of New York – Stony Brook
AccessWear: Ubiquitous Accessibility using Wearables
Tingjun Chen, Duke University
Machine Learning- and Optical-enabled Mobile Millimeter-Wave Networks

Machine Perception
Amir Patel, University of Cape Town
WildPose: 3D Animal Biomechanics in the Field using Multi-Sensor Data Fusion
Angjoo Kanazawa, University of California, Berkeley
Practical Volumetric Capture of People and Scenes
Emanuele Rodolà, Sapienza University of Rome
Fair Geometry: Toward Algorithmic Debiasing in Geometric Deep Learning
Minchen Wei, The Hong Kong Polytechnic University
Accurate Capture of Perceived Object Colors for Smart Phone Cameras
Mohsen Ali, Information Technology University of the Punjab, Pakistan, Izza Aftab, Information Technology University of the Punjab, Pakistan
Is Economics From Afar Domain Generalizable?
Vineeth N Balasubramanian, Indian Institute of Technology Hyderabad
Bridging Perspectives of Explainability and Adversarial Robustness
Xin Yu, University of Technology Sydney, Linchao Zhu, University of Technology Sydney
Sign Language Translation in the Wild

Networking
Aurojit Panda, New York University
Bertha: Network APIs for the Programmable Network Era
Cristina Klippel Dominicini, Instituto Federal do Espirito Santo
Polynomial Key-based Architecture for Source Routing in Network Fabrics
Noa Zilberman, University of Oxford
Exposing Vulnerabilities in Programmable Network Devices
Rachit Agarwal, Cornell University
Designing Datacenter Transport for Terabit Ethernet

Natural Language Processing
Danqi Chen, Princeton University
Improving Training and Inference Efficiency of NLP Models
Derry Tanti Wijaya, Boston University, Anietie Andy, University of Pennsylvania
Exploring the evolution of racial biases over time through framing analysis
Eunsol Choi, University of Texas at Austin
Answering Information Seeking Questions In The Wild
Kai-Wei Chang, University of California, Los Angeles
Certified Robustness to against language differences in Cross-Lingual Transfer
Mohohlo Samuel Tsoeu, University of Cape Town
Corpora collection and complete natural language processing of isiXhosa, Sesotho and South African Sign languages
Natalia Diaz Rodriguez, University of Granada (Spain) + ENSTA, Institut Polytechnique Paris, Inria. Lorenzo Baraldi, University of Modena and Reggio Emilia
SignNet: Towards democratizing content accessibility for the deaf by aligning multi-modal sign representations

Other Research Areas
John Dickerson, University of Maryland – College Park, Nicholas Mattei, Tulane University
Fairness and Diversity in Graduate Admissions
Mor Nitzan, Hebrew University
Learning representations of tissue design principles from single-cell data
Nikolai Matni, University of Pennsylvania
Robust Learning for Safe Control

Privacy
Foteini Baldimtsi, George Mason University
Improved Single-Use Anonymous Credentials with Private Metabit
Yu-Xiang Wang, University of California, Santa Barbara
Stronger, Better and More Accessible Differential Privacy with autodp

Quantum Computing
Ashok Ajoy, University of California, Berkeley
Accelerating NMR spectroscopy with a Quantum Computer
John Nichol, University of Rochester
Coherent spin-photon coupling
Jordi Tura i Brugués, Leiden University
RAGECLIQ – Randomness Generation with Certification via Limited Quantum Devices
Nathan Wiebe, University of Toronto
New Frameworks for Quantum Simulation and Machine Learning
Philipp Hauke, University of Trento
ProGauge: Protecting Gauge Symmetry in Quantum Hardware
Shruti Puri, Yale University
Surface Code Co-Design for Practical Fault-Tolerant Quantum Computing

Structured data, extraction, semantic graph, and database management
Abolfazl Asudeh, University Of Illinois, Chicago
An end-to-end system for detecting cherry-picked trendlines
Eugene Wu, Columbia University
Interactive training data debugging for ML analytics
Jingbo Shang, University of California, San Diego
Structuring Massive Text Corpora via Extremely Weak Supervision

Security
Chitchanok Chuengsatiansup, The University of Adelaide, Markus Wagner, The University of Adelaide
Automatic Post-Quantum Cryptographic Code Generation and Optimization
Elette Boyle, IDC Herzliya, Israel
Cheaper Private Set Intersection via Advances in “Silent OT”
Joseph Bonneau, New York University
Zeroizing keys in secure messaging implementations
Yu Feng , University of California, Santa Barbara, Yuan Tian, University of Virginia
Exploit Generation Using Reinforcement Learning

Software engineering and programming languages
Kelly Blincoe, University of Auckland
Towards more inclusive software engineering practices to retain women in software engineering
Fredrik Kjolstad, Stanford University
Sparse Tensor Algebra Compilation to Domain-Specific Architectures
Milos Gligoric, University of Texas at Austin
Adaptive Regression Test Selection
Sarah E. Chasins, University of California, Berkeley
If you break it, you fix it: Synthesizing program transformations so that library maintainers can make breaking changes

Systems
Adwait Jog, College of William & Mary
Enabling Efficient Sharing of Emerging GPUs
Heiner Litz, University of California, Santa Cruz
Software Prefetching Irregular Memory Access Patterns
Malte Schwarzkopf, Brown University
Privacy-Compliant Web Services by Construction
Mehdi Saligane, University of Michigan
Autonomous generation of Open Source Analog & Mixed Signal IC
Nathan Beckmann, Carnegie Mellon University
Making Data Access Faster and Cheaper with Smarter Flash Caches
Yanjing Li, University of Chicago
Resilient Accelerators for Deep Learning Training Tasks

Categories
Misc

Elevate Game Content Creation and Collaboration with NVIDIA Omniverse

With everyone shifting to a remote work environment, game development and professional visualization teams around the world need a solution for real-time collaboration and more efficient workflows.

With everyone shifting to a remote work environment, game development and professional visualization teams worldwide need a solution for real-time collaboration and more efficient workflows.

To boost creativity and innovation, developers need access to powerful technology accelerated by GPUs and easy access to secure datasets, no matter where they’re working from. And as many developers concurrently work on a project, they need to be able to manage version control of a dataset to ensure everyone is working on the latest assets.

NVIDIA Omniverse addresses these challenges. It is an open, multi-GPU enabled platform that makes it easy to accelerate development workflows and collaborate in real time. 

The primary goal of Omniverse is to support universal interoperability across various applications and 3D ecosystems. Using Pixar’s Universal Scene Description and NVIDIA RTX technology, Omniverse allows people to easily work with leading 3D applications and collaborate simultaneously with colleagues and customers, wherever they may be.

USD is the foundation for Omniverse — the open-source 3D scene description is easily extensible, originally developed to simplify content creation and facilitate frictionless interchange of assets between disparate software tools. 

The Omniverse platform is comprised of multiple components designed to help developers connect 3D applications and transform workflows:

  • Move assets throughout your pipeline seamlessly. Omniverse Connect opens the portals that allow content creation tools to connect to the Omniverse platform. With Omniverse Connect, users can work in their favorite industry software applications. 
  • Manage and store assets. Omniverse Nucleus allows users to store, share, and collaborate on project data and provides the unique ability to collaborate live across multiple applications. Nucleus works on a local machine, on-premises, or in the cloud. 
  • Quickly build tools. Omniverse Kit is a powerful toolkit for developers to create new Omniverse apps and extensions.
  • Access new technology, such as Omniverse Simulation including PhysX 5, Flow, and Blast – plus NVIDIA AI SDKs and apps such as Omniverse Audio2Face, Omniverse Deep Search, and many more. 

Learn more about NVIDIA Omniverse, which is currently in open beta. 

In addition to Omniverse, there are several other SDKs that enable developers to create more rich and lifelike content.

NVIDIA OptiX Ray Tracing Engine

OptiX provides a programmable GPU-accelerated Ray-Tracing Pipeline that is scalable across multiple NVIDIA GPU architectures. Developers can easily use this framework with other existing NVIDIA tools and OptiX has already been successfully deployed in a broad range of commercial applications.

NanoVDB

Accelerates OpenVDB applications which is the industry standard for motion picture visual effects. NanoVDB is fully optimized for high performance and quality in real time on NVIDIA GPUs and is completely compatible with OpenVDB structures, which allows for efficient creation and visualization.

Texture Tools Exporter

Enables both the creation of highly compressed texture files, saving memory in their applications and the processing of complex high quality images. Texture Tools Exporter supports all modern compression algorithms, making it a very seamless and versatile tool for developers. 

At GTC, starting on Monday, April 12, there will be over 70 technical sessions that dive into NVIDIA Omniverse. Register for free and experience its impact on the future of game development.

And don’t forget to check out all the game development sessions at GTC

Categories
Misc

Speedy Model Training With RAPIDS + Determined AI

Model developers no longer face a steep learning curve to accelerate model training. By utilizing two open-source software projects, Determined AI’s Deep Learning Training Platform and the RAPIDS accelerated data science toolkit, they can easily achieve up to 10x speedups in data preprocessing and train models at scale.  Making GPUs accessible As the field of … Continued

Model developers no longer face a steep learning curve to accelerate model training. By utilizing two open-source software projects, Determined AI’s Deep Learning Training Platform and the RAPIDS accelerated data science toolkit, they can easily achieve up to 10x speedups in data preprocessing and train models at scale. 

Making GPUs accessible

As the field of deep learning advances, practitioners are increasingly expected to make a significant investment in GPUs, either on-prem or from the cloud. Hardware is only half the story behind the proliferation of AI, though. NVIDIA’s success in powering data science has as much to do with software as hardware: widespread GPU adoption would be very difficult without convenient software abstractions that make GPUs easy for model developers to use. RAPIDS is a software suite that bridges the gap from CUDA primitives to data-hungry analytics and machine learning use cases.

Similarly, Determined AI’s deep learning training platform frees the model developer from hassles: operational hassles they are guaranteed to hit in a cluster setting, and model development hassles as they move from toy prototype to scale.  On the operational side, the platform handles distributed systems concerns like training job orchestration, storage layer integration, centralized logging, and automatic fault tolerance for long-running jobs.  On the model development side, machine learning engineers only need to maintain one version of code from the model prototype phase to more advanced tasks like multi-GPU (and multi-node) distributed training and hyperparameter tuning.  Further, the platform handles the boilerplate engineering required to track workload dependencies, metrics, and checkpoints.

At their core, both Determined AI and RAPIDS make the GPU accessible to machine learning engineers via intuitive APIs: Determined as the platform for accelerating and tracking deep learning training workflows, and RAPIDS as the suite of libraries speeding up parts of those training workflows.

For the remainder of this post, we’ll examine a model development process in which RAPIDS accelerates training data set construction within a Determined cluster, at which point Determined handles scaled out, fault-tolerant model training and hyperparameter tuning.

The RAPIDS is not alone in offering familiar interfaces atop GPU acceleration.  E.g., CuPy is NumPy-compatible, and OpenCV’s GPU module API interface is “kept similar with the CPU interface where possible.” experience will look familiar to ML engineers who are accustomed to tackling data manipulation with pandas or NumPy, and model training with PyTorch or TensorFlow.

Getting started

To use Determined and RAPIDS to accelerate model training, there are a few requirements to meet upfront. On the RAPIDS side, OS and CUDA version requirements are listed here.  One is worth calling out explicitly: RAPIDS requires NVIDIA P100 or later generation GPUs, ruling out the NVIDIA K80 in AWS P2 instances.

After satisfying these prerequisites, making RAPIDS available to tasks running on Determined is simple.  Because Determined supports custom Docker images for running training workloads, we can create an image that contains the appropriate version of RAPIDS1 installed via conda.  This is as simple as specifying the RAPIDS dependency in a Conda environment file:

name: Rapids
channels:
 - rapidsai
 - nvidia
 - conda-forge
dependencies:
 - rapids=0.14

And updating the base Conda environment in your custom image Dockerfile:

FROM determinedai/environments:cuda-10.0-pytorch-1.4-tf-1.15-gpu-0.7.0 as base
COPY environment.yml /tmp/
RUN conda --version && 
   conda env update --name base --file /tmp/environment.yml && 
   conda clean --all --force-pkgs-dirs --yes
RUN eval "$(conda shell.bash hook)" && conda activate base

After building and pushing this image to a Docker repository, you can run experiments, notebooks, or shell sessions by configuring the environment image that these tasks should use.

The model

To showcase the potency of integrating RAPIDS and Determined, we picked a tabular learning task that would typically benefit from nontrivial data preprocessing, based on the TabNet architecture and the pytorch-tabnet library implementing it. TabNet brings the power of deep learning to tabular data-driven use cases and offers some nice interpretability properties to boot.  One benchmark explored in the TabNet paper is the Rossman store sales prediction task of building a model to predict revenue across thousands of stores based on tabular data describing the stores, promotions, and nearby competitors.  Since Rossman dataset access requires signing off on an agreement, we train our model on generated data of a similar schema and scale so that users can more easily run this example.  All assets for this experiment are available on GitHub.

Data prep with RAPIDS, training with Determined

With multiple CSVs to ingest and denormalize, the Rossman revenue prediction task is ripe for RAPIDS.  The high level flow to develop a revenue prediction model looks like this:

  • Read location and historical sales CSVs into cuDF DataFrames residing in GPU memory.
  • Join these data sets into a denormalized DataFrame. This GPU-accelerated join is handled by cuDF.
  • Construct a PyTorch Dataset from the denormalized DataFrame.
  • Train with Determined!

RAPIDS cuDF’s familiar pandas-esque interface makes data ingest and manipulation a breeze:

df_store = cudf.read_csv(STORE_CSV)
df_train = cudf.read_csv(TRAIN_CSV).join(df_store,
                                        how='left',
                                        on='store_id',
                                        rsuffix='store')
df_valid = cudf.read_csv(VAL_CSV).join(df_store,
                                      how='left',
                                      on='store_id',
                                      rsuffix='store')

We then use CuPy to get from a cuDF DataFrame to a PyTorch Dataset and DataLoader to expose via Determined’s Trial interface.

Given that RAPIDS cuDF is a drop-in replacement for pandas, it’s trivial to toggle between the two libraries and compare performance of their analogous APIs. In this simplified case, cuDF showed a 10x speedup over pandas, requiring only 6 seconds to complete on a single NVIDIA V100 GPU that took a minute on the vCPU.

  Another option is to use DLPack as the intermediate format that both cuDF and PyTorch support, either directly, or using NVTabular’s Torch Dataloader which does the same under the covers.

Figure 1: Faster enterprise AI with RAPIDS and Determined.

On an absolute scale, this might not seem like a big deal: whether the overall training job takes 20 or 21 minutes doesn’t seem to matter much. However, given the iterative nature of deep learning model tuning, the time and cost savings quickly add up. For a hyperparameter tuning experiment training hundreds or thousands of models, on data larger than the couple of GB, and perhaps with more complex data transformations, savings on the order of GPU-minutes per trained model can translate to savings on the order of GPU-days or weeks at scale, netting your organization hundreds or thousands of dollars in infrastructure cost.

Determined and the broader RAPIDS toolkit

The RAPIDS library suite goes far beyond manipulation of data frames that we leveraged in this example.  To name a couple:

  • RAPIDS cuML offers GPU-accelerated ML algorithms mirroring sklearn.
  • NVTabular, which sits atop RAPIDS, offers high-level abstractions for feature engineering and building recommenders.

If you’re using these libraries, you’ll soon be able to train on a Determined cluster and get the platform’s resource management, experiment tracking, and hyperparameter tuning capabilities. We’ve heard from our users that the need for these tools isn’t limited to deep learning, so we are pushing into the broader ML space and making Determined not only the platform for PyTorch and TensorFlow model development, but for any Python-based model development. Stay tuned, and in the meantime you can learn more about this development from our 2021 roadmap discussion during our most recent community meetup.

Get started

If you’d like to learn more about (and test drive!) RAPIDS and Determined, check out the RAPIDS quick start and Determined’s quick start documentation. We’d love to hear your feedback on the RAPIDS and Determined community Slack channels. Happy training!

Categories
Misc

NVIDIA Deepens Commitment to Streamlining Recommender Workflows with GTC Spring Sessions

Here a few key sessions from industry leaders in media, delivery-on-demand, and retail at GTC Spring 2021.

Ensuring recommenders are meaningful, personalized, and relevant to a single customer is not easy. Scaling a personalized recommender experience to hundreds of thousands, or millions of customers, comes with unique challenges that data scientists and machine learning engineers tackle every day. Scaling challenges often provide obstacles to effective ETL, training, retraining, or deploying models into production.

To tackle these challenges, machine learning engineers and data scientists within the industry utilize a combination, or hybrid of tools, techniques, and algorithms. NVIDIA is committed to help streamline recommender workflows with Merlin, an open-source framework that is interoperable and designed to support machine learning engineers and data scientists with preprocessing, feature engineering, training, and inference. Merlin supports industry leaders who are tackling common recommender challenges as they provide relevant, impactful, and fresh recommenders at scale.

Here a few key sessions from industry leaders in media, delivery-on-demand, and retail at GTC Spring 2021.

  • AI-First Social Media Feeds: A View From the Trenches
    Discusses efficient training of extremely large recommender models with billions of parameters distributed across multiple GPUs and workers as well as the importance of continual model updates in near-real time to deal with the key challenge of concept drift. The session includes how solutions in NVIDIA’s Merlin stack resolve key bottlenecks faced in general purpose deep learning frameworks.

Registration is free, visit the GTC website for more information.