Categories
Misc

layers.Conv2D( ) -> Running this function kills the Python Kernel

layers.Conv2D( ) -> Running this function kills the Python Kernel
import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras.datasets import cifar10 tf.__version__ # outputs -> '2.8.0' # Normalizing the data (x_train, y_train), (x_test, y_test) = cifar10.load_data() x_train = x_train.astype("float32") / 255.0 x_test = x_test.astype("float32") / 255.0 # defining Model using Functional api def my_model(): inputs = keras.Input(shape=(32,32,3)) x = layers.Conv2D(32,3)(inputs) x = layers.BatchNormalization()(x) x = keras.activations.relu(x) x = layers.MaxPooling2D()(x) x = layers.Conv2D(64,5,padding="same")(x) x = layers.BatchNormalization()(x) x = keras.activations.relu(x) x = layers.Conv2D(128,3)(x) x = layers.BatchNormalization()(x) x = keras.activations.relu(x) x = layers.Flatten()(x) x = layers.Dense(64, activation='relu')(x) outputs = layers.Dense(10)(x) model = keras.Model(inputs= inputs, outputs=outputs) return model # Building the model model = my_model() # Compiling the model model.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits = True), optimizer = keras.optimizers.Adam(learning_rate=3e-4), metrics=['accuracy'], ) # running the model model.fit(x_train, y_train, batch_size= 64, epochs=10, verbose =2) # testing the model model.evaluate(x_test, y_test, batch_size = 1, verbose =2) 

I tried this above code on both Jupyter notebook & VSCODE.

On both occasions, its killing the python kernel. Below is the error message screen shot from VS code.

Error message from VS Code

when i run a simple MLP & also deep MLP on MNIST digit dataset. it works fine even when i had more than 10 million parameters. So I am guessing its definitely not the VRAM because for the above CNN model parameters from model.summary( ) = ~400K.

This problem occurs only when i use Conv2D function.

Using tensor flow 2.8.0

CPU: i7 – 7920hq

GPU: Quadro P4000 8 GB

RAM: 64GB.

OS: Win 11 pro

Using latest nvidia driver, cuda tool kit, cudnn.

Installed tensorlow using the tutorial below.

https://github.com/jeffheaton/t81_558_deep_learning/blob/master/install/manual_setup2.ipynb

(found from a youtube video) https://www.youtube.com/watch?v=OEFKlRSd8Ic

I would appreciate if someone can help me fix this issue.

submitted by /u/Hello_World_2K22
[visit reddit] [comments]

Categories
Offsites

Unlocking the Full Potential of Datacenter ML Accelerators with Platform-Aware Neural Architecture Search

Continuing advances in the design and implementation of datacenter (DC) accelerators for machine learning (ML), such as TPUs and GPUs, have been critical for powering modern ML models and applications at scale. These improved accelerators exhibit peak performance (e.g., FLOPs) that is orders of magnitude better than traditional computing systems. However, there is a fast-widening gap between the available peak performance offered by state-of-the-art hardware and the actual achieved performance when ML models run on that hardware.

One approach to address this gap is to design hardware-specific ML models that optimize both performance (e.g., throughput and latency) and model quality. Recent applications of neural architecture search (NAS), an emerging paradigm to automate the design of ML model architectures, have employed a platform-aware multi-objective approach that includes a hardware performance objective. While this approach has yielded improved model performance in practice, the details of the underlying hardware architecture are opaque to the model. As a result, there is untapped potential to build full capability hardware-friendly ML model architectures, with hardware-specific optimizations, for powerful DC ML accelerators.

In “Searching for Fast Model Families on Datacenter Accelerators”, published at CVPR 2021, we advanced the state of the art of hardware-aware NAS by automatically adapting model architectures to the hardware on which they will be executed. The approach we propose finds optimized families of models for which additional hardware performance gains cannot be achieved without loss in model quality (called Pareto optimization). To accomplish this, we infuse a deep understanding of hardware architecture into the design of the NAS search space for discovery of both single models and model families. We provide quantitative analysis of the performance gap between hardware and traditional model architectures and demonstrate the advantages of using true hardware performance (i.e., throughput and latency), instead of the performance proxy (FLOPs), as the performance optimization objective. Leveraging this advanced hardware-aware NAS and building upon the EfficientNet architecture, we developed a family of models, called EfficientNetX, that demonstrate the effectiveness of this approach for Pareto-optimized ML models on TPUs and GPUs.

Platform-Aware NAS for DC ML Accelerators
To achieve high performance, ML models need to adapt to modern ML accelerators. Platform-aware NAS integrates knowledge of the hardware accelerator properties into all three pillars of NAS: (i) the search objectives; (ii) the search space; and (iii) the search algorithm (shown below). We focus on the new search space because it contains the building blocks needed to compose the models and is the key link between the ML model architectures and accelerator hardware architectures.

We construct TPU/GPU specialized search spaces with TPU/GPU-friendly operations to infuse hardware awareness into NAS. For example, a key adaptation is maximizing parallelism to ensure different hardware components inside the accelerators work together as efficiently as possible. This includes the matrix multiplication units (MXUs) in TPUs and the TensorCore in GPUs for matrix/tensor computation, as well as the vector processing units (VPUs) in TPUs and CUDA cores in GPUs for vector processing. Maximizing model arithmetic intensity (i.e., optimizing the parallelism between computation and operations on the high bandwidth memory) is also critical to achieve top performance. To tap into the full potential of the hardware, it is crucial for ML models to achieve high parallelism inside and across these hardware components.

Overview of platform-aware NAS on TPUs/GPUs, highlighting the search space and search objectives.

Advanced platform-aware NAS has an optimized search space containing a set of complementary techniques to holistically improve parallelism for ML model execution on TPUs and GPUs:

  1. It uses specialized tensor reshaping techniques to maximize the parallelism in the MXUs / TensorCores.
  2. It dynamically selects different activation functions depending on matrix operation types to ensure overlapping of vector and matrix/tensor processing.
  3. It employs hybrid convolutions and a novel fusion strategy to strike a balance between total compute and arithmetic intensity to ensure that computation and memory access happens in parallel and to reduce the contention on VPUs / CUDA cores.
  4. With latency-aware compound scaling (LACS), which uses hardware performance instead of FLOPs as the performance objective to search for model depth, width and resolutions, we ensure parallelism at all levels for the entire model family on the Pareto-front.

EfficientNet-X: Platform-Aware NAS-Optimized Computer Vision Models for TPUs and GPUs
Using this approach to platform-aware NAS, we have designed EfficientNet-X, an optimized computer vision model family for TPUs and GPUs. This family builds upon the EfficientNet architecture, which itself was originally designed by traditional multi-objective NAS without true hardware-awareness as the baseline. The resulting EfficientNet-X model family achieves an average speedup of ~1.5x–2x over EfficientNet on TPUv3 and GPUv100, respectively, with comparable accuracy.

In addition to the improved speeds, EfficientNet-X has shed light on the non-proportionality between FLOPs and true performance. Many think FLOPs are a good ML performance proxy (i.e., FLOPs and performance are proportional), but they are not. While FLOPs are a good performance proxy for simple hardware such as scalar machines, they can exhibit a margin of error of up to 400% on advanced matrix/tensor machines. For example, because of its hardware-friendly model architecture, EfficientNet-X requires ~2x more FLOPs than EfficientNet, but is ~2x faster on TPUs and GPUs.

EfficientNet-X family achieves 1.5x–2x speedup on average over the state-of-the-art EfficientNet family, with comparable accuracy on TPUv3 and GPUv100.

Self-Driving ML Model Performance on New Accelerator Hardware Platforms
Platform-aware NAS exposes the inner workings of the hardware and leverages these properties when designing hardware-optimized ML models. In a sense, the “platform-awareness” of the model is a “gene” that preserves knowledge of how to optimize performance for a hardware family, even on new generations, without the need to redesign the models. For example, TPUv4i delivers up to 3x higher peak performance (FLOPS) than its predecessor TPUv2, but EfficientNet performance only improves by 30% when migrating from TPUv2 to TPUv4i. In comparison, EfficientNet-X retains its platform-aware properties even on new hardware and achieves a 2.6x speedup when migrating from TPUv2 to TPUv4i, utilizing almost all of the 3x peak performance gain expected when upgrading between the two generations.

Hardware peak performance ratio of TPUv2 to TPUv4i and the geometric mean speedup of EfficientNet-X and EfficientNet families, respectively, when migrating from TPUv2 to TPUv4i.

Conclusion and Future Work
We demonstrate how to improve the capabilities of platform-aware NAS for datacenter ML accelerators, especially TPUs and GPUs. Both platform-aware NAS and the EfficientNet-X model family have been deployed in production and materialize up to ~40% efficiency gains and significant quality improvements for various internal computer vision projects across Google. Additionally, because of its deep understanding of accelerator hardware architecture, platform-aware NAS was able to identify critical performance bottlenecks on TPUv2-v4i architectures and has enabled design enhancements to future TPUs with significant potential performance uplift. As next steps, we are working on expanding platform-aware NAS’s capabilities to the ML hardware and model design beyond computer vision.

Acknowledgements
Special thanks to our co-authors: Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le. We also thank many collaborators including Jeff Dean, David Patterson, Shengqi Zhu, Yun Ni, Gang Wu, Tao Chen, Xin Li, Yuan Qi, Amit Sabne, Shahab Kamali, and many others from the broad Google research and engineering teams who helped on the research and the subsequent broad production deployment of platform-aware NAS.

Categories
Misc

Burgers, Fries and a Side of AI: Startup Offers Taste of Drive-Thru Convenience

Eating into open hours and menus, a labor shortage has gobbled up fast-food services employees, but some restaurants are trying out a new staff member to bring back the drive-thru good times: AI. Toronto startup HuEx is in pilot tests with a conversational AI assistant for drive-thrus to help support service at several popular Canadian Read article >

The post Burgers, Fries and a Side of AI: Startup Offers Taste of Drive-Thru Convenience appeared first on The Official NVIDIA Blog.

Categories
Misc

Meet the Omnivore: Developer Sleighs Complex Manufacturing Workflows With Digital Twin of Santa’s Workshop

Don’t be fooled by the candy canes, hot cocoa and CEO’s jolly demeanor. Santa’s workshop is the very model of a 21st-century enterprise: pioneering mass customization and perfecting a worldwide distribution system able to meet almost bottomless global demand.

The post Meet the Omnivore: Developer Sleighs Complex Manufacturing Workflows With Digital Twin of Santa’s Workshop appeared first on The Official NVIDIA Blog.

Categories
Misc

Wrong output from tensorflow pose classification while test data is 97% accurate

Hi all, I’m running into a problem with a student project.

I’m trying to use the tensorflow pose estimation library to create a script that recognizes different human gestures (specifically, pointing up, pointing left, and pointing right) using Movenet.

I followed the tutorial [ https://www.tensorflow.org/lite/tutorials/pose_classification ] to train my neural network using 3000+ pictures of gestures sourced from fellow students. The testing section of the tutorial shows that the model has a 97% accuracy on the test data subselection.

The output of this tutorial gives a .tflite file, and links to the [ https://github.com/tensorflow/examples/tree/master/lite/examples/pose_estimation/raspberry_pi ] github as a tutorial on how to use this .tflite to classify new input.

However the classifications seem to be completely off, not one gesture seems to be recognized. Suspicious of this result, I tried inserting some of the old training videos as input. These also seem to be classified completely wrongly, which leads me to think there is something wrong with my execution of the code.

Has anyone run into a similar problem using the tensorflow pose classification before? Or does anyone have an idea on what I could be doing wrong? I followed all the steps in the tutorials multiple times and am getting a bit hopeles…

The code I use to run the pose classification from the github:

import pose_estimation
pose_estimation.run(
‘movenet_lightning’, # estimation_model: str,
‘keypoint’, # tracker_type: str, # Apparantly not needed when using singlepose
‘gesture_classifier_using_lighting’, # classification_model: str,
‘gesture_labels.txt’, # label_file: str,
‘Kaj7.mp4’, # camera_id: int, #right now set to be an example video used in training
600, # width: int,
600) # height: int

submitted by /u/newbroo
[visit reddit] [comments]

Categories
Misc

NVIDIA and SoftBank Group Announce Termination of NVIDIA’s Acquisition of Arm Limited

NVIDIA and SoftBank Group Corp. (SBG) today announced the termination of the previously announced transaction whereby NVIDIA would acquire Arm Limited from SBG.

Categories
Misc

AVX-512 on windows?

I have looked high and low and I can find no clarity on whether or not it’s even worth getting a CPU with the Intel DL boost instruction set for use in Windows.

All I seem to get is AVX and AVX2 reporting as enabled.

Is my assumption that AVX512 is not supported in Windows correct? Do I just need to build from source?

submitted by /u/luxdesigns
[visit reddit] [comments]

Categories
Misc

Get Started on NLP and Conversational AI with NVIDIA DLI Courses

Illustrated diagram showcasing conversational AI capabilitiesBuilding an NLP model or AI-powered chatbot? Developers can learn how to create, train and deploy sample models with free NVIDIA DLI courses.Illustrated diagram showcasing conversational AI capabilities

This past year, NVIDIA announced several major breakthroughs in conversational AI for building and deploying automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications.

To get developers started with some quick examples in a cloud GPU-accelerated environment, NVIDIA Deep Learning Institute (DLI) is offering three fast, free, self-paced courses.

What will you learn?

These instructional DLI courses give developers a taste of how to use modern tools to quickly create conversational AI and NLP GPU-accelerated applications. Learning objectives include:

  • Train a Text Classification Model Using TAO Toolkit
    • Train and fine-tune a BERT text classification model on the SST-2 dataset.
    • Run evaluation and inference on the model.
    • Export the model to ONNX format or Riva format for deployment.
  • Deploy a Text Classification Model Using Riva
    • Use Riva ServiceMaker to take a TAO-exported Riva model and convert it for final deployment.
    • Deploy the model(s) locally on the Riva Server.
    • Send inference requests from a demo client using Riva API bindings.
  • Riva Speech API Demo
    • Send audio to an ASR model and receive back text.
    • Use NLP models to transform text, classify text, and classify tokens.
    • Send text to a TTS model and receive back audio.

Upon course completion, developers will be familiar with:

  • How to train, infer, and export a text classification model using NVIDIA TAO Toolkit on NVIDIA GPUs.
  • How to deploy a text classification model using NVIDIA Riva on NVIDIA GPUs.
  • How to construct requests to an NVIDIA Riva Speech server from a sample client.

Why is text classification useful?

Text classification answers the question: Which category does this bit of text belong in? For example, if you want to know whether a movie review is positive or negative, you can use two categories to build a sentiment analysis project.

Take this one step further, and classify sentences or documents by topic using several categories. In both use cases, you start with a pre-trained language model and then “train” a classifier using example classified text to create our text classification project.

Granted, text classification is just one of many NLP tasks that uses a pre-trained language model to understand written language. Once developers try NVIDIA TAO Toolkit and NVIDIA Riva to train and deploy text classification projects, they will be in a position to extend that experience to additional NLP tasks, such as named entity recognition (NER) and question answering.

How does the NVIDIA Riva Speech API work?

The Riva Speech API server exposes a simple API for performing speech recognition, speech synthesis, and a variety of NLP inferences. In this course, developers use Python examples to run several of these API calls from within a Riva sample client. The server is prepopulated with ASR, NLP, and TTS models. These built-in models allow developers to test several conversational AI components quickly with ease.

Start learning about NLP and conversational AI

Categories
Misc

Create Speech AI Applications in Multiple Languages and Customize Text-to-Speech with Riva

This month, NVIDIA Riva released world-class speech-to-text in Spanish, German, and Russian, empowering enterprises to deploy speech AI applications globally.

This month, NVIDIA released world-class speech-to-text models for Spanish, German, and Russian in Riva, powering enterprises to deploy speech AI applications globally. In addition, enterprises can now create expressive speech interfaces using Riva’s customizable text-to-speech pipeline.

NVIDIA Riva is a GPU-accelerated speech AI SDK for developing real-time applications like live captioning, adding voice to text-based chatbots, and generating real-time transcription in call centers. For easy implementation, Riva offers highly accurate pretrained models in the NGC catalog. 

With the TAO Toolkit, these models can be customized for any industry including telecommunications, finance, unified communications as a service, and healthcare. Developers can use Riva to deploy these models out-of-the-box. They are optimized to run in real time in less than 300 ms in the cloud, data center, and at the edge.

Riva release highlights include

  • World-class speech recognition skills in Spanish, German, and Russian.
  • Customizable text-to-speech pipeline for expressive interactions.
  • Low-code fine-tuning workflow with TAO Toolkit.

Automatic speech recognition in multiple languages

Every conversational AI application, from call centers to virtual assistants, relies heavily on automatic speech recognition. Enterprises can extend these apps globally with Riva automatic speech recognition in English, Spanish, German, and Russian.

This demo show's NVIDIA Riva world-class automatic speech recognition, now available in multiple languages.
Figure 1: NVIDIA Riva world-class automatic speech recognition is available in English, Spanish, German, and Russian.

The non-English automatic speech recognition models are trained on a variety of open-source datasets, such as Mozilla Common Voice, as well as private datasets. Riva automatic speech recognition models are developed to provide out-of-the-box accuracy and serve as a great starting point for adapting to industry, jargon, dialect, or even noisy surroundings. On popular evaluation datasets, these models deliver world-class accuracy on several industry applications.

Customizable text-to-speech pipelines

For customers to enjoy lifelike dialogues, speech applications must offer human-like expressions. Using Fastpitch, a new model created by the NVIDIA speech AI research team, Riva helps developers customize the text-to-speech pipeline and create expressive speech interfaces. For example, during inference time, developers can vary voice pitch and speed using SSML tags. 

ALT Text: This demo shows NVIDIA Riva customizable text-to-speech capabilities, allowing developers to vary voice pitch and speed using SSML tags.
Figure 2: NVIDIA Riva provides customizable text-to-speech pipelines for more expressive interactions.

The latest state-of-the-art models, such as Fastpitch in Riva, help text-to-speech pipelines run several times faster than other competing options in the market.  

Resources

Subscribe to the NVIDIA Developer Blog to stay up to date on all things Conversational AI/NLP.

Categories
Misc

Analyizing Fabric-wide Network Latency with NetQ 4.1.0

NetQ 4.1.0 introduces fabric-wide network latency and buffer occupancy analysis, along with many other enhancements.

NetQ 4.1.0 was recently released, introducing fabric-wide network latency and buffer occupancy analysis along with many other enhancements. For more information about all the new capabilities, see the NetQ 4.1.0 User Guide.

This post covers the following features:

  • Flow based fabric-wide latency and buffer occupancy analysis (new)
  • What Just Happened (WJH) dashboard (new)
  • Generic webhook notifications (new)
  • Validation improvements
  • gNMI streaming enhancements

Fabric-wide latency and buffer occupancy analysis

For the first time, NetQ offers network-wide fabric latency and buffer occupancy analysis by using the live application traffic to troubleshoot network issues impacting application performance. NetQ working with Cumulus Linux samples packets matching 4-tuple and 5-tuple application flow, analyzes, and reports per-switch latency (max, min, avg) and buffer occupancy details along the path of the flow.

The NetQ graphical user interface reports all the possible paths, paths in use, and per-path details (Figure 1). On each switch, you can see minimum latency, maximum latency, and average latency. 

WJH is an always-on, full packet inspection, tool-detecting network issues at line rate with respect to packet drops, congestion, and latency issues. Working with WJH, flow telemetry is enabled on-demand for deep analysis and troubleshooting of traffic matching specific flows.

Using these capabilities together, network engineers can proactively identify and root cause server and application issues and inform the server or application administrator about the possible outage or performance impact.

Screenshot of the dashboard. NetQ analyzes all paths between nodes to determine latency statistics.
Figure 1. Flow-based telemetry

How does this differ from sFlow?

NetQ flow-based telemetry enables you to select the flow for analysis using 5-tuple or 4-tuple information of the packet, including VXLAN inner or outer headers.

sFlow does not have this level of flexibility and normally monitors at a physical port level. sFlow also provides a specific device attribute without correlating to peer devices in the network and therefore does not have the capability to provide data to build a network topology.

How does this differ from traceroute?

In the case of traceroute, the host generates a packet that runs through the network for collecting the trace data. NetQ flow telemetry analysis uses the actual application packets to build the data of the trace. The data you get from flow-based telemetry includes latency and buffer occupancy provided by the hardware. Traceroute returns ping-level accuracy of the latency, which is much less accurate than hardware-level latency.

How does this differ from in-band flow analysis?

In-band flow analysis is intrusive in nature as each hop adds metadata to the packets in the data plane. This creates network overhead. NetQ flow-based telemetry does not alter the data plane packet structure.

What Just Happened events dashboard

The new What Just Happened (WJH) dashboard introduced with NetQ 4.1 presents a timeline view of WJH events, top switches generating WJH events, top event types, and more. This enables you to quickly absorb insights from WJH Events data collected from all the switches in the fabric. WJH event details can be used to trigger flow telemetry analysis described earlier. 

The What Just Happened events dashboard presents key network events.
Figure 2. What Just Happened events dashboard

Here’s how to get the most out of WJH: Learn, clean, and personalize.

Learn

Collect WJH events from all switches centrally into NetQ. This enables you to see what is happening in your network with respect to drops, congestion, ACLs, and other protocols. NetQ dashboards are organized by drop category:

  • L1
  • L2
  • Router
  • Tunnel
  • Buffer
  • ACL

Clean

WJH reports issues that inform network, server, and storage admins. Resolve the network issues identified by WJH in priority order.

Personalize

Set WJH filters on NetQ to receive only selected WJH events going forward. For example, you may not want to receive ACL drops until the next revision of ACL updates. This reduces the volume of events to what matters to you.

NetQ also offers flexibility in receiving specific event types or events with a specified severity. NetQ also offers to set up threshold-crossing alerts on WJH events that can be integrated with tools like PagerDuty, Slack, and other notification tools. 

Validation enhancements

In the production network, NetQ validations provide insight into the live state of the network and help with proactive monitoring and troubleshooting. As part of NetQ 4.1.0, validation checks are re-architected, resulting in significant performance improvements.

  • Hourly network validation checks can be enabled or disabled depending on protocols running in your data center fabric.
  • Global validation check filters applied to hourly network validation checks with NetQ 4.1.0 enable network admins to establish a clean network validation state matching what’s running in the network.

When this baseline is established, it is easy to detect any deviations using NetQ.

In NetQ 4.1, a grouping concept has been added to create multiple validation scopes within a site.  Using this, customers with multiple fabrics in a single site can run per-fabric on-demand and scheduled validations.

Screenshot of the run on group of switches option. NetQ has re-architected the validation check process for higher performance and better ease-of-use.
Figure 3. NetQ validation checks
Screenshot of grouped switches. Switches may be grouped to make it simpler and faster to validate a subset of the network.
Figure 4. Validation device groups

gNMI streaming enhancements

NetQ 4.1.0 supports gNMI, the gRPC network management interface, to collect WJH data from the NetQ Agent on SONiC in addition to Cumulus Linux. For Cumulus Linux switches, system resource and interface counters can be streamed using gNMI. YANG Model details are available in the User Guide.

Generic webhook notification support

NetQ 4.1.0 introduced support for generic webhook notifications in addition to email, syslog, PagerDuty, and Slack event notification distribution options. Generic webhook enables NetQ to integrate with custom applications using event payload information in JSON format. 

Summary

In this post, you’ve seen an overview of the new capabilities available with NetQ 4.1.0. You can further explore NetQ 4.1.0 using NVIDIA Air.  For more information, see Troubleshooting Networks with NetQ.