Categories
Misc

GTC21: Top 5 Higher Education and Research Technical Sessions

Join speakers and panelists considered to be pioneers of AI, technologists, and creators who are re-imagining what is possible in higher education and research.

This year at GTC, you will join speakers and panelists considered to be pioneers of AI, technologists, and creators who are re-imagining what is possible in higher education and research. 

By registering for this free event you’ll get access to these top sessions, and more: 

  1. The DLI University Ambassador Program: Preparing Today’s Students and Researchers for Tomorrow’s AI and Accelerated Computing Challenges

    NVIDIA’s Deep Learning Institute (DLI) University Ambassador Program facilitates free hands-on training in AI and accelerated computing in academia to solve real-world problems. The program provides educators the opportunity to get certified to teach instructor-led DLI workshops, at no cost, across campuses and academic conferences. DLI workshop topics span from the fundamentals of deep learning, accelerated computing, and accelerated data science to more advanced workshops on NLP, intelligent recommender systems, and health-care imaging analysis. Select courses offer a DLI certificate to demonstrate subject matter competency and support career growth. Join NVIDIA’s higher education leadership and Professor Manuel Ujaldon from the University of Malaga, Spain to learn more about the program and how to apply. Manuel will also discuss how he leverages the program for his students and how these resources have helped him improve the quality of online learning during the COVID-19 pandemic.

    Joe Bungo, DLI Program Manager, NVIDIA
    Manuel Ujaldón, Professor, Computer Architecture Department, University of Malaga

  1. Insights From NVIDIA Research

    This talk will give some highlights from NVIDIA Research over the past year. Topics will include high-performance optical signaling, deep learning accelerators, applying AI to video coding, and the latest in computer graphics.

    Bill Dally, Chief Scientist and SVP Research, NVIDIA

  1. NASA Frontier Development Lab: Severe Weather Prediction Using Lightning Data

    Learn how a state-of-the-art model for time series classification was used for severe weather (tornadoes and severe hail thunderstorms) prediction using lightning data from the Geostationary Lightning Mapper (GLM) device aboard the NOAA GOES-16 satellite. These results are the outcome of NASA Frontier Development Lab (FDL) 2020: Lightning and Extreme Weather, and were accepted to oral presentations in two NeurIPS 2020 workshops and are currently submitted for publication. According to our tests, the GPU implementation of the convolutional time series approach was 27x faster than a CPU implementation. This dramatic gain in speed allowed rapid iteration to build more and better models. Find out how leveraged by NVIDIA V100 GPUs, our results suggest that, with a 15 minute lead time, false alarms for warned thunderstorms could be decreased by 70% and that tornadoes and large hail could be correctly identified approximately 3 out of 4 times using lightning data only.

    Ivan Venzor, Data Science Deputy Director, Banregio

  1. Hands-On Deep Learning Robotics Curriculum in High Schools with Jetson Nano

    NVIDIA’s Jetson AI Ambassador will show you how to implement fundamental deep learning and robotics in high schools with Jetson Nano developer kit. If you’re interested in AI education, you’ll learn how to get the most from NVIDIA Jetson Nano developer kit and DLI courses. This AI curriculum includes deep learning fundamental concepts, Python programming skills, an image processing algorithm, and an integrated robotics project (Jetbot). Students can evaluate what they’ve learned and have an instant response on the robot’s behavior.

    David Tseng, Manager, CAVEDU Education

  1. NVIDIA Kaolin and Omniverse for 3D Deep Learning Research

    Get an introduction to NVIDIA’s Kaolin library for accelerating 3D deep learning research, as well as a demonstration of APIs for Kaolin’s GPU-optimized operations such as modular differentiable rendering, fast conversions between representations, data loading, 3D checkpoints, and more. In this session, we’ll also demonstrate using the Omniverse Kaolin application to visualize training datasets, generate new ones, and to observe the progress of ongoing training of models.

    Jean-Francois Lafleche, Deep Learning Engineer, NVIDIA
    Clement Fuji Tsang, Research Scientist, NVIDIA

Visit the GTC website to view more recommended sessions and to register for the free conference.

Categories
Misc

Now Hear This: Startup Gives Businesses a New Voice

Got a conflict with your 2pm appointment? Just spin up a quick assistant that takes good notes and when your boss asks about you even identifies itself and explains why you aren’t there. Nice fantasy? No, it’s one of many use cases a team of some 50 ninja programmers, AI experts and 20 beta testers Read article >

The post Now Hear This: Startup Gives Businesses a New Voice appeared first on The Official NVIDIA Blog.

Categories
Misc

Drum Roll, Please: AI Startup Sunhouse Founder Tlacael Esparza Finds His Rhythm

Drawing on his trifecta of degrees in math, music and music technology, Tlacael Esparza, co-founder and CTO of Sunhouse, is revolutionizing electronic drumming. Esparza has created Sensory Percussion, a combination of hardware and software that uses sensors and AI to allow a single drum to produce a complex range of sounds depending on where and Read article >

The post Drum Roll, Please: AI Startup Sunhouse Founder Tlacael Esparza Finds His Rhythm appeared first on The Official NVIDIA Blog.

Categories
Misc

Art and Music in Light of AI

In the sea of virtual exhibitions that have popped up over the last year, the NVIDIA AI Art Gallery offers a fresh combination of incredible visual art, musical experiences and poetry, highlighting the narrative of an emerging art form based on AI technology. The online exhibit — part of NVIDIA’s GTC event — will feature Read article >

The post Art and Music in Light of AI appeared first on The Official NVIDIA Blog.

Categories
Misc

Maintaining Container Security as the Core of NGC with Anchore Enterprise

Containers have quickly gained strong adoption in the software development and deployment process and has truly enabled us to manage software complexity. It is not surprising that, by a recent Gartner report, more than 70% of global organizations will be running containerized applications in production by 2023. That’s up from less than 20% in 2019. … Continued

Containers have quickly gained strong adoption in the software development and deployment process and has truly enabled us to manage software complexity. It is not surprising that, by a recent Gartner report, more than 70% of global organizations will be running containerized applications in production by 2023. That’s up from less than 20% in 2019.

However, containers also bring security challenges to IT and security practitioners. Shipping containers can be a potential hiding place for illegal contraband. You may not be fully aware of the contents of a software container. That’s why it’s critical to have a comprehensive understanding of the contents of the containers that you deploy. Security is no longer an afterthought for IT and security admins, but there is a need to adopt security best practices early in the software building process.

Today, there are numerous software marketplaces from which to pull a variety of containerized software tools to help you speed up software development. However, this speedup in the development process is counterproductive if the DevSecOps or IT team flags the software for security lapses, preventing deployment to production. This can lead to delays in production and, eventually, revenue loss.

To speed up development in a repeated and an automated format, the most common starting point is to download a publicly available image and build on top of it. Unknowingly, you might expose your new application code or service to the risk of vulnerabilities, which are inherited from base images. Some of the most common threats include images that have unpatched vulnerabilities or mistakenly granting many privileges that can have potential escalation in production environments, related to exposed insecure ports, private keys, or secrets. Relying on software images from trusted sources, like NVIDIA NGC, can play a key role in accelerating the software development cycle.

When you layer your own application code with NGC images as base images, you may only have to worry about the code layers that you add on top of it. Secondly, every time a CVE is identified in any layer, you must build an image from scratch, which may take several hours and may be time-consuming. Using NGC images to build production applications or services helps you reduce time to deployments.

Container security at the core of the NGC catalog

The software from NGC provides a high level of security assurance required by enterprises. Curated containers on NGC can enable rapid application development with minimal investment as the NGC containers undergo performance regression testing, and functional and security checks ahead of a release.

The NGC container publication process has container image scanning by Anchore at its core. Image scanning refers to the process of analyzing the contents of a container image to detect security issues, vulnerabilities, or bad practices.

NGC registry integrates security scanning as an SaaS offering where images are retrieved and scanned with the Anchore solution. The security scans include checks like the following:

  • Vulnerability, such as CVE-mapping
  • Metadata scans such as Dockerfiles
  • Data or key leaks such as crypto keys
  • Open ports

The scanning policy for CVEs measures severity into critical, high, medium, and low vulnerabilities using the Common Vulnerability Scoring System (CVSS). Known CVEs are patched before publishing an image to NGC. The scan results may vary in time as new CVEs are published each hour and the new CVEs may not be known at the time of publishing. The scan results allow publishers to identify any red flags early in the development process, saving development time using Anchore’s best-in-class, high signal-to-noise ratio scanning technology, which means fewer false positives.

Figure 1 shows a sample results of an image scanned in NGC, with two high vulnerabilities found in OS packages. It also provides CVE links to detailed descriptions on the security threats exposed and if it was patched in upstream versions. Developers and security can analyze the risk further to triage it.

Sample results of an image scanned in NGC with two high vulnerabilities found in OS packages
Figure 1. Detailed security scan obtained during the container publishing process in NGC.

The most popular products like PyTorch, TensorFlow, Triton, TensorRT, MXNet, RAPIDS, CUDA, and nv-HPC SDK update their NGC images on a monthly release cadence, assuring that the latest security patches are applied.

As software complexity increases with the need for additional capabilities, you rely on additional packages and software layers, which in turn increases security risks and exposures. Our security development practices drive to a minimal memory footprint as we provide thinner images in flavors of development and deployment images. For example, the CUDA base is used to build applications but the CUDA runtime image is used for deployments. This leads to a smaller attack surface, where unused packages or debug tools are eliminated.

Thus, NGC aims to provide a strong foundation for enterprises by adapting to security best practices such as scanning and other approaches. As upgrading, testing, and deploying gets easier with containers, you are encouraged to upgrade to the latest NGC image versions. This not only reduces security risks from recently found CVEs, but also allows you to get maximum performance delivered on NVIDIA GPUs.

GTC 21

To learn more about container security, join us for the Industry Experts Discuss Container Security and Best Practices for Software Development Stakeholders GTC panel session on April 14, 1PM (registration required to view). During the session, security industry experts discuss the best practices that data scientists and developers can follow to be more vigilant in identifying and pulling secure software images. Register today!

Categories
Misc

N Ways to SAXPY: Demonstrating the Breadth of GPU Programming Options

Back in 2012, NVIDIAN Mark Harris wrote Six Ways to Saxpy, demonstrating how to perform the SAXPY operation on a GPU in multiple ways, using different languages and libraries. Since then, programming paradigms have evolved and so has the NVIDIA HPC SDK. In this post, I demonstrate five ways to implement a simple SAXPY computation … Continued

Back in 2012, NVIDIAN Mark Harris wrote Six Ways to Saxpy, demonstrating how to perform the SAXPY operation on a GPU in multiple ways, using different languages and libraries. Since then, programming paradigms have evolved and so has the NVIDIA HPC SDK.

In this post, I demonstrate five ways to implement a simple SAXPY computation using NVIDIA GPUs. Why is this interesting? Because it demonstrates the breadth of options that you have today for programming NVIDIA GPUs. It also covers the four main approaches to GPU computing:

  • GPU-accelerated libraries
  • Compiler directives
  • Standard language parallelism
  • GPU programming languages

SAXPY stands for Single-Precision A·X Plus Y,  a function in the standard Basic Linear Algebra Subroutines (BLAS) library. SAXPY is a combination of scalar multiplication and vector addition, and it’s simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like the following:

void saxpy_cpu(int n, float a, float *x, float *y)
{
	for (int i = 0; i 

Given this basic example code, I can now show you five ways to SAXPY on GPUs. I chose SAXPY because it is a short and simple code, but it shows enough of the syntax of each programming approach to compare them. Because it does relatively little computation, SAXPY isn’t that useful for demonstrating the difference in performance between the different programming models, but that’s not my intent here. My goal is to demonstrate multiple ways to program on the NVIDIA platform today, rather than to recommend one over another. That would require taking other factors into account and is beyond the scope of this post.

I discuss implementations of SAXPY in the following models:

  • CUDA C++—A C++ language extension to support the CUDA programming model and allow C++ code to be executed on NVIDIA GPUs.
  • cuBLAS—A GPU-accelerated implementation of the basic linear algebra subroutines (BLAS) optimized for NVIDIA GPUs.
  • OpenACC—Using compiler directives to tell the compiler that a given portion of the code can be parallelized and letting the compiler figure out how to do it.
  • Standard C++—Using the NVC++ compiler and parallel execution policies added to the standard library with C++11 and 17.
  • Thrust—A high-level, GPU-accelerated parallel algorithms library.

After going through all the implementations, I show what performance looks like when SAXPY is accelerated through these approaches.

CUDA C++ SAXPY

__global__ void saxpy_cuda(int n, float a, float *x, float *y)
{
	unsigned int t_id = threadIdx.x + blockDim.x * blockIdx.x;
	unsigned int stride = blockDim.x * gridDim.x;
	for (int i = t_id; i >>(n, 2.0, dev_x, dev_y);
cudaDeviceSynchronize();

CUDA C++ is a GPU programming language that provides extensions to the C/C++ language for expressing parallel computation. Device functions, called kernels, are declared with the __global__ specifier to show that they can be called either from host code or device code. Device memory to hold the float vector is allocated using cudaMalloc. Then, the kernel defined is called with an execution configuration:

>>

Each thread launched executes the kernel, using built-in variables like threadIdx, blockDim, and blockIdx. The variables are assigned by the device for each thread and block and are used to calculate the index of the elements in the vector for which it is responsible. In doing so, each thread does the multiply-add operation on a limited number of elements of the vector. In the case where the number of threads is less than the size of the vector, each thread computes a stride to operate on multiple elements so that the entire vector is taken care of (Figure 1).

An array of n=8 blocks with the first four set as stride or the number of threads in the grid. Four GPU threads, each with arrows originating from the thread and ending at the array blocks corresponding to the thread id and thread id + stride.
Figure 1. How GPU threads operate across a large array using a grid-stride when the number of threads is fewer than the number of elements. The stride is the number of threads in the grid, so the kernel loops over the data array one grid-size at a time.

cuBLAS SAXPY

cublasHandle_t handle;
cublasCreate(&handle);
unsigned int n = 1UL 

SAXPY, being a BLAS operation, has an implementation in the NVIDIA cuBLAS library. It involves initializing a cuBLAS library context by passing a handle to cublasCreate, allocating memory for the vectors, and then calling the library function cublasSaxpy while passing in the vector and scalar values. Finally, cublasDestroy and cudaFree are used to release the resources associated with the cuBLAS library context and device memory allocated for the vectors, respectively.

OpenACC C++ SAXPY

void saxpy(int n, float a, float *restrict x, float *restrict y)
{
#pragma acc kernels

	for (int i = 0; i 

OpenACC is a directive-based programming model that uses compiler directives through #pragma to tell the compiler that a portion of the code can be parallelized. The compiler then analyzes the instruction and automatically generates code for the GPU.  OpenACC provides options for fine-tuning launch configurations in those instances where the automatically generated code may not be optimal.

Compilers with support for NVIDIA GPUs like nvc++ can offload computation to the GPU using unified memory to seamlessly copy data between the host and device. Adding #pragma acc kernels tells the compiler to generate a kernel for the following for loop. Because you allocated x and y on the host using the malloc instruction, the compiler uses unified memory to move the vector to the device before computation and back to the host afterward. The compiler generates instructions to move the vectors x and y into device memory and do a fused multiply-add for each element.

std::par C++ SAXPY

void saxpy(int N, float a, float *restrict x, float *restrict y)
{
std::transform(std::execution::par_unseq, x, x + N, y, y,[=](float xi, float yi) { return a * xi + yi; });
}
float alpha = 2.0;
unsigned int n = 1UL 

With the NVIDIA NVC++ compiler, you can use GPU acceleration in standard C++ with no language extensions, pragmas, directives, or libraries, other than the C++ standard library. The code, being standard C++, is portable to other compilers and systems and can be accelerated on NVIDIA GPUs or multicore CPUs using NVC++.

With the new features for parallel execution and execution policies introduced with C++11 and 17, algorithms in the standard library like std::transform and std::reduce added an execution policy as the first parameter to any algorithm that supports execution policies. You can thus pass std::execution::par_unseq to std::transform, defining a lambda that captures by value and performs the saxpy operation. When compiled using the -stdpar command line option, the compiler compiles standard algorithms that are called with a parallel execution policy for execution on NVIDIA GPUs.

Thrust SAXPY

struct saxpy_functor
{
    const float a;
    saxpy_functor(float _a) : a(_a) {}
    __global__ float operator()(const float &x, const float &y)
    {
        return a * x + y;
    }
};
float alpha = 2.0;
unsigned int n = 1UL  x(n);
thrust::device_vector y(n);
thrust::fill(x.begin(), x.end(), 1.0);
thrust::fill(y.begin(), y.end(), 1.0);
thrust::transform(x.begin(), x.end(), y.begin(), y.begin(), saxpy_functor(alpha));

Thrust is a parallel algorithms library that resembles the C++ Standard Template Library (STL). It provides parallel building blocks to develop fast, portable algorithms. Interoperability with established technologies like CUDA and TBB, along with its modular design, allows you to focus on the algorithms instead of the platform-specific implementations.

Here, you allocate memory on the device, in this case the NVIDIA GPU for x and y. You then use the fill function to initialize them. Finally, you use the Thrust transform algorithm along with the defined functor saxpy_functor to apply the y=a*x+y operation to each element of x and y.

SAXPY performance

While SAXPY is a bandwidth-bound operation and not computationally complex, its highly parallel nature means that it still benefits from GPU acceleration if the problem size is large enough. When compared to a dual socket AMD EPYC 7742 system with 128 cores and 256 threads, an NVIDIA A100 GPU was 23x faster, executing more than 3000 SAXPY operations in the time that the CPU took to do 140. Furthermore, all the GPU-accelerated implementations gave a similar performance, with cuBLAS edging out the rest by a slight margin (Figure 2).

Bar graph showing how many times each implementation could run the SAXPY operation in 1 second. CPU std_par C++ with the lowest value of 140 with the GPU implementations of CUDA C++, CUBLAS, OpenACC, std_par C++, and Thrust with similar values of 3100 with CUBLAS being slightly faster than the others on the order of 100-200 iterations.
Figure 2. SAXPY performance on a dual socket AMD EPYC 7742 system with 128 cores and 256 threads when compared to GPU accelerated implementations on A100.

Accelerating your code with NVIDIA GPUs

The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools enabling you to choose the programming model that works best for you and still get excellent performance by accelerating your code using NVIDIA GPUs. Learn more and get started today:

Try NVIDIA GPU acceleration for your code from any major cloud service provider. Try A100 in the cloud at Google Cloud Platform, Amazon Web Services, or Alibaba Cloud.

Categories
Misc

Tensorflow 1.14, Fix : “google.protobuf.message.DecodeError”: Error parsing message

Protobuf v3.15 Error: google.protobuf.message.DecodeError, When using tf.graph(), loading TensorFlow model into memory. After changing tf.graph() snippet above into TensorFlow v2, same error was getting.

I have tried protobuf 3.12.4(same on colabs), same error appeared

https://stackoverflow.com/questions/66842689/tensorflow-1-14-fix-google-protobuf-message-decodeerror-error-parsing-mess

Traceback (most recent call last): File "object_detection/webcam.py", line 25, in <module> od_graph_def.ParseFromString(serialized_graph) google.protobuf.message.DecodeError: Error parsing message [ WARN:0] global C:projectsopencv-pythonopencvmodulesvideoiosrccap_msmf.cpp (674) SourceReaderCB::~SourceReaderCB terminating async callback 

I have reinstalled different protobuf version and still same error is getting.

I have trained a “SSD MobileNet” model using TensorFlow version 1.14 CPU for Webcam Object-detection with OpenCV. After installing required libraries of TensorFlow, I run model_builder_tf1.py and it successfully passed all 21 tests.

Snippet: to load TensorFlow model into memory using tf.graph()

detection_graph = tf.Graph() with detection_graph.as_default(): od_graph_def = tf.compat.v1.GraphDef() with tf.gfile.GFile(PATH_TO_FROZEN_GRAPH, 'rb') as fid: serialized_graph = fid.read() od_graph_def.ParseFromString(serialized_graph) tf.import_graph_def(od_graph_def, name='') sess = tf.compat.v1.Session(graph=detection_graph) 

Note that TensorFlow 1.14 is installed on conda environment.

Using protobuf==3.8, another of error appeared

AttributeError: module ‘google.protobuf.descriptor’ has no attribute ‘_internal_create_key

Can someone please give a solution to this problem.

submitted by /u/jhivesh
[visit reddit] [comments]

Categories
Misc

Doubling Network File System Performance with RDMA-Enabled Networking

This post was originally published on the Mellanox blog. Network File System (NFS) is a ubiquitous component of most modern clusters. It was initially designed as a work-group filesystem, making a central file store available to and shared among several client servers. As NFS became more popular, it was used for mission-critical applications, which required access … Continued

This post was originally published on the Mellanox blog.

Network File System (NFS) is a ubiquitous component of most modern clusters. It was initially designed as a work-group filesystem, making a central file store available to and shared among several client servers. As NFS became more popular, it was used for mission-critical applications, which required access to storage. Next, migration to higher performing networks was implemented to improve client-to-NFS communications. In addition to higher networking speeds (today 100 GbE and soon 200 GbE), the industry has been looking for technologies that offload stateless networking functions that run on the CPU to the IO subsystems. This leaves more CPU cycles free to run business applications and maximizes the data center efficiency.

One of the more popular networking offload technologies is RDMA (Remote Direct Memory Access). RDMA makes data transfers more efficient and enables fast data move­ment between servers and storage without involving its CPU. Throughput is increased, latency reduced, and CPU power is freed up for the applications. RDMA technology is already widely used for efficient data transfer in render farms and large cloud deployments, including the following:

  • Microsoft Azure
  • HPC solutions (including machine learning and deep learning)
  • iSER and NVMe-oF-based storage
  • Mission-critical SQL database solutions such as Oracle RAC (Exadata)
  • IBM DB2 pureScale
  • Microsoft SQL solutions and Teradata
Figure 1. Data communication over TCP vs. RDMA.

Figure 1 shows why IT managers have been deploying RoCE (RDMA over Converged Ethernet). RoCE uses advances in Ethernet to enable more efficient RDMA over Ethernet and enables widespread deployment of RDMA technologies in mainstream data center applications.

The growing deployment of RDMA-enabled networking solutions in public and private clouds—like RoCE that enables ruining RDMA over Ethernet, plus the recent NFS protocol extensions—enables NFS communication over RoCE. For more information, see the Open Source NFS/RDMA Roadmap presentation given at the OpenFabrics Workshop in 2017 by Chuck Lever, an upstream Linux contributor and Linux kernel architect at Oracle. For more information about how to run NFS over RoCE, see How to Configure NFS over RDMA (RoCE).

To evaluate the boost that RoCE enables compared to TCP, we ran the IOzone test, measured the read/write IOPS, and throughput of multi-thread read or write tests. The tests were performed on a single client against a Linux NFS server using tmpfs, so that storage latency was removed from the picture and transport behavior exposed.

  • Client server: Intel Core i5-3450S CPU @ 2.80GHz one socket, four cores, HT disabled 16-GB RAM, 1333 MHz DDR3, non-ECC HCA together with the NVIDIA Mellanox ConnectX-5 100 GbE NIC (SW version 16.20.1010) plugged into in a PCIe 3.0 x16 slot.
  • NFS server: Intel Xeon CPU E5-1620 v4 @ 3.50GHz one socket, four cores, HT disabled 64-GB RAM, 2400 MHz DDR4 HCA, together with the ConnectX-5 100 GbE NIC (16.20.1010) plugged into in a PCIe 3.0 x16 slot.

The client and the NFS server were connected over a single 100-GbE NVIDIA Mellanox LinkX copper cable to the NVIDIA Mellanox Spectrum switch using the SN2700 model with its 32 x 100-GbE ports, which is the lowest latency Ethernet switch available in the market today. This makes it ideal for running latency-sensitive applications over Ethernet.

The following charts show the bandwidth and IOPS measured for performance over RoCE vs. TCP, running the IOzone test.

Figure 2. Running NFS over RoCE enables 2X to 3X higher bandwidth (using a 128-KB block size, read and write with 16 threads, aggregate throughput).
Figure 3. NFS over RoCE enables up to 140% higher IOPs (using a 8-KB block size, read and write with 16 threads, aggregate IOPs).
Figure 4. NFS over RoCE enables up to 150% higher IOPs (using a 2-KB block size, read and write with 16 threads, aggregate IOPs).

Conclusion

Running NFS over RDMA-enabled networks—such as RoCE, which offloads the CPU from performing the data communication job—generates a significant performance boost. As a result, Mellanox expects that NFS over RoCE will eventually replace NFS over TCP and become the leading transport technology in data centers.

Categories
Misc

Integrating with Telephone Networks to Enable Real-Time AI Services

Many of you may not recognize my company, Ribbon Communications. We are best known for building and securing large telecom networks for communication service providers (also known as phone companies). However, there’s a good chance that in the next day or two, you’ll place a call that traverses a piece of our gear somewhere in … Continued

Many of you may not recognize my company, Ribbon Communications. We are best known for building and securing large telecom networks for communication service providers (also known as phone companies). However, there’s a good chance that in the next day or two, you’ll place a call that traverses a piece of our gear somewhere in the world. In addition to service providers, we have substantial practice working with large enterprises, the kinds of organizations that need carrier-grade services, either because of their size or the critical nature of their communications. That includes universities, healthcare institutions, financial services, government agencies, and so on.

A short while ago, one of our customers, one of the largest investment banks in the world, approached Ribbon with a problem. They wanted to use advanced AI to analyze their contact center calls, in real-time, so that they could make immediate business decisions based on AI-based observations. They wanted to be able to ingest the audio stream, immediately transcribe it into text, and then also immediately analyze the text to look for issues such as customer satisfaction, threatening behavior, and fraud attempts. The sooner the text was transcribed, the easier it would be to store and search. Our customer could also use it for other forms of trend analysis that could spot upcoming issues, for example, customer sentiment with a certain agent.

Anyone that has ever tried to search a recording can appreciate why a bank with thousands of calls a day would rather store transcriptions than audio and would rather use AI tools to search for issues compared to traditional search tools. Unfortunately, the bank was stymied by several common technical issues that stood in their way:

  • The bank needed a secure element that could sit in the middle of thousands of contact center calls and replicate all the call media streams so the streams could be sent to an AI engine.
  • Because the element is in the middle of these calls, it can’t ever fail, and it can’t degrade the calls. It also had to be extremely secure such that a third party couldn’t find a way to intercept the streams. Nor could it be compromised or overloaded using a DoS attack.
  • The telephone network uses a different media format than AI engines accept: Real-time Transport Protocol (RTP). The bank could not just send raw audio streams of all calls to an AI engine.
  • The bank wanted to use the real-time audio streams to execute multiple AI-based services at the same time. That means that they needed multiple copies of the real-time audio sent to different AI services simultaneously to enable different constituencies in the bank to analyze the data and use the results for their own purposes.

Because the bank could not overcome these issues, they were forced to record calls in another format, store them, and then send the recordings to an AI engine for analysis. Recording was not acceptable as it introduced two drawbacks:

  • The transcription and analysis are not real-time so there’s no way to leverage AI to react to issues happening right now. That dramatically reduces the value.
  • Recordings can reduce audio quality. As you all know, lower audio quality inherently reduces the transcription accuracy of an AI platform.
Ribbon’s AI Gateway integrates into the public phone network, converting phone conversations into linear audio streams so those streams can be sent to NVIDIA’s Jarvis platform where they are translated into text and used in real-time by data analytics tools.
Figure 1. Ribbon AI gateway concept.

Ribbon, working in collaboration with NVIDIA, created a solution. We used our extensive experience in managing telephone network audio and signaling and combined that with the NVIDIA Jarvis advanced conversational AI platform, powered by GPU technology.

Ribbon is well-known for its telecom network security software—session border controllers (SBCs) —that provides swire-speed packet inspection and media manipulation. We took that know-how and created a secure interface to the telecom network so that we could securely access and replicate thousands of high-quality streams of telephone audio from the bank’s contact center (or any telephony source).

In real-time, we convert those streams from RTP into AI-acceptable audio. The audio goes to Jarvis, to be transcribed in real-time. Line-of-business owners can then use that data for many different applications. The bank already has distinct use cases in mind but it’s obvious that developers could find thousands of use cases and the value could be applied across hundreds of different industries. Any organization that receives a high volume of calls is a potential target. Target applications include:

  • Regulatory compliance
  • Real-time security or fraud analysis
  • Real-time sentiment analysis
  • Real-time translation 

Figure 1 shows that the Ribbon AI gateway becomes a secure bridge between the telephone network and AI data analytics domain. After the audio moves into text, the breadth of potential applications grows exponentially. The ability to get that almost instantaneously expands the potential opportunities to use the data and value of that data.

Ribbon’s AI gateway architecture

Ribbon’s AI Gateway has multiple internal services including a SIP Call Control Interface to integrate with a Session Border Controller. It has a Media Relay Function that converts the audio. IT also has a REST Server and File System that manage rest requests for business applications and store data, respectively.
Figure 2. AI gateway architecture, interface view.

Figure 2 provides a view of the AI gateway components, how they connect to the contact center and integrate with the Jarvis AI engine.

In the diagram, the Ribbon SBC acts as a secure spigot that delivers thousands of high-quality streams of audio to the AI gateway, using standard telecoms protocols: Session Initiation Protocol (SIP) for call signaling and RTP with various codecs for call audio. Each call participant has a separate audio stream sent to the AI gateway, to ensure the quality of the audio.

The Media Relay component then converts each audio stream from RTP to AI-acceptable audio and delivers it to the AI engine for conversion to text or other Jarvis application functions. The text for each audio stream is then sent back to the AI gateway.

The AI gateway is controlled using REST APIs. This allows a business application to dynamically instruct and control how each call is handled. For example, a business application might be set up to focus on the quality of engagements for an organization’s premium customers. The application would match incoming caller ID to the premium customers’ phone number. When there is a match, those calls would be selected for transcription and real-time analysis. The same type of filtering could be used to look for new customers, customers in a certain geography, time of day, and so on. Alternatively, they could target all calls or only a percentage to sample calls, based on defined rules.

An application can instruct the AI gateway whether to use the Jarvis AI engine to convert a call’s audio to text from speech or use some other Jarvis AI function. It is even possible to instruct the AI gateway to perform different functions on the same audio stream.

Finally, the application can instruct the AI gateway to stream the AI data output either in real-time or at the end of the call. It can choose one or multiple destinations. The AI gateway can provide multiple AI streams from an individual audio call to different business functions, in parallel. Businesses often have siloed organizations that have distinct requirements. They want their own feed of data so that they can unilaterally act on it. This allows different departments—like compliance, operations, or security—to use the call data to address their own specific business needs.

To demonstrate the AI gateway capabilities, we deployed a single Amazon EC2 instance in AWS. For benchmarking performance, we deployed a separate test harness and drove hundreds of simultaneous voice calls at the AI gateway instance. Using the g4dn.2xlarge EC2 instance type running Jarvis EA2 ASR, with T4 GPU, we generated 220 simultaneous voice streams in 110 simultaneous calls. Each GPU provided an order of magnitude capacity improvement over CPUs.

The AI gateway can direct call traffic to multiple GPUs to scale well beyond 100 simultaneous calls, to support the thousands of concurrent calls that a large contact center would field.

Conclusion

The speech-to-text use case is only the beginning. By providing the ability to convert from text back to speech and inject this into the call path to the caller, the AI gateway can provide a basis for real-time conversational AI agents to engage directly with contact center customers.

This AI gateway capability opens literally thousands of potential application use cases that can be tailored to fit specific business verticals and go beyond the confines of the contact center environment.

If you are interested in learning more, look for our conference talk at the upcoming GTC session, Real-time Integration of Telephony Network with AI Speech-to-Text Translation or contact me directly at Ribbon.

Categories
Misc

Reducing Costs with One-Pass Reverse Time Migration

Reverse time migration (RTM) is a powerful seismic migration technique, providing geophysicists with the ability to create accurate 3D images of the subsurface. Steep dips? Complex salt structure? High velocity contrast? No problem. By splitting the upgoing and downgoing wavefields and combining them with an accurate velocity model, RTM can image even the most complex … Continued

Reverse time migration (RTM) is a powerful seismic migration technique, providing geophysicists with the ability to create accurate 3D images of the subsurface. Steep dips? Complex salt structure? High velocity contrast? No problem. By splitting the upgoing and downgoing wavefields and combining them with an accurate velocity model, RTM can image even the most complex geologic formations.

The algorithm migrates each shot independently using this basic workflow:

  1. Compute the downgoing wavefield.
  2. Reconstruct the upgoing wavefield and reverse it in time.
  3. Correlate up and down wavefields at each image point.
  4. Repeat for all shots and combine in a 3D image.

While simple in concept, the computational costs made RTM economically unviable until the early 2010s, when parallel processing with NVIDIA GPUs dramatically reduced the migration time and hardware footprint needed.

Reducing RTM costs by increasing computational efficiency

There are several factors driving computational requirements for tilted transversely isotropic (TTI) RTM. One is the calculation of first, second, and cross-derivatives along x, y, and z. Earlier versions of GPU, such as the Fermi and Kepler generations, had limited streaming multiprocessors (SMs), shared memory, and compiler technology.

Paulius Micikevicius famously overcame these issues by splitting the derivative calculations into two or three passes, with each pass computing a set of derivatives. This major breakthrough allowed seismic processors to run RTM in an economical and time-efficient manner. However, each pass requires a round-trip to memory. Each round-trip to memory hinders performance and drives up costs.

While multi-pass RTM was the best you could do in 2012, you can do much better today with the NVIDIA Volta or NVIDIA Ampere Architecture generations. If your RTM kernel hasn’t been tuned since the days of Paulius, you are leaving significant value on the table.

Moving to a one-pass RTM

A one-pass TTI RTM kernel reads the wavefield one time, computes all necessary derivatives, and writes the updated wavefields to global memory one time. By eliminating multiple read/write roundtrips to memory, this implementation dramatically increases the performance gained on GPUs. It also helps the algorithm scale linearly across multiple GPUs in a node. Figure 2 shows the performance and strong scaling gained by reducing the number of passes on V100, T4, and A100 GPUs.

For seismic processing in the cloud, T4 provides a particularly good price/performance solution. On-premises servers for seismic processing typically have four to eight V100 or A100 GPUs per node. For these configurations, reducing the number of passes from three to one improves RTM kernel performance by 78-98%!

Performance benchmarks for TTI RTM on A100s. The graph shows benchmarks for three, two, and one pass TTI RTM, and near linear scaling from one to eight GPUs in a node.
Figure 1. A100 performance on multi and single-pass TTI RTM with linear scaling.
Performance benchmarks for TTI RTM on T4s. The graph shows benchmarks for three, two, and one pass TTI RTM, and near linear scaling from one to eight GPUs in a node.
Figure 2. T4 performance on multi and single-pass TTI RTM with linear scaling.
Performance benchmarks for TTI RTM on T4s. The graph shows benchmarks for three, two, and one pass TTI RTM, and near linear scaling from one to eight GPUs in a node.
Figure 3. V100 performance on multi and single-pass RTM with linear scaling.

Conclusion

Reducing the number of passes in your RTM kernel can dramatically improve code performance and decrease costs. To make the development easier, NVIDIA has developed a collection of code examples showing how to implement a GPU-accelerated RTM using best practices. If we have an NDA in place for you, you can have free access to this code.

Of course, the number of passes in an RTM kernel is only one piece of the puzzle. There are several other tricks shown in the example code to further increase performance, such as compression.

If you’re interested in accessing the NVIDIA RTM implementation or want assistance in optimizing your code, please comment below.