Got a conflict with your 2pm appointment? Just spin up a quick assistant that takes good notes and when your boss asks about you even identifies itself and explains why you aren’t there. Nice fantasy? No, it’s one of many use cases a team of some 50 ninja programmers, AI experts and 20 beta testers Read article >
Drawing on his trifecta of degrees in math, music and music technology, Tlacael Esparza, co-founder and CTO of Sunhouse, is revolutionizing electronic drumming. Esparza has created Sensory Percussion, a combination of hardware and software that uses sensors and AI to allow a single drum to produce a complex range of sounds depending on where and Read article >
In the sea of virtual exhibitions that have popped up over the last year, the NVIDIA AI Art Gallery offers a fresh combination of incredible visual art, musical experiences and poetry, highlighting the narrative of an emerging art form based on AI technology. The online exhibit — part of NVIDIA’s GTC event — will feature Read article >
Containers have quickly gained strong adoption in the software development and deployment process and has truly enabled us to manage software complexity. It is not surprising that, by a recent Gartner report, more than 70% of global organizations will be running containerized applications in production by 2023. That’s up from less than 20% in 2019. … Continued
Containers have quickly gained strong adoption in the software development and deployment process and has truly enabled us to manage software complexity. It is not surprising that, by a recent Gartner report, more than 70% of global organizations will be running containerized applications in production by 2023. That’s up from less than 20% in 2019.
However, containers also bring security challenges to IT and security practitioners. Shipping containers can be a potential hiding place for illegal contraband. You may not be fully aware of the contents of a software container. That’s why it’s critical to have a comprehensive understanding of the contents of the containers that you deploy. Security is no longer an afterthought for IT and security admins, but there is a need to adopt security best practices early in the software building process.
Today, there are numerous software marketplaces from which to pull a variety of containerized software tools to help you speed up software development. However, this speedup in the development process is counterproductive if the DevSecOps or IT team flags the software for security lapses, preventing deployment to production. This can lead to delays in production and, eventually, revenue loss.
To speed up development in a repeated and an automated format, the most common starting point is to download a publicly available image and build on top of it. Unknowingly, you might expose your new application code or service to the risk of vulnerabilities, which are inherited from base images. Some of the most common threats include images that have unpatched vulnerabilities or mistakenly granting many privileges that can have potential escalation in production environments, related to exposed insecure ports, private keys, or secrets. Relying on software images from trusted sources, like NVIDIA NGC, can play a key role in accelerating the software development cycle.
When you layer your own application code with NGC images as base images, you may only have to worry about the code layers that you add on top of it. Secondly, every time a CVE is identified in any layer, you must build an image from scratch, which may take several hours and may be time-consuming. Using NGC images to build production applications or services helps you reduce time to deployments.
Container security at the core of the NGC catalog
The software from NGC provides a high level of security assurance required by enterprises. Curated containers on NGC can enable rapid application development with minimal investment as the NGC containers undergo performance regression testing, and functional and security checks ahead of a release.
The NGC container publication process has container image scanning by Anchore at its core. Image scanning refers to the process of analyzing the contents of a container image to detect security issues, vulnerabilities, or bad practices.
NGC registry integrates security scanning as an SaaS offering where images are retrieved and scanned with the Anchore solution. The security scans include checks like the following:
Vulnerability, such as CVE-mapping
Metadata scans such as Dockerfiles
Data or key leaks such as crypto keys
Open ports
The scanning policy for CVEs measures severity into critical, high, medium, and low vulnerabilities using the Common Vulnerability Scoring System (CVSS). Known CVEs are patched before publishing an image to NGC. The scan results may vary in time as new CVEs are published each hour and the new CVEs may not be known at the time of publishing.The scan results allow publishers to identify any red flags early in the development process, saving development time using Anchore’s best-in-class, high signal-to-noise ratio scanning technology, which means fewer false positives.
Figure 1 shows a sample results of an image scanned in NGC, with two high vulnerabilities found in OS packages. It also provides CVE links to detailed descriptions on the security threats exposed and if it was patched in upstream versions. Developers and security can analyze the risk further to triage it.
The most popular products like PyTorch, TensorFlow, Triton, TensorRT, MXNet, RAPIDS, CUDA, and nv-HPC SDK update their NGC images on a monthly release cadence, assuring that the latest security patches are applied.
As software complexity increases with the need for additional capabilities, you rely on additional packages and software layers, which in turn increases security risks and exposures. Our security development practices drive to a minimal memory footprint as we provide thinner images in flavors of development and deployment images. For example, the CUDA base is used to build applications but the CUDA runtime image is used for deployments. This leads to a smaller attack surface, where unused packages or debug tools are eliminated.
Thus, NGC aims to provide a strong foundation for enterprises by adapting to security best practices such as scanning and other approaches. As upgrading, testing, and deploying gets easier with containers, you are encouraged to upgrade to the latest NGC image versions. This not only reduces security risks from recently found CVEs, but also allows you to get maximum performance delivered on NVIDIA GPUs.
Back in 2012, NVIDIAN Mark Harris wrote Six Ways to Saxpy, demonstrating how to perform the SAXPY operation on a GPU in multiple ways, using different languages and libraries. Since then, programming paradigms have evolved and so has the NVIDIA HPC SDK. In this post, I demonstrate five ways to implement a simple SAXPY computation … Continued
Back in 2012, NVIDIAN Mark Harris wrote Six Ways to Saxpy, demonstrating how to perform the SAXPY operation on a GPU in multiple ways, using different languages and libraries. Since then, programming paradigms have evolved and so has the NVIDIA HPC SDK.
In this post, I demonstrate five ways to implement a simple SAXPY computation using NVIDIA GPUs. Why is this interesting? Because it demonstrates the breadth of options that you have today for programming NVIDIA GPUs. It also covers the four main approaches to GPU computing:
GPU-accelerated libraries
Compiler directives
Standard language parallelism
GPU programming languages
SAXPY stands for Single-Precision A·X Plus Y, a function in the standard Basic Linear Algebra Subroutines (BLAS) library. SAXPY is a combination of scalar multiplication and vector addition, and it’s simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like the following:
void saxpy_cpu(int n, float a, float *x, float *y)
{
for (int i = 0; i
Given this basic example code, I can now show you five ways to SAXPY on GPUs. I chose SAXPY because it is a short and simple code, but it shows enough of the syntax of each programming approach to compare them. Because it does relatively little computation, SAXPY isn’t that useful for demonstrating the difference in performance between the different programming models, but that’s not my intent here. My goal is to demonstrate multiple ways to program on the NVIDIA platform today, rather than to recommend one over another. That would require taking other factors into account and is beyond the scope of this post.
I discuss implementations of SAXPY in the following models:
CUDA C++—A C++ language extension to support the CUDA programming model and allow C++ code to be executed on NVIDIA GPUs.
cuBLAS—A GPU-accelerated implementation of the basic linear algebra subroutines (BLAS) optimized for NVIDIA GPUs.
OpenACC—Using compiler directives to tell the compiler that a given portion of the code can be parallelized and letting the compiler figure out how to do it.
Standard C++—Using the NVC++ compiler and parallel execution policies added to the standard library with C++11 and 17.
After going through all the implementations, I show what performance looks like when SAXPY is accelerated through these approaches.
CUDA C++ SAXPY
__global__ void saxpy_cuda(int n, float a, float *x, float *y)
{
unsigned int t_id = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int stride = blockDim.x * gridDim.x;
for (int i = t_id; i >>(n, 2.0, dev_x, dev_y);
cudaDeviceSynchronize();
CUDA C++ is a GPU programming language that provides extensions to the C/C++ language for expressing parallel computation. Device functions, called kernels, are declared with the __global__ specifier to show that they can be called either from host code or device code. Device memory to hold the float vector is allocated using cudaMalloc. Then, the kernel defined is called with an execution configuration:
>>
Each thread launched executes the kernel, using built-in variables like threadIdx, blockDim, and blockIdx. The variables are assigned by the device for each thread and block and are used to calculate the index of the elements in the vector for which it is responsible. In doing so, each thread does the multiply-add operation on a limited number of elements of the vector. In the case where the number of threads is less than the size of the vector, each thread computes a stride to operate on multiple elements so that the entire vector is taken care of (Figure 1).
cuBLAS SAXPY
cublasHandle_t handle;
cublasCreate(&handle);
unsigned int n = 1UL
SAXPY, being a BLAS operation, has an implementation in the NVIDIA cuBLAS library. It involves initializing a cuBLAS library context by passing a handle to cublasCreate, allocating memory for the vectors, and then calling the library function cublasSaxpy while passing in the vector and scalar values. Finally, cublasDestroy and cudaFree are used to release the resources associated with the cuBLAS library context and device memory allocated for the vectors, respectively.
OpenACC C++ SAXPY
void saxpy(int n, float a, float *restrict x, float *restrict y)
{
#pragma acc kernels
for (int i = 0; i
OpenACC is a directive-based programming model that uses compiler directives through #pragma to tell the compiler that a portion of the code can be parallelized. The compiler then analyzes the instruction and automatically generates code for the GPU. OpenACC provides options for fine-tuning launch configurations in those instances where the automatically generated code may not be optimal.
Compilers with support for NVIDIA GPUs like nvc++ can offload computation to the GPU using unified memory to seamlessly copy data between the host and device. Adding #pragma acc kernels tells the compiler to generate a kernel for the following for loop. Because you allocated x and y on the host using the malloc instruction, the compiler uses unified memory to move the vector to the device before computation and back to the host afterward. The compiler generates instructions to move the vectors x and y into device memory and do a fused multiply-add for each element.
std::par C++ SAXPY
void saxpy(int N, float a, float *restrict x, float *restrict y)
{
std::transform(std::execution::par_unseq, x, x + N, y, y,[=](float xi, float yi) { return a * xi + yi; });
}
float alpha = 2.0;
unsigned int n = 1UL
With the NVIDIA NVC++ compiler, you can use GPU acceleration in standard C++ with no language extensions, pragmas, directives, or libraries, other than the C++ standard library. The code, being standard C++, is portable to other compilers and systems and can be accelerated on NVIDIA GPUs or multicore CPUs using NVC++.
With the new features for parallel execution and execution policies introduced with C++11 and 17, algorithms in the standard library like std::transform and std::reduce added an execution policy as the first parameter to any algorithm that supports execution policies. You can thus pass std::execution::par_unseq to std::transform, defining a lambda that captures by value and performs the saxpy operation. When compiled using the -stdpar command line option, the compiler compiles standard algorithms that are called with a parallel execution policy for execution on NVIDIA GPUs.
Thrust is a parallel algorithms library that resembles the C++ Standard Template Library (STL). It provides parallel building blocks to develop fast, portable algorithms. Interoperability with established technologies like CUDA and TBB, along with its modular design, allows you to focus on the algorithms instead of the platform-specific implementations.
Here, you allocate memory on the device, in this case the NVIDIA GPU for x and y. You then use the fill function to initialize them. Finally, you use the Thrust transform algorithm along with the defined functor saxpy_functor to apply the y=a*x+y operation to each element of x and y.
SAXPY performance
While SAXPY is a bandwidth-bound operation and not computationally complex, its highly parallel nature means that it still benefits from GPU acceleration if the problem size is large enough. When compared to a dual socket AMD EPYC 7742 system with 128 cores and 256 threads, an NVIDIA A100 GPU was 23x faster, executing more than 3000 SAXPY operations in the time that the CPU took to do 140. Furthermore, all the GPU-accelerated implementations gave a similar performance, with cuBLAS edging out the rest by a slight margin (Figure 2).
Accelerating your code with NVIDIA GPUs
The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools enabling you to choose the programming model that works best for you and still get excellent performance by accelerating your code using NVIDIA GPUs. Learn more and get started today:
Protobuf v3.15 Error: google.protobuf.message.DecodeError, When using tf.graph(), loading TensorFlow model into memory. After changing tf.graph() snippet above into TensorFlow v2, same error was getting.
I have tried protobuf 3.12.4(same on colabs), same error appeared
Traceback (most recent call last): File "object_detection/webcam.py", line 25, in <module> od_graph_def.ParseFromString(serialized_graph) google.protobuf.message.DecodeError: Error parsing message [ WARN:0] global C:projectsopencv-pythonopencvmodulesvideoiosrccap_msmf.cpp (674) SourceReaderCB::~SourceReaderCB terminating async callback
I have reinstalled different protobuf version and still same error is getting.
I have trained a “SSD MobileNet” model using TensorFlow version 1.14 CPU for Webcam Object-detection with OpenCV. After installing required libraries of TensorFlow, I run model_builder_tf1.py and it successfully passed all 21 tests.
Snippet: to load TensorFlow model into memory using tf.graph()
detection_graph = tf.Graph() with detection_graph.as_default(): od_graph_def = tf.compat.v1.GraphDef() with tf.gfile.GFile(PATH_TO_FROZEN_GRAPH, 'rb') as fid: serialized_graph = fid.read() od_graph_def.ParseFromString(serialized_graph) tf.import_graph_def(od_graph_def, name='') sess = tf.compat.v1.Session(graph=detection_graph)
Note that TensorFlow 1.14 is installed on conda environment.
Using protobuf==3.8, another of error appeared
AttributeError: module ‘google.protobuf.descriptor’ has no attribute ‘_internal_create_key
Can someone please give a solution to this problem.
This post was originally published on the Mellanox blog. Network File System (NFS) is a ubiquitous component of most modern clusters. It was initially designed as a work-group filesystem, making a central file store available to and shared among several client servers. As NFS became more popular, it was used for mission-critical applications, which required access … Continued
This post was originally published on the Mellanox blog.
Network File System (NFS) is a ubiquitous component of most modern clusters. It was initially designed as a work-group filesystem, making a central file store available to and shared among several client servers. As NFS became more popular, it was used for mission-critical applications, which required access to storage. Next, migration to higher performing networks was implemented to improve client-to-NFS communications. In addition to higher networking speeds (today 100 GbE and soon 200 GbE), the industry has been looking for technologies that offload stateless networking functions that run on the CPU to the IO subsystems. This leaves more CPU cycles free to run business applications and maximizes the data center efficiency.
One of the more popular networking offload technologies is RDMA (Remote Direct Memory Access). RDMA makes data transfers more efficient and enables fast data movement between servers and storage without involving its CPU. Throughput is increased, latency reduced, and CPU power is freed up for the applications. RDMA technology is already widely used for efficient data transfer in render farms and large cloud deployments, including the following:
Microsoft Azure
HPC solutions (including machine learning and deep learning)
iSER and NVMe-oF-based storage
Mission-critical SQL database solutions such as Oracle RAC (Exadata)
IBM DB2 pureScale
Microsoft SQL solutions and Teradata
Figure 1 shows why IT managers have been deploying RoCE (RDMA over Converged Ethernet). RoCE uses advances in Ethernet to enable more efficient RDMA over Ethernet and enables widespread deployment of RDMA technologies in mainstream data center applications.
The growing deployment of RDMA-enabled networking solutions in public and private clouds—like RoCE that enables ruining RDMA over Ethernet, plus the recent NFS protocol extensions—enables NFS communication over RoCE. For more information, see the Open Source NFS/RDMA Roadmap presentation given at the OpenFabrics Workshop in 2017 by Chuck Lever, an upstream Linux contributor and Linux kernel architect at Oracle. For more information about how to run NFS over RoCE, see How to Configure NFS over RDMA (RoCE).
To evaluate the boost that RoCE enables compared to TCP, we ran the IOzone test, measured the read/write IOPS, and throughput of multi-thread read or write tests. The tests were performed on a single client against a Linux NFS server using tmpfs, so that storage latency was removed from the picture and transport behavior exposed.
Client server: Intel Core i5-3450S CPU @ 2.80GHz one socket, four cores, HT disabled 16-GB RAM, 1333 MHz DDR3, non-ECC HCA together with the NVIDIA Mellanox ConnectX-5 100 GbE NIC (SW version 16.20.1010) plugged into in a PCIe 3.0 x16 slot.
NFS server: Intel Xeon CPU E5-1620 v4 @ 3.50GHz one socket, four cores, HT disabled 64-GB RAM, 2400 MHz DDR4 HCA, together with the ConnectX-5 100 GbE NIC (16.20.1010) plugged into in a PCIe 3.0 x16 slot.
The client and the NFS server were connected over a single 100-GbE NVIDIA Mellanox LinkX copper cable to the NVIDIA Mellanox Spectrum switch using the SN2700 model with its 32 x 100-GbE ports, which is the lowest latency Ethernet switch available in the market today. This makes it ideal for running latency-sensitive applications over Ethernet.
The following charts show the bandwidth and IOPS measured for performance over RoCE vs. TCP, running the IOzone test.
Conclusion
Running NFS over RDMA-enabled networks—such as RoCE, which offloads the CPU from performing the data communication job—generates a significant performance boost. As a result, Mellanox expects that NFS over RoCE will eventually replace NFS over TCP and become the leading transport technology in data centers.
Many of you may not recognize my company, Ribbon Communications. We are best known for building and securing large telecom networks for communication service providers (also known as phone companies). However, there’s a good chance that in the next day or two, you’ll place a call that traverses a piece of our gear somewhere in … Continued
Many of you may not recognize my company, Ribbon Communications. We are best known for building and securing large telecom networks for communication service providers (also known as phone companies). However, there’s a good chance that in the next day or two, you’ll place a call that traverses a piece of our gear somewhere in the world. In addition to service providers, we have substantial practice working with large enterprises, the kinds of organizations that need carrier-grade services, either because of their size or the critical nature of their communications. That includes universities, healthcare institutions, financial services, government agencies, and so on.
A short while ago, one of our customers, one of the largest investment banks in the world, approached Ribbon with a problem. They wanted to use advanced AI to analyze their contact center calls, in real-time, so that they could make immediate business decisions based on AI-based observations. They wanted to be able to ingest the audio stream, immediately transcribe it into text, and then also immediately analyze the text to look for issues such as customer satisfaction, threatening behavior, and fraud attempts. The sooner the text was transcribed, the easier it would be to store and search. Our customer could also use it for other forms of trend analysis that could spot upcoming issues, for example, customer sentiment with a certain agent.
Anyone that has ever tried to search a recording can appreciate why a bank with thousands of calls a day would rather store transcriptions than audio and would rather use AI tools to search for issues compared to traditional search tools. Unfortunately, the bank was stymied by several common technical issues that stood in their way:
The bank needed a secure element that could sit in the middle of thousands of contact center calls and replicate all the call media streams so the streams could be sent to an AI engine.
Because the element is in the middle of these calls, it can’t ever fail, and it can’t degrade the calls. It also had to be extremely secure such that a third party couldn’t find a way to intercept the streams. Nor could it be compromised or overloaded using a DoS attack.
The telephone network uses a different media format than AI engines accept: Real-time Transport Protocol (RTP). The bank could not just send raw audio streams of all calls to an AI engine.
The bank wanted to use the real-time audio streams to execute multiple AI-based services at the same time. That means that they needed multiple copies of the real-time audio sent to different AI services simultaneously to enable different constituencies in the bank to analyze the data and use the results for their own purposes.
Because the bank could not overcome these issues, they were forced to record calls in another format, store them, and then send the recordings to an AI engine for analysis. Recording was not acceptable as it introduced two drawbacks:
The transcription and analysis are not real-time so there’s no way to leverage AI to react to issues happening right now. That dramatically reduces the value.
Recordings can reduce audio quality. As you all know, lower audio quality inherently reduces the transcription accuracy of an AI platform.
Ribbon, working in collaboration with NVIDIA, created a solution. We used our extensive experience in managing telephone network audio and signaling and combined that with the NVIDIA Jarvis advanced conversational AI platform, powered by GPU technology.
Ribbon is well-known for its telecom network security software—session border controllers (SBCs) —that provides swire-speed packet inspection and media manipulation. We took that know-how and created a secure interface to the telecom network so that we could securely access and replicate thousands of high-quality streams of telephone audio from the bank’s contact center (or any telephony source).
In real-time, we convert those streams from RTP into AI-acceptable audio. The audio goes to Jarvis, to be transcribed in real-time. Line-of-business owners can then use that data for many different applications. The bank already has distinct use cases in mind but it’s obvious that developers could find thousands of use cases and the value could be applied across hundreds of different industries. Any organization that receives a high volume of calls is a potential target. Target applications include:
Regulatory compliance
Real-time security or fraud analysis
Real-time sentiment analysis
Real-time translation
Figure 1 shows that the Ribbon AI gateway becomes a secure bridge between the telephone network and AI data analytics domain. After the audio moves into text, the breadth of potential applications grows exponentially. The ability to get that almost instantaneously expands the potential opportunities to use the data and value of that data.
Ribbon’s AI gateway architecture
Figure 2 provides a view of the AI gateway components, how they connect to the contact center and integrate with the Jarvis AI engine.
In the diagram, the Ribbon SBC acts as a secure spigot that delivers thousands of high-quality streams of audio to the AI gateway, using standard telecoms protocols: Session Initiation Protocol (SIP) for call signaling and RTP with various codecs for call audio. Each call participant has a separate audio stream sent to the AI gateway, to ensure the quality of the audio.
The Media Relay component then converts each audio stream from RTP to AI-acceptable audio and delivers it to the AI engine for conversion to text or other Jarvis application functions. The text for each audio stream is then sent back to the AI gateway.
The AI gateway is controlled using REST APIs. This allows a business application to dynamically instruct and control how each call is handled. For example, a business application might be set up to focus on the quality of engagements for an organization’s premium customers. The application would match incoming caller ID to the premium customers’ phone number. When there is a match, those calls would be selected for transcription and real-time analysis. The same type of filtering could be used to look for new customers, customers in a certain geography, time of day, and so on. Alternatively, they could target all calls or only a percentage to sample calls, based on defined rules.
An application can instruct the AI gateway whether to use the Jarvis AI engine to convert a call’s audio to text from speech or use some other Jarvis AI function. It is even possible to instruct the AI gateway to perform different functions on the same audio stream.
Finally, the application can instruct the AI gateway to stream the AI data output either in real-time or at the end of the call. It can choose one or multiple destinations. The AI gateway can provide multiple AI streams from an individual audio call to different business functions, in parallel. Businesses often have siloed organizations that have distinct requirements. They want their own feed of data so that they can unilaterally act on it. This allows different departments—like compliance, operations, or security—to use the call data to address their own specific business needs.
To demonstrate the AI gateway capabilities, we deployed a single Amazon EC2 instance in AWS. For benchmarking performance, we deployed a separate test harness and drove hundreds of simultaneous voice calls at the AI gateway instance. Using the g4dn.2xlarge EC2 instance type running Jarvis EA2 ASR, with T4 GPU, we generated 220 simultaneous voice streams in 110 simultaneous calls. Each GPU provided an order of magnitude capacity improvement over CPUs.
The AI gateway can direct call traffic to multiple GPUs to scale well beyond 100 simultaneous calls, to support the thousands of concurrent calls that a large contact center would field.
Conclusion
The speech-to-text use case is only the beginning. By providing the ability to convert from text back to speech and inject this into the call path to the caller, the AI gateway can provide a basis for real-time conversational AI agents to engage directly with contact center customers.
This AI gateway capability opens literally thousands of potential application use cases that can be tailored to fit specific business verticals and go beyond the confines of the contact center environment.
Reverse time migration (RTM) is a powerful seismic migration technique, providing geophysicists with the ability to create accurate 3D images of the subsurface. Steep dips? Complex salt structure? High velocity contrast? No problem. By splitting the upgoing and downgoing wavefields and combining them with an accurate velocity model, RTM can image even the most complex … Continued
Reverse time migration (RTM) is a powerful seismic migration technique, providing geophysicists with the ability to create accurate 3D images of the subsurface. Steep dips? Complex salt structure? High velocity contrast? No problem. By splitting the upgoing and downgoing wavefields and combining them with an accurate velocity model, RTM can image even the most complex geologic formations.
The algorithm migrates each shot independently using this basic workflow:
Compute the downgoing wavefield.
Reconstruct the upgoing wavefield and reverse it in time.
Correlate up and down wavefields at each image point.
Repeat for all shots and combine in a 3D image.
While simple in concept, the computational costs made RTM economically unviable until the early 2010s, when parallel processing with NVIDIA GPUs dramatically reduced the migration time and hardware footprint needed.
Reducing RTM costs by increasing computational efficiency
There are several factors driving computational requirements for tilted transversely isotropic (TTI) RTM. One is the calculation of first, second, and cross-derivatives along x, y, and z. Earlier versions of GPU, such as the Fermi and Kepler generations, had limited streaming multiprocessors (SMs), shared memory, and compiler technology.
Paulius Micikevicius famously overcame these issues by splitting the derivative calculations into two or three passes, with each pass computing a set of derivatives. This major breakthrough allowed seismic processors to run RTM in an economical and time-efficient manner. However, each pass requires a round-trip to memory. Each round-trip to memory hinders performance and drives up costs.
While multi-pass RTM was the best you could do in 2012, you can do much better today with the NVIDIA Volta or NVIDIA Ampere Architecture generations. If your RTM kernel hasn’t been tuned since the days of Paulius, you are leaving significant value on the table.
Moving to a one-pass RTM
A one-pass TTI RTM kernel reads the wavefield one time, computes all necessary derivatives, and writes the updated wavefields to global memory one time. By eliminating multiple read/write roundtrips to memory, this implementation dramatically increases the performance gained on GPUs. It also helps the algorithm scale linearly across multiple GPUs in a node. Figure 2 shows the performance and strong scaling gained by reducing the number of passes on V100, T4, and A100 GPUs.
For seismic processing in the cloud, T4 provides a particularly good price/performance solution. On-premises servers for seismic processing typically have four to eight V100 or A100 GPUs per node. For these configurations, reducing the number of passes from three to one improves RTM kernel performance by 78-98%!
Conclusion
Reducing the number of passes in your RTM kernel can dramatically improve code performance and decrease costs. To make the development easier, NVIDIA has developed a collection of code examples showing how to implement a GPU-accelerated RTM using best practices. If we have an NDA in place for you, you can have free access to this code.
Of course, the number of passes in an RTM kernel is only one piece of the puzzle. There are several other tricks shown in the example code to further increase performance, such as compression.
If you’re interested in accessing the NVIDIA RTM implementation or want assistance in optimizing your code, please comment below.
This tutorial is the seventh installment of introductions to the RAPIDS ecosystem. The series explores and discusses various aspects of RAPIDS that allow its users solve ETL (Extract, Transform, Load) problems, build ML (Machine Learning) and DL (Deep Learning) models, explore expansive graphs, process geospatial, signal, and system log data, or use SQL language via … Continued
This tutorial is the seventh installment of introductions to the RAPIDS ecosystem. The series explores and discusses various aspects of RAPIDS that allow its users solve ETL (Extract, Transform, Load) problems, build ML (Machine Learning) and DL (Deep Learning) models, explore expansive graphs, process geospatial, signal, and system log data, or use SQL language via BlazingSQL to process data.
In the age of the Internet, abundant IoT devices, social media, web servers, and more, data flows at incredible speeds. In 2019, Forbes reported that every minute, Americans use approximately 4.4PB of internet data: which converts to roughly 1MB of data per Internet user per minute.
Not only is the volume of data increasing over time, but so are the speeds at which data arrives. Over the years, we went from dial-up modem connections with speeds up to 56kbit in the early 1990s to contemporary 10Gbit networks starting gaining some popularity. 1Gbit networks are still the most widely used type of interconnecting devices at home and in the office unless you are on a WiFi network.
Many of the Internet services offered these days rely on prompt and fast processing of this constant waterfall of data. cuStreamz is one of the newer additions to the RAPIDS stack. It aims to take the streaming data processing historically done on CPU and accelerate on the GPU. Thanks to GPUs’ immense parallelism, processing streaming data has now become much faster with a friendly Python interface.
In the previous posts we showcased other areas:
In the first post, python pandas tutorialwe introduced cuDF, the RAPIDS DataFrame framework for processing large amounts of data on an NVIDIA GPU.
In the sixth post, the use of RAPIDS cuGraph, we introduced a GPU framework for processing and analyzing cyber logs.
Today, we talk about cuStreamz – a library that uses GPUs to process streaming data. To help get familiar with cuStreamz, we also published a cheat sheet that can be downloaded here cuStreamz cheatsheet, and an interactive notebook with all the current functionality of cuStreamz showcasedhere.
Streaming frameworks
First released in 2011, Apache Kafka has quickly become a standard for managing vast quantities of fast-moving data with low latency and high-level APIs. Kafka is a distributed platform that maintains a list of topics that systems can subscribe to (the so-called, consumers), and publish their data onto (the producers). Data in Kafka, like many other distributed systems, is replicated among multiple workers (or brokers): if any of the brokers disconnects from the cluster, or otherwise dies, the data is not lost and still available from other brokers. This improves the resiliency and availability of the system that is required by today’s Internet service companies.
Streamz is a Python framework that focuses on processing high-velocity data and allows for branching, joining, controlling the flow of messages, and sinking the data to disk or other streams. Here’s what a set of distinct pipelines might look like;
The pipeline can branch into multiple branches. A popular Lambda architecture also implements two branches: one to process fast-moving, near real-time data, and another one to provide batch processing.
RAPIDS cuStreamz builds on top of the streamz framework and allows the messages to be batched into cuDF DataFrames instead of text messages. This, on its own, enables significant speed-ups of processing of messages that purport to the same schema by tapping into the power of GPUs. Also, if the data volume cannot fit on a single machine, cuStreams supports pushing the data processing using Dask-cuDF DataFrames.
Setting up locally
It is easy to get started. In this section, we will show you how to set up your own mini-Kafka cluster using Docker. To use cuStreamz, you will, of course, need an NVIDIA GPU with Pascal architecture (GTX 1000-series) or newer as required by RAPIDS.
Next, let’s set up our Kafka cluster. If you clone the Github repository, inside the cheatsheets/cuStreamz folder navigate to Kafka and open docker-compose.yaml file.
Docker-compose uses the YAML configuration files to set up the whole cluster. The first service we start is the zookeeper. Zookeeper is a service used to track naming and configuration data for Kafka; it maintains information about the cluster nodes’ status and their topics, partitions, replication, etc. Besides, the Zookeeper service allows multiple clients to carry out concurrent reads and writes to the service to keep up with the volume and velocity of the incoming and outgoing data calls.
In this example, we use the cp-zookeeper:5.4.3 image from Confluent to start our Zookeeper service; the server started will be named zookeeper. The Zookeeper service can be replicated among multiple servers, so it can become resilient; the Zookeeper servers talk to each other on port 2888, and the leader-of-the-pack runs on port 3888. Clients that want to use the Zookeeper connect to the service on port 2181, and that port gets forwarded to the host via the config ports. We also map some host folders to the container so the data that Zookeeper stores is persisted.
Next, we start two Kafka worker nodes (one shown here for brevity).
The cp-kafka image comes from the Confluent’s Docker Hub; here, we also use version 5.4.3. There are plenty of environmental variables but let’s review just the most important from our point of view:
KAFKA_LISTENERS identifies a list of server names and ports the server will be listening to. Note that the external and internal ports are different: to facilitate communication between multiple docker containers the server will be placed on the Docker internal network (in our case kafka_kafka) and the kafka0 server will be listening on port 19092. If you would like to connect to this service from the host you can use the localhost and port 9092. The same list is provided in the KAFKA_ADVERTISED_LISTENERS environmental variable.
KAFKA_INTER_BROKER_LISTENER_NAME tells the Docker which server name to use for internal communication between containers: in our case, this is LISTENER_DOCKER_INTERNAL but any recognizable name should work. Should you, however, change this name you will have to change the KAFKA_LISTENERS and the KAFKA_ADVERTISED_LISTENERS.
KAFKA_ZOOKEEPER_CONNECT specifies the address of the zookeeper to connect to; in our case, that is zookeeper: 2181.
KAFKA_BROKER_ID is a unique identifier of the kafka node and by convention should be included in the name of the service and server name.
We also identify the zookeeper as a service this container depends on.
To start all these services, simply navigate to the folder where the docker-compose.yaml file is saved and run docker-compose up in the terminal (if you want to stop the service press Ctrl-C or from another terminal window type docker-compose down). Once the services are running, you can check the list of all containers by running docker ps command.
With all the services running, let’s create a sample topic. Run the following command in the terminal.
docker exec -ti bash
Once inside, run the following command.
kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor --partitions --topic test
Now, you should be able to subscribe to the topic test to either sink or consume the messages. Your Kafka service is running!!!
Let’s get streaming!
In this example, we will be using the official RAPIDS container. Go ahead and pull the latest one following the examples here https://rapids.ai/start.html. Start the container using the command listed on the RAPIDS website. You should now be able to navigate to https://localhost:8888 and access JupyterLab.
Before we move forward, we need to connect this container to the kafka_kafka network: do so with the following command from the terminal.
docker network connect kafka_kafka
From now on, we should be able to access the kafka0:19092 server from the RAPIDS container.
Note that if you do not have custreamz available in your container, you can install it using the following command.
We will be using the .from_kafka_batched(...) method to subscribe as this allows us to use the CUDA Kafka connector and return the messages in the form of a cudf DataFrame. The first parameter specifies the topic name and is followed by the dictionary with configuration. Next, we set up the interval the stream object will be checking the Kafka topic for new messages; 2 seconds in this example. The engine set cudf specifies that the messages should be returned as DataFrames. We can now provide the rest of the pipeline and start the listener.
from streamz.dataframe import DataFrame
def process_batch(messages):
batch_df = cudf.DataFrame()
for message in messages:
df_split = messages[message].str.tokenize()
df_split = (
df_split
.to_frame('word')
.reset_index()
.groupby(by='word')
.agg({'index': 'count'})
.rename(columns={'index': 'count'})
.reset_index()
)
print("nWord Count for this batch:")
batch_df = cudf.concat([batch_df, df_split])
return batch_df
stream_df = source.map(process_batch)
# Create a streamz dataframe to get stateful word count
sdf = DataFrame(stream_df, example=cudf.DataFrame({'word':[], 'count':[]}))
# Formatting the print statements
def print_format(sdf):
print("nGlobal Word Count:")
return sdf
# Print cumulative word count from the start of the stream, after every batch.
# One can also sink the output to a list.
sdf.groupby('word').sum().stream.gather().map(print_format)
After this run;
source.start()
Et voila! We now have a running listener to the test topic!
The code here is pretty self-explanatory, but at the high level, we expect the message to come as a DataFrame. We will count all the words occurring in the message by using the .tokenize() functionality of RAPIDS cudf and then count the number of individual words. Finally, we create a Streamz DataFrame that we use to produce the final tally of words by summing the occurrences of each word.
With the consumer running now, let’s produce some messages! Open a new notebook and install kafka-python package by running in a cell.
The bootstrap_servers is the address of our kafka0 server. Every message we will emit will be JSON string UTF-8 encoded. Now we can start pushing the messages onto the topic message bus:
producer.send('test',{'text': 'RAPIDS rocks!'})
What your notebook with cuStreamz consumer running should produce is a DataFrame with index being RAPIDS and rocks! rows, and a count 1 against each of these words. You can now play more with it!
With the introduction of cuStreamz, the RAPIDS ecosystem can speed up the processing of fast-moving data. You can try the above examples and more for yourself at app.blazingsql.com and download the cuStreamz cheatsheethere.