Categories
Misc

Can’t save TextVectorization model. I just failed the exam because I couldn’t save it. This isn’t from the exam. This is the kaggle dataset on NLP I used to practice on, and just went there now to save the model to show the error I got. How do you save it? Or is there a way to not use it?

Can't save TextVectorization model. I just failed the exam because I couldn't save it. This isn't from the exam. This is the kaggle dataset on NLP I used to practice on, and just went there now to save the model to show the error I got. How do you save it? Or is there a way to not use it? submitted by /u/masterdipp
[visit reddit] [comments]
Categories
Misc

Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs

Inline processing of network packets using GPUs is a packet analysis technique useful to a number of different applications.

The inline processing of network packets using GPUs is a packet-analysis technique useful to a number of different application domains: signal processing, network security, information gathering, input reconstruction, and so on.

The main requirement of these application types is to move received packets into GPU memory as soon as possible, to trigger the CUDA kernel responsible to execute parallel processing on them.

The general idea is to create a continuous asynchronous pipeline able to receive packets from the network card directly into the GPU memory. You also dedicate a CUDA kernel to process the incoming packets without needing to synchronize the GPU and CPU.

An effective application workflow involves creating a continuous asynchronous pipeline coordinated between the following player components using lockless communication mechanisms:

  • A network controller to feed the GPU memory with received network packets
  • A CPU to query the network controller to get info about received packets
  • A GPU to receive the packets info and process them directly into GPU memory

Figure 1 shows a typical packet workflow scenario for an accelerated inline packet processing application using an NVIDIA GPU and a ConnectX network card.

Diagram showing the flow of packets between the CPU process, GPU memory, CUDA processing, and network card.
Figure 1. Typical inline packet processing workflow scenario

Avoiding latency is crucial in this context. The more the communications between different components are optimized, the more the system is responsive with increased throughput. Every step has to happen inline as soon as the resource that it needs is available, without blocking any of the other waiting components. 

You can clearly identify two different flows:

  • Data flow: Optimized data (network packets) exchange between network card and GPU over the PCIe bus.
  • Control flow: The CPU orchestrates the GPU and network card.

Data flow

The key is optimized data movement (send or receive packets) between the network controller and the GPU. It can be implemented through the GPUDirect RDMA technology, which enables a direct data path between an NVIDIA GPU and third-party peer devices such as network cards, using standard features of the PCI Express bus interface.

GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose portions of device memory on a PCI Express base address register (BAR) region. For more information, see Developing a Linux Kernel Module using GPUDirect RDMA in the CUDA Toolkit documentation. The Benchmarking GPUDirect RDMA on Modern Server Platforms post provides more in-depth analysis of GPUDirect RDMA bandwidth and latency in case of network operations (send and receive) performed with standard IB verbs using different system topologies. 

Diagram shows how GPUDIrect RDMA technology enables control and data packets to flow directly over the network between GPUs on different systems.
Figure 2. NVIDIA GPUDirect RDMA enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express

To enable GPUDirect RDMA on a Linux system, the nvidia-peermem module is required (available in CUDA 11.4 and later). Figure 3 shows the ideal system topology to maximize the GPUDirect RDMA internal throughput: a dedicated PCIe switch between GPU and NIC, rather than going through the system PCIe connection shared with other components.

Diagram shows a topology that connects CPUs, GPUs, and network cards over the PCIe bus.
Figure 3. Ideal topology to maximize internal data throughput between the network controller and GPU

Control flow

The CPU is the main player coordinating and synchronizing activities between the network controller and the GPU to wake up the NIC to receive packets into GPU memory and notify the CUDA workload that new packets are available for processing.

When dealing with a GPU, it’s really important to emphasize the asynchrony between CPU and GPU. For example, consider a simple application executing the following three steps in the main loop:

  • Receive packets.
  • Process packets.
  • Send back modified packets.

In this post, I cover four different methods to implement the control flow in this kind of application, including pros and cons.

Method 1 

Figure 4 shows the easiest but least effective approach: a single CPU thread is responsible for receiving packets, launches the CUDA kernel to process them, waits for the completion of the CUDA kernel, and sends the modified packets back to the network controller.

A series of green packet boxes depicting CUDA streams, and blue CPU control and data packets below a line, to show an example network communications workflow.
Figure 4. Workflow of a single CPU passing a packet to the CUDA kernel and waiting for completion to take the next step

If packet processing is not so intensive, this approach may perform worse than just processing packets with the CPU, without the GPU involved. For example, you might have a high degree of parallelism to solve a difficult and time-consuming algorithm on packets.

Method 2 

In this approach, the application splits the CPU workload into two CPU threads: one for receiving packets and launching GPU processing, and the other for waiting for completion of GPU processing and transmitting modified packets over the network (Figure 5). 

A series of packet boxes depicting CUDA streams, CPU control and data packets, and boxes to show how the CUDA kernel processes network packets in parallel.
Figure 5. Split CPU threads to process packets through a GPU

A drawback of this approach is the launch of a new CUDA kernel for each burst of accumulated packets. The CPU has to pay for CUDA kernel launch latency for every iteration. If the GPU is overwhelmed, the packet processing may not be executed immediately, causing a delay.

Method 3

Figure 6 shows the third approach, which involves the use of a CUDA persistent kernel.

A series of packet boxes depicting CUDA streams, CPU control and data packets, and boxes with arrows between the other boxes, to show how the CUDA kernel processes network packets in parallel.
Figure 6. Inline packet processing using a persistent CUDA kernel.

A CUDA persistent kernel is a pre-launched kernel that is busy waiting for a notification from the CPU: New packets have arrived and are ready to be processed. When the packets are ready, the kernel notifies the second CPU thread that it can move forward to send them. 

The easiest way to implement this notification system is to share some memory between CPU and GPU using a busy wait-on-flag update mechanism. While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, you can use these same APIs to create perfectly valid CPU mappings of the GPU memory. The advantage of a CPU-driven copy is the small overhead involved. This feature can be enabled today through the GDRCopy library. 

Directly mapping GPU memory for signaling makes the memory modifiable from the CPU and less latency expensive for the GPU during polling. You can also put that flag in CPU pinned memory visible from the GPU, but the CUDA kernel polling on CPU memory flag would consume more PCIe bandwidth and increase the overall latency. 

The problem with this fast solution is that it is risky and not supported by the CUDA programming model. GPU kernels cannot be preempted. If not written correctly, the persistent kernel may loop forever. Also, a long-running persistent kernel may lose synchronization with respect to other CUDA kernels, CPU activity, memory allocation status, and so on.

It also holds GPU resources (for example, streaming multiprocessors) that may not be the best option, in case the GPU is really busy with other tasks. If you use CUDA persistent kernels, you really must have a good handle on your application.

Method 4

The final approach is a hybrid solution of the previous ones: use CUDA stream memory operations to wait or update the notification flag, with pre-launching on the CUDA stream one CUDA kernel per set of received packets.

A series of packet boxes depicting CUDA streams, CPU thread 0 control and data packets, and CPU thread 1 boxes to show how the CUDA kernel processes network packets in parallel and using combinations.
Figure 7. Hybrid approach to inline packet processing using a combination of models

The difference with this approach is that GPU HW is polling (with cuStreamWaitValue) on the memory flag rather than blocking the GPU streaming multiprocessors, and the packets’ processing kernel is triggered only when the packets are ready.

Similarly, when the processing kernel ends, cuStreamWriteValue notifies the CPU thread responsible for sending that the packets have been processed. 

The downside of this approach is that the application must, from time to time, re-fill the GPU with a new sequence of cuStreamWaitValue + CUDA kernel + cuStreamWriteValue so as to not waste execution time with an empty stream not ready to process more packets. A CUDA Graph here can be a good approach for reposting on the stream.

Different approaches are suited to different application patterns.

DPDK and GPUdev

The Data Plane Development Kit (DPDK) is a set of libraries to help accelerate packet processing workloads running on a wide variety of CPU architectures and different devices.  

In DPDK 21.11, NVIDIA introduced a new library named GPUdev to introduce the notion of GPU in the context of DPDK, and to enhance the dialog between CPU, network cards, and GPUs. GPUdev was extended with more features in DPDK 22.03.

The goals of the library are as follows:

  • Introduce the concept of a GPU device managed from a DPDK generic library.
  • Implement basic GPU memory interactions, hiding GPU-specific implementation details. 
  • Reduce the gap between the network card, GPU device, and CPU enhancing the communication. 
  • Simplify the DPDK integration into GPU applications.
  • Expose GPU driver-specific features through a generic layer.

For NVIDIA-specific GPUs, the GPUdev library functionalities are implemented at DPDK driver level through the CUDA driver DPDK library. To enable all the gpudev available features for an NVIDIA GPU, DPDK must be built on a system having CUDA libraries and GDRCopy. 

With features offered by this new library, you can easily implement inline packet processing with GPUs taking care of both data flow and control flows. 

DPDK receives packets in a mempool, a continuous chunk of memory. With the following sequence of instructions, you can enable GPUDirect RDMA to allocate the mempool in GPU memory, registering it into the device network.

struct rte_pktmbuf_extmem gpu_mem; 

gpu_mem.buf_ptr = rte_gpu_mem_alloc(gpu_id, gpu_mem.buf_len, alignment)); 

/* Make the GPU memory visible to DPDK */ 

rte_extmem_register(gpu_mem.buf_ptr, gpu_mem.buf_len, 

                            NULL, gpu_mem.buf_iova, NV_GPU_PAGE_SIZE); 

/* Create DMA mappings on the NIC */ 

rte_dev_dma_map(rte_eth_devices[PORT_ID].device, gpu_mem.buf_ptr, 

                                   gpu_mem.buf_iova, gpu_mem.buf_len)); 

/* Create the actual mempool */ 

struct rte_mempool *mpool = rte_pktmbuf_pool_create_extbuf(... , &gpu_mem, ...);

Figure 8 shows the structure of the mempool: 

For a typical network packet using the DPDK RX queue, boxes at the top depict GPU memory buffers, and CPU metadata boxes below, with arrows pointing between metadata and packet.
Figure 8. Structure of the mempool for inline packet processing

For the control flow, to enable the notification mechanism between CPU and GPU, you can use the gpudev communication list: a shared memory structure between the CPU memory and the CUDA kernel. Each item of the list can hold the addresses of received packets (mbufs) and a flag to update about the status of processing that item (ready with packets, done with processing, and so on).

struct rte_gpu_comm_list {
    /** DPDK GPU ID that will use the communication list. */
    uint16_t dev_id;
    /** List of mbufs populated by the CPU with a set of mbufs. */
    struct rte_mbuf **mbufs;
    /** List of packets populated by the CPU with a set of mbufs info. */
    struct rte_gpu_comm_pkt *pkt_list;
    /** Number of packets in the list. */
    uint32_t num_pkts;
    /** Status of the packets’ list. CPU pointer. */
    enum rte_gpu_comm_list_status *status_h;
    /** Status of the packets’ list. GPU pointer. */
    enum rte_gpu_comm_list_status *status_d;
};

Example pseudo-code: 

struct rte_mbuf * rx_mbufs[MAX_MBUFS]; 

int item_index = 0; 

struct rte_gpu_comm_list *comm_list = rte_gpu_comm_create_list(gpu_id, NUM_ITEMS); 

 while(exit_condition) { 

... 

// Receive and accumulate enough packets 

    nb_rx += rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), rx_pkts); 

// Populate next item in the communication list. 

    rte_gpu_comm_populate_list_pkts(&(p_v->comm_list[index]), rx_mbufs, nb_rx); 

 ... 

    index++; 

}

For simplicity, assuming that the application follows the CUDA persistent kernel scenario, the polling side on the CUDA kernel would look something like the following code example:

__global__ void cuda_persistent_kernel(struct rte_gpu_comm_list *comm_list, int comm_list_entries) 

{ 

    int item_index = 0; 

    uint32_t wait_status; 

     /* GPU kernel keeps checking exit condition as it can’t be preempted. */ 

    while (!exit_condition()) { 

        wait_status = RTE_GPU_VOLATILE(comm_list[item_index].status_d[0]); 

        if (wait_status != RTE_GPU_COMM_LIST_READY) 

            continue; 

         if (threadIdx.x num_pkts) { 

            /* Each CUDA thread processes a different packet. */ 

            packet_processing(comm_list[item_index]->addr, comm_list[item_index]->size, ..); 

        }

        __syncthreads(); 

         /* Notify packets in the items have been processed */ 

        if (threadIdx.x == 0) { 

            RTE_GPU_VOLATILE(comm_list[item_index].status_d[0]) = RTE_GPU_COMM_LIST_DONE; 

            __threadfence_system(); 

        } 

         /* Wait for new packets on the next communication list entry. */ 

        item_index = (item_index+1) % comm_list_entries; 

    } 

}
CPU packets aligned on the left, GPU persistent CUDA kernel packets on the right, and arrows pointing to the middle list of boxes to flag intercommunications between CPU and GPU.
Figure 9. Example workflow of the pseudo-code for the polling side in a persistent kernel

A concrete use case at NVIDIA that uses the DPDK gpudev library for inline packet processing is in the Aerial application framework for building high-performance, software-defined, 5G applications. In this case, packets must be received in GPU memory and reordered according to 5G-specific packet headers, with the effect that signal processing can start on the reordered payload.

Diagram flow of a network packet flow over the PCIe bus, into the boxes at the top depicting GPU memory buffers and DPDK gpudev software helping to put the packets in proper order.
Figure 10. Inline packet processing use case with DPDK gpudev in Aerial 5G software

l2fwd-nv application

To provide a practical example of how to implement inline packet processing and use the DPDK gpudev library, the l2fwd-nv sample code has been released on the /NVIDIA/l2fwd-nv GitHub repo. This is an extension of the vanilla DPDK l2fwd example enhanced with GPU capabilities. The application layout is to receive packets, swap MAC addresses (source and destination) for each packet, and transmit the modified packets.

L2fwd-nv provides an implementation example for all the approaches discussed in this post for comparison:

  • CPU only
  • CUDA kernels per set of packets
  • CUDA persistent kernel
  • CUDA Graphs

As an example, Figure 11 shows the timeline for CUDA persistent kernel with DPDK gpudev objects.

Similar to Figure 7, the flow and organization of packet data is organized by a DPDK communication list as shown in a separate bubble above the layers of green CUDA stream, blue CPU thread 1, and gray CPU thread 2 boxes.
Figure 11. Example timeline for the CUDA persistent kernel using DPDK gpudev objects

To measure l2fwd-nv performance against the DPDK testpmd packet generator, two Gigabyte servers connected back-to-back, have been used in Figure 12 with CPU: Intel Xeon Gold 6240R, PCIe gen3 dedicated switch, Ubuntu 20.04, MOFED 5.4, and CUDA 11.4.

Two 3D boxes representing gigabyte systems and how data is flowing between the two boxes using ConnectX network cards on both systems.
Figure 12. Two Gigabyte server configurations to test l2fwd-nv performance

Figure 13 shows that peak I/O throughput is identical when using either CPU or GPU memory for the packets, so there is no inherent penalty for using one over the other. Packets here are forwarded without being modified.

5 Pairs of blue and green vertical bars shown side-by-side at similar heights to show how CPU versus GPU memory performance is identical.
Figure 13. Peak I/O throughput is identical

To highlight differences between different GPU packet handling approaches, Figure 14 shows the throughput comparison between Method 2 (CUDA kernel per set of packets) and Method 3 (CUDA persistent kernel). Both methods keep packet sizes to 1024 bytes, varying the number of accumulated packets before triggering the GPU work to swap packets’ MAC addresses.

4 sets of yellow and red vertical bars showing how with smaller packet workloads (32 and 16), CUDA persistent kernel is closer at achieving peak I/O throughput (highest bars).
Figure 14. Differences between GPU packet handling methods

For both approaches, 16 packets per iteration causes too many interactions in the control plane and peak throughput is not achieved. With 32 packets per iteration, the persistent kernel can keep up with peak throughput while individual launches per iteration still have too much control plane overhead. For 64 and 128 packets per iteration, both approaches are able to reach peak I/O throughput. Throughput measurements here are not at zero-loss packets.

Conclusion

In this post, I discussed several approaches to optimize inline packet processing using GPUs. Depending on your application needs, you can apply several workflow models to gain improved performance as a result of reduced latency. The DPDK gpudev library also helps simplify your coding efforts to achieve optimum results in the shortest amount of time.

Other factors to consider, depending on the application, include how much time to spend accumulating enough packets on the receive side before triggering the packet processing, how many threads are available to enhance, as much as possible, parallelism among different tasks, and how long the kernel should be persistent in the execution.

Categories
Misc

Convert Tensorflow model to CoreML for iOS app

I’m trying to convert my tensorflow image classification model to a CoreML model that I can integrate in an iOS app. The tensorflow model takes in an image of shape (1,28,28,1) and outputs a softmax array. When I try to convert it to CoreML I get the following error:

raise TypeError(“Unsupported numpy type: %s” % (nptype))

TypeError: Unsupported numpy type: float32

Does anyone have any experience with this?

submitted by /u/berimbolo21
[visit reddit] [comments]

Categories
Misc

What is the current industry standard in distributed tensorflow setup?

In mapreduce everything is handled by yarn and hadoop for resource management, how is tensorflow setup in distributed environment?

Is Horovod recommended?

submitted by /u/Rough_Source_123
[visit reddit] [comments]

Categories
Misc

Updating the CUDA Linux GPG Repository Key

CUDA 16x9 Aspect RatioNVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning April 27, 2022.CUDA 16x9 Aspect Ratio

To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by the apt, dnf/yum, and zypper package managers beginning April 27, 2022.

If you don’t update your repository signing keys, expect package management errors when attempting to access or install packages from CUDA repositories.

To ensure continued access to the latest NVIDIA software, complete the following steps.

Remove the outdated signing key

Debian, Ubuntu, WSL

$ sudo apt-key del 7fa2af80

Fedora, RHEL, openSUSE, SLES

$ sudo rpm --erase gpg-pubkey-7fa2af80*

Install the new key

For Debian-based distributions, including Ubuntu, you must also install the new package or manually install the new signing key.

Install the new cuda-keyring package

To avoid the need for manual key installation steps, NVIDIA is providing a new helper package to automate the installation of new signing keys for NVIDIA repositories. 

Replace $distro/$arch in the following commands with values appropriate for your OS; for example:

  • debian10/x86_64
  • debian11/x86_64
  • ubuntu1604/x86_64
  • ubuntu1804/cross-linux-sbsa
  • ubuntu1804/ppc64el
  • ubuntu1804/sbsa
  • ubuntu1804/x86_64
  • ubuntu2004/cross-linux-sbsa
  • ubuntu2004/sbsa
  • ubuntu2004/x86_64
  • ubuntu2204/sbsa
  • ubuntu2204/x86_64
  • wsl-ubuntu/x86_64

Debian, Ubuntu, WSL

$ wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb

Alternate method: Manually install the new signing key

If you can’t install the cuda-keyring package, you can install the new signing key manually (not the recommended method).

Debian, Ubuntu, WSL

$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/3bf863cc.pub

RPM distros

On a fresh installation, Fedora, RHEL, openSUSE, or SLES as dnf/yum/zypper prompt you to accept new keys when the repository signing key changes. Accept the change when prompted.

Replace $distro/$arch in the following commands with values appropriate for your OS; for example:

  • fedora32/x86_64
  • fedora33/x86_64
  • fedora34/x86_64
  • fedora35/x86_64
  • opensuse15/x86_64
  • rhel7/ppc64le
  • rhel7/x86_64
  • rhel8/cross-linux-sbsa
  • rhel8/ppc64le
  • rhel8/sbsa
  • rhel8/x86_64
  • sles15/cross-linux-sbsa
  • sles15/sbsa
  • sles15/x86_64

For upgrades on RPM-based distros including Fedora, RHEL, and SUSE, you must also run the following command.

Fedora and RHEL 8

$ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-$distro.repo

RHEL 7

$ sudo yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/$arch/cuda-rhel7.repo

openSUSE and SLES

$ sudo zypper removerepo cuda-$distro-$arch
$ sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-$distro.repo

Working with containers

CUDA applications built using older NGC base containers may contain outdated repository keys. If you build Docker containers using these images as a base and update the package manager or install additional NVIDIA packages as part of your Dockerfile, these commands may fail as they would on a non-container system. To work around this, integrate the earlier commands into the Dockerfile you use to build the container.

Existing containers in which the package manager is not used to install updates are not affected by this key rotation.

Categories
Misc

How DNEG Helped Win Another Visual-Effects Oscar by Bringing ‘Dune’ to Life With NVIDIA RTX

Featuring stunning visuals from futuristic interstellar worlds, including colossal sand creatures, Dune captivated audiences around the world. The sci-fi film picked up six Oscars last month at the 94th Academy Awards, including for Best Sound and Visual Effects. Adapted from Frank Herbert’s 1965 novel of the same name, Dune tells the story of Paul Atreides, Read article >

The post How DNEG Helped Win Another Visual-Effects Oscar by Bringing ‘Dune’ to Life With NVIDIA RTX appeared first on NVIDIA Blog.

Categories
Misc

Coursera TF Developer Certification Course Certification worth the price?

Hello,
So I am about to do the https://www.coursera.org/professional-certificates/tensorflow-in-practice#instructors to prepare for the real TF certificate. My company pays for the certificate but for the course Id have to use a shared account. Hence the granted coursera certificate will be addressed to another name. So my question is, is it worth it to pay the price for coursera and the course to get the cert on my name? (would be approx 44€*4 months) or does it not matter since I am going to do the TF Developer certificate anyways?

submitted by /u/sickTheBest
[visit reddit] [comments]

Categories
Misc

Ranking within groups

New to tensorflow. I have used ML.NET in the past.

I am trying to build a model that will rank a group of volunteers’ ability to sign up new subscribers for a service in a given day. Let’s say that for a year I hire 10 new volunteers each day (features are age, gender, education, etc … and my label is number-of-subscription-signups) . After a year I have 3650 rows of data that I want to train but use the ‘date’ as a group designation so that ranking consideration does not span across different days. How can I specify group id? or what is the proper terminology for what I’m trying to achieve (and I can do my own research)?

I want to use ‘group id’ as it’s being used in THIS ml.net EXAMPLE) .

Thanks

submitted by /u/MemphisRay
[visit reddit] [comments]

Categories
Misc

Your Odyssey Awaits: Stream ‘Lost Ark’ to Nearly Any Device This GFN Thursday

It’s a jam-packed GFN Thursday. This week brings the popular, free-to-play, action role-playing game Lost Ark to gamers across nearly all their devices, streaming on GeForce NOW. And that’s not all. GFN Thursday also delivers an upgraded experience in the 2.0.40 update. M1-based MacBooks, iMacs and Mac Minis are now supported natively. Plus, membership gift Read article >

The post Your Odyssey Awaits: Stream ‘Lost Ark’ to Nearly Any Device This GFN Thursday appeared first on NVIDIA Blog.

Categories
Misc

MoViNets: Mobile Video Networks for Efficient Video Recognition

Anyone try using MoViNets for custom dataset?

submitted by /u/InternalStorm133
[visit reddit] [comments]