AI factories rely on more than just compute fabrics. While the East-West network connecting the GPUs is critical to AI application performance, the storage…
AI factories rely on more than just compute fabrics. While the East-West network connecting the GPUs is critical to AI application performance, the storage fabric—connecting high-speed storage arrays—is equally important. Storage performance plays a key role across several stages of the AI lifecycle, including training checkpointing, inference techniques such as retrieval-augmented generation…
Just Released: CUTLASS 3.8
Provides support for the NVIDIA Blackwell SM100 architecture. CUTLASS is a collection of CUDA C++ templates and abstractions for implementing high-performance…
Provides support for the NVIDIA Blackwell SM100 architecture. CUTLASS is a collection of CUDA C++ templates and abstractions for implementing high-performance GEMM computations.
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL…
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL is a central piece of software for multi-GPU deep learning training. It handles any kind of inter-GPU communication, be it over PCI, NVLink, or networking. It uses advanced topology detection, optimized communication graphs…
Just Released: NVIDIA cuDNN 9.7
Bringing support for NVIDIA Blackwell architecture across data center and GeForce products, NVIDIA cuDNN 9.7 delivers speedups of up to 84% for FP8 Flash…
Bringing support for NVIDIA Blackwell architecture across data center and GeForce products, NVIDIA cuDNN 9.7 delivers speedups of up to 84% for FP8 Flash Attention operations and optimized GEMM capabilities with advanced fusion support to accelerate deep learning workloads.
Just Released: CUTLASS 3.8
CUTLASS 3.8 extends support to NVIDIA Blackwell SM100 architecture with 99% peak performance for Tensor Core operations, bringing essential features like Mixed…
CUTLASS 3.8 extends support to NVIDIA Blackwell SM100 architecture with 99% peak performance for Tensor Core operations, bringing essential features like Mixed Input GEMMs for efficient model quantization and Grouped GEMM capabilities that accelerate MoE models through parallel expert computation.
The recently released DeepSeek-R1 model family has brought a new wave of excitement to the AI community, allowing enthusiasts and developers to run state-of-the-art reasoning models with problem-solving, math and code capabilities, all from the privacy of local PCs. With up to 3,352 trillion operations per second of AI horsepower, NVIDIA GeForce RTX 50 Series
Read Article
The AI tools for Art Newsletter – Issue 1
DeepSeek-R1 Now Live With NVIDIA NIM
DeepSeek-R1 is an open model with state-of-the-art reasoning capabilities. Instead of offering direct responses, reasoning models like DeepSeek-R1 perform multiple inference passes over a query, conducting chain-of-thought, consensus and search methods to generate the best answer. Performing this sequence of inference passes — using reason to arrive at the best answer — is known as
Read Article