Programming Distributed Multi-GPU Tensor Operations with cuTENSOR v1.4

NVIDIA cuTENSOR, version 1.4, library supports 64-dimensional tensors, distributed multi-GPU tensor operations, and improves tensor contraction performance models.

Today, NVIDIA is announcing the availability of cuTENSOR, version 1.4, which supports up to 64-dimensional tensors, distributed multi-GPU tensor operations, and helps improve tensor contraction performance models. This software can be downloaded now free of charge.

Download the cuTENSOR software.

What’s New?

Supports up to 64-dimensional tensors.
Supports distributed, multi-GPU tensor operations.
Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT).
Improved performance for tensor contraction that have an overall large contracted dimension (i.e., a parallel reduction was added).
Improved performance for tensor contraction that have a tiny contracted dimension (
Improved performance for outer-product-like tensor contractions (e.g., C[a,b,c,d] = A[b,d] * B[a,c]).
Additional bug fixes.

For more information, see the cuTENSOR Release Notes.

About cuTENSOR

cuTENSOR is a high-performance CUDA library for tensor primitives; its key features include:

Extensive mixed-precision support:
- FP64 inputs with FP32 compute.
- FP32 inputs with FP16, BF16, or TF32 compute.
- Complex-times-real operations.
- Conjugate (without transpose) support.

Support for up to 64-dimensional tensors.
Supports arbitrary data layouts.
Supports trivially serializable data structures.
Enhancements to main computational routines:
- Direct (i.e., transpose-free) tensor contractions.
- Tensor reductions (including partial reductions).
- Element-wise tensor operations:
  - Support for various activation functions.
  - Arbitrary tensor permutations.
  - Conversion between different data types

Learn more

On Math Libraries, see Recent Developments in NVIDIA Math Libraries (GTC #S31754).
For the latest on HPC software, see A Deep Dive into the latest HPC software (GTC #S31286).
Catch-up on Tensor Core-Accelerated Math Libraries for Dense and Sparse Linear Algebra in AI and HPC (GTC #CWES1098).
Read technical details in our cuTENSOR Product Documentation.

Recent Developer posts

On Fortran enhancements to support Tensor Cores, read Bringing Tensor Cores to Standard Fortran.
Benefit from A100 acceleration and read Getting Immediate Speedups with NVIDIA A100 TF32.
To gain AI training benefits, see Accelerating AI Training with NVIDIA TF32 Tensor Cores.

What’s New?

About cuTENSOR

Learn more

Recent Developer posts

Leave a Reply Cancel reply