Categories
Misc

I can’t seem to get Tensor Flow working on my 3070 for NLP or CNN

I have been using tensor flow for a while now but I just recently ran into a problem with one of my programs. While trying to create a convolutional network I got the error

Epoch 1/4 Process finished with exit code -1073740791 (0xC0000409) 

where I have never had this error before. I have all of the updated CUDA’s and CUDD’s and have them in the right folder so I don’t know what the problem is. Anything helps thanks.

#from keras.datasets import imdb from keras.preprocessing import sequence import tensorflow as tf VOCAB_SIZE = 88584 MAXLEN = 250 BATCH_SIZE = 64 (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=VOCAB_SIZE) train_data = sequence.pad_sequences(train_data, MAXLEN) test_data = sequence.pad_sequences(test_data, MAXLEN) model = tf.keras.Sequential([ tf.keras.layers.Embedding(VOCAB_SIZE, 32), # Graph vector form, 32 dimensions tf.keras.layers.LSTM(32), # Long-Short term memory tf.keras.layers.Dense(1, activation="sigmoid") # Between 0-1 ]) model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss="mean_squared_error", metrics=[tf.keras.metrics.RootMeanSquaredError()]) history = model.fit(x=train_data, y=train_labels, batch_size=128, epochs=10) 

submitted by /u/Cheif_Cheese
[visit reddit] [comments]

Categories
Misc

πŸ“’ New Course on TensorFlow and Keras by OpenCV

πŸ“’ New Course on TensorFlow and Keras by OpenCV submitted by /u/spmallick
[visit reddit] [comments]
Categories
Misc

2022-02-11 16:51:01.357924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1900] Ignoring visible gpu device (device: 0, name: GeForce 820M, pci bus id: 0000:04:00.0, compute capability: 2.1) with Cuda compute capability 2.1. The minimum required Cuda capability is 3.5.

Hello everyone.

Is there a way to bypass this without having to only use my CPU?

Thanks

submitted by /u/dalpendre
[visit reddit] [comments]

Categories
Offsites

An International Scientific Challenge for the Diagnosis and Gleason Grading of Prostate Cancer

In recent years, machine learning (ML) competitions in health have attracted ML scientists to work together to solve challenging clinical problems. These competitions provide access to relevant data and well-defined problems where experienced data scientists come to compete for solutions and learn new methods. However, a fundamental difficulty in organizing such challenges is obtaining and curating high quality datasets for model development and independent datasets for model evaluation. Importantly, to reduce the risk of bias and to ensure broad applicability of the algorithm, evaluation of the generalisability of resulting algorithms should ideally be performed on multiple independent evaluation datasets by an independent group of scientists.

One clinical problem that has attracted substantial ML research is prostate cancer, a condition that 1 in 9 men develop in their lifetime. A prostate cancer diagnosis requires pathologists to examine biological tissue samples under a microscope to identify cancer and grade the cancer for signs of aggressive growth patterns in the cells. However, this cancer grading task (called Gleason grading) is difficult and subjective due to the need for visual assessment of cell differentiation and Gleason pattern predominance. Building a large dataset of samples with expert annotations can help with the development of ML systems to aid in prostate cancer grading.

To help accelerate and enable more research in this area, Google Health, Radboud University Medical Center and Karolinska Institutet joined forces to organize a global competition, the Prostate cANcer graDe Assessment (PANDA) Challenge, on the open Kaggle platform. In β€œArtificial Intelligence for Diagnosis and Gleason Grading of Prostate Cancer: the PANDA challenge”, published in Nature Medicine, we present the results of the challenge. The study design of the PANDA challenge provided the largest public whole-slide image dataset available and was open to participants from April 21st until July 23rd, 2020. The development datasets remain available for further research. In this effort, we compiled and publicly released a European cohort of prostate cancer cases for algorithm development and pioneered a standardized evaluation setup for digital pathology that enabled independent, blinded external validation of the algorithms on data from both the United States and EU.

The global competition attracted participants from 65 countries (the size of the circle for each country illustrates the number of participants).

Design of the Panda Challenge
The challenge had two phases: a development phase (i.e., the Kaggle competition) and a validation phase. During the competition, 1,290 developers from 65 countries competed in building the best performing Gleason grading algorithm, having full access to a development set for algorithm training. Throughout the competition teams submitted algorithms that were evaluated on a hidden tuning set.

In the validation phase, a selection of top performing algorithms were independently evaluated on internal and external validation datasets with high quality reference grades from panels of expert prostate pathologists. In addition, a group of general pathologists graded a subset of the same cases to put the difficulty of the task and dataset in context. The algorithms submitted by the teams were then compared to grades done by groups of international and US general pathologists on these subsets.

Overview of the PANDA challenge’s phases for development and validation.

Research Velocity During the Challenge
We found that a group of Gleason grading ML algorithms developed during a global competition could achieve pathologist-level performance and generalize well to intercontinental and multinational cohorts. On all external validation sets, these algorithms achieved high agreement with urologic pathologists (prostate specialists) and high sensitivity for detecting tumor in biopsies. The Kaggle platform enabled the tracking of teams’ performance throughout the competition. Impressively, the first team achieving high agreement with the prostate pathologists at above 0.90 (quadratically weighted Cohen’s kappa) on the internal validation set occurred within the first 10 days of the competition. By the 33rd day, the median performance of all teams exceeded a score of 0.85.

Progression of algorithms’ performances throughout the competition, as shown by the highest score on the tuning and internal validation sets among all participating teams. During the competition teams could submit their algorithm for evaluation on the tuning set, after which they received their score. At the same time, algorithms were evaluated on the internal validation set, without disclosing these results to the participating teams. The development of the top score obtained by any team shows the rapid improvement of the algorithms.

Learning from the Challenge
By moderating the discussion forum on the Kaggle platform, we learned that the teams’ openness in sharing code via colab notebooks led to rapid improvement across the board, a promising sign for future public challenges, and a clear indication of the power of sharing knowledge on a common platform.

Organizing a public challenge that evaluates algorithm generalization across independent cohorts using high quality reference standard panels presents substantial logistical difficulties. Assembling this size of a dataset across countries and organizations was a massive undertaking. This work benefited from an amazing collaboration between the three organizing institutions which have all contributed respective publications in this space, two in Lancet Oncology and one in JAMA Oncology. Combining these efforts provided a high quality foundation on which this competition could be based. With the publication, Radboud and Karolinska research groups are also open sourcing the PANDA challenge development datasets to facilitate the further improvement of prostate Gleason grading algorithms. We look forward to seeing many more advancements in this field, and more challenges that can catalyze extensive international knowledge sharing and collaborative research.

Acknowledgements
Key contributors to this project at Google include Po-Hsuan Cameron Chen, Kunal Nagpal, Yuannan Cai, David F. Steiner, Maggie Demkin, Sohier Dane, Fraser Tan, Greg S. Corrado, Lily Peng, Craig H. Mermel. Collaborators on this project include Wouter Bulten, Kimmo Kartasalo, Peter StrΓΆm, Hans Pinckaers, Hester van Boven, Robert Vink, Christina Hulsbergen-van de Kaa, Jeroen van der Laak, Mahul B. Amin, Andrew J. Evans, Theodorus van der Kwast, Robert Allan, Peter A. Humphrey, Henrik GrΓΆnberg, Hemamali Samaratunga, Brett Delahunt, Toyonori Tsuzuki, Tomi HΓ€kkinen, Lars Egevad, Masi Valkonen, Pekka Ruusuvuori, Geert Litjens, Martin Eklund and the PANDA Challenge consortium. We thank Ellery Wulczyn, Annisah Um’rani, Yun Liu, and Dale Webster for their feedback on the manuscript and guidance on the project. We thank our collaborators at NMCSD, particularly Niels Olson, for internal re-use of de-identified data which contributed to the US external validation set. Sincere appreciation also goes to Sami Lachgar, Ashley Zlatinov, and Lauren Winer for their feedback on the blogpost.

Categories
Misc

Slow initialization of model with dynamic batch size in the C API

I’m experiencing very slow loading times (5 minutes) with an EfficientNet architecture I converted from PyTorch (PT -> ONNX_TF -> TF).

The problem is only present when I load the model with the `v1.compat` mode or with the C API, which is my final goal. However, it loads fast in the standard TF v2 mode In Python (< 1s). After loading the models, the inference seems equally fast and correct in all the cases. I’m using `v1.compat` only for debugging, as it seems to behave similar to the C API.

I’ve noticed that the issue disappears when I export the model from PyTorch with a fixed batch size of 1, however I would prefer to have dynamic batch size.

I created a topic in the TensorFlow forum with access to the models, and details to reproduce.

I’m looking for ideas on what could be the issue, and if the resulting SavedModels could be modified in a way the loading is as fast in the C API as in TF v2.

submitted by /u/pablo_alonso
[visit reddit] [comments]

Categories
Misc

Is there a way to lock seed so training a network will always return same results?

submitted by /u/Ninja181
[visit reddit] [comments]

Categories
Offsites

Guiding Frozen Language Models with Learned Soft Prompts

Large pre-trained language models, which are continuing to grow in size, achieve state-of-art results on many natural language processing (NLP) benchmarks. Since the development of GPT and BERT, standard practice has been to fine-tune models on downstream tasks, which involves adjusting every weight in the network (i.e., model tuning). However, as models become larger, storing and serving a tuned copy of the model for each downstream task becomes impractical.

An appealing alternative is to share across all downstream tasks a single frozen pre-trained language model, in which all weights are fixed. In an exciting development, GPT-3 showed convincingly that a frozen model can be conditioned to perform different tasks through β€œin-context” learning. With this approach, a user primes the model for a given task through prompt design, i.e., hand-crafting a text prompt with a description or examples of the task at hand. For instance, to condition a model for sentiment analysis, one could attach the prompt, β€œIs the following movie review positive or negative?” before the input sequence, β€œThis movie was amazing!”

Sharing the same frozen model across tasks greatly simplifies serving and allows for efficient mixed-task inference, but unfortunately, this is at the expense of task performance. Text prompts require manual effort to design, and even well-designed prompts still far underperform compared to model tuning. For instance, the performance of a frozen GPT-3 175B parameter model on the SuperGLUE benchmark is 5 points below a fine-tuned T5 model that uses 800 times fewer parameters.

In β€œThe Power of Scale for Parameter-Efficient Prompt Tuning”, presented at EMNLP 2021, we explore prompt tuning, a more efficient and effective method for conditioning frozen models using tunable soft prompts. Just like engineered text prompts, soft prompts are concatenated to the input text. But rather than selecting from existing vocabulary items, the β€œtokens” of the soft prompt are learnable vectors. This means a soft prompt can be optimized end-to-end over a training dataset. In addition to removing the need for manual design, this allows the prompt to condense information from datasets containing thousands or millions of examples. By comparison, discrete text prompts are typically limited to under 50 examples due to constraints on model input length. We are also excited to release the code and checkpoints to fully reproduce our experiments.

Prompt tuning retains the strong task performance of model tuning, while keeping the pre-trained model frozen, enabling efficient multitask serving.

Prompt Tuning
To create a soft prompt for a given task, we first initialize the prompt as a fixed-length sequence of vectors (e.g., 20 tokens long). We attach these vectors to the beginning of each embedded input and feed the combined sequence into the model. The model’s prediction is compared to the target to calculate a loss, and the error is back-propagated to calculate gradients, however we only apply these gradient updates to our new learnable vectors β€” keeping the core model frozen. While soft prompts learned in this way are not immediately interpretable, at an intuitive level, the soft prompt is extracting evidence about how to perform a task from the labeled dataset, performing the same role as a manually written text prompt, but without the need to be constrained to discrete language.

Our codebase, implemented in the new JAX-based T5X framework, makes it easy for anyone to replicate this procedure, and provides practical hyperparameter settings, including a large learning rate (0.3), which we found was important for achieving good results.

Since soft prompts have a small parameter footprint (we train prompts with as few as 512 parameters), one can easily pass the model a different prompt along with each input example. This enables mixed-task inference batches, which can streamline serving by sharing one core model across many tasks.

Left: With model tuning, incoming data are routed to task-specific models. Right: With prompt tuning, examples and prompts from different tasks can flow through a single frozen model in large batches, better utilizing serving resources.

Improvement with Scale
When evaluated on SuperGLUE and using a frozen T5 model, prompt tuning significantly outperforms prompt design using either GPT-3 or T5. Furthermore, as model size increases, prompt tuning catches up to the performance level of model tuning. Intuitively, the larger the pre-trained model, the less of a β€œpush” it needs to perform a specific task, and the more capable it is of being adapted in a parameter-efficient way.

As scale increases, prompt tuning matches model tuning, despite tuning 25,000 times fewer parameters.

The effectiveness of prompt tuning at large model scales is especially important, since serving separate copies of a large model can incur significant computational overhead. In our paper, we demonstrate that larger models can be conditioned successfully even with soft prompts as short as 5 tokens. For T5 XXL, this means tuning just 20 thousand parameters to guide the behavior of an 11 billion parameter model.

Resilience to Domain Shift
Another advantage of prompt tuning is its resilience to domain shift. Since model tuning touches every weight in the network, it has the capacity to easily overfit on the provided fine-tuning data and may not generalize well to variations in the task at inference time. By comparison, our learned soft prompts have a small number of parameters, so the solutions they represent may be more generalizable.

To test generalizability, we train prompt tuning and model tuning solutions on one task, and evaluate zero-shot on a closely related task. For example, when we train on the Quora Question Pairs task (i.e., detecting if two questions are duplicates) and evaluate on MRPC (i.e., detecting if two sentences from news articles are paraphrases), prompt tuning achieves +3.2 points higher accuracy than model tuning.

Train Β Β  Eval Β Β  Tuning Β Β  Accuracy Β Β  F1
Β Β  Β Β  Β Β  Β Β  Β Β  Β Β  Β Β  Β Β  Β Β 
QQP Β Β  MRPC Β Β  Model Β Β  73.1 Β±0.9 Β Β  81.2 Β±2.1
Prompt Β Β  76.3 Β±0.1 Β Β  84.3 Β±0.3
Β Β  Β Β  Β Β  Β Β  Β Β  Β Β  Β Β  Β Β  Β Β 
MRPC Β Β  QQP Β Β  Model Β Β  74.9 Β±1.3 Β Β  70.9 Β±1.2
Prompt Β Β  75.4 Β±0.8 Β Β  69.7 Β±0.3 Β Β 
On zero-shot domain transfer between two paraphrase detection tasks, prompt tuning matches or outperforms model tuning, depending on the direction of transfer.

Looking Forward
Prompt-based learning is an exciting new area that is quickly evolving. While several similar methods have been proposed β€” such as Prefix Tuning, WARP, and P-Tuning β€” we discuss their pros and cons and demonstrate that prompt tuning is the simplest and the most parameter efficient method.

In addition to the Prompt Tuning codebase, we’ve also released our LM-adapted T5 checkpoints, which we found to be better-suited for prompt tuning compared to the original T5. This codebase was used for the prompt tuning experiments in FLAN, and the checkpoints were used as a starting point for training the BigScience T0 model. We hope that the research community continues to leverage and extend prompt tuning in future research.

Acknowledgements
This project was a collaboration between Brian Lester, Rami Al-Rfou and Noah Constant. We are grateful to the following people for feedback, discussion and assistance: Waleed Ammar, Lucas Dixon, Slav Petrov, Colin Raffel, Adam Roberts, Sebastian Ruder, Noam Shazeer, Tu Vu and Linting Xue.

Categories
Misc

Latest Releases and Resources: Feb. 3-10

Sharpen your conversational AI, vehicle routing, or CUDA Python skills; learn how Metropolis boosts go-to-market efforts; find solutions for AI inference deployment.

Our weekly roundup covers the most recent software updates, learning resources, events, and notable news.Β 


Courses

Learn to Deploy a Text Classification Model Using Riva (DLI)

This free, 30 minute, online course is self paced and includes a sample notebook from the NGC TAO Toolkitβ€”Conversational AI collection, complete with a live GPU environment.

Learn more: Deploy a Text Classification Model Using Riva

Optimized Vehicle Routing (DLI)

In this free one-hour course, participants will work through a demonstration of a common vehicle routing optimization problem at their own pace. Upon completion, participants will be able to preprocess input data for use by NVIDIA ReOpt routing solver, and compose variants of the problem that reflect real-world business constraints.

Register online: Optimized Vehicle Routing

Fundamentals of Accelerated Computing with CUDA Python (DLI)

This Deep Learning Institute workshop teaches you the fundamental tools and techniques for running GPU-accelerated Python applications using CUDA GPUs and the Numba compiler. This workshop is being offered Feb, 23 from 9 am to 5 pm PT.

At the conclusion of the workshop, you’ll have an understanding of the fundamental tools and techniques for GPU-accelerated Python applications with CUDA and Numba, including:

  • GPU-accelerate NumPy ufuncs with a few lines of code.
  • Configure code parallelization using the CUDA thread hierarchy.
  • Write custom CUDA device kernels for maximum performance and flexibility.
  • Use memory coalescing and on-device shared memory to increase CUDA kernel bandwidth.

Register online: Fundamentals of Accelerated Computing with CUDA Python


WebinarsΒ 

Learn How Metropolis Boosts Go-to-Market Efforts​ at a Developer Meetup

Join NVIDIA experts at developer meetups Feb. 16 and 17, and find out how the Metropolis program can grow your vision AI business and enhance go-to-market efforts​.

Learn how:

  • Metropolis Validation Labs optimize your applications and accelerate deployments.
  • NVIDIA Fleet Command simplifies provisioning and management of edge deployments accelerating the time to scale from POC to production.
  • NVIDIA Launchpad provides easy access to GPU instances for faster POCs and customer trial

Register online: How the NVIDIA Metropolis Program will Supercharge Your Business

A Flexible Solution for Every AI Inference Deployment

Dive into NVIDIA inference solutions, including open-source NVIDIA Triton Inference Server and NVIDIA TensorRT, with a webinar and live Q&A, Feb. 23 at 10 a.m. PT.

Learn how:

  • To optimize, deploy, and scale AI models in production using Triton Inference Server and TensorRT.
  • Triton streamlines inference serving across multiple frameworks, across different query types (real-time, batch, streaming), on CPUs and GPUs, and with a model analyzer for efficient deployment.
  • To standardize workflows to optimize models using TensorRT and framework Integrations with PyTorch and TensorFlow.
  • Real-world customers are benefitting from Triton and TensorRT.

Register online: A Flexible Solution for Every AI Inference Deployment

Categories
Misc

Implementing High-Precision Decimal Arithmetic with CUDA int128

This post details CUDA’s new int128 support and how to implement decimal fixed-point arithmetic on top of it.

β€œTruth is much too complicated to allow anything but approximations.” β€” John von Neumann

The history of computing has demonstrated that there is no limit to what can be achieved with the relatively simple arithmetic implemented in computer hardware. But the β€œtruth” that computers represent using finite-size numbers is fundamentally approximate. As David Goldberg wrote, β€œSqueezing infinitely many real numbers into a finite number of bits requires an approximate representation.” Floating point is the most widely used representation of real numbers, implemented in many processors, including GPUs. It is popular due to its ability to represent a large dynamic range of values and to trade off range and precision.

Unfortunately, floating point’s flexibility and range can cause trouble in applications where accuracy within a fixed range is more important: think dollars and cents. Binary floating point numbers cannot exactly represent every decimal value, and their approximation and rounding can lead to accumulation of errors that may be unacceptable in accounting calculations, for example. Moreover, adding very large and very small floating-point numbers can result in truncation errors. For these reasons, many currency and accounting computations are implemented using fixed-point decimal arithmetic, which stores a fixed number of fractional digits. Depending on the range required, fixed-point arithmetic may need a larger number of bits.

NVIDIA GPUs do not implement fixed-point arithmetic in hardware, but a GPU-accelerated software implementation can be efficient. In fact, RAPIDS cuDF library has provided efficient 32- and 64-bit fixed-point decimal numbers and computation for a while now. But some users of RAPIDS cuDF and GPU-accelerated Apache Spark need the higher range and precision provided by 128-bit decimals, and now NVIDIA CUDA 11.5 provides preview support of the 128-bit integer type(int128) that is needed to implement 128-bit decimal arithmetic.

In this post, after introducing CUDA’s new int128 support, we detail how we implemented decimal fixed-point arithmetic on top of it. We then demonstrate how 128-bit fixed-point support in RAPIDS cuDF enables key Apache Spark workloads to run entirely on GPU.

Introducing CUDA __int128

In NVIDIA CUDA 11.5, the NVCC offline compiler has added preview support for the signed and unsigned __int128 data types on platforms where the host compiler supports it. The nvrtc JIT compiler has also added support for 128-bit integers, but requires a command-line option, --device-int128, to enable this support. Β but requires a command-line option, –device-int128, to enable this support. Β Arithmetic, logical, and bitwise operations are all supported on 128-bit integers. Note that DWARF debug support for 128-bit integers is not available yet and will be available in a subsequent CUDA release. With the 11.6 release, cuda-gdb and Nsight Visual Studio Code Edition have added support for inspecting this new variable type.

NVIDIA GPUs compute integers in 32-bit quantities, so 128-bit integers are represented using four 32-bit unsigned integers. The addition, subtraction, and multiplication algorithms are straightforward and use the built-in PTX addc/madc instructions to handle multiple-precision values. Division and remainder are implemented using a simple O(n^2) division algorithm, similar to Algorithm 1.6 in Brent and Zimmermann’s book Modern Computer Arithmetic, with a few optimizations to improve the quotient selection step and minimize correction steps.One of the motivating use cases for 128-bit integers is using them to implement decimal fixed-point arithmetic. 128-bit decimal fixed-point support is included in the 21.12 release of RAPIDS libcudf. Keep reading to find out more about fixed-point arithmetic and how __int128 is used to enable high-precision computation.

Fixed-point Arithmetic

Fixed-point numbers represent real numbers by storing a fixed number of digits for the fractional part. Fixed-point numbers can also be used to β€œomit” the lower-order digits of integer values (i.e. if you want to represent multiples of 1000, you can use a base-10 fixed-point number with scale equal to 3). One easy way to remember the difference between fixed-point and floating point is that with fixed-point numbers, the decimal β€œpoint” is fixed, whereas in floating-point numbers the decimal β€œpoint” can float (move).

The basic idea behind fixed-point numbers is that even though the values being represented can have fractional digits (aka the 0.23 in 1.23), you actually store the value as an integer. To represent 1.23, for example, you can construct a fixed_point number with scale = -2 and value 123. This representation can be converted to a floating point number by multiplying the value by the radix raised to the scale. So in our example, 1.23 is produced by multiplying 123 (value) by 0.001 (10 (radix) to the power of -2 (scale)). When constructing a fixed-point number, the opposite occurs and you β€œshift” the value so that you can store it as an integer (with the floating point number 1.23 you would divide by 0.001 if you were using scale -2 (0.001 (10 (radix) to the power of -2 (scale))).

Note that fixed-point representations are not unique because you can choose multiple scales. For the example of 1.23, you can use any scale less than -2, such as -3 or -4. The only difference is that the number stored on disk will be different; 123 for scale -2, 1230 for scale -3 and 12300 for scale -4. When you know that your use case only requires a set number of decimal places, you should use the least precise (aka largest) scale possible to maximize the range of representable values. With scale -2 the range is roughly -20 to +20 million (with two decimal places), whereas with scale -3 the range is roughly -2 to +2 million (with three decimal places). If you know you are modeling money and you don’t need three decimal places, scale -2 is a much better option.

Another parameter of a fixed-point type is the base. In the examples in this post, and in RAPIDS cuDF, we use base 10, or decimal fixed point. Decimal fixed point is the easiest to think about because we are comfortable with decimal (base 10) numbers. The other common base for fixed-point numbers is base 2, also known as binary fixed point. This simply means that instead of shifting value by powers of 10, the scale shifts a value by powers of 2. You can see some examples of binary and decimal fixed-point values later in the β€œExamples” section.

Fixed point Vs floating point

Fixed Point Floating Point
Narrower, static range Wider, dynamic range Β 
Exact representation avoids certain truncation and rounding errors Β  Truncation errors can occur
Keeps relative error constant Approximate representation leads to certain truncation and rounding errors Β 
Β  Keeps relative error constant
Table 1: Comparison of floating point and fixed point.

Absolute error is the difference between the real value and its computer representation (in either fixed or floating point). Relative error is the ratio of the absolute error to the represented value.

To demonstrate issues with floating point representations that fixed point can address, let’s look at exactly how floating point is represented. A floating point number cannot represent all values exactly. For instance, the closest 32-bit floating point number to value 1.1 is 1.10000002384185791016 (see float.exposed to visualize this). The trailing β€œimprecision” can lead to errors when performing arithmetic operations. For example, 1.1 + 1.1 yields 2.20000004768371582031.

Visualization of 1.1 in floating-point.
Figure 1: Visualization of 1.1 in floating-point.

In contrast, when using fixed-point representations, an integer is used to store the exact value. To represent 1.1 using a fixed-point number with a scale equal to -1, the value 11 is stored.Β  Arithmetic operations are performed on the underlying integer, so adding 1.1 + 1.1 as fixed-point numbers simply adds 11 + 11 yielding 22, representing the value 2.2 exactly

Why is fixed-point arithmetic important?

As shown in the example preceding, fixed-point arithmetic avoids the precision and rounding errors inherent in floating point numbers while still providing the ability to represent fractional digits. Floating point provides a much larger range of values by keeping relative error constant. However, this means it can suffer from large absolute (truncation) errors when adding very large and very small numbers and run into rounding errors. Fixed-point representation always has the same absolute error at the cost of being able to represent a reduced range of values. If you know you need a specific precision after the decimal/binary point, then fixed point allows you to maintain accuracy of those digits without truncation even as the value grows, up to the limits of the range. If you need more range, you have to add more bits. Hence decimal128 becomes important for some users.

Lower Bound Upper Bound
decimal32 -21474836.48 21474836.47
decimal64 -92233720368547758.08 92233720368547758.07
decimal128 -1701411834604692317
316873037158841057.28
1701411834604692317316
873037158841057.27
Table 2: Ranges for decimal32 with scale = -2.

There are many applications and use cases for fixed_point numbers. You can find a list of actual applications that use fixed_point numbers on Wikipedia

fixed_point in RAPIDS libcudf

Overview

The core of the RAPIDS libcudf `fixed_point` type is a simple class template.

template 
class fixed_point {
  Rep _value;
  scale_type _scale;
}

The fixed_point class is templated on:

  • Rep: the representation of the fixed_point number (for example, the integer type used)
  • Rad: the Radix of the number (for example, base 2 or base 10)

The decimal32 and decimal64 types use int32_t and int64_t for the Rep, respectively and both have Radix::BASE_10. The scale is a strongly typed run-time variable (see Run-Time Scale and Strong Typing subsections below, etc).

The fixed_point type has several constructors (see Ways to Construct subsection below), explicit conversion operators to cast to both integral and floating point types, and a full complement of operators (addition, subtraction, etc.).

Sign of Scale

In most C++ fixed-point implementations (including RAPIDS libcudf’s), a negative scale indicates the number of fractional digits. A positive scale indicates the multiple that is representable (for example, if scale = 2 for a decimal32, multiples of 100 can be represented).

auto const number_with_pos_scale = decimal32{1.2345, scale_type{-2}}; // 1.23
auto const number_with_neg_scale = decimal32{12345,  scale_type{2}}; // 12300

Constructors

The following (simplified) constructors are provided in libcudf:

fixed_point(T const& value, scale_type const& scale)
fixed_point(scaled_integer  s) // already "shifted" value
fixed_point(T const& value)         // scale = 0
fixed_point()                       // scale = 0, value = 0

Where Rep is a signed integer type and T can be either an integral type or floating-point number.

Design and motivation

There are a number of design goals for libcudf’s fixed_point type. These include:

  • Need for a run-time scale
  • Consistency with potential standard C++ fixed_point types
  • Strong typing

These design motivations are detailed below.

Run-time scale and third-party fixed-point libraries

We studied eight existing fixed-point C++ libraries during the design phase. The primary reason for not using a third-party library is that all of the existing fixed-point types/libraries are designed with the scale being a compile-time parameter. This does not work for RAPIDS libcudf as it needs scale to be a run-time parameter.

While RAPIDS libcudf is a C++ library that can be used in C++ applications, it is also the backend for RAPIDs cuDF, which is a Python library. Python is an interpreted (rather than compiled, like C++) language. Moreover, cuDF must be able to read or receive fixed-point data from other data sources. This means that we do not know the scale of the fixed-point values at compile time. Therefore we need to have the fixed_point type in RAPIDS libcudf that has a run-time scale parameter.

The main library we referenced was CNL, the Compositional Numeric Library by John McFarlane that is currently the reference for an ISO C++ proposal to add fixed-point types to the C++ standard. We aim for the RAPIDS libcudf fixed_point type to be as similar as possible to the potentially standardized type. Here’s a simple comparison between RAPIDS libcudf and CNL.

CNL (Godbolt Link)

using namespace cnl;
auto x = fixed_point{1.23};

RAPIDS libcudf

using namespace numeric;
auto x = fixed_point{1.23, scale_type{-2}};

Or alternatively:

using namespace numeric;
auto x = decimal32{1.23, scale_type{-2}};

The most important difference to notice here is the -2 as a template (aka compile-time parameter) in the CNL example versus the scale_type{-2} as a run-time parameter in the RAPIDS libcudf example.

Strong typing

Strong typing has been incorporated into the design of the fixed_point type. Two examples of this are:

RAPIDS libcudf adheres to strong typing best practices and strongly typed APIs because of the protection and expressivity strong typing provides. I won’t go into the rabbit hole of weak compared to strong typing, but if you would like to read more about it there are many great resources, including Jonathon Bocarra’s Fluent C++ post on How typed C++ is, and why it matters.

Adding support for decimal128

RAPIDS libcudf 21.12 adds decimal128 as a supported fixed_point type. This required a number of changes, the first being the addition of the decimal128 type alias that relies on the __int128 type provided by CUDA 11.5.

using decimal32  = fixed_point;
using decimal64  = fixed_point;
using decimal128 = fixed_point;

This required a number of internal changes, including updating type traits functions, __int128_t specializations for certain functions, and adding support so that cudf::column_view and friends work with decimal128. The following simple examples demonstrate the use of libcudf APIs with decimal128 (note, all of these examples work the same for decimal32 and decimal64).

Examples

Simple currency

A simple currency example uses the decimal32 type provided by libcudf with scale -2 to represent exactly $17.29:

auto const money = decimal32{17.29, scale_type{-2}};

Summing large and small numbers

Fixed point is very useful when summing both large and small values. As a simple toy example, the following piece of code sums the powers of 10 from exponent -2 to 9.

template 
auto sum_powers_of_10() {
    auto iota = std::views::iota(-2, 10);
    return std::transform_reduce(
        iota.begin(), iota.end(), 
        T{}, std::plus{}, 
        [](auto e) -> T { return std::pow(10, e); });
}

Comparing 32-bit floating-point and decimal fixed point shows the following results:

sum_powers_of_10();        // 1111111168.000000
sum_powers_of_10(); // 1111111111.11

Where decimal_type is a 32-bit base-10 fixed-point type. You can see an example of this using the CNL library on Godbolt here.

Avoiding floating-point rounding issues

An example of where floating-point values run into issues (in C++) is the following piece of code (see in Godbolt):

std::cout 



The equivalent code in RAPIDS libcudf will not have the same issue (see on Github):

auto col    = // decimal32 column with scale -5 and value 256.49999
auto result = cudf::round(input); // result is 256

The value of 256.4999 is not representable with a 32-bit binary float and therefore rounds to 256.5 before the std::roundf function is called. This problem can be avoided with fixed-point representation because 256.4999 is representable with any base-10 (decimal) type that has five or more fractional values of precision.

Binary versus decimal fixed point

// Decimal Fixed Point
using decimal32 = fixed_point;
auto const a    = decimal32{17.29, scale_type{-2}};  // 17.29
auto const b    = decimal32{4.2,   scale_type{ 0}};  // 4
auto const c    = decimal32{1729,  scale_type{ 2}};  // 1700

// Binary Fixed Point
using binary32 = fixed_point;
auto const x    = binary{17.29, scale_type{-2}};  // 17.25
auto const y    = binary{4.2,   scale_type{ 0}};  // 4
auto const z    = binary{1729,  scale_type{ 2}};  // 1728

decimal128

// Multiplying two decimal128 numbers
auto const x = decimal128{1.1, scale_type{-1});
auto const y = decimal128{2.2, scale_type{-1}};
auto const z = x * y;  // 2.42 with scale equal to -2 

// Adding two decimal128 numbers
auto const x = decimal128{1.1, scale_type{-1});
auto const y = decimal128{2.2, scale_type{-1}};
auto const z = x * y;  // 3.3 with scale equal to -1 

DecimalType in RAPIDS Spark

DecimalType in Apache Spark SQL is a data type that can represent Java BigDecimal values. SQL queries operating on financial data make significant use of the decimal type. Unlike the RAPIDS libcudf implementation of fixed-point decimal numbers, the maximum precision possible for DecimalType in Spark is limited to 38 bits. The scale, which is defined as the number of digits after the decimal point, is also capped at 38. This definition is the negative of the C++ scale. For example, a decimal value like 0.12345 has a scale of 5 in Spark but a scale of -5 in libcudf.

Spark closely follows the Apache Hive specification on precision calculations for operations and provides options to the user to configure precision loss for decimal operations. Spark SQL is aggressive about promoting precision of the result column when performing operations like aggregation, windowing, casting and so on This behavior in and of itself is what makes decimal128 extremely relevant to real-world queries and answers the question: β€œWhy do we need support for high-precision decimal columns?”. Consider the example below, specifically the hash aggregate, which has a multiplication expression involving a decimal64 column, price, and a non-decimal column, quantity. Spark first casts the non-decimal column to an appropriate decimal column. It then determines the result precision, which is greater than the input precision. Therefore, it is quite common for the result precision to be a decimal128 value even if decimal64 inputs are involved.

scala> val queryDfGpu = readPar.agg(sum('price*'quantity))
queryDfGpu1: org.apache.spark.sql.DataFrame = [sum((price * quantity)): decimal(32,2)]

scala> queryDfGpu.explain
== Physical Plan ==
*(2) HashAggregate(keys=[], 
functions=[sum(CheckOverflow((promote_precision(cast(price#19 as decimal(12,2))) * promote_precision(cast(cast(quantity#20 as decimal(10,0)) as decimal(12,2)))), DecimalType(22,2), true))])
+- Exchange SinglePartition, true, [id=#429]
   +- *(1) HashAggregate(keys=[], 
functions=[partial_sum(CheckOverflow((promote_precision(cast(price#19 as decimal(12,2))) * promote_precision(cast(cast(quantity#20 as decimal(10,0)) as decimal(12,2)))), DecimalType(22,2), true))])
  	+- *(1) ColumnarToRow
     	+- FileScan parquet [price#19,quantity#20] Batched: true,DataFilters: 
[], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/dec_walmart.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct

With the introduction of the new decimal128 data type in libcudf the RAPIDS plug-in for Spark is able to leverage higher precisions and keep computation on the GPU where previously it needed to fall back to the CPU.

As an example, let’s look at a simple query that operates on the following schema.

{
    id   	:   IntegerType   	// Unique ID
    prodName :   StringType    	// Product name will be used to aggregate / partition
    price	:   DecimalType(11,2)   // Decimal64
    quantity :   IntegerType   	// Quantity of product
    
}

This query computes the unbounded window over totalCost, which is the sum(price*quantity). It then groups the result by the prodName after a sort and returns the minimum totalCost.

// Run window operation
val byProdName = Window.partitionBy('prodName)
val queryDfGpu = readPar.withColumn(
    "totalCost",
    sum('price*'quantity) over byProdName).sort(
   	 "prodName").groupBy(
   		 "prodName").min(
   			 "totalCost")

The RAPIDS Spark plug-in is set up to run operators on the GPU only if all the expressions can be evaluated on the GPU. Let’s first look at the following physical plan for this query without decimal128 support.)

Without decimal128 support every operator falls back to the CPU because child expressions that contain a decimal 128 type cannot be supported. Therefore, the containing exec or parent expression will also not execute on the GPU to avoid inefficient row-to-column and column-to-row conversions.

== Physical Plan ==
*(3) HashAggregate(keys=[prodName#18], functions=[min(totalCost#66)])
+- *(3) HashAggregate(keys=[prodName#18], 
functions=[partial_min(totalCost#66)])
   +- *(3) Project [prodName#18, totalCost#66]
  	+- Window [sum(_w0#67) windowspecdefinition(prodName#18, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS totalCost#66], [prodName#18]
     	+- *(2) Sort [prodName#18 ASC NULLS FIRST], false, 0
        	+- Exchange hashpartitioning(prodName#18, 1), true, [id=#169]
           	+- *(1) Project [prodName#18, 
CheckOverflow((promote_precision(cast(price#19 as decimal(12,2))) * 
promote_precision(cast(cast(quantity#20 as decimal(10,0)) as 
decimal(12,2)))), DecimalType(22,2), true) AS _w0#67]
              	+- *(1) ColumnarToRow
                +- FileScan parquet [prodName#18,price#19,quantity#20] 
Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/dec_walmart.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct

The query plan after enabling decimal128 support shows that all the operations can now run on the GPU. The absence of ColumnarToRow and RowToColumnar transitions (which show up for the collect operation in the query) enables better performance by running the entire query on the GPU.

== Physical Plan ==
GpuColumnarToRow false
+- GpuHashAggregate(keys=[prodName#18], functions=[gpumin(totalCost#31)]),
filters=ArrayBuffer(None))
   +- GpuHashAggregate(keys=[prodName#18], 
functions=[partial_gpumin(totalCost#31)]), filters=ArrayBuffer(None))
  	+- GpuProject [prodName#18, totalCost#31]
     	+- GpuWindow [prodName#18, _w0#32, gpusum(_w0#32, DecimalType(32,2)) gpuwindowspecdefinition(prodName#18, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS totalCost#31], [prodName#18]
            +- GpuCoalesceBatches batchedbykey(prodName#18 ASC NULLS FIRST)
            +- GpuSort [prodName#18 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@3204b591
              	+- GpuShuffleCoalesce 2147483647
                +- GpuColumnarExchange gpuhashpartitioning(prodName#18, 1),
 true, [id=#57]
                    	+- GpuProject [prodName#18, 
gpucheckoverflow((gpupromoteprecision(cast(price#19 as decimal(12,2))) * gpupromoteprecision(cast(cast(quantity#20 as decimal(10,0)) as 
decimal(12,2)))), DecimalType(22,2), true) AS _w0#32]
                       	+- GpuFileGpuScan parquet 
[prodName#18,price#19,quantity#20] Batched: true, DataFilters: [], Format: 
Parquet, Location: InMemoryFileIndex[file:/tmp/dec_walmart.parquet], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

For the multiplication operation, the quantity column is cast to decimal64 ( precision = 10) and the price column, which is already of type decimal64 is casted up to precision of 12 making both columns of the same type. The result column is sized to a precision of 22, which is of type decimal128 since the precision is greater than 18. This is shown in the GpuProject node of the plan above.

The window operation over the sum() also promotes the precision further to 32.

We use NVIDIA Decision Support (NDS), an adaptation of the TPC-DS data science benchmark often used by Spark customers and providers, to measure speedup. NDS consists of the same 100+ SQL queries as the industry standard benchmark but has modified parts for dataset generation and execution scripts.​  Results from NDS are not comparable to TPC-DS.

Preliminary runs of a subset of NDS queries demonstrate significant performance improvement due to decimal128 support, as shown in the following graph. These were run on a cluster of eight nodes each with one A100 GPU and 1024 CPU cores, running executors with 16 cores on spark 3.1.1. Each executor uses 240GiB in memory. The queries show excellent speedup of nearly 8x, which can be attributed to operations that were previously falling back to the CPU now running on the GPU, thereby avoiding row-to-column and column-to-row transitions and other associated overheads. On average the end-to-end run time of all the NDS queries shows 2x improvement. This is (hopefully) just the beginning!

Performance evaluation of a subset of NDS queries.
Figure 2: Β Performance evaluation of a subset of NDS queries.

With the 21.12 release of the RAPIDS plug-in for Spark, decimal128 support is available for the majority of operators. Some special handling of overflow conditions to maintain result compatibility between CPU and GPU is necessary. The ultimate goal of this effort is to allow retail and financial queries to fully benefit from GPU acceleration through the RAPIDS for Spark Plugin.

Summary

fixed_point types in RAPIDS libcudf, the addition of DecimalType, and decimal128 support for the RAPIDS plug-in for Spark enable exciting use cases that were previously only possible on the CPU now to be run on the GPU. If you want to get started with RAPIDS libcudf or the RAPIDS plug-in for Spark, you can follow the links below:

Categories
Misc

Play PC Games on Your Phone With GeForce NOW This GFN Thursday

Who says you have to put your play on pause just because you’re not at your PC? This GFN Thursday takes a look at how GeForce NOW makes PC gaming possible on Android and iOS mobile devices to support gamers on the go. This week also comes with sweet in-game rewards for members playing Eternal Read article >

The post Play PC Games on Your Phone With GeForce NOW This GFN Thursday appeared first on The Official NVIDIA Blog.