
Constructing Transformers For Longer Sequences with Sparse Attention Methods

Natural language processing (NLP) models based on Transformers, such as BERT, RoBERTa, T5, or GPT3, are successful for a wide variety of tasks and a mainstay of modern NLP research. The versatility and robustness of Transformers are the primary drivers behind their wide-scale adoption, leading them to be easily adapted for a diverse range of sequence-based tasks — as a seq2seq model for translation, summarization, generation, and others, or as a standalone encoder for sentiment analysis, POS tagging, machine reading comprehension, etc. The key innovation in Transformers is the introduction of a self-attention mechanism, which computes similarity scores for all pairs of positions in an input sequence, and can be evaluated in parallel for each token of the input sequence, avoiding the sequential dependency of recurrent neural networks, and enabling Transformers to vastly outperform previous sequence models like LSTM.

A limitation of existing Transformer models and their derivatives, however, is that the full self-attention mechanism has computational and memory requirements that are quadratic with the input sequence length. With commonly available current hardware and model sizes, this typically limits the input sequence to roughly 512 tokens, and prevents Transformers from being directly applicable to tasks that require larger context, like question answering, document summarization or genome fragment classification. Two natural questions arise: 1) Can we achieve the empirical benefits of quadratic full Transformers using sparse models with computational and memory requirements that scale linearly with the input sequence length? 2) Is it possible to show theoretically that these linear Transformers preserve the expressivity and flexibility of the quadratic full Transformers?

We address both of these questions in a recent pair of papers. In “ETC: Encoding Long and Structured Inputs in Transformers”, presented at EMNLP 2020, we present the Extended Transformer Construction (ETC), which is a novel method for sparse attention, in which one uses structural information to limit the number of computed pairs of similarity scores. This reduces the quadratic dependency on input length to linear and yields strong empirical results in the NLP domain. Then, in “Big Bird: Transformers for Longer Sequences”, presented at NeurIPS 2020, we introduce another sparse attention method, called BigBird that extends ETC to more generic scenarios where prerequisite domain knowledge about structure present in the source data may be unavailable. Moreover, we also show that theoretically our proposed sparse attention mechanism preserves the expressivity and flexibility of the quadratic full Transformers. Our proposed methods achieve a new state of the art on challenging long-sequence tasks, including question answering, document summarization and genome fragment classification.

Attention as a Graph
The attention module used in Transformer models computes similarity scores for all pairs of positions in an input sequence. It is useful to think of the attention mechanism as a directed graph, with tokens represented by nodes and the similarity score computed between a pair of tokens represented by an edge. In this view, the full attention model is a complete graph. The core idea behind our approach is to carefully design sparse graphs, such that one only computes a linear number of similarity scores.

Full attention can be viewed as a complete graph.

Extended Transformer Construction (ETC)
On NLP tasks that require long and structured inputs, we propose a structured sparse attention mechanism, which we call Extended Transformer Construction (ETC). To achieve structured sparsification of self attention, we developed the global-local attention mechanism. Here the input to the Transformer is split into two parts: a global input where tokens have unrestricted attention, and a long input where tokens can only attend to either the global input or to a local neighborhood. This achieves linear scaling of attention, which allows ETC to significantly scale input length.

In order to further exploit the structure of long documents, ETC combines additional ideas: representing the positional information of the tokens in a relative way, rather than using their absolute position in the sequence; using an additional training objective beyond the usual masked language model (MLM) used in models like BERT; and flexible masking of tokens to control which tokens can attend to which other tokens. For example, given a long selection of text, a global token is applied to each sentence, which connects to all tokens within the sentence, and a global token is also applied to each paragraph, which connects to all tokens within the same paragraph.

An example of document structure based sparse attention of ETC model. The global variables are denoted by C (in blue) for paragraph, S (yellow) for sentence while the local variables are denoted by X (grey) for tokens corresponding to the long input.

With this approach, we report state-of-the-art results in five challenging NLP datasets requiring long or structured inputs: TriviaQA, Natural Questions (NQ), HotpotQA, WikiHop, and OpenKP.

Test set result on Question Answering. For both verified TriviaQA and WikiHop, using ETC achieved a new state of the art.

Extending the work of ETC, we propose BigBird — a sparse attention mechanism that is also linear in the number of tokens and is a generic replacement for the attention mechanism used in Transformers. In contrast to ETC, BigBird doesn’t require any prerequisite knowledge about structure present in the source data. Sparse attention in the BigBird model consists of three main parts:

  • A set of global tokens attending to all parts of the input sequence
  • All tokens attending to a set of local neighboring tokens
  • All tokens attending to a set of random tokens
BigBird sparse attention can be seen as adding few global tokens on Watts-Strogatz graph.

In the BigBird paper, we explain why sparse attention is sufficient to approximate quadratic attention, partially explaining why ETC was successful. A crucial observation is that there is an inherent tension between how few similarity scores one computes and the flow of information between different nodes (i.e., the ability of one token to influence each other). Global tokens serve as a conduit for information flow and we prove that sparse attention mechanisms with global tokens can be as powerful as the full attention model. In particular, we show that BigBird is as expressive as the original Transformer, is computationally universal (following the work of Yun et al. and Perez et al.), and is a universal approximator of continuous functions. Furthermore, our proof suggests that the use of random graphs can further help ease the flow of information — motivating the use of the random attention component.

This design scales to much longer sequence lengths for both structured and unstructured tasks. Further scaling can be achieved by using gradient checkpointing by trading off training time for sequence length. This lets us extend our efficient sparse transformers to include generative tasks that require an encoder and a decoder, such as long document summarization, on which we achieve a new state of the art.

Summarization ROUGE score for long documents. Both for BigPatent and ArXiv datasets, we achieve a new state of the art result.

Moreover, the fact that BigBird is a generic replacement also allows it to be extended to new domains without pre-existing domain knowledge. In particular, we introduce a novel application of Transformer-based models where long contexts are beneficial — extracting contextual representations of genomic sequences (DNA). With longer masked language model pre-training, BigBird achieves state-of-the-art performance on downstream tasks, such as promoter-region prediction and chromatin profile prediction.

On multiple genomics tasks, such as promoter region prediction (PRP), chromatin-profile prediction including transcription factors (TF), histone-mark (HM) and DNase I hypersensitive (DHS) detection, we outperform baselines. Moreover our results show that Transformer models can be applied to multiple genomics tasks that are currently underexplored.

Main Implementation Idea
One of the main impediments to the large scale adoption of sparse attention is the fact that sparse operations are quite inefficient in modern hardware. Behind both ETC and BigBird, one of our key innovations is to make an efficient implementation of the sparse attention mechanism. As modern hardware accelerators like GPUs and TPUs excel using coalesced memory operations, which load blocks of contiguous bytes at once, it is not efficient to have small sporadic look-ups caused by a sliding window (for local attention) or random element queries (random attention). Instead we transform the sparse local and random attention into dense tensor operations to take full advantage of modern single instruction, multiple data (SIMD) hardware.

To do this, we first “blockify” the attention mechanism to better leverage GPUs/TPUs, which are designed to operate on blocks. Then we convert the sparse attention mechanism computation into a dense tensor product through a series of simple matrix operations such as reshape, roll, and gather, as illustrated in the animation below.

Illustration of how sparse window attention is efficiently computed using roll and reshape, and without small sporadic look-ups.

Recently, “Long Range Arena: A Benchmark for Efficient Transformers“ provided a benchmark of six tasks that require longer context, and performed experiments to benchmark all existing long range transformers. The results show that the BigBird model, unlike its counterparts, clearly reduces memory consumption without sacrificing performance.

We show that carefully designed sparse attention can be as expressive and flexible as the original full attention model. Along with theoretical guarantees, we provide a very efficient implementation which allows us to scale to much longer inputs. As a consequence, we achieve state-of-the-art results for question answering, document summarization and genome fragment classification. Given the generic nature of our sparse attention, the approach should be applicable to many other tasks like program synthesis and long form open domain question answering. We have open sourced the code for both ETC (github) and BigBird (github), both of which run efficiently for long sequences on both GPUs and TPUs.

This research resulted as a collaboration with Amr Ahmed, Joshua Ainslie, Chris Alberti, Vaclav Cvicek, Avinava Dubey, Zachary Fisher, Guru Guruganesh, Santiago Ontañón, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, Li Yang, Manzil Zaheer, who co-authored EMNLP and NeurIPS papers.


GTC 21: Top 5 Professional Visualization Sessions

Learn how you can take advantage of the latest NVIDIA technology to enable the creation of beautiful worlds quicker and easier than ever before.

This year at GTC we have several sessions for professional content creators looking to take advantage of the latest NVIDIA technology to enable the creation of beautiful worlds quicker and easier than ever before. Find out how to add life-like realism to your projects with new ray tracing features and faster real-time volumetric rendering and simulation. Also, learn how to collaborate with creators around the world seamlessly and effortlessly no matter what software you use. We have hundreds of sessions on graphics, simulation, and design to choose from. Registration is free.

These are the five graphics sessions you can’t miss: 

Image courtesy of Adobe Substance and Zhelongxu

What’s New in OptiX
Catch up with the latest additions to the OptiX SDK and learn tips and tricks on how best to implement them into your products.

NanoVDB: A GPU-Friendly and Portable VDB Data Structure for Real-Time Rendering and Simulation
Learn how NanoVDB accelerates real-time rendering and simulation of the most graphically intensive volumetric effects on NVIDIA GPUs.

An Overview of NVIDIA CloudXR
Learn all about NVIDIA CloudXR, a groundbreaking innovation for streaming VR and AR from any OpenVR application on a remote server to a client device. Get details on the architecture of the CloudXR software stack and explore the use cases.

Building Omniverse Kit Apps and Extensions
Learn how to leverage Omniverse Kit to build amazing applications and extensions.

Making a Connector for Omniverse
Learn how to connect with the Omniverse platform, send data to it, and establish a live sync session. There will also be a USD 101 overview to get you started.

Register for free and check out GTC sessions that dive into the latest technologies for graphics and simulation. A quick registration is required to view the GTC catalog with over 1,400 free sessions covering XR, graphics, simulation, design, and more.


Question about using tf.stop_gradient in separate Actor-Critic networks for A2C implementation for TF2

submitted by /u/AvisekEECS
[visit reddit] [comments]


AI wakeword dataset single word audiofile chess

Hi, I would like to make simple AI, which can detect hotword/wakeword and I need a lot of short audiofiles with names of chess figures. How to get the dataset like that or is there another opensource hotword/wakeword detection? (STT would be too weak and common wakeword mechanisms are deprecated or commercial ;( )

submitted by /u/Feeling_Wait_7132
[visit reddit] [comments]


sparklyr 1.6: weighted quantile summaries, power iteration clustering, spark_write_rds(), and more

Sparklyr 1.6 is now available on CRAN!

To install sparklyr 1.6 from CRAN, run


In this blog post, we shall highlight the following features and enhancements from sparklyr 1.6:

Weighted quantile summaries

Apache Spark is well-known for supporting approximate algorithms that trade off marginal amounts of accuracy for greater speed and parallelism. Such algorithms are particularly beneficial for performing preliminary data explorations at scale, as they enable users to quickly query certain estimated statistics within a predefined error margin, while avoiding the high cost of exact computations. One example is the Greenwald-Khanna algorithm for on-line computation of quantile summaries, as described in Greenwald and Khanna (2001). This algorithm was originally designed for efficient (epsilon)- approximation of quantiles within a large dataset without the notion of data points carrying different weights, and the unweighted version of it has been implemented as approxQuantile() since Spark 2.0. However, the same algorithm can be generalized to handle weighted inputs, and as sparklyr user @Zhuk66 mentioned in this issue, a weighted version of this algorithm makes for a useful sparklyr feature.

To properly explain what weighted-quantile means, we must clarify what the weight of each data point signifies. For example, if we have a sequence of observations ((1, 1, 1, 1, 0, 2, -1, -1)), and would like to approximate the median of all data points, then we have the following two options:

  • Either run the unweighted version of approxQuantile() in Spark to scan through all 8 data points

  • Or alternatively, “compress” the data into 4 tuples of (value, weight): ((1, 0.5), (0, 0.125), (2, 0.125), (-1, 0.25)), where the second component of each tuple represents how often a value occurs relative to the rest of the observed values, and then find the median by scanning through the 4 tuples using the weighted version of the Greenwald-Khanna algorithm

We can also run through a contrived example involving the standard normal distribution to illustrate the power of weighted quantile estimation in sparklyr 1.6. Suppose we cannot simply run qnorm() in R to evaluate the quantile function of the standard normal distribution at (p = 0.25) and (p = 0.75), how can we get some vague idea about the 1st and 3rd quantiles of this distribution? One way is to sample a large number of data points from this distribution, and then apply the Greenwald-Khanna algorithm to our unweighted samples, as shown below:


sc <- spark_connect(master = "local")

num_samples <- 1e6
samples <- data.frame(x = rnorm(num_samples))

samples_sdf <- copy_to(sc, samples, name = random_string())

samples_sdf %>%
    column = "x",
    probabilities = c(0.25, 0.75),
    relative.error = 0.01
  ) %>%
##        25%        75%
## -0.6629242  0.6874939

Notice that because we are working with an approximate algorithm, and have specified relative.error = 0.01, the estimated value of (-0.6629242) from above could be anywhere between the 24th and the 26th percentile of all samples. In fact, it falls in the (25.36896)-th percentile:

## [1] 0.2536896

Now how can we make use of weighted quantile estimation from sparklyr 1.6 to obtain similar results? Simple! We can sample a large number of (x) values uniformly randomly from ((-infty, infty)) (or alternatively, just select a large number of values evenly spaced between ((-M, M)) where (M) is approximately (infty)), and assign each (x) value a weight of (displaystyle frac{1}{sqrt{2 pi}}e^{-frac{x^2}{2}}), the standard normal distribution’s probability density at (x). Finally, we run the weighted version of sdf_quantile() from sparklyr 1.6, as shown below:


sc <- spark_connect(master = "local")

num_samples <- 1e6
M <- 1000
samples <- tibble::tibble(
  x = M * seq(-num_samples / 2 + 1, num_samples / 2) / num_samples,
  weight = dnorm(x)

samples_sdf <- copy_to(sc, samples, name = random_string())

samples_sdf %>%
    column = "x",
    weight.column = "weight",
    probabilities = c(0.25, 0.75),
    relative.error = 0.01
  ) %>%
##    25%    75%
## -0.696  0.662

Voilà! The estimates are not too far off from the 25th and 75th percentiles (in relation to our abovementioned maximum permissible error of (0.01)):

## [1] 0.2432144
## [1] 0.7460144

Power iteration clustering

Power iteration clustering (PIC), a simple and scalable graph clustering method presented in Lin and Cohen (2010), first finds a low-dimensional embedding of a dataset, using truncated power iteration on a normalized pairwise-similarity matrix of all data points, and then uses this embedding as the “cluster indicator”, an intermediate representation of the dataset that leads to fast convergence when used as input to k-means clustering. This process is very well illustrated in figure 1 of Lin and Cohen (2010) (reproduced below)

in which the leftmost image is the visualization of a dataset consisting of 3 circles, with points colored in red, green, and blue indicating clustering results, and the subsequent images show the power iteration process gradually transforming the original set of points into what appears to be three disjoint line segments, an intermediate representation that can be rapidly separated into 3 clusters using k-means clustering with (k = 3).

In sparklyr 1.6, ml_power_iteration() was implemented to make the PIC functionality in Spark accessible from R. It expects as input a 3-column Spark dataframe that represents a pairwise-similarity matrix of all data points. Two of the columns in this dataframe should contain 0-based row and column indices, and the third column should hold the corresponding similarity measure. In the example below, we will see a dataset consisting of two circles being easily separated into two clusters by ml_power_iteration(), with the Gaussian kernel being used as the similarity measure between any 2 points:

gen_similarity_matrix <- function() {
  # Guassian similarity measure
  guassian_similarity <- function(pt1, pt2) {
    exp(-sum((pt2 - pt1) ^ 2) / 2)
  # generate evenly distributed points on a circle centered at the origin
  gen_circle <- function(radius, num_pts) {
    seq(0, num_pts - 1) %>%
        function(idx) {
          theta <- 2 * pi * idx / num_pts
          radius * c(x = cos(theta), y = sin(theta))
  # generate points on both circles
  pts <- rbind(
    gen_circle(radius = 1, num_pts = 80),
    gen_circle(radius = 4, num_pts = 80)
  # populate the pairwise similarity matrix (stored as a 3-column dataframe)
  similarity_matrix <- data.frame()
  for (i in seq(2, nrow(pts)))
    similarity_matrix <- similarity_matrix %>%
      rbind(seq(i - 1L) %>%
        purrr::map_dfr(~ list(
          src = i - 1L, dst = .x - 1L,
          similarity = guassian_similarity(pts[i,], pts[.x,])



sc <- spark_connect(master = "local")
sdf <- copy_to(sc, gen_similarity_matrix())
clusters <- ml_power_iteration(
  sdf, k = 2, max_iter = 10, init_mode = "degree",
  src_col = "src", dst_col = "dst", weight_col = "similarity"

clusters %>% print(n = 160)
## # A tibble: 160 x 2
##        id cluster
##     <dbl>   <int>
##   1     0       1
##   2     1       1
##   3     2       1
##   4     3       1
##   5     4       1
##   ...
##   157   156       0
##   158   157       0
##   159   158       0
##   160   159       0

The output shows points from the two circles being assigned to separate clusters, as expected, after only a small number of PIC iterations.

spark_write_rds() + collect_from_rds()

spark_write_rds() and collect_from_rds() are implemented as a less memory- consuming alternative to collect(). Unlike collect(), which retrieves all elements of a Spark dataframe through the Spark driver node, hence potentially causing slowness or out-of-memory failures when collecting large amounts of data, spark_write_rds(), when used in conjunction with collect_from_rds(), can retrieve all partitions of a Spark dataframe directly from Spark workers, rather than through the Spark driver node. First, spark_write_rds() will distribute the tasks of serializing Spark dataframe partitions in RDS version 2 format among Spark workers. Spark workers can then process multiple partitions in parallel, each handling one partition at a time and persisting the RDS output directly to disk, rather than sending dataframe partitions to the Spark driver node. Finally, the RDS outputs can be re-assembled to R dataframes using collect_from_rds().

Shown below is an example of spark_write_rds() + collect_from_rds() usage, where RDS outputs are first saved to HDFS, then downloaded to the local filesystem with hadoop fs -get, and finally, post-processed with collect_from_rds():


num_partitions <- 10L
sc <- spark_connect(master = "yarn", spark_home = "/usr/lib/spark")
flights_sdf <- copy_to(sc, flights, repartition = num_partitions)

# Spark workers serialize all partition in RDS format in parallel and write RDS
# outputs to HDFS
  dest_uri = "hdfs://<namenode>:8020/flights-part-{partitionId}.rds"

# Run `hadoop fs -get` to download RDS files from HDFS to local file system
for (partition in seq(num_partitions) - 1)
    c("fs", "-get", sprintf("hdfs://<namenode>:8020/flights-part-%d.rds", partition))

# Post-process RDS outputs
partitions <- seq(num_partitions) - 1 %>%
  lapply(function(partition) collect_from_rds(sprintf("flights-part-%d.rds", partition)))

# Optionally, call `rbind()` to combine data from all partitions into a single R dataframe
flights_df <-, partitions)

Dplyr-related improvements

Similar to other recent sparklyr releases, sparklyr 1.6 comes with a number of dplyr-related improvements, such as

  • Support for where() predicate within select() and summarize(across(…)) operations on Spark dataframes
  • Addition of if_all() and if_any() functions
  • Full compatibility with dbplyr 2.0 backend API

select(where(…)) and summarize(across(where(…)))

The dplyr where(…) construct is useful for applying a selection or aggregation function to multiple columns that satisfy some boolean predicate. For example,


iris %>% select(where(is.numeric))

returns all numeric columns from the iris dataset, and


iris %>% summarize(across(where(is.numeric), mean))

computes the average of each numeric column.

In sparklyr 1.6, both types of operations can be applied to Spark dataframes, e.g.,


sc <- spark_connect(master = "local")
iris_sdf <- copy_to(sc, iris, name = random_string())

iris_sdf %>% select(where(is.numeric))

iris %>% summarize(across(where(is.numeric), mean))

if_all() and if_any()

if_all() and if_any() are two convenience functions from dplyr 1.0.4 (see here for more details) that effectively1 combine the results of applying a boolean predicate to a tidy selection of columns using the logical and/or operators.

Starting from sparklyr 1.6, if_all() and if_any() can also be applied to Spark dataframes, .e.g.,


sc <- spark_connect(master = "local")
iris_sdf <- copy_to(sc, iris, name = random_string())

# Select all records with Petal.Width > 2 and Petal.Length > 2
iris_sdf %>% filter(if_all(starts_with("Petal"), ~ .x > 2))

# Select all records with Petal.Width > 5 or Petal.Length > 5
iris_sdf %>% filter(if_any(starts_with("Petal"), ~ .x > 5))

Compatibility with dbplyr 2.0 backend API

Sparklyr 1.6 is fully compatible with the newer dbplyr 2.0 backend API (by implementing all interface changes recommended in here), while still maintaining backward compatibility with the previous edition of dbplyr API, so that sparklyr users will not be forced to switch to any particular version of dbplyr.

This should be a mostly non-user-visible change as of now. In fact, the only discernible behavior change will be the following code


sc <- spark_connect(master = "local")



[1] 2

if sparklyr is working with dbplyr 2.0+, and

[1] 1

if otherwise.


In chronological order, we would like to thank the following contributors for making sparklyr 1.6 awesome:

We would also like to give a big shout-out to the wonderful open-source community behind sparklyr, without whom we would not have benefitted from numerous sparklyr-related bug reports and feature suggestions.

Finally, the author of this blog post also very much appreciates the highly valuable editorial suggestions from @skeydan.

If you wish to learn more about sparklyr, we recommend checking out,, and also some previous sparklyr release posts such as sparklyr 1.5 and sparklyr 1.4.

That is all. Thanks for reading!

Greenwald, Michael, and Sanjeev Khanna. 2001. “Space-Efficient Online Computation of Quantile Summaries.” SIGMOD Rec. 30 (2): 58–66.

Lin, Frank, and William Cohen. 2010. “Power Iteration Clustering.” In, 655–62.

  1. modulo possible implementation-dependent short-circuit evaluations↩︎


sparklyr 1.6: weighted quantile summaries, power iteration clustering, spark_write_rds(), and more

Sparklyr 1.6 is now available on CRAN!

To install sparklyr 1.6 from CRAN, run


In this blog post, we shall highlight the following features and enhancements from sparklyr 1.6:

Weighted quantile summaries

Apache Spark is well-known for supporting approximate algorithms that trade off marginal amounts of accuracy for greater speed and parallelism. Such algorithms are particularly beneficial for performing preliminary data explorations at scale, as they enable users to quickly query certain estimated statistics within a predefined error margin, while avoiding the high cost of exact computations. One example is the Greenwald-Khanna algorithm for on-line computation of quantile summaries, as described in Greenwald and Khanna (2001). This algorithm was originally designed for efficient (epsilon)- approximation of quantiles within a large dataset without the notion of data points carrying different weights, and the unweighted version of it has been implemented as approxQuantile() since Spark 2.0. However, the same algorithm can be generalized to handle weighted inputs, and as sparklyr user @Zhuk66 mentioned in this issue, a weighted version of this algorithm makes for a useful sparklyr feature.

To properly explain what weighted-quantile means, we must clarify what the weight of each data point signifies. For example, if we have a sequence of observations ((1, 1, 1, 1, 0, 2, -1, -1)), and would like to approximate the median of all data points, then we have the following two options:

  • Either run the unweighted version of approxQuantile() in Spark to scan through all 8 data points

  • Or alternatively, “compress” the data into 4 tuples of (value, weight): ((1, 0.5), (0, 0.125), (2, 0.125), (-1, 0.25)), where the second component of each tuple represents how often a value occurs relative to the rest of the observed values, and then find the median by scanning through the 4 tuples using the weighted version of the Greenwald-Khanna algorithm

We can also run through a contrived example involving the standard normal distribution to illustrate the power of weighted quantile estimation in sparklyr 1.6. Suppose we cannot simply run qnorm() in R to evaluate the quantile function of the standard normal distribution at (p = 0.25) and (p = 0.75), how can we get some vague idea about the 1st and 3rd quantiles of this distribution? One way is to sample a large number of data points from this distribution, and then apply the Greenwald-Khanna algorithm to our unweighted samples, as shown below:


sc <- spark_connect(master = "local")

num_samples <- 1e6
samples <- data.frame(x = rnorm(num_samples))

samples_sdf <- copy_to(sc, samples, name = random_string())

samples_sdf %>%
    column = "x",
    probabilities = c(0.25, 0.75),
    relative.error = 0.01
  ) %>%
##        25%        75%
## -0.6629242  0.6874939

Notice that because we are working with an approximate algorithm, and have specified relative.error = 0.01, the estimated value of (-0.6629242) from above could be anywhere between the 24th and the 26th percentile of all samples. In fact, it falls in the (25.36896)-th percentile:

## [1] 0.2536896

Now how can we make use of weighted quantile estimation from sparklyr 1.6 to obtain similar results? Simple! We can sample a large number of (x) values uniformly randomly from ((-infty, infty)) (or alternatively, just select a large number of values evenly spaced between ((-M, M)) where (M) is approximately (infty)), and assign each (x) value a weight of (displaystyle frac{1}{sqrt{2 pi}}e^{-frac{x^2}{2}}), the standard normal distribution’s probability density at (x). Finally, we run the weighted version of sdf_quantile() from sparklyr 1.6, as shown below:


sc <- spark_connect(master = "local")

num_samples <- 1e6
M <- 1000
samples <- tibble::tibble(
  x = M * seq(-num_samples / 2 + 1, num_samples / 2) / num_samples,
  weight = dnorm(x)

samples_sdf <- copy_to(sc, samples, name = random_string())

samples_sdf %>%
    column = "x",
    weight.column = "weight",
    probabilities = c(0.25, 0.75),
    relative.error = 0.01
  ) %>%
##    25%    75%
## -0.696  0.662

Voilà! The estimates are not too far off from the 25th and 75th percentiles (in relation to our abovementioned maximum permissible error of (0.01)):

## [1] 0.2432144
## [1] 0.7460144

Power iteration clustering

Power iteration clustering (PIC), a simple and scalable graph clustering method presented in Lin and Cohen (2010), first finds a low-dimensional embedding of a dataset, using truncated power iteration on a normalized pairwise-similarity matrix of all data points, and then uses this embedding as the “cluster indicator”, an intermediate representation of the dataset that leads to fast convergence when used as input to k-means clustering. This process is very well illustrated in figure 1 of Lin and Cohen (2010) (reproduced below)

in which the leftmost image is the visualization of a dataset consisting of 3 circles, with points colored in red, green, and blue indicating clustering results, and the subsequent images show the power iteration process gradually transforming the original set of points into what appears to be three disjoint line segments, an intermediate representation that can be rapidly separated into 3 clusters using k-means clustering with (k = 3).

In sparklyr 1.6, ml_power_iteration() was implemented to make the PIC functionality in Spark accessible from R. It expects as input a 3-column Spark dataframe that represents a pairwise-similarity matrix of all data points. Two of the columns in this dataframe should contain 0-based row and column indices, and the third column should hold the corresponding similarity measure. In the example below, we will see a dataset consisting of two circles being easily separated into two clusters by ml_power_iteration(), with the Gaussian kernel being used as the similarity measure between any 2 points:

gen_similarity_matrix <- function() {
  # Guassian similarity measure
  guassian_similarity <- function(pt1, pt2) {
    exp(-sum((pt2 - pt1) ^ 2) / 2)
  # generate evenly distributed points on a circle centered at the origin
  gen_circle <- function(radius, num_pts) {
    seq(0, num_pts - 1) %>%
        function(idx) {
          theta <- 2 * pi * idx / num_pts
          radius * c(x = cos(theta), y = sin(theta))
  # generate points on both circles
  pts <- rbind(
    gen_circle(radius = 1, num_pts = 80),
    gen_circle(radius = 4, num_pts = 80)
  # populate the pairwise similarity matrix (stored as a 3-column dataframe)
  similarity_matrix <- data.frame()
  for (i in seq(2, nrow(pts)))
    similarity_matrix <- similarity_matrix %>%
      rbind(seq(i - 1L) %>%
        purrr::map_dfr(~ list(
          src = i - 1L, dst = .x - 1L,
          similarity = guassian_similarity(pts[i,], pts[.x,])



sc <- spark_connect(master = "local")
sdf <- copy_to(sc, gen_similarity_matrix())
clusters <- ml_power_iteration(
  sdf, k = 2, max_iter = 10, init_mode = "degree",
  src_col = "src", dst_col = "dst", weight_col = "similarity"

clusters %>% print(n = 160)
## # A tibble: 160 x 2
##        id cluster
##     <dbl>   <int>
##   1     0       1
##   2     1       1
##   3     2       1
##   4     3       1
##   5     4       1
##   ...
##   157   156       0
##   158   157       0
##   159   158       0
##   160   159       0

The output shows points from the two circles being assigned to separate clusters, as expected, after only a small number of PIC iterations.

spark_write_rds() + collect_from_rds()

spark_write_rds() and collect_from_rds() are implemented as a less memory- consuming alternative to collect(). Unlike collect(), which retrieves all elements of a Spark dataframe through the Spark driver node, hence potentially causing slowness or out-of-memory failures when collecting large amounts of data, spark_write_rds(), when used in conjunction with collect_from_rds(), can retrieve all partitions of a Spark dataframe directly from Spark workers, rather than through the Spark driver node. First, spark_write_rds() will distribute the tasks of serializing Spark dataframe partitions in RDS version 2 format among Spark workers. Spark workers can then process multiple partitions in parallel, each handling one partition at a time and persisting the RDS output directly to disk, rather than sending dataframe partitions to the Spark driver node. Finally, the RDS outputs can be re-assembled to R dataframes using collect_from_rds().

Shown below is an example of spark_write_rds() + collect_from_rds() usage, where RDS outputs are first saved to HDFS, then downloaded to the local filesystem with hadoop fs -get, and finally, post-processed with collect_from_rds():


num_partitions <- 10L
sc <- spark_connect(master = "yarn", spark_home = "/usr/lib/spark")
flights_sdf <- copy_to(sc, flights, repartition = num_partitions)

# Spark workers serialize all partition in RDS format in parallel and write RDS
# outputs to HDFS
  dest_uri = "hdfs://<namenode>:8020/flights-part-{partitionId}.rds"

# Run `hadoop fs -get` to download RDS files from HDFS to local file system
for (partition in seq(num_partitions) - 1)
    c("fs", "-get", sprintf("hdfs://<namenode>:8020/flights-part-%d.rds", partition))

# Post-process RDS outputs
partitions <- seq(num_partitions) - 1 %>%
  lapply(function(partition) collect_from_rds(sprintf("flights-part-%d.rds", partition)))

# Optionally, call `rbind()` to combine data from all partitions into a single R dataframe
flights_df <-, partitions)

Dplyr-related improvements

Similar to other recent sparklyr releases, sparklyr 1.6 comes with a number of dplyr-related improvements, such as

  • Support for where() predicate within select() and summarize(across(…)) operations on Spark dataframes
  • Addition of if_all() and if_any() functions
  • Full compatibility with dbplyr 2.0 backend API

select(where(…)) and summarize(across(where(…)))

The dplyr where(…) construct is useful for applying a selection or aggregation function to multiple columns that satisfy some boolean predicate. For example,


iris %>% select(where(is.numeric))

returns all numeric columns from the iris dataset, and


iris %>% summarize(across(where(is.numeric), mean))

computes the average of each numeric column.

In sparklyr 1.6, both types of operations can be applied to Spark dataframes, e.g.,


sc <- spark_connect(master = "local")
iris_sdf <- copy_to(sc, iris, name = random_string())

iris_sdf %>% select(where(is.numeric))

iris %>% summarize(across(where(is.numeric), mean))

if_all() and if_any()

if_all() and if_any() are two convenience functions from dplyr 1.0.4 (see here for more details) that effectively1 combine the results of applying a boolean predicate to a tidy selection of columns using the logical and/or operators.

Starting from sparklyr 1.6, if_all() and if_any() can also be applied to Spark dataframes, .e.g.,


sc <- spark_connect(master = "local")
iris_sdf <- copy_to(sc, iris, name = random_string())

# Select all records with Petal.Width > 2 and Petal.Length > 2
iris_sdf %>% filter(if_all(starts_with("Petal"), ~ .x > 2))

# Select all records with Petal.Width > 5 or Petal.Length > 5
iris_sdf %>% filter(if_any(starts_with("Petal"), ~ .x > 5))

Compatibility with dbplyr 2.0 backend API

Sparklyr 1.6 is fully compatible with the newer dbplyr 2.0 backend API (by implementing all interface changes recommended in here), while still maintaining backward compatibility with the previous edition of dbplyr API, so that sparklyr users will not be forced to switch to any particular version of dbplyr.

This should be a mostly non-user-visible change as of now. In fact, the only discernible behavior change will be the following code


sc <- spark_connect(master = "local")



[1] 2

if sparklyr is working with dbplyr 2.0+, and

[1] 1

if otherwise.


In chronological order, we would like to thank the following contributors for making sparklyr 1.6 awesome:

We would also like to give a big shout-out to the wonderful open-source community behind sparklyr, without whom we would not have benefitted from numerous sparklyr-related bug reports and feature suggestions.

Finally, the author of this blog post also very much appreciates the highly valuable editorial suggestions from @skeydan.

If you wish to learn more about sparklyr, we recommend checking out,, and also some previous sparklyr release posts such as sparklyr 1.5 and sparklyr 1.4.

That is all. Thanks for reading!

Greenwald, Michael, and Sanjeev Khanna. 2001. “Space-Efficient Online Computation of Quantile Summaries.” SIGMOD Rec. 30 (2): 58–66.

Lin, Frank, and William Cohen. 2010. “Power Iteration Clustering.” In, 655–62.

  1. modulo possible implementation-dependent short-circuit evaluations↩︎


Come Sale Away with GFN Thursday

GFN Thursday means more games for GeForce NOW members, every single week. This week’s list includes the day-and-date release of Spacebase Startopia, but first we want to share the scoop on some fantastic sales available across our digital game store partners that members will want to take advantage of this very moment. Discounts for All Read article >

The post Come Sale Away with GFN Thursday appeared first on The Official NVIDIA Blog.


NIH, NVIDIA Use AI to Trace COVID-19 Disease Progression in Chest CT Images

Researchers from the U.S. National Institutes of Health have collaborated with NVIDIA experts on an AI-accelerated method to monitor COVID-19 disease severity over time from patient chest CT scans.

Researchers from the U.S. National Institutes of Health have collaborated with NVIDIA experts on an AI-accelerated method to monitor COVID-19 disease severity over time from patient chest CT scans. 

Published today in Scientific Reports, this work studied the progression of lung opacities in chest CT images of COVID patients, and extracted insights about the temporal relationships between CT features and lab measurements. 

Quantifying CT opacities can tell doctors how severe a patient’s condition is. A better understanding of the progression of lung opacities in COVID patients could help inform clinical decisions in patients with pneumonia, and yield insights during clinical trials for therapies to treat the virus. 

Selecting a dataset of more than 100 sequential chest CTs from 29 COVID patients from China and Italy, the researchers used an NVIDIA Clara AI segmentation model to automate the time-consuming task of segmenting the total lung in each CT scan. Expert radiologists reviewed the total lung segmentations, and manually segmented the lung opacities. 

To track disease progression, the researchers used generalized temporal curves, which correlated the CT imaging data with lab measurements such as white blood cell count and procalcitonin levels. They then used 3D visualizations to reconstruct the evolution of COVID opacities in one of the patients. 

The team found that lung opacities appeared between one and five days before symptom onset, and peaked a day after symptoms began. They also analyzed two opacity subtypes — ground glass opacity and consolidation — and discovered that ground glass opacities appeared earlier in the disease, and persisted for a time after the resolution of the consolidation.  

In the paper, the researchers showed how CT dynamic curves could be used as a clinical reference tool for mild COVID-19 cases, and might help spot cases that grow more severe over time. These curves could also assist clinicians in identifying chronic lung effects by flagging cases where patients have residual opacities visible in CT scans long after other symptoms dissipate. 

This paper follows research published in Nature Communications, in which the team used deep learning to distinguish COVID-19 associated pneumonia from non-COVID pneumonia in chest scans. The deep learning models were developed using the NVIDIA Clara application framework for medical imaging, and are available for research use in the NGC catalog

Read the full paper in Scientific Reports. Download the models from NGC and visit our COVID-19 research hub for more. 

Learn more about NVIDIA’s work in healthcare at the GPU Technology Conference, April 12-16. Registration is free. The healthcare track includes 16 live webinars, 18 special events, and over 100 recorded sessions.

Subscribe to NVIDIA healthcare news


Recursive Classification: Replacing Rewards with Examples in RL

A general goal of robotics research is to design systems that can assist in a variety of tasks that can potentially improve daily life. Most reinforcement learning algorithms for teaching agents to perform new tasks require a reward function, which provides positive feedback to the agent for taking actions that lead to good outcomes. However, actually specifying these reward functions can be quite tedious and can be very difficult to define for situations without a clear objective, such as whether a room is clean or if a door is sufficiently shut. Even for tasks that are easy to describe, actually measuring whether the task has been solved can be difficult and may require adding many sensors to a robot’s environment.

Alternatively, training a model using examples, called example-based control, has the potential to overcome the limitations of approaches that rely on traditional reward functions. This new problem statement is most similar to prior methods based on “success detectors”, and efficient algorithms for example-based control could enable non-expert users to teach robots to perform new tasks, without the need for coding expertise, knowledge of reward function design, or the installation of environmental sensors.

In “Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification,” we propose a machine learning algorithm for teaching agents how to solve new tasks by providing examples of success (e.g., if “success” examples show a nail embedded into a wall, the agent will learn to pick up a hammer and knock nails into the wall). This algorithm, recursive classification of examples (RCE), does not rely on hand-crafted reward functions, distance functions, or features, but rather learns to solve tasks directly from data, requiring the agent to learn how to solve the entire task by itself, without requiring examples of any intermediate states. Using a version of temporal difference learning — similar to Q-learning, but replacing the typical reward function term using only examples of success — RCE outperforms prior approaches based on imitation learning on simulated robotics tasks. Coupled with theoretical guarantees similar to those for reward-based learning, the proposed method offers a user-friendly alternative for teaching robots new tasks.

Top: To teach a robot to hammer a nail into a wall, most reinforcement learning algorithms require that the user define a reward function. Bottom: The example-based control method uses examples of what the world looks like when a task is completed to teach the robot to solve the task, e.g., examples where the nail is already hammered into the wall.

Example-Based Control vs Imitation Learning
While the example-based control method is similar to imitation learning, there is an important distinction — it does not require expert demonstrations. In fact, the user can actually be quite bad at performing the task themselves, as long as they can look back and pick out the small fraction of states where they did happen to solve the task.

Additionally, whereas previous research used a stage-wise approach in which the model first uses success examples to learn a reward function and then applies that reward function with an off-the-shelf reinforcement learning algorithm, RCE learns directly from the examples and skips the intermediate step of defining the reward function. Doing so avoids potential bugs and bypasses the process of defining the hyperparameters associated with learning a reward function (such as how often to update the reward function or how to regularize it) and, when debugging, removes the need to examine code related to learning the reward function.

Recursive Classification of Examples
The intuition behind the RCE approach is simple: the model should predict whether the agent will solve the task in the future, given the current state of the world and the action that the agent is taking. If there were data that specified which state-action pairs lead to future success and which state-action pairs lead to future failure, then one could solve this problem using standard supervised learning. However, when the only data available consists of success examples, the system doesn’t know which states and actions led to success, and while the system also has experience interacting with the environment, this experience isn’t labeled as leading to success or not.

Left: The key idea is to learn a future success classifier that predicts for every state (circle) in a trajectory whether the task will be solved in the future (thumbs up/down). Right: In the example-based control approach, the model is provided only with unlabeled experience (grey circles) and success examples (green circles), so one cannot apply standard supervised learning. Instead, the model uses the success examples to automatically label the unlabeled experience.

Nonetheless, one can piece together what these data would look like, if it were available. First, by definition, a successful example must be one that solves the given task. Second, even though it is unknown whether an arbitrary state-action pair will lead to success in solving a task, it is possible to estimate how likely it is that the task will be solved if the agent started at the next state. If the next state is likely to lead to future success, it can be assumed that the current state is also likely to lead to future success. In effect, this is recursive classification, where the labels are inferred based on predictions at the next time step.

The underlying algorithmic idea of using a model’s predictions at a future time step as a label for the current time step closely resembles existing temporal-difference methods, such as Q-learning and successor features. The key difference is that the approach described here does not require a reward function. Nonetheless, we show that this method inherits many of the same theoretical convergence guarantees as temporal difference methods. In practice, implementing RCE requires changing only a few lines of code in an existing Q-learning implementation.

We evaluated the RCE method on a range of challenging robotic manipulation tasks. For example, in one task we required a robotic hand to pick up a hammer and hit a nail into a board. Previous research into this task [1, 2] have used a complex reward function (with terms corresponding to the distance between the hand and the hammer, the distance between the hammer and the nail, and whether the nail has been knocked into the board). In contrast, the RCE method requires only a few observations of what the world would look like if the nail were hammered into the board.

We compared the performance of RCE to a number of prior methods, including those that learn an explicit reward function and those based on imitation learning , all of which struggle to solve this task. This experiment highlights how example-based control makes it easy for users to specify even complex tasks, and demonstrates that recursive classification can successfully solve these sorts of tasks.

Compared with prior methods, the RCE approach solves the task of hammering a nail into a board more reliably that prior approaches based on imitation learning [SQIL, DAC] and those that learn an explicit reward function [VICE, ORIL, PURL].

We have presented a method to teach autonomous agents to perform tasks by providing them with examples of success, rather than meticulously designing reward functions or collecting first-person demonstrations. An important aspect of example-based control, which we discuss in the paper, is what assumptions the system makes about the capabilities of different users. Designing variants of RCE that are robust to differences in users’ capabilities may be important for applications in real-world robotics. The code is available, and the project website contains additional videos of the learned behaviors.

We thank our co-authors, Ruslan Salakhutdinov and Sergey Levine. We also thank Surya Bhupatiraju, Kamyar Ghasemipour, Max Igl, and Harini Kannan for feedback on this post, and Tom Small for helping to design figures for this post.


GTC 21: Top 5 Arm Computing and Ecosystem Sessions

NVIDIA and Arm are working together to open new opportunities for partners, users, and developers, driving a new wave of computing around the world. Explore all the Arm accelerated computing and ecosystem sessions at GTC.

From powering the world’s largest supercomputers and cloud data centers, to edge devices on factory floors and city streets, the NVIDIA accelerated computing platform is used to help solve the world’s most challenging computational problems. 

NVIDIA and Arm are working together to open new opportunities for partners, users, and developers, driving a new wave of computing around the world. 

Explore all the Arm accelerated computing and ecosystem sessions at GTC. Here are a few key sessions you may be interested in. 

  1. A Vision for the Next Decade of Computing 

    AI, 5G, and the internet of things are sparking the world’s potential. And for many hardware engineers and software developers, these technologies will also become the challenge of their careers. The question is how to invisibly integrate the new intelligence everywhere by creating more responsive infrastructure that links people, processes, devices, and data seamlessly. Getting there will require architectural leaps, new partnerships, and plenty of creativity. Arm President Rene Haas will discuss the forces pushing these advances and how Arm’s global developer ecosystem will react to drive the next wave of compute.

    Speaker: Rene Haas, President, IP Products Group, Arm

  1. Introducing Developer Tools for Arm and NVIDIA Systems

    NVIDIA GPUs on Arm servers are here. In migrating to, or developing on, Arm servers with NVIDIA GPUs, developers using native code, CUDA, and OpenACC continue to need tools and toolchains to succeed and to get the most out of applications. We’ll explore the role of key tools and toolchains on Arm servers, from Arm, NVIDIA and elsewhere — and show how each tool fits in the end-to-end journey to production science and simulation.

    Speaker: Daniel Owens, Product Director, Infrastructure Software, Arm

  1. The Arm HPC User Group: An Open Community for Arm-Based Research and Engagement

    We’ll introduce the newly created Arm HPC User Group, which provides a forum for application developers, system integrators, tool vendors, and implementers to share their experiences. Learn about the history of Arm for HPC and see what plans the Arm HPC User Group has to engage with users and researchers over the coming year. You don’t need an in-depth technical knowledge of either Arm systems or HPC to attend or appreciate this talk.

    Speaker: Jeffrey Young, Senior Research Scientist, Georgia Tech

  1. HPC Applications on Arm and NVIDIA A100

    By design, HPC applications have radically different performance characteristics across domains of expertise. Achieving a balanced computing platform that addresses a breadth of HPC applications is a fundamental advance in the HPC state of the art. We demonstrate that Arm-based CPUs (such as the Ampere Altra), paired with NVIDIA GPUs (such as the NVIDIA A100), comprise a balanced, performant, and scalable supercomputing platform for any HPC application, whether CPU-bound, GPU-accelerated, or GPU-bound. We present the runtime performance profiles of representative applications from genomics.

    Thomas Bradley, Director of Developer Technology at NVIDIA
    John Linford, Director of HPC Applications, Arm

  1. Scalable, Efficient, Software-Defined 5G-Enabled Edge Based on NVIDIA GPUs and Arm Servers 

    We’ll demonstrate a scalable, performance-optimized 5G-enabled edge cloud that’s based on Arm servers with NVIDIA GPUs. We’ll focus on fully software-defined 5G Distributed Unit (DU) with an NVIDIA GPU/Aerial-based PHY layer with the upper layers based on Ampere Altra server based on Arm Neoverse N1 CPU. We’ll cover the performance, scale, and power benefits of this architecture for a centralized radio access network architecture.

    Anupa Kelkar, Product Manager, NVIDIA
    Mo Jabbari, Senior Segment Marketing Manager, Arm

Register today for free and start building your schedule. Once you are signed in, you can view all Arm sessions here. 

You can also explore all GTC conference topics here. Topics include areas of interest  such as GPU programming, HPC, deep learning, data science, and autonomous machines, or industries including healthcare, public sector, retail, and telecommunications.