Categories
Misc

libcu++ Open-Source GPU-enable C++ Standard Library Updated

libcu++, the NVIDIA C++ Standard Library, provides a C++ Standard Library for your entire system which can be used in and between CPU and GPU codes.

libcu++, the NVIDIA C++ Standard Library, provides a C++ Standard Library for your entire system which can be used in and between CPU and GPU codes. The NVIDIA C++ Standard Library is an open source project. 

Version 1.4.0 of libcu++ is a major release providing several feature enhancements and bug fixes. It adds support for the following: , NVCC + MSVC support for , and backports of C++20 and C++17 features to C++14.

Other enhancements include improved and reorganized documentation, atomics decoupled from host Standard Library in MSVC, and revamped examples and benchmarks.  

Additional information, examples and documentation can be found below.

libcu++ is available on GitHub and is included in the NVIDIA HPC SDK and the CUDA Toolkit

Learn more:

Categories
Misc

torch 0.2.0 – Initial JIT support and many bug fixes

We are happy to announce that the version 0.2.0 of torch just
landed on CRAN.

This release includes many bug fixes and some nice new features
that we will present in this blog post. You can see the full
changelog in the NEWS.md
file.

The features that we will discuss in detail are:

  • Initial support for JIT tracing
  • Multi-worker dataloaders
  • Print methods for nn_modules

Multi-worker dataloaders

dataloaders now respond to the num_workers argument and will run
the pre-processing in parallel workers.

For example, say we have the following dummy dataset that does a
long computation:

library(torch) dat <- dataset( "mydataset", initialize = function(time, len = 10) { self$time <- time self$len <- len }, .getitem = function(i) { Sys.sleep(self$time) torch_randn(1) }, .length = function() { self$len } ) ds <- dat(1) system.time(ds[1])
 user system elapsed 0.029 0.005 1.027 

We will now create two dataloaders, one that executes
sequentially and another executing in parallel.

seq_dl <- dataloader(ds, batch_size = 5) par_dl <- dataloader(ds, batch_size = 5, num_workers = 2)

We can now compare the time it takes to process two batches
sequentially to the time it takes in parallel:

seq_it <- dataloader_make_iter(seq_dl) par_it <- dataloader_make_iter(par_dl) two_batches <- function(it) { dataloader_next(it) dataloader_next(it) "ok" } system.time(two_batches(seq_it)) system.time(two_batches(par_it))
 user system elapsed 0.098 0.032 10.086 user system elapsed 0.065 0.008 5.134 

Note that it is batches that are obtained in parallel, not
individual observations. Like that, we will be able to support
datasets with variable batch sizes in the future.

Using multiple workers is not necessarily
faster than serial execution because there’s a considerable
overhead when passing tensors from a worker to the main session as
well as when initializing the workers.

This feature is enabled by the powerful callr package and works in all
operating systems supported by torch. callr let’s us create
persistent R sessions, and thus, we only pay once the overhead of
transferring potentially large dataset objects to workers.

In the process of implementing this feature we have made
dataloaders behave like coro iterators. This means that
you can now use coro’s syntax for looping
through the dataloaders:

coro::loop(for(batch in par_dl) { print(batch$shape) })
[1] 5 1 [1] 5 1

This is the first torch release including the multi-worker
dataloaders feature, and you might run into edge cases when using
it. Do let us know if you find any problems.

Initial JIT support

Programs that make use of the torch package are inevitably R
programs and thus, they always need an R installation in order to
execute.

As of version 0.2.0, torch allows users to JIT trace torch R
functions into TorchScript. JIT (Just in time) tracing will invoke
an R function with example inputs, record all operations that
occured when the function was run and return a script_function
object containing the TorchScript representation.

The nice thing about this is that TorchScript programs are
easily serializable, optimizable, and they can be loaded by another
program written in PyTorch or LibTorch without requiring any R
dependency.

Suppose you have the following R function that takes a tensor,
and does a matrix multiplication with a fixed weight matrix and
then adds a bias term:

w <- torch_randn(10, 1) b <- torch_randn(1) fn <- function(x) { a <- torch_mm(x, w) a + b }

This function can be JIT-traced into TorchScript with jit_trace
by passing the function and example inputs:

x <- torch_ones(2, 10) tr_fn <- jit_trace(fn, x) tr_fn(x)
torch_tensor -0.6880 -0.6880 [ CPUFloatType{2,1} ]

Now all torch operations that happened when computing the result
of this function were traced and transformed into a graph:

tr_fn$graph
graph(%0 : Float(2:10, 10:1, requires_grad=0, device=cpu)): %1 : Float(10:1, 1:1, requires_grad=0, device=cpu) = prim::Constant[value=-0.3532 0.6490 -0.9255 0.9452 -1.2844 0.3011 0.4590 -0.2026 -1.2983 1.5800 [ CPUFloatType{10,1} ]]() %2 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::mm(%0, %1) %3 : Float(1:1, requires_grad=0, device=cpu) = prim::Constant[value={-0.558343}]() %4 : int = prim::Constant[value=1]() %5 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::add(%2, %3, %4) return (%5)

The traced function can be serialized with jit_save:

jit_save(tr_fn, "linear.pt")

It can be reloaded in R with jit_load, but it can also be
reloaded in Python with torch.jit.load:

import torch fn = torch.jit.load("linear.pt") fn(torch.ones(2, 10))
tensor([[-0.6880], [-0.6880]])

How cool is that?!

This is just the initial support for JIT in R. We will continue
developing this. Specifically, in the next version of torch we plan
to support tracing nn_modules directly. Currently, you need to
detach all parameters before tracing them; see an example
here
. This will allow you also to take benefit of TorchScript
to make your models run faster!

Also note that tracing has some limitations, especially when
your code has loops or control flow statements that depend on
tensor data. See ?jit_trace to learn more.

New print method for nn_modules

In this release we have also improved the nn_module printing
methods in order to make it easier to understand what’s
inside.

For example, if you create an instance of an nn_linear module
you will see:

nn_linear(10, 1)
An `nn_module` containing 11 parameters. ── Parameters ────────────────────────────────────────────────────────────────── ● weight: Float [1:1, 1:10] ● bias: Float [1:1]

You immediately see the total number of parameters in the module
as well as their names and shapes.

This also works for custom modules (possibly including
sub-modules). For example:

my_module <- nn_module( initialize = function() { self$linear <- nn_linear(10, 1) self$param <- nn_parameter(torch_randn(5,1)) self$buff <- nn_buffer(torch_randn(5)) } ) my_module()
An `nn_module` containing 16 parameters. ── Modules ───────────────────────────────────────────────────────────────────── ● linear: <nn_linear> #11 parameters ── Parameters ────────────────────────────────────────────────────────────────── ● param: Float [1:5, 1:1] ── Buffers ───────────────────────────────────────────────────────────────────── ● buff: Float [1:5]

We hope this makes it easier to understand nn_module objects. We
have also improved autocomplete support for nn_modules and we will
now show all sub-modules, parameters and buffers while you
type.

torchaudio

torchaudio
is an extension for torch developed by Athos Damiani (@athospd), providing audio
loading, transformations, common architectures for signal
processing, pre-trained weights and access to commonly used
datasets. An almost literal translation from PyTorch’s Torchaudio
library to R.

torchaudio is not yet on CRAN, but you can already try the
development version available here.

You can also visit the pkgdown website for examples
and reference documentation.

Other features and bug fixes

Thanks to community contributions we have found and fixed many
bugs in torch. We have also added new features including:

You can see the full list of changes in the NEWS.md
file.

Thanks very much for reading this blog post, and feel free to
reach out on GitHub for help or discussions!

The photo used in this post preview is by
Oleg Illarionov
on
Unsplash

Categories
Misc

torch 0.2.0 – Initial JIT support and many bug fixes

We are happy to announce that the version 0.2.0 of torch just
landed on CRAN.

This release includes many bug fixes and some nice new features
that we will present in this blog post. You can see the full
changelog in the NEWS.md
file.

The features that we will discuss in detail are:

  • Initial support for JIT tracing
  • Multi-worker dataloaders
  • Print methods for nn_modules

Multi-worker dataloaders

dataloaders now respond to the num_workers argument and will run
the pre-processing in parallel workers.

For example, say we have the following dummy dataset that does a
long computation:

library(torch) dat <- dataset( "mydataset", initialize = function(time, len = 10) { self$time <- time self$len <- len }, .getitem = function(i) { Sys.sleep(self$time) torch_randn(1) }, .length = function() { self$len } ) ds <- dat(1) system.time(ds[1])
 user system elapsed 0.029 0.005 1.027 

We will now create two dataloaders, one that executes
sequentially and another executing in parallel.

seq_dl <- dataloader(ds, batch_size = 5) par_dl <- dataloader(ds, batch_size = 5, num_workers = 2)

We can now compare the time it takes to process two batches
sequentially to the time it takes in parallel:

seq_it <- dataloader_make_iter(seq_dl) par_it <- dataloader_make_iter(par_dl) two_batches <- function(it) { dataloader_next(it) dataloader_next(it) "ok" } system.time(two_batches(seq_it)) system.time(two_batches(par_it))
 user system elapsed 0.098 0.032 10.086 user system elapsed 0.065 0.008 5.134 

Note that it is batches that are obtained in parallel, not
individual observations. Like that, we will be able to support
datasets with variable batch sizes in the future.

Using multiple workers is not necessarily
faster than serial execution because there’s a considerable
overhead when passing tensors from a worker to the main session as
well as when initializing the workers.

This feature is enabled by the powerful callr package and works in all
operating systems supported by torch. callr let’s us create
persistent R sessions, and thus, we only pay once the overhead of
transferring potentially large dataset objects to workers.

In the process of implementing this feature we have made
dataloaders behave like coro iterators. This means that
you can now use coro’s syntax for looping
through the dataloaders:

coro::loop(for(batch in par_dl) { print(batch$shape) })
[1] 5 1 [1] 5 1

This is the first torch release including the multi-worker
dataloaders feature, and you might run into edge cases when using
it. Do let us know if you find any problems.

Initial JIT support

Programs that make use of the torch package are inevitably R
programs and thus, they always need an R installation in order to
execute.

As of version 0.2.0, torch allows users to JIT trace torch R
functions into TorchScript. JIT (Just in time) tracing will invoke
an R function with example inputs, record all operations that
occured when the function was run and return a script_function
object containing the TorchScript representation.

The nice thing about this is that TorchScript programs are
easily serializable, optimizable, and they can be loaded by another
program written in PyTorch or LibTorch without requiring any R
dependency.

Suppose you have the following R function that takes a tensor,
and does a matrix multiplication with a fixed weight matrix and
then adds a bias term:

w <- torch_randn(10, 1) b <- torch_randn(1) fn <- function(x) { a <- torch_mm(x, w) a + b }

This function can be JIT-traced into TorchScript with jit_trace
by passing the function and example inputs:

x <- torch_ones(2, 10) tr_fn <- jit_trace(fn, x) tr_fn(x)
torch_tensor -0.6880 -0.6880 [ CPUFloatType{2,1} ]

Now all torch operations that happened when computing the result
of this function were traced and transformed into a graph:

tr_fn$graph
graph(%0 : Float(2:10, 10:1, requires_grad=0, device=cpu)): %1 : Float(10:1, 1:1, requires_grad=0, device=cpu) = prim::Constant[value=-0.3532 0.6490 -0.9255 0.9452 -1.2844 0.3011 0.4590 -0.2026 -1.2983 1.5800 [ CPUFloatType{10,1} ]]() %2 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::mm(%0, %1) %3 : Float(1:1, requires_grad=0, device=cpu) = prim::Constant[value={-0.558343}]() %4 : int = prim::Constant[value=1]() %5 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::add(%2, %3, %4) return (%5)

The traced function can be serialized with jit_save:

jit_save(tr_fn, "linear.pt")

It can be reloaded in R with jit_load, but it can also be
reloaded in Python with torch.jit.load:

import torch fn = torch.jit.load("linear.pt") fn(torch.ones(2, 10))
tensor([[-0.6880], [-0.6880]])

How cool is that?!

This is just the initial support for JIT in R. We will continue
developing this. Specifically, in the next version of torch we plan
to support tracing nn_modules directly. Currently, you need to
detach all parameters before tracing them; see an example
here
. This will allow you also to take benefit of TorchScript
to make your models run faster!

Also note that tracing has some limitations, especially when
your code has loops or control flow statements that depend on
tensor data. See ?jit_trace to learn more.

New print method for nn_modules

In this release we have also improved the nn_module printing
methods in order to make it easier to understand what’s
inside.

For example, if you create an instance of an nn_linear module
you will see:

nn_linear(10, 1)
An `nn_module` containing 11 parameters. ── Parameters ────────────────────────────────────────────────────────────────── ● weight: Float [1:1, 1:10] ● bias: Float [1:1]

You immediately see the total number of parameters in the module
as well as their names and shapes.

This also works for custom modules (possibly including
sub-modules). For example:

my_module <- nn_module( initialize = function() { self$linear <- nn_linear(10, 1) self$param <- nn_parameter(torch_randn(5,1)) self$buff <- nn_buffer(torch_randn(5)) } ) my_module()
An `nn_module` containing 16 parameters. ── Modules ───────────────────────────────────────────────────────────────────── ● linear: <nn_linear> #11 parameters ── Parameters ────────────────────────────────────────────────────────────────── ● param: Float [1:5, 1:1] ── Buffers ───────────────────────────────────────────────────────────────────── ● buff: Float [1:5]

We hope this makes it easier to understand nn_module objects. We
have also improved autocomplete support for nn_modules and we will
now show all sub-modules, parameters and buffers while you
type.

torchaudio

torchaudio
is an extension for torch developed by Athos Damiani (@athospd), providing audio
loading, transformations, common architectures for signal
processing, pre-trained weights and access to commonly used
datasets. An almost literal translation from PyTorch’s Torchaudio
library to R.

torchaudio is not yet on CRAN, but you can already try the
development version available here.

You can also visit the pkgdown website for examples
and reference documentation.

Other features and bug fixes

Thanks to community contributions we have found and fixed many
bugs in torch. We have also added new features including:

You can see the full list of changes in the NEWS.md
file.

Thanks very much for reading this blog post, and feel free to
reach out on GitHub for help or discussions!

The photo used in this post preview is by
Oleg Illarionov
on
Unsplash

Categories
Offsites

torch 0.2.0 – Initial JIT support and many bug fixes

We are happy to announce that the version 0.2.0 of torch just
landed on CRAN.

This release includes many bug fixes and some nice new features
that we will present in this blog post. You can see the full
changelog in the NEWS.md
file.

The features that we will discuss in detail are:

  • Initial support for JIT tracing
  • Multi-worker dataloaders
  • Print methods for nn_modules

Multi-worker dataloaders

dataloaders now respond to the num_workers argument and will run
the pre-processing in parallel workers.

For example, say we have the following dummy dataset that does a
long computation:

library(torch) dat <- dataset( "mydataset", initialize = function(time, len = 10) { self$time <- time self$len <- len }, .getitem = function(i) { Sys.sleep(self$time) torch_randn(1) }, .length = function() { self$len } ) ds <- dat(1) system.time(ds[1])
 user system elapsed 0.029 0.005 1.027 

We will now create two dataloaders, one that executes
sequentially and another executing in parallel.

seq_dl <- dataloader(ds, batch_size = 5) par_dl <- dataloader(ds, batch_size = 5, num_workers = 2)

We can now compare the time it takes to process two batches
sequentially to the time it takes in parallel:

seq_it <- dataloader_make_iter(seq_dl) par_it <- dataloader_make_iter(par_dl) two_batches <- function(it) { dataloader_next(it) dataloader_next(it) "ok" } system.time(two_batches(seq_it)) system.time(two_batches(par_it))
 user system elapsed 0.098 0.032 10.086 user system elapsed 0.065 0.008 5.134 

Note that it is batches that are obtained in parallel, not
individual observations. Like that, we will be able to support
datasets with variable batch sizes in the future.

Using multiple workers is not necessarily
faster than serial execution because there’s a considerable
overhead when passing tensors from a worker to the main session as
well as when initializing the workers.

This feature is enabled by the powerful callr package and works in all
operating systems supported by torch. callr let’s us create
persistent R sessions, and thus, we only pay once the overhead of
transferring potentially large dataset objects to workers.

In the process of implementing this feature we have made
dataloaders behave like coro iterators. This means that
you can now use coro’s syntax for looping
through the dataloaders:

coro::loop(for(batch in par_dl) { print(batch$shape) })
[1] 5 1 [1] 5 1

This is the first torch release including the multi-worker
dataloaders feature, and you might run into edge cases when using
it. Do let us know if you find any problems.

Initial JIT support

Programs that make use of the torch package are inevitably R
programs and thus, they always need an R installation in order to
execute.

As of version 0.2.0, torch allows users to JIT trace torch R
functions into TorchScript. JIT (Just in time) tracing will invoke
an R function with example inputs, record all operations that
occured when the function was run and return a script_function
object containing the TorchScript representation.

The nice thing about this is that TorchScript programs are
easily serializable, optimizable, and they can be loaded by another
program written in PyTorch or LibTorch without requiring any R
dependency.

Suppose you have the following R function that takes a tensor,
and does a matrix multiplication with a fixed weight matrix and
then adds a bias term:

w <- torch_randn(10, 1) b <- torch_randn(1) fn <- function(x) { a <- torch_mm(x, w) a + b }

This function can be JIT-traced into TorchScript with jit_trace
by passing the function and example inputs:

x <- torch_ones(2, 10) tr_fn <- jit_trace(fn, x) tr_fn(x)
torch_tensor -0.6880 -0.6880 [ CPUFloatType{2,1} ]

Now all torch operations that happened when computing the result
of this function were traced and transformed into a graph:

tr_fn$graph
graph(%0 : Float(2:10, 10:1, requires_grad=0, device=cpu)): %1 : Float(10:1, 1:1, requires_grad=0, device=cpu) = prim::Constant[value=-0.3532 0.6490 -0.9255 0.9452 -1.2844 0.3011 0.4590 -0.2026 -1.2983 1.5800 [ CPUFloatType{10,1} ]]() %2 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::mm(%0, %1) %3 : Float(1:1, requires_grad=0, device=cpu) = prim::Constant[value={-0.558343}]() %4 : int = prim::Constant[value=1]() %5 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::add(%2, %3, %4) return (%5)

The traced function can be serialized with jit_save:

jit_save(tr_fn, "linear.pt")

It can be reloaded in R with jit_load, but it can also be
reloaded in Python with torch.jit.load:

import torch fn = torch.jit.load("linear.pt") fn(torch.ones(2, 10))
tensor([[-0.6880], [-0.6880]])

How cool is that?!

This is just the initial support for JIT in R. We will continue
developing this. Specifically, in the next version of torch we plan
to support tracing nn_modules directly. Currently, you need to
detach all parameters before tracing them; see an example
here
. This will allow you also to take benefit of TorchScript
to make your models run faster!

Also note that tracing has some limitations, especially when
your code has loops or control flow statements that depend on
tensor data. See ?jit_trace to learn more.

New print method for nn_modules

In this release we have also improved the nn_module printing
methods in order to make it easier to understand what’s
inside.

For example, if you create an instance of an nn_linear module
you will see:

nn_linear(10, 1)
An `nn_module` containing 11 parameters. ── Parameters ────────────────────────────────────────────────────────────────── ● weight: Float [1:1, 1:10] ● bias: Float [1:1]

You immediately see the total number of parameters in the module
as well as their names and shapes.

This also works for custom modules (possibly including
sub-modules). For example:

my_module <- nn_module( initialize = function() { self$linear <- nn_linear(10, 1) self$param <- nn_parameter(torch_randn(5,1)) self$buff <- nn_buffer(torch_randn(5)) } ) my_module()
An `nn_module` containing 16 parameters. ── Modules ───────────────────────────────────────────────────────────────────── ● linear: <nn_linear> #11 parameters ── Parameters ────────────────────────────────────────────────────────────────── ● param: Float [1:5, 1:1] ── Buffers ───────────────────────────────────────────────────────────────────── ● buff: Float [1:5]

We hope this makes it easier to understand nn_module objects. We
have also improved autocomplete support for nn_modules and we will
now show all sub-modules, parameters and buffers while you
type.

torchaudio

torchaudio
is an extension for torch developed by Athos Damiani (@athospd), providing audio
loading, transformations, common architectures for signal
processing, pre-trained weights and access to commonly used
datasets. An almost literal translation from PyTorch’s Torchaudio
library to R.

torchaudio is not yet on CRAN, but you can already try the
development version available here.

You can also visit the pkgdown website for examples
and reference documentation.

Other features and bug fixes

Thanks to community contributions we have found and fixed many
bugs in torch. We have also added new features including:

You can see the full list of changes in the NEWS.md
file.

Thanks very much for reading this blog post, and feel free to
reach out on GitHub for help or discussions!

The photo used in this post preview is by
Oleg Illarionov
on
Unsplash

Categories
Misc

Scotland’s Rural College Makes Moo-ves Against Bovine Tuberculosis with AI

Each morning millions of bleary-eyed people pour milk into their bowls of cereal or cups of coffee without a second thought as to where that beverage came from. Few will consider the processes in place to maintain the health of the animals involved in milk production and to ensure that the final product is fit Read article >

The post Scotland’s Rural College Makes Moo-ves Against Bovine Tuberculosis with AI appeared first on The Official NVIDIA Blog.

Categories
Misc

Ubuntu install help

Hi there! I am brand new to unbuntu and I have just build a
computer running it. I am currently trying to download TensorFlow
for my computer and I am trying to run the comamnd:

$. ./venv/bin/activate.fish # fish

and it’s returning this:

bash: ./venv/bin/activate.fish: line 4: syntax error near
unexpected token `-d’

bash: ./venv/bin/activate.fish: line 4: `function deactivate -d
“Exit virtualenv and return to normal shell environment”‘

Any thoughts?

submitted by /u/Elrekl

[visit reddit]

[comments]

Categories
Misc

layers with zero parameters have no weights?

When we print model.summary(), it shows the number of parameters
associated with each layer. When the number is zero, it means there
are no weights associated with this layer. Is that correct?

A very basic question but I just want to confirm. Having issues
with loading pretrained weights ‘by_name’, hence trying to
debug.

submitted by /u/juggy94

[visit reddit]

[comments]

Categories
Offsites

Privacy Considerations in Large Language Models

Machine learning-based language models trained to predict the next word in a sentence have become increasingly capable, common, and useful, leading to groundbreaking improvements in applications like question-answering, translation, and more. But as language models continue to advance, new and unexpected risks can be exposed, requiring the research community to proactively work to develop new ways to mitigate potential problems.

One such risk is the potential for models to leak details from the data on which they’re trained. While this may be a concern for all large language models, additional issues may arise if a model trained on private data were to be made publicly available. Because these datasets can be large (hundreds of gigabytes) and pull from a range of sources, they can sometimes contain sensitive data, including personally identifiable information (PII) — names, phone numbers, addresses, etc., even if trained on public data. This raises the possibility that a model trained using such data could reflect some of these private details in its output. It is therefore important to identify and minimize the risks of such leaks, and to develop strategies to address the issue for future models.

If one prompts the GPT-2 language model with the prefix “East Stroudsburg Stroudsburg…”, it will autocomplete a long block of text that contains the full name, phone number, email address, and physical address of a particular person whose information was included in GPT-2’s training data.

In “Extracting Training Data from Large Language Models”, a collaboration with OpenAI, Apple, Stanford, Berkeley, and Northeastern University, we demonstrate that, given only the ability to query a pre-trained language model, it is possible to extract specific pieces of training data that the model has memorized. As such, training data extraction attacks are realistic threats on state-of-the-art large language models. This research represents an early, critical step intended to inform researchers about this class of vulnerabilities, so that they may take steps to mitigate these weaknesses.

Ethics of Language Model Attacks
A training data extraction attack has the greatest potential for harm when applied to a model that is available to the public, but for which the dataset used in training is not. However, since conducting this research on such a dataset could have harmful consequences, we instead mount a proof of concept training data extraction attack on GPT-2, a large, publicly available language model developed by OpenAI, that was trained using only public data. While this work focuses on GPT-2 specifically, the results apply to understanding what privacy threats are possible on large language models generally.

As with other privacy- and security-related research, it is important to consider the ethics of such attacks before actually performing them. To minimize the potential risk of this work, the training data extraction attack in this work was developed using publicly available data. Furthermore, the GPT-2 model itself was made public by OpenAI in 2019, and the training data used to train GPT-2 was collected from the public internet, and is available for download by anyone who follows the data collection process documented in the GPT-2 paper.

Additionally, in accordance with responsible computer security disclosure norms, we followed up with individuals whose PII was extracted, and secured their permission before including references to this data in publication. Further, in all publications of this work, we have redacted any personally identifying information that may identify individuals. We have also worked closely with OpenAI in the analysis of GPT-2.

The Training Data Extraction Attack
By design, language models make it very easy to generate a large amount of output data. By seeding the model with random short phrases, the model can generate millions of continuations, i.e., probable phrases that complete the sentence. Most of the time, these continuations will be benign strings of sensible text. For example, when asked to predict the continuation of the string “Mary had a little…”, a language model will have high confidence that the next token is the word “lamb”. However, if one particular training document happened to repeat the string “Mary had a little wombat” many times, the model might predict that phrase instead.

The goal of a training data extraction attack is then to sift through the millions of output sequences from the language model and predict which text is memorized. To accomplish this, our approach leverages the fact that models tend to be more confident on results captured directly from their training data. These membership inference attacks enable us to predict if a result was used in the training data by checking the confidence of the model on a particular sequence.

The main technical contribution of this work is the development of a method for inferring membership with high accuracy along with techniques for sampling from models in a way that encourages the output of memorized content. We tested a number of different sampling strategies, the most successful of which generates text conditioned on a wide variety of input phrases. We then compare the output of two different language models. When one model has high confidence in a sequence, but the other (equally accurate) model has low confidence in a sequence, it’s likely that the first model has memorized the data.

Results
Out of 1800 candidate sequences from the GPT-2 language model, we extracted over 600 that were memorized from the public training data, with the total number limited by the need for manual verification. The memorized examples cover a wide range of content, including news headlines, log messages, JavaScript code, PII, and more. Many of these examples are memorized even though they appear infrequently in the training dataset. For example, for many samples of PII we extract are found in only a single document in the dataset. However, in most of these cases, the originating document contains multiple instances of the PII, and as a result, the model still learns it as high likelihood text.

Finally, we also find that the larger the language model, the more easily it memorizes training data. For example, in one experiment we find that the 1.5 billion parameter GPT-2 XL model memorizes 10 times more information than the 124 million parameter GPT-2 Small model. Given that the research community has already trained models 10 to 100 times larger, this means that as time goes by, more work will be required to monitor and mitigate this problem in increasingly large language models.

Lessons
While we demonstrate these attacks on GPT-2 specifically, they show potential flaws in all large generative language models. The fact that these attacks are possible has important consequences for the future of machine learning research using these types of models.

Fortunately, there are several ways to mitigate this issue. The most straightforward solution is to ensure that models do not train on any potentially problematic data. But this can be difficult to do in practice.

The use of differential privacy, which allows training on a dataset without revealing any details of individual training examples, is one of the most principled techniques to train machine learning models with privacy. In TensorFlow, this can be achieved with the use of the tensorflow/privacy module (or similar for PyTorch or JAX) that is a drop-in replacement for existing optimizers. Even this can have limitations and won’t prevent memorization of content that is repeated often enough. If this is not possible, we recommend at least measuring how much memorization occurs so appropriate action can be taken.

Language models continue to demonstrate great utility and flexibility—yet, like all innovations, they can also pose risks. Developing them responsibly means proactively identifying those risks and developing ways to mitigate them. We hope that this effort to highlight current weaknesses in large language modeling will raise awareness of this challenge in the broader machine learning community and motivate researchers to continue to develop effective techniques to train models with reduced memorization.

Acknowledgements
This work was performed jointly with Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel.

Categories
Misc

Meet the Researcher, Rommie Amaro, Simulating the SARS-CoV-2 virus with AI and HPC

This month we spotlight Rommie Amaro, professor and endowed chair in the Department of Chemistry and Biochemistry at the University of California, San Diego.

‘Meet the Researcher’ is a new series in which we spotlight different researchers in academia who are using GPUs to accelerate their work. This month we spotlight Rommie Amaro, professor and endowed chair in the Department of Chemistry and Biochemistry at the University of California, San Diego.

Amaro is also the principal investigator of the Amaro Lab, which is broadly concerned with the development and application of state-of-the-art computational and theoretical techniques to investigate the structure, function, and dynamics of complex biological systems for applications to drug discovery.

This year, she led a team of 28 researchers at a number of different institutions, combining high performance computing (HPC) and AI to provide the clearest view to date of the coronavirus, winning a special Gordon Bell Prize in November 2020.

What are your research areas of focus?

I like to call myself a computational biophysical chemist. In general, we use computational models to explore biological and chemical systems. The idea of being able to use mathematics and computing to understand how biological systems work is just fascinating to me. Over the years, we’ve applied this to areas such as cancer and DNA editing.

What motivated you to pursue your recent research area of focus in supercomputing and the fight against COVID?

We had been working for a number of years on influenza virus, and so after 7-8 years of work on that, we published a paper in February on simulating the influenza virus envelope. That project was really a labor of love. It presented, for the first time, molecular simulations of a lipid enveloped virus in atomic detail. The size was 160 million atoms, very large, and to get to that point we had to overcome multiple technical challenges to simulating viruses in such detail.

When SARS-CoV-2 hit, it was an immediate and natural pivot to focus on COVID. As soon as the McLellan group deposited the spike structure in the bioRxiv, we scooped up the data and started working with it, aiming to simulate the complete spike structure in all its atomic detail.

Simulation of SARS-CoV-2

What problems does your research address?

There’s this aspect of the virus that’s very intriguing. Viruses in general have evolved something called a glycan shield, which is basically a sugary coating that the virus uses to mask itself from the human immune system.

All of the cells in our body are coated with different types of sugar molecules, called glycans. The viruses have evolved this way of looking just like other human cells by covering themselves in the same soft of sugary coating. That way, when viruses get into our bodies, our immune system doesn’t see spike protein, it sees this sugary coating. Since it looks like any other cell, the human body doesn’t recognize it as a threat and therefore doesn’t activate the immune system response.

Experimentally, we know these glycans are there but can’t take pictures of them because the sugars move around quite a bit, one can’t get a crisp image of what they actually look like. This is where computing has a lot to give because we can basically create an atomic level structure of what those sugar molecules look like and then that gives us the picture of the sugar structures and what they’re doing. That’s what our research addresses.

Our simulations gave people the first views of what the glycan shield looks like on the viral spike protein. For researchers, it’s not only knowing where that shield is, but it’s even more critical knowing where the shield is not because there are holes in the shield, vulnerabilities of the virus. We can use that information to understand where and how antibodies bind and how to use this information in the design of novel therapeutics.

Simulation of SARS-CoV-2 spike protein 

What is the (expected) impact of your work on the field/community/world?

The reason that we care about simulating the virus in all its atomic detail is that it helps us understand how the virus works, the mechanisms of viral infection, as well as to understand and design new therapeutics. For example, drug molecules that target different virus protein molecules or different parts of the virus. It also helps us understand how antibodies work, their design with vaccines. Having information about where all the atoms are and what they’re doing is really critical.

How have you used NVIDIA technology in your current research?

The main tool my team is using is molecular dynamics simulations. We like to think of this tool as a computational microscope. We take data from different sources, bring these data streams together to create these highly detailed, accurate models of the virus.

The simulations run on chips like NVIDIA GPU systems, and use supercomputing sites that use NVIDIA GPUs. The kind of math these simulations require ports super well to GPUs. That’s where NVIDIA technology has been transformative: it’s allowed us to speed up our calculations tremendously. It’s basically making our microscope really powerful–getting better, more accurate views of the virus much faster. The sooner we have that information, the sooner we can develop drugs and therapeutics to fight the virus.

What were some of the challenges you faced while pursuing your research?

This whole year has been extraordinary. It’s a really crazy time to be a scientist, in the sense of being on the front lines trying to understand the mysteries of COVID-19. There’s so much data coming in from all over the world. We’ve had so many sleepless nights since February/March, working around the clock.

At the same time, there’s been so many amazing collaborations. So many people around the world focused on one single problem. This doesn’t happen, typically we’re each working on different things.

What’s next for your research?

We’re just at the beginning, there’s so much we want to do. Now that we’ve got the virus built and simulated, and that was honestly an incredible feat in terms of how fast we were able to do it, the next is to build and simulate the host cell, the human cell. We want to understand how the virus latches on because it’s a pretty complex process.

Another aspect I’m really excited to study—there are a lot of questions about how and why the virus is airborne transmissible. One of the things we’re planning to do next is to build aerosol particles that have the virus inside of them. These systems will have about one billion atoms. This will help us understand what’s happening to the virus in the air and how long it lives and how it can infect people under a range of conditions, such as relative humidity.

There’s just so much to do.

Any advice for new researchers, or early career professionals, especially to those who are inspired and motivated by your work?

Find your passion. It won’t necessarily be science for everybody. We need so many kinds of people to make the world go round.

Find that thing that lights you up, that you just want to keep coming back to. When I was at university and Saturday came around, everyone would go tailgating. I looked forward to going into the lab. When you really find your passion, your work doesn’t feel like your job. And when your passion unites with the ability to serve others or a greater good, it’s so incredibly rewarding.

Also, education is super important–stay at it, stay it in, to get a better life. 

Categories
Offsites

Personal Assistant Kino Part 1 – Overview

Kino 프로젝트는 QS를 통해서 자신에 대해서 알고, 불필요한 일들을 자동화시키고 삶의 질을 증진시키기 위한 프로젝트 입니다.

images

출처 : http://quantifiedself.com/

지금까지의 시리즈

Github: https://github.com/DongjunLee/quantified-self

Introduction

최근에 Bot에 대한 글도 많이 읽고.. 조그만 미니 프로젝트로 Bot도 개발해보면서, 그 동안 마음 속에 계속해서 자리잡고 있던! 제가 가장 만들어보고 싶던 개인 프로젝트를 진행할 때가 되었구나 생각이 되었습니다.

그 프로젝트는 바로.. 개인용 비서 Bot을 만드는 프로젝트입니다. 다른 누구도 아닌 오직 나 자신만을 위한 개인 비서를 만들어보자라는 생각은 예전부터 했었고, 그것에 대비하여 Toggl 을 통해서 내가 보내는 시간을 기록하고, RescueTime 을 통해서는 어떤 프로그램을 사용하고, 생산성은 어떠한지 기록하고 있었습니다. 그 외에는 Pebble Time을 통해서 걸음걸이와 수면시간 또한 Tracking이 되고 있었습니다. 그리고 Todoist 를 통해서는 일정관리를 하고 있었습니다. 이렇듯 저에 대한 정보들은 수치화된 데이터가 되어 여기저기 쌓이고 있던 것이죠.

그래서 이렇게 쌓은 나에 대한 Data를 바탕으로.. 나를 아는 Bot을 만들고 싶다는 생각을 해왔습니다. 물론 위의 서비스들은 대부분 API를 제공하고 있는 상황입니다.

images

Data 수집 정리

  1. Toggl: 시간을 트래킹하기 편한 앱. 일을 시작하기전에 타이머를 누르고, 일이 끝나면 타이머를 종료해서 내가한 작업들을 기록하는데 많이 사용한다.
  2. RescueTime: 생산성을 관리해주는 툴로서, PC에서 사용한 앱들의 시간을 기록해서 보여줍니다.
  3. Pebble: 걸음걸이와 수면시간을 Tracking. Data는 스마트워치 안에 기록되는 시스템으로 보이나.. 간편하게 사용하기에는 어려워 보이네요.
  4. Todoist: 온라인 작업 관리 및 할일 목록 관리 앱 입니다. 스마트폰, PC, 웹 등.. 다양한 플랫폼을 제공하고 있어서 편하게 사용하고 있습니다.
  5. 그 외 수면 시간, 행복도, 집중도, 생산성 데이터를 수집하는 App (없을 경우 Bot에 붙여서 만들기)

Bot Platform

Bot을 개발하기 이전에 Data 수집에 대한 셋팅은 위와 같이 맞춰놓고, 다음으로 무엇으로 Bot 만들 것인가.. 고민을 해보았습니다. 최근에 Telegram, Facebook Messenger, Line 등.. 많은 Messaing App 기업들이 Bot API를 공개하고 있지만, 개인용으로 사용하기에는 조금 부적합한 면들이 있었습니다.

그렇게 알아보던 중, 제가 사용하고 있는 개인용 Slack이 눈에 들어왔습니다. IFTTT를 연동하여 여러 서비스들에 대한 정보를 Slack에 기록하고 있었고, 개인적인 메모를 하거나 정보를 볼 때 사용하고 있었습니다. 또한 Slack은 굉장히 간단하게 BOT_TOKEN 만 있어도 통신을 주고 받을 수 있고, Slack으로 Salady Bot을 만들면서 이미 개발경험을 가지고 있기에 더욱 적합하다고 생각을 했습니다.

images

출처 : Slack

Slack의 선정이유

  1. 개인용으로 만들어서 운영하는 Slack이 있다. (개인용으로 사용가능)
  2. 다른 App들은 Server로 구성하고, Webhook 설정들의 작업들이 필요하지만, Slack은 Token만 있으면 통신이 가능.
  3. 이미 Slack Bot을 개발해본 경험이 있다.
  4. Team용으로 이미 친숙하게 사용하고 있었다.
  5. 외부 서비스들을 Integration 해서 사용하기 쉽다.

Chat Bot

최근에 Chat Bot이 큰 화두가 되면서, 여러가지 bot에 대한 글을 많이 접하기도 하고 api.ai, wit.ai, 최근 한국에서는 AMICA.ai, fluenty.ai 등의 프레임워크들도 나오면서 쉽게 Chat Bot을 만들 수 있게 되었고.. 조금 더 대중적이게 되었습니다.

이러한 Bot 프레임워크를 사용하다보면 Bot의 Flow도 대략적으로 감이 오고, 어떤 방식으로 이루어져있는지 알 수가 있습니다. 대부분의 프레임워크에서는 NLP 엔진을 이용해서 intent, named entity, sentiment, domain 등을 추출하고, 그에 대한 답을 사용자가 입력하여 연결하는 방식으로 진행이 됩니다. 아직은 Short-term 즉, 조금 전에 대화했던 것들을 기억해서 처리하는 것은 잘 해내지만, Long-term 조금 더 오래된 대화의 경우, 그 대화를 기억하고 말을 이어가는 것은 훨씬 어렵습니다. 그래서 위의 bot 프레임워크들도 long-term까지 지원하는 것을 목표로 개발하고 있습니다.

그래서 제가 생각하기에 봇은 크게 3가지 종류의 봇으로 구분이 된다고 생각을 합니다.

  1. Basic Chatbot: Bot이 그저 하나의 UX인 경우입니다. 정해진 입력에 따라서 정해진 응답을 하는 경우입니다. 보통 정규화 표현식으로 입력을 처리하게 되고, 여기서는 NLP가 들어가지 않습니다.
  2. Smart Chatbot: 이 단계부터는 NLP가 적용된 단계입니다. 각종 Bot framework에서 제공하는 기능처럼, intent, named entity, sentiment, domain 등을 추출하여 그에 따른 응답을 처리합니다. 여기서는 Dialog manager가 대화를 파악하고 관리하며, 자연스러운 응답을 생성하는 NLG까지 포함됩니다.
  3. A.I Chatbot: 이 단계의 Bot은 강 인공지능을 의미합니다. 아직은 어떤 모습으로 나타날지 상상할 수 없는 모습이기도 합니다. Deep Learning + Reinforce Learning 가 합쳐지면서 조금씩 조금씩 이 단계를 향해 나아가고 있다고 생각합니다.

여기서 제가 만들고자 하는 Bot의 목표는 우선.. Smart Chabot 입니다. 하지만 문제는 Data가 어느정도 필요하다는 문제가 있습니다.

그래서 보통 Basic Chatbot부터 시작을 하여 Data를 모으는 것이 일반적 입니다. 그리고 일정량의 Data가 모인 다음에 기존의 로직을 바탕으로 학습을 하는 Imitation Learning을 통해 자동화를 시킬 수 있습니다. 그 후로 더 데이터를 모으고, 기준이 생긴다면 그 것에 맞춰서 조금 더 똑똑하게 행동하도록 만들 수 있을 것 입니다.

저는 이와 같은 장기프로젝트에 대한 계획을 세우면서, 이 과정에서 필요한 기능들이 무엇인지 떠올랐습니다.
우선, 개인용 봇으로 사용하는데, 괜히 어렵게 말을 알아들을 필요가 없다는 것 입니다. 최소한의 자연어 처리를 하고 간단한 키워드 매칭을 통해서 의도를 파악하는 것.
다음으로 새로운 기능들을 다 만들 필요없이 기존의 서비스들을 사용해서 기능들을 만들 것.
마지막으로 나에게 필요한 기능을 내가 지정한 시간에 실행할 수 있는 것.

그렇게 여러가지 서비스들을 Kino의 Skill 들로 등록해서 사용하고, 간단한 자연어 처리로 간단하게 Job을 등록해서 내가 원하는 시간에 그 일을 하도록 만들었습니다 (예, 2시간 마다 날씨 알려줘, 오전 8시에 하루 브리핑 등…) 아래는 지금까지 Kino의 중간 결과물 입니다.

다음에는 Skill과 Scheduling 에 대해서 조금 더 세부적인 내용들을 다뤄보겠습니다.

Kino

그렇게 만들기 시작한 저만의 개인 비서 Kino. 아래는 중간 결과물을 입니다.

skill_example1

Weather Skill

skill_example2

Kino는 아침에 스케쥴을 알려줍니다

guide

인트로 & 가이드

functions

사용할 수 있는 Skill들