Categories
Misc

Pycharm for Tensoflow Developer Certification

I’m starting to prepare for the certification and from what I know jupyter notebooks are only editable in pycharm professional edition. The exercises in the course im taking (DeepLearning.AI TensorFlow Developer Professional Certificate) are in ipynb format.

So I’m a little confused if the test will be done in .ipynb or .py format? If it is in .ipynb, will I need to get pycharm professional? Thanks in advance!

submitted by /u/jonathanneam
[visit reddit] [comments]

Categories
Misc

CUDA 11.6 Toolkit New Release Revealed

New CUDA 11.6 Toolkit is focused on enhancing the programming model and performance of your CUDA applications.

NVIDIA announces the newest release of the CUDA development environment, CUDA 11.6. This release is focused on enhancing the programming model and performance of your CUDA applications. CUDA continues to push the boundaries of GPU acceleration and lay the foundation for new applications in HPC, visualization, AI, ML and DL, and data science.

CUDA 11.6 has several important features. This post offers an overview of the key capabilities:

  • GSP driver architecture now default on Turing and Ampere GPUs
  • New API to allow disabling nodes in instantiated graph
  • Full support of 128-bit integer type
  • Cooperative groups namespace update
  • CUDA compiler update
  • Nsight Compute 2022.1 release

CUDA 11.6 ships with the R510 driver, an update branch. CUDA 11.6 Toolkit is available to download.

GSP driver architecture

The GSP driver architecture is now the default driver mode for all listed Turing and Ampere GPUs. The older driver architecture is supported as a fallback. For more information, see R510 Driver Readme.

Instantiated Graph Node API additions

We added a new API, cudaGraphNodeSetEnabled, to allow disabling nodes in an instantiated graph. Support is limited to kernel nodes in this release.  A corresponding API, cudaGraphNodeGetEnabled, allows querying the enabled state of a node. We’ve also added the ability to disable NULL kernel graph node launches.

128-bit integer support

CUDA 11.6 includes the full release of 128-bit integer (__int128) data type, including compiler and developer tools support. The host-side compiler must support the __int128 type to use this feature.

Cooperative groups namespace

The cooperative groups namespace has been updated with new functions to improve consistency in naming, function scope, and unit dimension and size.

Implicit Group/Member Threads Blocks
thread_block:: dim_threads
num_threads
thread_rank
thread_index
(Not needed)
grid_group:: num_threads
thread_rank
dim_blocks
num_blocks
block_rank
block_index
Table 1. New functions in cooperative groups namespace

CUDA compiler

  • Added -arch=native compilation option to target installed GPUs during compilation. This extends the existing -gencode=arch=compute_xx,code=sm_xx architecture specification
  • Add the ability to create PTX files from nvlink

Deprecated features

  • The cudaDeviceSynchronize() used for on-device fork and join parallelism is deprecated in preparation for a replacement programming model with higher performance. These functions continue to work in this release, but the tools emit a warning about the upcoming change.
  • CentOS Linux 8 has reached End-of-Life on Dec 31, 2021, and support for this OS is now deprecated in the CUDA Toolkit. CentOS Linux 8 support will be completely removed in a future release. 

Additional resources

Categories
Misc

Hi, I need some help with a custom loss function that I’m trying to build for a class project

I’m trying to make a GAN model to reconstruct one image from another. So I have two MRI images I put one example of these images, which my input is the top one and the target is the bottom one. My problem here is that I’m trying to create a custom loss function that would penalize pixels that are from the background that are not inside of the brain. A professor told me about masking the image inside of the loss function so I’m kinda lost :/. Can anyone help me ?

Edit: My image didn’t upload so I’m attaching the link https://www.researchgate.net/publication/273638459/figure/fig4/AS:272623535915014@1442009924273/Brain-MRI-and-DTI-at-2years-of-age-A-Axial-T1-weighted-images-reveal-microcephaly-with.png

submitted by /u/DaSpaceman245
[visit reddit] [comments]

Categories
Misc

Tensorflow_probability CNN gives the same output regardless of the input

submitted by /u/RaunchyAppleSauce
[visit reddit] [comments]

Categories
Misc

Is it possible for someone to give me some advice on why my model just does not work?

Hi all, Im trying to train a TensorFlow model to differentiate between a British one pence coin and a two pence coin.

I have 50 Photos for each at the dimension size 255 X 255, all photos are in focus and taken in natural sunlight.

Here is the training code I am using

https://pastebin.com/khjZrSgs

And here is the testing code I am using:

https://pastebin.com/1RZc35xK

I am pretty new to this and only following YouTube Tutorials, any advice would be fantastic!

Thanks all!

submitted by /u/Adhesive_Hooks
[visit reddit] [comments]

Categories
Misc

I want to extract patches from multiple images of a batch

Consider my input data.shape = [batch_size, 10_images, img_size, img_size, 1]

I want to extract patches of patch_size of each 10 images in a particular batch.

After that combine patches of all these 10 images as a whole in sequential order.

So that my output is like, output.shape = [batch_size, 10_images * patch_size * patch_size * 1]

Please help.

submitted by /u/ashwani_iitp
[visit reddit] [comments]

Categories
Misc

Obtain operators list

I am currently working on a tensorflow 1 project which I would like to migrate to tensorlfow 2, the project builds a tf model and then extracts the operations used in the model with the following code

graph = tf.get_default_graph()

operations = graph.get_operations()

I can’t find a way to do the same using tensorflow2, does anyone know how to do so? thanks!

submitted by /u/ostrichfear
[visit reddit] [comments]

Categories
Misc

How to have better results at forecasting?

I am a beginner in machine learning and I would like to forecast some pollution data.

I am using a dataset with values for pm2.5, pm10 and pm1 as features and I am predicting the values for the pm2.5. I built an LSTM network but the predicted values are quite from the real values.

What I used:

win_length=2 batch_size=32 num_features=3 train_generator=TimeseriesGenerator(x_train,y_train,length=win_length,sampling_rate=1,batch_size=batch_size) test_generator=TimeseriesGenerator(x_test,y_test,length=win_length,sampling_rate=1,batch_size=batch_size) 

`

The used model is LSTM:

model=tf.keras.Sequential() model.add(tf.keras.layers.LSTM(200,input_shape=(win_length,num_features),return_sequences=True)) model.add(tf.keras.layers.LeakyReLU(alpha=0.5)) model.add(tf.keras.layers.LSTM(128,return_sequences=True)) model.add(tf.keras.layers.LeakyReLU(alpha=0.5)) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.LSTM(64,return_sequences=False)) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Dense(1)) 

This is the data snippet and how the predicted values look compared to the original pm2.5 values: prediction snippet

How to increase the accuracy of the forecast? I am also attaching the jupyter notebook, which contains all the analysis: https://github.com/creativitylab/dataset/blob/main/pollution%20data.ipynb

submitted by /u/MobileInformal460
[visit reddit] [comments]

Categories
Offsites

Learning to Route by Task for Efficient Inference

Scaling large language models has resulted in significant quality improvements natural language understanding (T5), generation (GPT-3) and multilingual neural machine translation (M4). One common approach to building a larger model is to increase the depth (number of layers) and width (layer dimensionality), simply enlarging existing dimensions of the network. Such dense models take an input sequence (divided into smaller components, called tokens) and pass every token through the full network, activating every layer and parameter. While these large, dense models have achieved state-of-the-art results on multiple natural language processing (NLP) tasks, their training cost increases linearly with model size.

An alternative, and increasingly popular, approach is to build sparsely activated models based on a mixture of experts (MoE) (e.g., GShard-M4 or GLaM), where each token passed to the network follows a separate subnetwork by skipping some of the model parameters. The choice of how to distribute the input tokens to each subnetwork (the “experts”) is determined by small router networks that are trained together with the rest of the network. This allows researchers to increase model size (and hence, performance) without a proportional increase in training cost.

While this is an effective strategy at training time, sending tokens of a long sequence to multiple experts, again makes inference computationally expensive because the experts have to be distributed among a large number of accelerators. For example, serving the 1.2T parameter GLaM model requires 256 TPU-v3 chips. Much like dense models, the number of processors needed to serve an MoE model still scales linearly with respect to the model size, increasing compute requirements while also resulting in significant communication overhead and added engineering complexity.

In “Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference”, we introduce a method called Task-level Mixture-of-Experts (TaskMoE), that takes advantage of the quality gains of model scaling while still being efficient to serve. Our solution is to train a large multi-task model from which we then extract smaller, stand-alone per-task subnetworks suitable for inference with no loss in model quality and with significantly reduced inference latency. We demonstrate the effectiveness of this method for multilingual neural machine translation (NMT) compared to other mixture of experts models and to models compressed using knowledge distillation.

Training Large Sparsely Activated Models with Task Information
We train a sparsely activated model, where router networks learn to send tokens of each task-specific input to different subnetworks of the model associated with the task of interest. For example, in the case of multilingual NMT, every token of a given language is routed to the same subnetwork. This differs from other recent approaches, such as the sparsely gated mixture of expert models (e.g., TokenMoE), where router networks learn to send different tokens in an input to different subnetworks independent of task.

Inference: Bypassing Distillation by Extracting Subnetworks
A consequence of this difference in training between TaskMoE and models like TokenMoE is in how we approach inference. Because TokenMoE follows the practice of distributing tokens of the same task to many experts at both training and inference time, it is still computationally expensive at inference.

For TaskMoE, we dedicate a smaller subnetwork to a single task identity during training and inference. At inference time, we extract subnetworks by discarding unused experts for each task. TaskMoE and its variants enable us to train a single large multi-task network and then use a separate subnetwork at inference time for each task without using any additional compression methods post-training. We illustrate the process of training a TaskMoE network and then extracting per-task subnetworks for inference below.

During training, tokens of the same language are routed to the same expert based on language information (either source, target or both) in task-based MoE. Later, during inference we extract subnetworks for each task and discard unused experts.

To demonstrate this approach, we train models based on the Transformer architecture. Similar to GShard-M4 and GLaM, we replace the feedforward network of every other transformer layer with a Mixture-of-Experts (MoE) layer that consists of multiple identical feedforward networks, the “experts”. For each task, the routing network, trained along with the rest of the model, keeps track of the task identity for all input tokens and chooses a certain number of experts per layer (two in this case) to form the task-specific subnetwork. The baseline dense Transformer model has 143M parameters and 6 layers on both the encoder and decoder. The TaskMoE and TokenMoE that we train are also both 6 layers deep but with 32 experts for every MoE layer and have a total of 533M parameters. We train our models using publicly available WMT datasets, with over 431M sentences across 30 language pairs from different language families and scripts. We point the reader to the full paper for further details.

Results
In order to demonstrate the advantage of using TaskMoE at inference time, we compare the throughput, or the number of tokens decoded per second, for TaskMoE, TokenMoE, and a baseline dense model. Once the subnetwork for each task is extracted, TaskMoE is 7x smaller than the 533M parameter TokenMoE model, and it can be served on a single TPUv3 core, instead of 64 cores required for TokenMoE. We see that TaskMoE has a peak throughput twice as high as that of TokenMoE models. In addition, on inspecting the TokenMoE model, we find that 25% of the inference time has been spent in inter-device communication, while virtually no time is spent in communication by TaskMoE.

Comparing the throughput of TaskMoE with TokenMoE across different batch sizes. The maximum batch size for TokenMoE is 1024 as opposed to 4096 for TaskMoE and the dense baseline model. Here, TokenMoE has one instance distributed across 64 TPUv3 cores, while TaskMoE and the baseline model have one instance on each of the 64 cores.

A popular approach to building a smaller network that still performs well is through knowledge distillation, in which a large teacher model trains a smaller student model with the goal of matching the teacher’s performance. However, this method comes at the cost of additional computation needed to train the student from the teacher. So, we also compare TaskMoE to a baseline TokenMoE model that we compress using knowledge distillation. The compressed TokenMoE model has a size comparable to the per-task subnetwork extracted from TaskMoE.

We find that in addition to being a simpler method that does not need any additional training, TaskMoE improves upon a distilled TokenMoE model by 2.1 BLEU on average across all languages in our multilingual translation model. We note that distillation retains 43% of the performance gains achieved from scaling a dense multilingual model to a TokenMoE, whereas extracting the smaller subnetwork from the TaskMoE model results in no loss of quality.

BLEU scores (higher is better) comparing a distilled TokenMoE model to the TaskMoE and TokenMoE models with 12 layers (6 on the encoder and 6 on the decoder) and 32 experts. While both approaches improve upon a multilingual dense baseline, TaskMoE improves upon the baseline by 3.1 BLEU on average while distilling from TokenMoE improves upon the baseline by 1.0 BLEU on average.

Next Steps
The quality improvements often seen with scaling machine learning models has incentivized the research community to work toward advancing scaling technology to enable efficient training of large models. The emerging need to train models capable of generalizing to multiple tasks and modalities only increases the need for scaling models even further. However, the practicality of serving these large models remains a major challenge. Efficiently deploying large models is an important direction of research, and we believe TaskMoE is a promising step towards more inference friendly algorithms that retain the quality gains of scaling.

Acknowledgements
We would like to first thank our coauthors – Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin and Minh-Thang Luong. We would also like to thank Wolfgang Macherey, Yuanzhong Xu, Zhifeng Chen and Macduff Richard Hughes for their helpful feedback. Special thanks to the Translate and Brain teams for their useful input and discussions, and the entire GShard development team for their foundational contributions to this project. We would also like to thank Tom Small for creating the animations for the blog post.

Categories
Misc

Expanding the NVIDIA DOCA Community for Developers in China

The developer community for NVIDIA DOCA continues to gain momentum around the world.

On January 13, NVIDIA hosted an online workshop to engage with the NVIDIA DOCA developer community in China. The core team at NVIDIA and leading partner representatives joined the workshop to discuss the application scenarios of NVIDIA BlueField DPUs and the NVIDIA DOCA software framework for cloud, data center, and edge. The workshop focused on the requirements for DOCA developers in key industries, such as consumer Internet, cybersecurity, and higher education, and designed a plan to expand the DOCA developer community in China.

Since the June 2021 launch of the DOCA community in China, nearly 1,000 developers have registered for the DOCA Early Access Program, accounting for almost half of global registrations. BlueField DPUs and DOCA show great potential for adoption in China, with developer numbers continuing to grow.

In the second half of 2021, NVIDIA has held three online bootcamps to introduce the application and future technology evolution of the BlueField DPU and DOCA in modern data centers. They also explored the software stack, development environment, developer resources, developer guides, and reference applications of the DOCA Software Development Kit (SDK). More than 3,500 developers participated in the bootcamps. At the latest bootcamp, the newest services and applications released with DOCA 1.2 attracted much attention. Technical information and success stories posted on social media and in knowledge communities have many developers and industry professionals excited about future advancements.

NVIDIA is working with leading global platform providers and partners, such as Juniper Networks, Excelero, VMware, and Palo Alto Networks, to integrate and extend solutions based on BlueField DPU and the DOCA software framework. Through the workshop, NVIDIA will help enable developers in China to develop applications in scenarios such as Zero Trust Security, Morpheus AI Security, edge network service platforms, and high-speed distributed storage. A rich developer program in China will be launched in 2022.

DOCA Developer Bootcamp and Virtual Hackathon

Following the hackathons in Europe and North America, NVIDIA intends to host the first spring DOCA Developer Hackathon in China in the second quarter of 2022. Before the hackathon, NVIDIA will host an online bootcamp to teach contestants about BlueField DPU and DOCA programming skills.

NVIDIA will invite teams of developers from partners, customers, and academia to learn, collaborate, and accelerate their software designs under the guidance of NVIDIA expert mentors. The aim is to foster innovative, breakthrough software projects based on BlueField DPU and DOCA 1.2 in high-performance networking, virtualization, cybersecurity, distributed storage, accelerated AI, edge computing, and video streaming processing. The program will empower the developer community in China to create revolutionary data center infrastructure applications and services. After evaluation, NVIDIA will reward outstanding innovation teams.

DPU and DOCA Excellence Center

Leadtek (Shanghai) Information Technology Co., Ltd. and Shanghai Zentek Intelligent Technology Co., Ltd. are highly familiar with NVIDIA products and solutions, including deep learning applications in cloud, data center, and edge scenarios. They have partnered with the NVIDIA Deep Learning Institute to train partners and customers. 

As the first group of members of the NVIDIA authorized DPU and DOCA Excellence Center, the two partners have set up their own Excellence Centers and begun their pilots. During the pilots, each partner will independently build and operate a virtual development platform based on the BlueField-2 DPU, establish a third-party DPU development environment, provide an online practice development environment for DOCA developers in China, and contribute to the DPU and DOCA ecosystem with NVIDIA.

The implementation of the DOCA developer program in China will help grow the global DOCA developer community, facilitate talent development, and enhance the capabilities of developers. It will also boost the performance advantages of solutions based on BlueField DPU and the DOCA SDK, and accelerate time-to-market, creating greater value for customers and partners.

Apply now to join the DOCA developer community and get early access to the DOCA software framework.