DataBloom - Part 365

Multimodal Bottleneck Transformer (MBT): A New Model for Modality Fusion

Post author By
Post date March 15, 2022
No Comments on Multimodal Bottleneck Transformer (MBT): A New Model for Modality Fusion

Posted by Arsha Nagrani and Chen Sun, Research Scientists, Google Research, Perception Team

People interact with the world through multiple sensory streams (e.g., we see objects, hear sounds, read words, feel textures and taste flavors), combining information and forming associations between senses. As real-world data consists of various signals that co-occur, such as video frames and audio tracks, web images and their captions and instructional videos and speech transcripts, it is natural to apply a similar logic when building and designing multimodal machine learning (ML) models.

Effective multimodal models have wide applications — such as multilingual image retrieval, future action prediction, and vision-language navigation — and are important for several reasons; robustness, which is the ability to perform even when one or more modalities is missing or corrupted, and complementarity between modalities, which is the idea that some information may be present only in one modality (e.g., audio stream) and not in the other (e.g., video frames). While the dominant paradigm for multimodal fusion, called late fusion, consists of using separate models to encode each modality, and then simply combining their output representations at the final step, investigating how to effectively and efficiently combine information from different modalities is still understudied.

In “Attention Bottlenecks for Multimodal Fusion”, published at NeurIPS 2021, we introduce a novel transformer-based model for multimodal fusion in video called Multimodal Bottleneck Transformer (MBT). Our model restricts cross-modal attention flow between latent units in two ways: (1) through tight fusion bottlenecks, that force the model to collect and condense the most relevant inputs in each modality (sharing only necessary information with other modalities), and (2) to later layers of the model, allowing early layers to specialize to information from individual modalities. We demonstrate that this approach achieves state-of-the-art results on video classification tasks, with a 50% reduction in FLOPs compared to a vanilla multimodal transformer model. We have also released our code as a tool for researchers to leverage as they expand on multimodal fusion work.

A Vanilla Multimodal Transformer Model
Transformer models consistently obtain state-of-the-art results in ML tasks, including video (ViViT) and audio classification (AST). Both ViViT and AST are built on the Vision Transformer (ViT); in contrast to standard convolutional approaches that process images pixel-by-pixel, ViT treats an image as a sequence of patch tokens (i.e., tokens from a smaller part, or patch, of an image that is made up of multiple pixels). These models then perform self-attention operations across all pairs of patch tokens. However, using transformers for multimodal fusion is challenging because of their high computational cost, with complexity scaling quadratically with input sequence length.

Because transformers effectively process variable length sequences, the simplest way to extend a unimodal transformer, such as ViT, to the multimodal case is to feed the model a sequence of both visual and auditory tokens, with minimal changes to the transformer architecture. We call this a vanilla multimodal transformer model, which allows free attention flow (called vanilla cross-attention) between different spatial and temporal regions in an image, and across frequency and time in audio inputs, represented by spectrograms. However, while easy to implement by concatenating audio and video input tokens, vanilla cross-attention at all layers of the transformer model is unnecessary because audio and visual inputs contain dense, fine-grained information, which may be redundant for the task — increasing complexity.

Restricting Attention Flow
The issue of growing complexity for long sequences in multimodal models can be mitigated by reducing the attention flow. We restrict attention flow using two methods, specifying the fusion layer and adding attention bottlenecks.

Fusion layer (early, mid or late fusion): In multimodal models, the layer where cross-modal interactions are introduced is called the fusion layer. The two extreme versions are early fusion (where all layers in the transformer are cross-modal) and late fusion (where all layers are unimodal and no cross-modal information is exchanged in the transformer encoder). Specifying a fusion layer in between leads to mid fusion. This technique builds on a common paradigm in multimodal learning, which is to restrict cross-modal flow to later layers of the network, allowing early layers to specialize in learning and extracting unimodal patterns.
Attention bottlenecks: We also introduce a small set of latent units that form an attention bottleneck (shown below in purple), which force the model, within a given layer, to collate and condense information from each modality before sharing it with the other, while still allowing free attention flow within a modality. We demonstrate that this bottlenecked version (MBT), outperforms or matches its unrestricted counterpart with lower computational cost.

The different attention configurations in our model. Unlike late fusion (top left), where no cross-modal information is exchanged in the transformer encoder, we investigate two pathways for the exchange of cross-modal information. Early and mid fusion (top middle, top right) is done via standard pairwise self attention across all hidden units in a layer. For mid fusion, cross-modal attention is applied only to later layers in the model. Bottleneck fusion (bottom left) restricts attention flow within a layer through tight latent units called attention bottlenecks. Bottleneck mid fusion (bottom right) applies both forms of restriction in conjunction for optimal performance.

Bottlenecks and Computation Cost
We apply MBT to the task of sound classification using the AudioSet dataset and investigate its performance for two approaches: (1) vanilla cross-attention, and (2) bottleneck fusion. For both approaches, mid fusion (shown by the middle values of the x-axis below) outperforms both early (fusion layer = 0) and late fusion (fusion layer = 12). This suggests that the model benefits from restricting cross-modal connections to later layers, allowing earlier layers to specialize in learning unimodal features; however, it still benefits from multiple layers of cross-modal information flow. We find that adding attention bottlenecks (bottleneck fusion) outperforms or maintains performance with vanilla cross-attention for all fusion layers, with more prominent improvements at lower fusion layers.

The impact of using attention bottlenecks for fusion on mAP performance (left) and compute (right) at different fusion layers on AudioSet. Attention bottlenecks (red) improve performance over vanilla cross-attention (blue) at lower computational cost. Mid fusion, which is in fusion layers 4-10, outperforms both early (fusion layer = 0) and late (fusion layer = 12) fusion, with best performance at fusion layer 8.

We compare the amount of computation, measured in GFLOPs, for both vanilla cross-attention and bottleneck fusion. Using a small number of attention bottlenecks (four bottleneck tokens used in our experiments) adds negligible extra computation over a late fusion model, with computation remaining largely constant with varying fusion layers. This is in contrast to vanilla cross-attention, which has a non-negligible computational cost for every layer it is applied to. We note that for early fusion, bottleneck fusion outperforms vanilla cross-attention by over 2 mean average precision points (mAP) on audiovisual sound classification, with less than half the computational cost.

Results on Sound Classification and Action Recognition
MBT outperforms previous research on popular video classification tasks — sound classification (AudioSet and VGGSound) and action recognition (Kinetics and Epic-Kitchens). For multiple datasets, late fusion and MBT with mid fusion (both fusing audio and vision) outperform the best single modality baseline, and MBT with mid fusion outperforms late fusion.

Across multiple datasets, fusing audio and vision outperforms the best single modality baseline, and MBT with mid fusion outperforms late fusion. For each dataset we report the widely used primary metric, i.e., Audioset: mAP, Epic-Kitchens: Top-1 action accuracy, VGGSound, Moments-in-Time and Kinetics: Top-1 classification accuracy.

Visualization of Attention Heatmaps
To understand the behavior of MBT, we visualize the attention computed by our network following the attention rollout technique. We compute heat maps of the attention from the output classification tokens to the image input space for a vanilla cross-attention model and MBT on the AudioSet test set. For each video clip, we show the original middle frame on the left with the ground truth labels overlayed at the bottom. We demonstrate that the attention is particularly focused on regions in the images that contain motion and create sound, e.g., the fingertips on the piano, the sewing machine, and the face of the dog. The fusion bottlenecks in MBT further force the attention to be localized to smaller regions of the images, e.g., the mouth of the dog in the top left and the woman singing in the middle right. This provides some evidence that the tight bottlenecks force MBT to focus only on the image patches that are relevant for an audio classification task and that benefit from mid fusion with audio.

Summary
We introduce MBT, a new transformer-based architecture for multimodal fusion, and explore various fusion approaches using cross-attention between bottleneck tokens. We demonstrate that restricting cross-modal attention via a small set of fusion bottlenecks achieves state-of-the-art results on a number of video classification benchmarks while also reducing computational costs compared to vanilla cross-attention models.

Acknowledgements
This research was conducted by Arsha Nagrani, Anurag Arnab, Shan Yang, Aren Jansen, Cordelia Schmid and Chen Sun. The blog post was written by Arsha Nagrani, Anurag Arnab and Chen Sun. Animations were created by Tom Small.

Misc

You can use almost any GPU for Deep Learning!

Post author By
Post date March 15, 2022
No Comments on You can use almost any GPU for Deep Learning!

submitted by /u/limapedro
[visit reddit] [comments]

Misc

Expanding Hybrid-Cloud Support in Virtualized Data Centers with New NVIDIA AI Enterprise Integrations

Post author By
Post date March 15, 2022
No Comments on Expanding Hybrid-Cloud Support in Virtualized Data Centers with New NVIDIA AI Enterprise Integrations

Get the latest on NVIDIA AI Enterprise on VMware Cloud Foundation, VMware Cloud Director, and new curated labs with VMware Tanzu and Domino Data.

The new year has been off to a great start with NVIDIA AI Enterprise 1.1 providing production support for container orchestration and Kubernetes cluster management using VMware vSphere with Tanzu 7.0 update 3c, delivering AI/ML workloads to every business in VMs, containers, or Kubernetes.

New LaunchPad labs

New NVIDIA AI Enterprise labs for IT admins and MLOps are available on NVIDIA LaunchPad:

VMware vSphere with Tanzu
Domino Enterprise MLOps platform

NVIDIA AI Enterprise with VMware vSphere with Tanzu

Enterprises can get started quickly with NVIDIA AI Enterprise running on VMware vSphere with Tanzu through the free LaunchPad program that provides immediate, short-term access to NVIDIA AI running on private accelerated compute infrastructure.

A newly added curated lab gives you hands-on experience using VMware Tanzu Kubernetes Grid service to manage a containerized workload using the frameworks provided in the NVIDIA AI Enterprise software suite. In this lab, you can configure, optimize, and orchestrate resources for AI and data science workloads with VMware Tanzu. You get experience using the NGC registry, NVIDIA operators, Kubernetes, and server virtualization, all by running vSphere with Tanzu on NVIDIA-Certified Systems.

NVIDIA AI Enterprise with the Domino Enterprise MLOps platform

MLOps administrators also have something to get excited about, with the addition of another new curated lab available soon on NVIDIA LaunchPad.

NVIDIA AI Enterprise provides validation for the Domino Data Lab Enterprise MLOps Platform with VMware vSphere. Enterprises can run through the lab and get hands-on experience on how to scale data science workloads with the Domino MLOps platform. Data scientists and AI researchers will be able to launch the Domino Workspaces on-demand with container images configured with the latest data science tools and frameworks, included in NVIDIA AI Enterprise, accelerated with NVIDIA GPUs.

The Domino MLOps Platform enables automatic storing and versioning of code, data, and results. When IT administrators try this lab, they use familiar management tools provided by VMware vSphere with Tanzu. Deployed on NVIDIA-Certified Systems, IT now has the confidence of enterprise-grade security, manageability, and support.

New AI Enterprise integrations with VMware

Service providers, telcos, and hybrid cloud enterprises using VMware now have access to NVIDIA AI on the following products:

VMware Cloud Foundation
VMware Cloud Director

Architecture diagram of AI Enterprise layer stacked on Vmware VSphere with Tanzu or VMware Cloud Foundation with Tanzu, over the accelerated mainstream servers (DPU or GPU)

Enterprise AI together with VMware Cloud Foundation

NVIDIA and VMware have expanded support for NVIDIA AI Enterprise to VMware’s flagship hybrid-cloud platform, VMware Cloud Foundation 4.4. Further democratizing AI for every enterprise, the AI-Ready enterprise platform combines the benefits of the full stack VMware Cloud Foundation environment with the NVIDIA software suite running on GPU accelerated mainstream servers that are NVIDIA-Certified. This provides you with the tools and frameworks you need for successfully developing and deploying AI, while enabling IT administrators with full control of infrastructure resources.

The integration of VMware Cloud Foundation with Tanzu and NVIDIA AI Enterprise enables enterprises to extend their software-defined private cloud platform to support a flexible and easily scalable AI-ready infrastructure. Administrators can deploy, configure, and manage IT infrastructure in an automated fashion, using powerful tools like vSphere Distributed Resource Scheduler for initial placement and VMware vMotion for migration of VMs running NVIDIA GPUs.

IT admins benefit from this integration, but so do AI practitioners who can now easily consume admin-assigned resources for their AI and data analytics workloads to get them from development and deployment to scaling quickly.

Addressing the growing demand for AI in the cloud with VMware Cloud Director

As cloud providers are seeing increased demand from customers for modern applications that require accelerated compute, VMware Cloud Director 10.3.2 has added support for NVIDIA AI Enterprise. These service providers can now leverage vSphere support to run AI/ML workloads on NVIDIA Ampere-based GPUs with capabilities like vMotion, multi-tenancy GPU services combined with the most fundamental AI tools and frameworks all managed through VMware Cloud Director.

NVIDIA AI Enterprise with VMware Cloud Director enables industries like cloud service providers and telcos to provide new services for their customers. One example is intelligent video analytics (IVA) using GPU-accelerated computer vision to provide real-time insights for boosting public safety, lowering theft, and improving customer experiences.

Start your AI journey

Get started with NVIDIA AI Enterprise on LaunchPad for free. Apply now for immediate access to the software suite running on NVIDIA-Certified Systems.

Come join us at GTC, March 21-24 to hear and see NVIDIA, VMware, and our partners share more details about these new and exciting integrations. Register today.

Misc

What happens during validation in keras?

Post author By
Post date March 14, 2022
No Comments on What happens during validation in keras?

Hello I am training a model for semantic segmentation using dice loss and as a metric.

model.compile(...., loss=diceLoss, metrics=[dice_coef])

I noticed that in an epoch, the loss and dice_coef are equal. However, val_loss is never equal val_dice_coef. What is the reason behind that? I feel like something is being done (turned on/off) during validation.

I hope any of this makes sense. Thanks!

Edit: i am using segmentation-models library, using a unet with pretrained weights. I tried looking into the source code; but I couldn’t figure out what’s happening.

submitted by /u/potato-question-mark
[visit reddit] [comments]

Misc

Delivering One-Click VR Streaming Using Innoactive Portal and NVIDIA CloudXR

Post author By
Post date March 14, 2022
No Comments on Delivering One-Click VR Streaming Using Innoactive Portal and NVIDIA CloudXR

Building one-click delivery of PC-based VR applications to standalone devices by streaming from the Cloud using NVIDIA CloudXR SDK and Innoactive Portal.

Given how prevalent streaming technologies are today, it is easy to forget that on-demand, globally available applications used to be considered a radical idea. But as the popularity of AR and VR grows, the race is on to push streaming even further. With the demand for realistic content and consumer-grade hardware comes a critical challenge: How to build applications that look good on any device without compromising on the quality and performance requirements of immersive experiences?

To help solve this problem, the team at Innoactive has integrated the NVIDIA CloudXR streaming solution into their VR application deployment platform, Innoactive Portal. Customers like SAP, Deloitte, Linde Engineering, and Volkswagen Group are using the platform to deliver immersive training seamlessly to users whenever and wherever they are.

Portal is a user-centric, enterprise-grade deployment platform for managing and accessing VR content with a single click from any device. What makes Portal unique is the ability to deploy content across platforms and devices including PC VR, standalone VR, and PC 3D applications.

Using both the server and client components of the NVIDIA CloudXR SDK, with improvements in cloud computing frameworks, Portal makes it possible to build an application for PC VR and streaming to all-in-one headsets or other low-compute devices. All end users do is select the application from their library to stream it.

Overcoming the Distribution Bottlenecks of XR

Woman in VR headset interacting with the Innoactive Portal VR launcher — *Figure 1. Innoactive Portal one-click user interface for launching to VR*.

While there are several other enterprise VR platforms out there, Portal makes the process painless for both creators and end users by alleviating three of the most critical bottlenecks in distributing XR content:

The platform’s secure cloud infrastructure provides customers with centralized control of their content and users, ensuring access for only those who need it.
The one-click streaming architecture makes content easy to use for users who may not be familiar with navigating in VR.
The NVIDIA CloudXR support of more complex distribution models, like that of Portal, results in several cost-saving benefits, like the ability to build an application one time and deploy it to any device, anywhere.

It’s been technically possible to use NVIDIA CloudXR to stream a VR application for some time, but Innoactive wanted to provide this functionality to casual and even first-time VR users. They faced a set of challenges to make this happen:

When users want to start a cloud streamed VR experience, they need a suitable NVIDIA GPU-powered machine available to them on-demand.
This server running NVIDIA CloudXR not only requires good performance to perform the rendering, it also must be available to users worldwide.
At the same time, it is important to be as resource-aware as possible. To reach a broad audience, all of this must happen with minimal complexity for the user.

To deliver this zero-friction experience, Innoactive needed to build on top of both the client and server side of the NVIDIA CloudXR SDK.

Building a scalable server architecture for NVIDIA CloudXR

On the server, it is important to ensure that an NVIDIA GPU-powered machine is available on-demand to each user, no matter how many users are trying to access machines, or where or when those users require access.

Relying on cloud providers like Amazon Web Services (AWS) and Microsoft Azure for the compute resources is a logical solution. However, Portal was built with the guiding principle of being an open platform supporting all types of VR content as well as all types of environments to be deployed in.

This agnosticism of cloud providers implies that a service is needed, capable of supporting cloud-rendering machines on either cloud service. In turn this provides the most flexibility to customers who have already decided for one or the other upfront. It also enables Portal to provide maximum coverage and performance to end users by being able to switch to another data center with better latency or performance, independently of the respective provider.

For this reason, Innoactive relies on building custom virtual machine (VM) images with preinstalled software components for any cloud provider. They use a tool called Packer and enable the resulting VMs to be scaled using the respective cloud provider’s VM scale set functionality.

By building for today but planning for tomorrow, Innoactive ensures that this process can be replicated in the future and adapted to work with new cloud providers. Services are now capable of dynamically allocating rendering resources whenever and wherever needed.

Workflow starting with machine configuration through the packer, VM images, EC2 Auto Scaling group, and the Innoactive Portal — *Figure 2. Custom VM image build pipeline for cloud-agnostic NVIDIA CloudXR servers*

When going to production with cloud-rendered VR, the cost of remote rendering resources must stay as low as possible to provide a positive business case. At the same time, ensuring a great user experience requires a high level of performance, which in turn can quickly become expensive. Thus, the Portal team was tasked to find the optimal cost and performance trade-off.

One huge cost driver apart from the size of a VM is its uptime. The target is to pay for what is needed and no more. However, waiting too long to boot up a machine can make the process feel clumsy or untimely.

Due to the on-demand design of Portal, reducing user wait time is key to getting them into a cloud-rendered experience as fast as possible. Portal detects when users start engaging with it and prepares a VM in anticipation of them launching a VR application. This dramatically reduces the load time for when they launch an application, from a matter of minutes to a matter of seconds.

After the cloud-rendered session is complete, Portal’s cloud-rendering orchestration service automatically removes the machine using the respective cloud provider’s endpoints for scaling. This again ensures that only uptime is paid for, not dead time. Portal can adapt its rendering capabilities depending on the use case. For example, it can ensure that the GPU rendering pool is sufficiently large on days that are planned for VR training at a corporation while minimizing the pool size in times of low activity.

To optimize the cost-performance ratio, you should understand customer and regional differences. Keeping latency at a minimum during a session is one of the most important factors to guaranteeing a great experience. Generally, the closer a server is to the end users, the lower the latency and lag is between user input and the streamed application.

To ensure that users are allocated the nearest server, multiple algorithms are used, one example being IP lookups to narrow down user geolocations. In the future, relying on the 5G network and edge computing will further improve performance.

Diagram of Innoactive Portal’s cloud rendering architecture — *Figure 3. Portal cloud-rendering architecture*.

With the previously mentioned architecture and optimizations, Portal supports users with the intent to render a VR application remotely using NVIDIA CloudXR. Portal takes care of allocating rendering resources in the cloud, synchronizes the desired VR content to the remote machine and executes the NVIDIA CloudXR Server so that users can connect. All users do is select the desired VR content.

While providing value on its own, having a preconfigured, ready-to-roll remote rendering machine is just half of the equation. It still requires users to boot up their streaming client and configure it manually by entering the server IP address, which is not something novices are comfortable with.

Building an NVIDIA CloudXR client into Portal

By integrating the NVIDIA CloudXR SDK within the client application, Portal takes away these manual steps to provide a smooth user experience. To keep users updated throughout every step of the process, Portal uses WebSockets built on top of ASP.NET Core SignalR. The client application’s user interface is also built using Unity, to enable fast iteration while still delivering a great user experience.

This comes with the challenge of combining the NVIDIA CloudXR SDK, which is written in C++ using the Android NDK, and an interactive Unity scene into one Portal client app. The result is a single, modular Android application featuring multiple activities that can be easily loaded to a standalone VR headset like Meta Quest 2, Vive Focus 3, or Pico Neo 3.

The user interface guides users up to the point when a remote rendering server is available. It then automatically connects to it through the NVIDIA CloudXR client and ensures that the VR application is correctly started on the server, guaranteeing a frictionless entry into the streamed experience.

The business-logic layer, written in native Android (Java), contains logic to communicate back and forth with Innoactive’s backend services and handles the entire application state. It updates the user interface layer using a proprietary protocol shipped with Unity called UnitySendMessage and initializes the native NVIDIA CloudXR client library when required.

The native NVIDIA CloudXR activity is extended to use the Java Native Interface (JNI) to communicate back lifecycle events to the business layer. The example code later in this post show this. To isolate the native Unity and NVIDIA CloudXR activities from each other properly, both are designed to run in a dedicated Android process.

Diagram of the application layer architecture for the CloudXR client — *Figure 4. Application-layer architecture of the NVIDIA CloudXR client*.

The following code examples show the interfacing between Android code and NVIDIA CloudXR. First, the Android code example (Java):

import android.app.Activity;

public class MainActivity extends Activity implements ServerAvailableInterface {
    // load native NVIDIA CloudXR client library (*.aar)
    static {
        System.loadLibrary("CloudXRClient");
    }
    static native void initCloudXR(String jcmdline);

    @Override
    public void onServerReady(String serverIpAddress) {
      // initializes the native CloudXR client after server ip is known
      String arguments = String.format("-s %s", serverIpAddress);
      initCloudXR(arguments);
    }

    public void onConnected() {
      // called by native CloudXR activity when it is successfully connected
      System.out.println("Connected to CloudXR server")
    }

Second, theNVIDIA CloudXR code example (C++):

#include 
#include 
#include 

// ...

// called whenever the CloudXR client state changes
void Client::UpdateClientState(cxrClientState state, cxrStateReason reason) {
    switch (state) {
        // ...
        case cxrClientState_StreamingSessionInProgress:
            OnCloudXRConnected();
            break;
        // ...
    }
}

// Fired when the CloudXR client has established a connection to the server
void Client::OnCloudXRConnected() {
    jmethodID jOnConnected = env ->GetMethodID(mainActivity, "onConnected", "()V");
    env->CallVoidMethod(mainActivity, jOnConnected);
}

// ...

The final client application provides an intuitive interface with a minimal set of required user interactions to select the desired VR content. This interface is accessible from a second screen device independently of VR, often a laptop or phone.

With the Portal’s real-time communication, all connected clients–regardless of device–display the same status updates as a cloud-rendered session is loading. Users then receive a push notification after their content is ready in VR. When done with a session, users can directly proceed with the next streaming session without having to quit the client application.

Bundling Portal’s services together with the extended NVIDIA CloudXR client provides a globally scalable solution for users to access their VR content library from any device with just one click. Portal with NVIDIA CloudXR has changed how customers are able to deploy their PC-based VR. Where customers had a single platform per headset type, Portal now handles the complexity.

The easy-to-use, one-click frontend supports them in scaling VR-based use cases like training employees who are often new to VR. On a more technical level, customers can plan flexibly for headset deployment. They are no longer limited to PC-based VR due to the heavy calculation power that is needed to provide high-quality experiences.

Now, the power of cloud machines makes high-end rendering available on standalone devices through the Portal with NVIDIA CloudXR.

Additional resources

For more information about how Innoactive Portal has helped customers deploy VR training at scale, check out their case studies with SAP and Volkswagen Group.

For more information about NVIDIA CloudXR, see the developer page. Check out GTC 2022 this March 21-24 for the latest from NVIDIA on XR and the era of AI.

Misc

Tensorflow-gpu makes everything slow

Post author By
Post date March 14, 2022
No Comments on Tensorflow-gpu makes everything slow

Hello,

I hope it is okay to post this question here. I have installed tensorflow-gpu through the Anaconda Navigator because I have a RTX3090 that I would like to use. However, when using the environment that where I have tensorflow-gpu installed everything is super slow. Like just executing the model without training takes for ever. Even something as simple as

model = Sequential()

model.add(Dense(10, activation=”relu”)

model.add(Dense(1, activation=”sigmoid”)

model.compile(optimizer=”rmsprop”, loss=”binary_crossentropy”)

Does anyone have clue what might be the issue? Thank you in advance!

submitted by /u/0stkreutz
[visit reddit] [comments]

Misc

Jetson Project of the Month: Creating Intelligent Music with the Neurorack Deep AI-based Synthesizer

Post author By
Post date March 14, 2022
No Comments on Jetson Project of the Month: Creating Intelligent Music with the Neurorack Deep AI-based Synthesizer

An image of the Neurorack synthesizer with NVIDIA Jetson Nano. This Jetson Project of the Month enhances synthesizer-based music by applying deep generative models to a classic Eurorack machine.

Are you a fan of synthesizer-driven bands like Depeche Mode, Erasure, or Kraftwerk? Did you ever think of how cool it would be to create your own music with a synthesizer at home? And what if that process could be enhanced with the help of NVIDIA Jetson Nano?

The latest Jetson Project of the Month has found a way to do just that, bringing together a Eurorack synthesizer with a Jetson Nano to create the Neurorack. This musical audio synthesizer is the first to combine the power of deep generative models and the compactness of a Eurorack machine.

“The goal of this project is to design the next generation of musical instruments, providing a new tool for musicians while enhancing the musician’s creativity. It proposes a novel approach to think [about] and compose music,” noted the app developers, who are members of Artificial Creative Intelligence and Data Science (ACIDS) group, based at the IRCAM Laboratory in Paris, France. “We deeply think that AI can be used to achieve this quest.”

The real-time capabilities of the Neurorack rely on Jetson Nano’s processing power and Ninon Devis’ research into crafting trained models that are lightweight in both computation and memory footprint.

“Our original dream was to find a way to miniaturize deep models and allow them inside embedded audio hardware and synthesizers. As we are passionate about all forms of synthesizers, and especially Eurorack, we thought that it would make sense to go directly for this format as it was more fun! The Jetson Nano was our go-to choice right at the onset … It allowed us to rely on deep models without losing audio quality, while maintaining real-time constraints,” said Devis.

Watch a demo of the project in action here:

The developers had several key design considerations as they approached this project, including:

Musicality: the generative model chosen can produce sounds that are impossible to synthesize without using samples.
Controllability: the interface that they picked is handy and easy to manipulate.
Real-time: the hardware behaves like a traditional synthesizer and is equally reactive.
Ability to standalone: it can be played without a computer.

As the developers note in their NVIDIA Developer Forum post about this project: “The model is based on a modified Neural-Source Filter architecture, which allows real-time descriptor-based synthesis of percussive sounds.”

Neurorack uses PyTorch deep audio synthesis models (see Figure 1) to produce sounds that typically require samples, is easy to manipulate and doesn’t require a separate computer.

A diagram showing how the Neurorack is architected with its own hardware and NVIDIA Jetson Nano. — *Figure 1: The* *diagram* *shows the overall structure of the module and the relations between the hardware and software (green) components.*

The hardware features four control voltage (CV) inputs and two gates (along with a screen, rotary, and button for handling the menus), which all communicate with specific Python libraries. The behavior of these controls (and the module itself) is highly dependent on the type of deep model embedded. For this first version of the Neurorack, the developers implemented a descriptor-based impact sounds generator, described in their GitHub documentation.

The Eurorack hardware and software were developed with equal contributions from Ninon Devis, Philippe Esling, and Martin Vert on the ACIDS team. According to their website, ACIDS is “hell-bent on trying to model musical creativity by developing innovative artificial intelligence models.”

The project code and hardware design are free, open-source, and available in their GitHub repository.

The team hopes to make the project accessible to musicians and people interested in AI/embedded computing as well.

“We hope that this project will raise the interest of both communities! Right now reproducing the project is slightly technical, but we will be working on simplifying the deployment and hopefully finding other weird people like us,” Devis said. “We strongly believe that one of the key aspects in developing machine learning models for music will lead to the empowerment of creative expression, even for nonexperts.”

More detail on the science behind this project is available on their website and in their academic paper.

Two of the team members, Devis and Esling, have formed a band using the instruments they developed. They are currently working on a full-length live act, which will feature the Neurorack and plan to perform during the next SynthFest in France this April.

Sign up now for Jetson Developer Day taking place at NVIDIA GTC on March 21. This full-day event led by world-renowned experts in robotics, edge AI, and deep learning, will give you a unique deep dive into building next-generation AI-powered applications and autonomous machines.

Misc

How would you make an intentionally bad CNN?

Post author By
Post date March 14, 2022
No Comments on How would you make an intentionally bad CNN?

Hey folks,

I’m a data science lecturer and for one of my assignments this year, I want to challenge my students to fix and optimise a CNN coded in keras/TF. The gist is I need to code up a model that is BAD—something full of processing bottlenecks to slow it down, and hyperparameters that hamper the models ability to learn anything. The students will get the model, and will be tasked with “fixing” it—tidying up the input pipeline so that it runs efficiently and adjust the model parameters so that it actually fits properly.

I have a few ideas already, mostly setting up the input pipeline in a convoluted order, using suboptimal activations, etc. But I’m curious to hear other suggestions!

submitted by /u/Novasry
[visit reddit] [comments]

Misc

TF2 Source code for Custom training loop with "Custom layers", "XLA compiling", "Distributed learning", and "Gradient accumulator"

Hi, guys 🤗

I just want to share my Github repository for the Custom training loop with “Custom layers,” “XLA compiling,” “Distributed learning,” and “Gradient accumulator.”

As you know, TF2 operates better on a static graph, so TF2 with XLA compiling is easy and powerful. However, to my knowledge, there is no source code or tutorial for XLA compiling for distributed learning. Also, TF2 doesn’t natively provide a gradients accumulator, which is a well-known strategy for small hardware users.

My source code provides all of them and makes it possible to train ResNet-50 with 512 mini-batch sizes on two 1080ti. All parts are XLA compiled so that the training loop is sufficiently fast considering old-fashioned GPUs.

Actually, this repository is source code for a search-based filter pruning algorithm, so if you want to know about it, please look around Readme and the paper.