Categories
Misc

Convert to numpy

Hello members,

So my question is I have a variable which of type ops.Tensor. I need to convert this variable to a numpy variable. I tried different solutions online but in those cases the variable needs to be ops.EagerTensor to convert the variable into numpy object(. numpy (), tf.make_ndarray etc). Soo how can I convert my tensor object to eager tensor object or directly to numpy??

submitted by /u/Anonymous_Guy_12
[visit reddit] [comments]

Categories
Misc

Building a Four-Node Cluster with NVIDIA Jetson Xavier NX

Create a compact desktop cluster with four NVIDIA Jetson Xavier NX modules to accelerate training and inference of AI and deep learning workflows.

Following in the footsteps of large-scale supercomputers like the NVIDIA DGX SuperPOD, this post guides you through the process of creating a small-scale cluster that fits on your desk. Below is the recommended hardware and software to complete this project. This small-scale cluster can be utilized to accelerate training and inference of artificial intelligence (AI) and deep learning (DL) workflows, including the use of containerized environments from sources such as the NVIDIA NGC Catalog.

Hardware:

Picture of hardware components used in this post

While the Seeed Studio Jetson Mate, USB-C PD power supply, and USB-C cable are not required, they were used in this post and are highly recommended for a neat and compact desktop cluster solution.

Software:

For more information, see the NVIDIA Jetson Xavier NX development kit.

Installation

Write the JetPack image to a microSD card and perform initial JetPack configuration steps:

The first iteration through this post is targeted toward the Slurm control node (slurm-control). After you have the first node configured, you can either choose to repeat each step for each module, or you can clone this first microSD card for the other modules; more detail on this later.

For more information about the flashing and initial setup of JetPack, see Getting Started with Jetson Xavier NX Developer Kit.

While following the getting started guide above:

  • Skip the wireless network setup portion as a wired connection will be used.
  • When selecting a username and password, choose what you like and keep it consistent across all nodes.
  • Set the computer’s name to be the target node you’re currently working with, the first being slurm-control.
  • When prompted to select a value for Nvpmodel Mode, choose MODE_20W_6CORE for maximum performance.

After flashing and completing the getting started guide, run the following commands:

echo "`id -nu` ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/`id -nu`
sudo systemctl mask snapd.service apt-daily.service apt-daily-upgrade.service
sudo systemctl mask apt-daily.timer apt-daily-upgrade.timer
sudo apt update
sudo apt upgrade -y
sudo apt autoremove -y

Disable NetworkManager, enable systemd-networkd, and configure network [DHCP]:

sudo systemctl disable NetworkManager.service NetworkManager-wait-online.service NetworkManager-dispatcher.service network-manager.service
sudo systemctl mask avahi-daemon
sudo systemctl enable systemd-networkd
sudo ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
cat  /dev/null

[Match]
Name=eth0

[Network]
DHCP=ipv4
MulticastDNS=yes

[DHCP]
UseHostname=false
UseDomains=false
EOF

sudo sed -i "/#MulticastDNS=/cMulticastDNS=yes" /etc/systemd/resolved.conf
sudo sed -i "/#Domains=/cDomains=local" /etc/systemd/resolved.conf

Configure the node hostname:

If you have already set the hostname in the initial JetPack setup, this step can be skipped.

[slurm-control]

sudo hostnamectl set-hostname slurm-control
sudo sed -i "s/127.0.1.1.*/127.0.1.1t`hostname`/" /etc/hosts

[compute-node]

Compute nodes should follow a particular naming convention to be easily addressable by Slurm. Use a consistent identifier followed by a sequentially incrementing number (for example, node1, node2, and so on). In this post, I suggest using nx1, nx2, and nx3 for the compute nodes. However, you can choose anything that follows a similar convention.

sudo hostnamectl set-hostname nx[1-3]
sudo sed -i "s/127.0.1.1.*/127.0.1.1t`hostname`/" /etc/hosts

Create users and groups for Munge and Slurm:

sudo groupadd -g 1001 munge
sudo useradd -m -c "MUNGE" -d /var/lib/munge -u 1001 -g munge -s /sbin/nologin munge
sudo groupadd -g 1002 slurm
sudo useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u 1002 -g slurm -s /bin/bash slurm

Install Munge:

sudo apt install libssl-dev -y
git clone https://github.com/dun/munge
cd munge 
./bootstrap
./configure
sudo make install -j6
sudo ldconfig
sudo mkdir -m0755 -pv /usr/local/var/run/munge
sudo chown -R munge: /usr/local/etc/munge /usr/local/var/run/munge /usr/local/var/log/munge

Create or copy the Munge encryption keys:

[slurm-control]

sudo -u munge mungekey --verbose

[compute-node]

sudo sftp -s 'sudo /usr/lib/openssh/sftp-server' `id -nu`@slurm-control 



Start Munge and test the local installation:

sudo systemctl enable munge
sudo systemctl start munge
munge -n | unmunge

Expected result: STATUS: Success (0)

Verify that the Munge encryption keys match from a compute node to slurm-control:

[compute-node]

munge -n | ssh slurm-control unmunge

Expected result: STATUS: Success (0)

Install Slurm (20.11.9):

cd ~
wget https://download.schedmd.com/slurm/slurm-20.11-latest.tar.bz2
tar -xjvf slurm-20.11-latest.tar.bz2
cd slurm-20.11.9
./configure --prefix=/usr/local
sudo make install -j6

Index the Slurm shared objects and copy the systemd service files:

sudo ldconfig -n /usr/local/lib/slurm
sudo cp etc/*.service /lib/systemd/system

Create directories for Slurm and apply permissions:

sudo mkdir -pv /usr/local/var/{log,run,spool} /usr/local/var/spool/{slurmctld,slurmd}
sudo chown slurm:root /usr/local/var/spool/slurm*
sudo chmod 0744 /usr/local/var/spool/slurm*

Create a Slurm configuration file for all nodes:

For this step, you can follow the included commands and use the following configuration file for the cluster (recommended). To customize variables related to Slurm, use the configuration tool.

cat  /dev/null
#slurm.conf for all nodes#
ClusterName=SlurmNX
SlurmctldHost=slurm-control
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/usr/local/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/usr/local/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/usr/local/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/usr/local/var/spool/slurmctld
SwitchType=switch/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
JobCompType=jobcomp/none
SlurmctldDebug=info
SlurmctldLogFile=/usr/local/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/usr/local/var/log/slurmd.log

NodeName=nx[1-3] RealMemory=7000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP

EOF
sudo chmod 0744 /usr/local/etc/slurm.conf
sudo chown slurm: /usr/local/etc/slurm.conf

Install Enroot 3.3.1:

cd ~
sudo apt install curl jq parallel zstd -y
arch=$(dpkg --print-architecture)curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.3.1/enroot_3.3.1-1_${arch}.deb
sudo dpkg -i enroot_3.3.1-1_${arch}.deb

Install Pyxis (0.13):

git clone https://github.com/NVIDIA/pyxis
cd pyxis
sudo make install -j6

Create the Pyxis plug-in directory and config file:

sudo mkdir /usr/local/etc/plugstack.conf.d
echo "include /usr/local/etc/plugstack.conf.d/*" | sudo tee /usr/local/etc/plugstack.conf > /dev/null

Link the Pyxis default config file to the plug-in directory:

sudo ln -s /usr/local/share/pyxis/pyxis.conf /usr/local/etc/plugstack.conf.d/pyxis.conf

Verify Enroot/Pyxis installation success:

srun --help | grep container-image

Expected result: --container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH

Finalization

When replicating the configuration across the remaining nodes, label the JetsonNX modules with the assigned node name and/or the microSD cards. This helps prevent confusion later on when moving modules or cards around.

There are two different methods in which you can replicate your installation to the remaining modules: manual configuration or cloning slurm-control. Read over both methods and choose which method you prefer.

Manually configure the remaining nodes

Follow the “Enable and start the Slurm service daemon” section below for your current module, then repeat the entire process for the remaining modules, skipping any steps tagged under [slurm-control]. When all modules are fully configured, install them into the Jetson Mate in their respective slots, as outlined in the “Install all Jetson Xavier NX modules into the enclosure” section.

Clone slurm-control installation for remaining nodes

To avoid repeating all installation steps for each node, clone the slurm-control node’s card as a base image and flash it onto all remaining cards. This requires a microSD-to-SD card adapter if you have only one multi-port card reader and want to do card-to-card cloning. Alternatively, creating an image file from the source slurm-control card onto the local machine and then flashing target cards is also an option.

  1. Shut down the Jetson that you’ve been working with, remove the microSD card from the module, and insert it into the card reader.
  2. If you’re performing a physical card to card clone (using Balena Etcher, dd, or any other utility that will do sector by sector writes), insert the blank target microSD into the SD card adapter, then insert it into the card reader.
  3. Identify which card is which for the source (microSD) and destination (SD card) in the application that you’re using and start the cloning process.
  4. If you are creating an image file, using a utility of your choice, create an image file from the slurm-control microSD card on the local machine, then remove that card and flash the remaining blank cards using that image.
  5. After cloning is completed, insert a cloned card into a Jetson module and power on. Configure the node hostname for a compute node, then proceed to enable and start the Slurm service daemon. Repeat this process for all remaining card/module pairs.

Enable and start the Slurm service daemon:

[slurm-control]

sudo systemctl enable slurmctld
sudo systemctl start slurmctld

[compute-node]

sudo systemctl enable slurmd
sudo systemctl start slurmd

Install all Jetson Xavier NX modules into the enclosure

First power down any running modules, then remove them from their carriers. Install all Jetson modules into the Seeed Studio Jetson Mate, ensuring that the control node is placed in the primary slot labeled “MASTER”, and compute nodes 1-3 are placed in secondary slots labeled “WORKE 1, 2, and 3” respectively. Optional fan extension cables are available from the Jetson Mate kit for each module.

The video output on the enclosure is connected to the primary module slot, as is the vertical USB2 port, and USB3 port 1. All other USB ports are wired to the other modules according to their respective port numbers.

Photo of fully assembled cluster on a table.
Figure 1. Fully assembled cluster inside of the SeeedStudio Jetson Mate

Troubleshooting

This section contains some helpful commands to assist in troubleshooting common networking and Slurm-related issues.

Test network configuration and connectivity

The following command should show eth0 in the routable state, with IP address information obtained from the DHCP server:

networkctl status

The command should respond with the local node’s hostname and .local as the domain (for example, slurm-control.local), along with DHCP assigned IP addresses:

host `hostname`

Choose a compute node hostname that is configured and online. It should respond similarly to the previous command. For example: host nx1 – nx1.local has address 192.168.0.1. This should also work for any other host that has an mDNS resolver daemon running on your LAN.

host [compute-node-hostname]

All cluster nodes should be pingable by all other nodes, and all local LAN IP addresses should be pingable as well, such as your router.

ping [compute-node-hostname/local-network-host/ip]

Test the external DNS name resolution and confirm that routing to the internet is functional:

ping www.nvidia.com

Check Slurm cluster status and node communication

The following command shows the current status of the cluster, including node states:

sinfo -lNe

If any nodes in the sinfo output show UNKNOWN or DOWN for their state, the following command signals to the specified nodes to change their state and become available for job scheduling ([ ] specifies a range of numbers following the hostname ‘nx’):

scontrol update NodeName=hostname[1-3] State=RESUME

The following command runs hostname on all available compute nodes. Nodes should respond back with their corresponding hostname in your console.

srun -N3 hostname

Summary

You’ve now successfully built a multi-node Slurm cluster that fits on your desk. There’s a vast amount of benchmarks, projects, workloads, and containers that you can now run on your mini-cluster. Feel free to share your feedback on this post and, of course, anything that your new cluster is being used for.

Power on and enjoy Slurm!

For more information, see the following resources:

Acknowledgments

Special thanks to Robert Sohigian, a technical marketing engineer on our team, for all the guidance in creating this post, providing feedback on the clarity of instructions, and for being the lab rat in multiple runs of building this cluster. Your feedback was invaluable and made this post what it is!

Categories
Misc

Driving Data Center Innovation Through Ecosystem Partners

Leading security, storage, and networking vendors are joining the DOCA and DPU community.

The DPU, or data processing unit, is a new class of programmable processors that specializes in moving data around the data center and now joins CPUs and GPUs as the third pillar of modern computing. NVIDIA DOCA is core to the NVIDIA Bluefield DPU offering because it provides ecosystem partners with an open platform to deliver the advanced networking, storage, and security services needed today. 

DOCA  unlocks data center innovation by enabling an open ecosystem and developer community to rapidly create applications and services on top of Bluefield DPUs, using industry-standard open APIs and frameworks. 

Integral to our customers’ success, and our own, is the collaboration with our ecosystem partners. For more than 15 years, our ecosystem partners have harnessed the power of CUDA to develop the world’s most effective accelerated applications for a multitude of use cases. 

The NVIDIA CUDA Toolkit provides everything that is needed to develop GPU-accelerated applications. Similarly, the NVIDIA DOCA Software Framework is an open SDK that enables you to rapidly create applications and services on top of Bluefield DPUs.  

Where partners have achieved such success with NVIDIA GPUs and CUDA, we are emulating that formula with our DPU portfolio and DOCA. Moreover, we recognize that to deliver best-in-class solutions for customers, we need to partner with the world’s leading technology vendors. Proprietary applications have their place, but who better to provide world-class security, storage, and networking solutions, than the world’s leading vendors in those fields?

A meeting of the minds

During the last two years, our ecosystem partners have been delivering innovative solutions and services essential for digital transformation. The most turbulent period in recent history has forced us all to find new ways to collaborate and embrace technology at a rate never expected. Not only have we had to adapt as individuals, but organizations across the globe have been forced to re-think their day-to-day activities.

We work closely with our partners to define and create more DOCA libraries and services to address innovative use cases.  More than ever, we’re witnessing a realignment between technology requirements in the data center and ever-changing business priorities. In turn, matching customers to ecosystem partners provides an opportunity to create customized technology solutions tuned to meet specific business objectives.

Today, NVIDIA is working with leading platform vendors and partners to integrate and expand DOCA support for commercial distributions on BlueField DPUs. Dozens of industry leaders, including VMWare, Red Hat, DDN, Aria Cybersecurity, and Juniper Networks, have started to integrate their solutions using the DPU/DOCA architecture. You’ll start to see more new applications in the coming year. 

Earlier this year, Palo Alto Networks, a global cybersecurity leader developed the first next-generation firewall (NGFW) specifically designed to be accelerated by the BlueField DPU. This first-to-market, hardware-accelerated software NGFW is a prime example of how the BlueField DPU boosts performance and optimizes data center security coverage and efficiency.

Third-party developers can create and distribute DPU-accelerated applications with the DOCA SDK, which is fully integrated into the NGC catalog of containerized software. Such accelerated solutions will be wide-ranging, including advanced applications for infrastructure, storage, and security. It will be the key to unlocking data center innovation.

Try DOCA today

NVIDIA DOCA is the key to unlocking the potential of the NVIDIA BlueField DPU to offload, accelerate, and isolate data center workloads. With DOCA, you can program the data center infrastructure of tomorrow by creating software-defined, cloud-native, DPU-accelerated services with zero-trust protection to address the increasing performance and security demands of modern data centers.

To start developing on DOCA:

Categories
Misc

Upcoming Webinar: Using GPUs to Accelerate HD Mapping and Location-Based Services

Join us on July 20 for a webinar highlighting how using NVIDIA A100 GPUs can help map and location-based service providers speed up map creation and workflows, while reducing costs.

Categories
Misc

Advanced API Performance: Vulkan Clearing and Presenting

This post covers best practices for Vulkan clearing and presentation on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

This post covers best practices for Vulkan clearing and presenting on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

With the recent Vulkan 1.3 release, it’s timely to add some Vulkan-specific tips that are not necessarily explicitly covered by the other Advanced API Performance posts. In addition to introducing new Vulkan 1.3 core features, this post shares a set of good practices for clearing and presenting surfaces.

Vulkan 1.3 Core

Vulkan 1.3 brings improvements through extensions to key parts in the API. This section summarizes our recommendations for obtaining the best performance when working with a number of these new features.

Recommended

Clears

This section provides a guideline for achieving performance when invoking clear commands. This type of command clears a region within a color image or within the bound framebuffer attachments. 

  • Use VK_ATTACHMENT_LOAD_OP_CLEAR to clear attachments at the beginning of a subpass instead of clear commands. This can allow the driver to skip loading unnecessary data.
  • Outside of a render pass instance, prefer the usage of vkCmdClearColorImage instead of a CS invocation to clear images. This path enables bandwidth optimizations.
  • If possible, batch clears to avoid interleaving single clears between dispatches.
  • Coordinate VkClearDepthStencilValue with the test function to achieve better depth testing performance:
    • 0.5 ≤ depth value VK_COMPARE_OP_LESS_OR_EQUAL
    • 0.0 ≤ depth value VK_COMPARE_OP_GREATER_OR_EQUAL

Not recommended

  • Specifying more than 30 unique clear values per application (or more than 15 on Turing) does not make the most of clear bandwidth optimizations.
  • “Clear shaders” should be avoided unless there is overlap of a compute clear with a neighboring dispatch.

Present

The following section offers insight into the preferred way of using the presentation modes supported by a surface in order to achieve good performance.

Recommended

  • Rely on VK_PRESENT_MODE_FIFO_KHR or VK_PRESENT_MODE_MAILBOX_KHR (for VSync on). Noteworthy aspects:
    • VK_PRESENT_MODE_FIFO_KHR is preferred as it does not drop frames and lacks tearing.
    • VK_PRESENT_MODE_MAILBOX_KHR may offer lower latency, but frames might be dropped.
    • VK_PRESENT_MODE_FIFO_RELAXED_KHR is compelling when your application only occasionally lags behind the refresh rate, allowing tearing so that it can “catch back up”.
  • Rely on VK_PRESENT_MODE_IMMEDIATE_KHR for VSync off.
  • On Windows systems, use the VK_EXT_full_screen_exclusive extension to bypass compositing.
  • Handle both out-of-date and suboptimal swapchains to re-create stale swapchains when windows resize, for example.
  • For latency-sensitive applications, use the Vulkan Reflex SDK to minimize latency by completing game engine work just-in-time for rendering.

More information

For more information about using Vulkan with NVIDIA GPUs, see Vulkan Do’s and Don’ts.

To view the Vulkan API state, use the API Inspector in Nsight Graphics. (free download)

With Nsight Systems, you can view Vulkan usage on a unified CPU-GPU timeline, investigate stutter, and track GPU cold spots to their CPU origins. Download Nsight Systems for free.

Acknowledgments

Thanks to  Piers Daniell, Ivan Fedorov, Adam Moss, Ryan Prescott, Joshua Schnarr, Juha Sjöholm, and Márton Tamás for their feedback and contributions.

Categories
Misc

Building Generally Capable AI Agents with MineDojo

A large compilation of Minecraft videos that MineDojo uses to train the AINVIDIA is helping push the limits of training AI generalist agents with a new open-sourced framework called MineDojo.A large compilation of Minecraft videos that MineDojo uses to train the AI

Using video games as a medium for training AI has become a popular method within the AI research community. These autonomous agents have had great success in Atari games, Starcraft, Dota, and Go. But while these advancements have been popular for AI research, the agents do not generalize beyond a very specific set of tasks, unlike humans that continuously learn from open-ended tasks.

Building an embodied agent that can attain high-level performance across a wide spectrum of tasks has been one of the greatest challenges facing the AI research community. In order to build a successful generalist agent, users need an environment that supports a multitude of tasks and goals, a large-scale database of multimodal knowledge, and a flexible and scalable agent architecture.

Enter Minecraft, the most played game in the world. With its flexible gameplay players can do a wide variety of actions. This ranges from building a medieval castle to exploring dangerous environments to gathering resources for building a Nether Portal to battle the Nether Dragon. This creative atmosphere is the perfect environment for an embodied agent to train.

A table of images showing the NVIDIA AI agent completing different tasks.
Figure 1. The NVIDIA AI agent follows the prompts within the MineDojo framework

To take advantage of such an optimal training ground, NVIDIA researchers created MineDojo. MineDojo has built a massive framework that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base. Building an AI powerful enough to complete these tasks would not be possible without an expansive data library.

The mission of MineDojo is to promote research towards the goal of generally capable embodied agents. In order for the embodied agent to be successful, the environment needs to provide an almost infinite number of open-ended tasks and actions. This is done by giving the agent access to a large database of information to pull knowledge and then apply learnings. The training gained from the embodied agent needs to be scalable to convert the large-scale knowledge into actionable insights later on.

A few screenshots and word maps detailing the vast annotated database from YouTube, the Minecraft Wiki and Reddit for an AI agent to train on.
Figure 2. The MineDojo framework takes advantage of an Internet-scale database to train an AI agent

In MineDojo, the embodied agent has access to three internet-scale datasets. With 750,000 Minecraft YouTube videos—amounting to over 33 years of Minecraft videos—pulled into the database, over 2 million words were transcribed. 

MineDojo also scraped over 6,000 web pages from the Minecraft Wiki, with over 2.2 million bounding boxes created for the visual elements of those pages). Also, millions of Reddit threads related to Minecraft and the variety of activities one can do within the game were captured. The questions included how to solve certain tasks and showcase achievements and creations in image and video formats, along with general tips and tricks.

Screenshots of webpages being annotated for the AI agent's training from the Minecraft Wiki.
Screenshots of Reddit questions being scraped for the AI to train on.
Figure 3. Examples of content annotated and scraped from the internet for the MineDojo framework

MineDojo offers a set of simulator APIs that users can use to train their AI agents. It provides unified observation and action spaces to help facilitate the agent to adapt to new scenarios and multitask. Additionally, using the APIs users can take advantage of all three worlds within the Minecraft universe to expand on the number of tasks and actions the agent can do. 

Within the simulator, MIneDojo splits the benchmarking tasks into two categories: programmatic tasks and creative tasks. 

Programmatic tasks are well defined and can be easily evaluated, such as “surviving 3 days” or “obtain one unit of pumpkin in the forest.” 

Creative tasks are much more open-ended, such as “build a beautiful beach house.” It is very difficult to define what qualifies as a beach house by an explicit set of rules. these tasks are to encourage the research community to develop more human-like and imaginative AI agents.

Video clips of the variety of tasks that are benchmarked through MineDojo.
Figure 4. MineDojo currently provides benchmarks for thousands of creative and programmatic tasks

Natural language is a cornerstone of the MineDojo framework. It aids open-vocabulary understanding, provides grounding for image and video modalities, and serves as an intuitive interface to specify instructions. Combined with the latest speech recognition technology, it is possible in the near future to talk to an AI Agent as you would to a friend in multiplayer co-op mode.

For example: “plant a row of blue flowers in front of our house. Add some gold decorations to the door frame. Let’s go explore the cave next to the river,” could all be possible.

Proof of concept using MineCLIP

To help promote the project and provide a proof of concept, the MineDojo researchers have implemented a single language-prompted agent to complete several complex tasks within Minecraft, called MineCLIP. This novel agent learning algorithm takes advantage of the 33 years worth of Minecraft YouTube videos. However, it is good to point out that any agent can use any or all three sections of the Internet-scale database at the user’s discretion.

A flowchart of the MineCLIP agent showing the reward signal to train the embodied agent.
Figure 5. MineCLIP learns to associate video and text from the large amount of YouTube videos. The association score provides a reward signal to guide the agent to learn multiple tasks in parallel

MineCLIP as an embodied agent learns from the YouTube videos the concepts and actions of Minecraft without human hand labeling. YouTubers typically narrate what they are doing as they stream the gameplay video. MineCLIP is a large Transformer model that learns to associate a video clip and its corresponding English transcripts. 

This association score can be provided as a reward signal to guide a reinforcement learning agent towards completing the task. For the example task, “shear a sheep to obtain wool,” MineCLIP gives a high reward to the agent if it approaches the sheep, but a low reward if the agent wanders aimlessly. It is even capable of multitasking within the game to complete a wide range of simple tasks.

Building generally capable embodied agents is a holy grail goal of AI research. MineDojo provides a benchmark of 1000s of tasks, an internet-scale rich knowledge base, and an innovative algorithm as a first step towards solving the grand challenge. 

Stay posted to see what new models and techniques the research community comes up with next! Start using MineDojo today.

Read more about the framework and its findings. Explore other NVIDIA research.

Categories
Misc

Resources for hands on TF

Hi all,
Hopefully this is no against any group rules, but I’m a DS master degree student coming from a CS bachelor, and I really love DeepLearning and all the magics that we can do solving optimization problems, even without NN involved.
I have a good preparation from the theoretical POV thanks to the university, and i’ve coded manually many optimization problem calculating gradient by hand, however I love the idea of autodiff that TF and PyTorch gives out of the box, and I’m really looking forward to learn TF from the ground up, however I really struggle to find material that does not lead in just stacking layers on a sequential model from Keras…
My aim is to be able to take an idea of (example) a layer, and code it using tensors and autodiff from TF, and not looking for online code that already solves that (or even maybe optimizers, since I’m pretty familiar to many other not already implemented in TF)
Do you have any online resource or book that you feel that is a good starting point? I usually learn hand on and reading Docs, however I feel like TF is better to learn it how it’s supposed to, to fully grasp everything that it can offers

In other words, I have a good theoretical preparation on ML/DL but I feel I’m lacking in a more practical aspect… so… how/where can I learn to use GradientTape and of those magic things (everything is accepted, online offline, paper digital, paid not paid)?

submitted by /u/bertosini
[visit reddit] [comments]

Categories
Misc

Error Trying to Run Python Script With Custom Model : ValueError – Shapes are Incompatible

I am having an issue trying to make use of the tutorial below for Object Detection using tensor flow. I have tried to stick closely with the tutorial with changes only for my alternative labeled imageset.

Tutorial : https://gilberttanner.com/blog/tensorflow-object-detection-with-tensorflow-2-creating-a-custom-model/

Webcam Python Script : https://github.com/TannerGilbert/Tensorflow-Object-Detection-with-Tensorflow-2.0/blob/master/detect_from_webcam.py#

When I attempt to run a python script to use the model with my webcam, I get the following error…

Traceback (most recent call last): File "detect_from_webcam.py", line 89, in <module> detection_model = load_model(args.model) File "detect_from_webcam.py", line 21, in load_model model = tf.saved_model.load(model_path) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py", line 864, in load result = load_internal(export_dir, tags, options)["root"] File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py", line 903, in load_internal ckpt_options, options, filters) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py", line 165, in __init__ self._restore_checkpoint() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py", line 476, in _restore_checkpoint load_status = saver.restore(variables_path, self._checkpoint_options) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py", line 1383, in restore checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/base.py", line 254, in restore restore_ops = trackable._restore_from_checkpoint_position(self) # pylint: disable=protected-access File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/base.py", line 981, in _restore_from_checkpoint_position tensor_saveables, python_saveables)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py", line 352, in restore_saveables validated_saveables).restore(self.save_path_tensor, self.options) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saving/functional_saver.py", line 339, in restore restore_ops = restore_fn() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saving/functional_saver.py", line 323, in restore_fn restore_ops.update(saver.restore(file_prefix, options)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saving/functional_saver.py", line 116, in restore restored_tensors, restored_shapes=None) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saving/saveable_object_util.py", line 132, in restore self.handle_op, self._var_shape, restored_tensor) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 309, in shape_safe_assign_variable_handle shape.assert_is_compatible_with(value_tensor.shape) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_shape.py", line 1161, in assert_is_compatible_with raise ValueError("Shapes %s and %s are incompatible" % (self, other)) ValueError: Shapes (1, 1, 64, 108) and (1, 1, 64, 36) are incompatible 

The primary issue seems to be the ValueError: Shapes (1, 1, 64, 108) and (1, 1, 64, 36) are incompatible reference. Unfortunately I am diving into Tensorflow Object detection headfirst without much background context. I am struggling to understand what is actually occurring with this error and what it means and where to even start looking for a resolution.

Any guidance at all would be exceedingly helpful.

Update:

I made a slight adjustment to my exporter_main_v2.py arguments which now results in the slightly different error… ValueError: Shapes (1, 1, 64, 54) and (1, 1, 64, 36) are incompatible.

python object_detection/exporter_main_v2.py  --input_type image_tensor  <-- added this --trained_checkpoint_dir object_detection/checkpoint  --output_directory object_detection/inference  --pipeline_config_path object_detection/training/ssd_efficientdet_d0_512x512_coco17_tpu-8.config 

submitted by /u/Sir_Sparky
[visit reddit] [comments]

Categories
Misc

Minerva CQ Deploys NVIDIA Riva Enterprise in the Energy Sector

Learn how NVIDIA Inception member Minerva CQ is using NVIDIA Riva to deliver faster, personalized experiences within a global EV charging and electric mobility company.

Categories
Offsites

Identifying Disfluencies in Natural Speech

People don’t write in the same way that they speak. Written language is controlled and deliberate, whereas transcripts of spontaneous speech (like interviews) are hard to read because speech is disorganized and less fluent. One aspect that makes speech transcripts particularly difficult to read is disfluency, which includes self-corrections, repetitions, and filled pauses (e.g., words like “umm”, and “you know”). Following is an example of a spoken sentence with disfluencies from the LDC CALLHOME corpus:

But that’s it’s not, it’s not, it’s, uh, it’s a word play on what you just said.

It takes some time to understand this sentence — the listener must filter out the extraneous words and resolve all of the nots. Removing the disfluencies makes the sentence much easier to read and understand:

But it’s a word play on what you just said.

While people generally don’t even notice disfluencies in day-to-day conversation, early foundational work in computational linguistics demonstrated how common they are. In 1994, using the Switchboard corpus, Elizabeh Shriberg demonstrated that there is a 50% probability for a sentence of 10–13 words to include a disfluency and that the probability increases with sentence length.

The proportion of sentences from the Switchboard dataset with at least one disfluency plotted against sentence length measured in non-disfluent (i.e., efficient) tokens in the sentence. The longer a sentence gets, the more likely it is to contain a disfluency.

In “Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection”, we present research findings on how to “clean up” transcripts of spoken text. We create more readable transcripts and captions of human speech by finding and removing disfluencies in people’s speech. Using labeled data, we created machine learning (ML) algorithms that identify disfluencies in human speech. Once those are identified we can remove the extra words to make transcripts more readable. This also improves the performance of natural language processing (NLP) algorithms that work on transcripts of human speech. Our work puts special priority on ensuring that these models are able to run on mobile devices so that we can protect user privacy and preserve performance in scenarios with low connectivity.

Base Model Overview
At the core of our base model is a pre-trained BERTBASE encoder with 108.9 million parameters. We use the standard per-token classifier configuration, with a binary classification head being fed by the sequence encodings for each token.

Illustration of how tokens in text become numerical embeddings, which then lead to output labels.

<!–

Illustration of how tokens in text become numerical embeddings, which then lead to output labels.

–>

We refined the BERT encoder by continuing the pretraining on the comments from the Pushrift Reddit dataset from 2019. Reddit comments are not speech data, but are more informal and conversational than the wiki and book data. This trains the encoder to better understand informal language, but may run the risk of internalizing some of the biases inherent in the data. For our particular use case, however, the model only captures the syntax or overall form of the text, not its content, which avoids potential issues related to semantic-level biases in the data.

We fine-tune our model for disfluency classification on hand-labeled corpora, such as the Switchboard corpus mentioned above. Hyperparameters (batch size, learning rate, number of training epochs, etc.) were optimized using Vizier.

We also produce a range of “small” models for use on mobile devices using a knowledge distillation technique known as “self training”. Our best small model is based on the Small-vocab BERT variant with 3.1 million parameters. This smaller model achieves comparable results to our baseline at 1% the size (in MiB). You can read more about how we achieved this model miniaturization in our 2021 Interspeech paper.

Streaming
Some of the latest use cases for automatic speech transcription include automated live captioning, such as produced by the Android “Live Captions” feature, which automatically transcribes spoken language in audio being played on the device. For disfluency removal to be of use in improving the readability of the captions in this setting, then it must happen quickly and in a stable manner. That is, the model should not change its past predictions as it sees new words in the transcript.

We call this live token-by-token processing streaming. Accurate streaming is difficult because of temporal dependencies; most disfluencies are only recognizable later. For example, a repetition does not actually become a repetition until the second time the word or phrase is said.

To investigate whether our disfluency detection model is effective in streaming applications, we split the utterances in our training set into prefix segments, where only the first N tokens of the utterance were provided at training time, for all values of N up to the full length of the utterance. We evaluated the model simulating a stream of spoken text by feeding prefixes to the models and measuring the performance with several metrics that capture model accuracy, stability, and latency including streaming F1, time to detection (TTD), edit overhead (EO), and average wait time (AWT). We experimented with look-ahead windows of either one or two tokens, allowing the model to “peek” ahead at additional tokens for which the model is not required to produce a prediction. In essence, we’re asking the model to “wait” for one or two more tokens of evidence before making a decision.

While adding this fixed look-ahead did improve the stability and streaming F1 scores in many contexts, we found that in some cases the label was already clear even without looking ahead to the next token and the model did not necessarily benefit from waiting. Other times, waiting for just one extra token was sufficient. We hypothesized that the model itself could learn when it should wait for more context. Our solution was a modified model architecture that includes a “wait” classification head that decides when the model has seen enough evidence to trust the disfluency classification head.

Diagram showing how the model labels input tokens as they arrive. The BERT embedding layers feed into two separate classification heads, which are combined for the output.

<!–

Diagram showing how the model labels input tokens as they arrive. The BERT embedding layers feed into two separate classification heads, which are combined for the output.

–>

We constructed a training loss function that is a weighted sum of three factors:

  1. The traditional cross-entropy loss for the disfluency classification head
  2. A cross-entropy term that only considers up to the first token with a “wait” classification
  3. A latency penalty that discourages the model from waiting too long to make a prediction

We evaluated this streaming model as well as the standard baseline with no look-ahead and with both 1- and 2-token look-ahead values:

Graph of the streaming F1 score versus the average wait time in tokens. Three data points indicate F1 scores above 0.82 across multiple wait times. The proposed streaming model achieves near top performance with much shorter wait times than the fixed look ahead models.

The streaming model achieved a better streaming F1 score than both a standard baseline with no look ahead and a model with a look ahead of 1. It performed nearly as well as the variant with fixed look ahead of 2, but with much less waiting. On average the model waited for only 0.21 tokens of context.

Internationalization
Our best outcomes so far have been with English transcripts. This is mostly due to resourcing issues: while there are a number of relatively large labeled conversational datasets that include disfluencies in English, other languages often have very few such datasets available. So, in order to make disfluency detection models available outside English a method is needed to build models in a way that does not require finding and labeling hundreds of thousands of utterances in each target language. A promising solution is to leverage multi-language versions of BERT to transfer what a model has learned about English disfluencies to other languages in order to achieve similar performance with much less data. This is an area of active research, but we do have some promising results to outline here.

As a first effort to validate this approach, we added labels to about 10,000 lines of dialogue from the German CALLHOME dataset. We then started with the Geotrend English and German Bilingual BERT model (extracted from Multilingual BERT) and fine-tuned it with approximately 77,000 disfluency-labeled English Switchboard examples and 1.3 million examples of self-labeled transcripts from the Fisher Corpus. Then, we did further fine tuning with about 7,500 in-house–labeled examples from the German CALLHOME dataset.

Diagram illustrating the flow of labeled data and self-trained output in our best multilingual training setup. By training on both English and German data we are able to improve performance via transfer learning.

Our results indicate that fine-tuning on a large English corpus can produce acceptable precision using zero-shot transfer to similar languages like German, but at least a modest amount of German labels were needed to improve recall from less than 60% to greater than 80%. Two-stage fine-tuning of an English-German bilingual model produced the highest precision and overall F1 score.

Approach Precision Recall F1
German BERTBASE model fine-tuned on 7,300 human-labeled German CALLHOME examples 89.1% 81.3% 85.0
Same as above but with additional 7,500 self-labeled German CALLHOME examples 91.5% 83.3% 87.2
English/German Bilingual BERTbase model fine-tuned on English Switchboard+Fisher, evaluated on German CALLHOME (zero-shot language transfer) 87.2% 59.1% 70.4
Same as above but subsequently fine-tuned with 14,800 German CALLHOME (human- and self-labeled) examples 95.5% 82.6% 88.6

Conclusion
Cleaning up disfluencies from transcripts can improve not just their readability for people, but also the performance of other models that consume transcripts. We demonstrate effective methods for identifying disfluencies and expand our disfluency model to resource-constrained environments, new languages, and more interactive use cases.

Acknowledgements
Thank you to Vicky Zayats, Johann Rocholl, Angelica Chen, Noah Murad, Dirk Padfield, and Preeti Mohan for writing the code, running the experiments, and composing the papers discussed here. Wealso thank our technical product manager Aaron Schneider, Bobby Tran from the Cerebra Data Ops team, and Chetan Gupta from Speech Data Ops for their support obtaining additional data labels.