Categories
Offsites

Alpa: Automated Model-Parallel Deep Learning

Over the last several years, the rapidly growing size of deep learning models has quickly exceeded the memory capacity of single accelerators. Earlier models like BERT (with a parameter size of < 1GB) can efficiently scale across accelerators by leveraging data parallelism in which model weights are duplicated across accelerators while only partitioning and distributing the training data. However, recent large models like GPT-3 (with a parameter size of 175GB) can only scale using model parallel training, where a single model is partitioned across different devices.

While model parallelism strategies make it possible to train large models, they are more complex in that they need to be specifically designed for target neural networks and compute clusters. For example, Megatron-LM uses a model parallelism strategy to split the weight matrices by rows or columns and then synchronizes results among devices. Device placement or pipeline parallelism partitions different operators in a neural network into multiple groups and the input data into micro-batches that are executed in a pipelined fashion. Model parallelism often requires significant effort from system experts to identify an optimal parallelism plan for a specific model. But doing so is too onerous for most machine learning (ML) researchers whose primary focus is to run a model and for whom the model’s performance becomes a secondary priority. As such, there remains an opportunity to automate model parallelism so that it can easily be applied to large models.

In “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning”, published at OSDI 2022, we describe a method for automating the complex model parallelism process. We demonstrate that with only one line of code Alpa can transform any JAX neural network into a distributed version with an optimal parallelization strategy that can be executed on a user-provided device cluster. We are also excited to release Alpa’s code to the broader research community.

Alpa Design
We begin by grouping existing ML parallelization strategies into two categories, inter-operator parallelism and intra-operator parallelism. Inter-operator parallelism assigns distinct operators to different devices (e.g., device placement) that are often accelerated with a pipeline execution schedule (e.g., pipeline parallelism). With intra-operator parallelism, which includes data parallelism (e.g., Deepspeed-Zero), operator parallelism (e.g., Megatron-LM), and expert parallelism (e.g., GShard-MoE), individual operators are split and executed on multiple devices, and often collective communication is used to synchronize the results across devices.

The difference between these two approaches maps naturally to the heterogeneity of a typical compute cluster. Inter-operator parallelism has lower communication bandwidth requirements because it is only transmitting activations between operators on different accelerators. But, it suffers from device underutilization because of its pipeline data dependency, i.e., some operators are inactive while waiting on the outputs from other operators. In contrast, intra-operator parallelism doesn’t have the data dependency issue, but requires heavier communication across devices. In a GPU cluster, the GPUs within a node have higher communication bandwidth that can accommodate intra-operator parallelism. However, GPUs across different nodes are often connected with much lower bandwidth (e.g., ethernet) so inter-operator parallelism is preferred.

By leveraging heterogeneous mapping, we design Alpa as a compiler that conducts various passes when given a computational graph and a device cluster from a user. First, the inter-operator pass slices the computational graph into subgraphs and the device cluster into submeshes (i.e., a partitioned device cluster) and identifies the best way to assign a subgraph to a submesh. Then, the intra-operator pass finds the best intra-operator parallelism plan for each pipeline stage from the inter-operator pass. Finally, the runtime orchestration pass generates a static plan that orders the computation and communication and executes the distributed computational graph on the actual device cluster.

An overview of Alpa. In the sliced subgraphs, red and blue represent the way the operators are partitioned and gray represents operators that are replicated. Green represents the actual devices (e.g., GPUs).

Intra-Operator Pass
Similar to previous research (e.g., Mesh-TensorFlow and GSPMD), intra-operator parallelism partitions a tensor on a device mesh. This is shown below for a typical 3D tensor in a Transformer model with a given batch, sequence, and hidden dimensions. The batch dimension is partitioned along device mesh dimension 0 (mesh0), the hidden dimension is partitioned along mesh dimension 1 (mesh1), and the sequence dimension is replicated to each processor.

A 3D tensor that is partitioned on a 2D device mesh.

With the partitions of tensors in Alpa, we further define a set of parallelization strategies for each individual operator in a computational graph. We show example parallelization strategies for matrix multiplication in the figure below. Defining parallelization strategies on operators leads to possible conflicts on the partitions of tensors because one tensor can be both the output of one operator and the input of another. In this case, re-partition is needed between the two operators, which incurs additional communication costs.

The parallelization strategies for matrix multiplication.

Given the partitions of each operator and re-partition costs, we formulate the intra-operator pass as a Integer-Linear Programming (ILP) problem. For each operator, we define a one-hot variable vector to enumerate the partition strategies. The ILP objective is to minimize the sum of compute and communication cost (node cost) and re-partition communication cost (edge cost). The solution of the ILP translates to one specific way to partition the original computational graph.

Inter-Operator Pass
The inter-operator pass slices the computational graph and device cluster for pipeline parallelism. As shown below, the boxes represent micro-batches of input and the pipeline stages represent a submesh executing a subgraph. The horizontal dimension represents time and shows the pipeline stage at which a micro-batch is executed. The goal of the inter-operator pass is to minimize the total execution latency, which is the sum of the entire workload execution on the device as illustrated in the figure below. Alpa uses a Dynamic Programming (DP) algorithm to minimize the total latency. The computational graph is first flattened, and then fed to the intra-operator pass where the performance of all possible partitions of the device cluster into submeshes are profiled.

Pipeline parallelism. For a given time, this figure shows the micro-batches (colored boxes) that a partitioned device cluster and a sliced computational graph (e.g., stage 1, 2, 3) is processing.

Runtime Orchestration
After the inter- and intra-operator parallelization strategies are complete, the runtime generates and dispatches a static sequence of execution instructions for each device submesh. These instructions include RUN a specific subgraph, SEND/RECEIVE tensors from other meshes, or DELETE a specific tensor to free the memory. The devices can execute the computational graph without other coordination by following the instructions.

Evaluation
We test Alpa with eight AWS p3.16xlarge instances, each of which has eight 16 GB V100 GPUs, for 64 total GPUs. We examine weak scaling results of growing the model size while increasing the number of GPUs. We evaluate three models: (1) the standard Transformer model (GPT); (2) the GShard-MoE model, a transformer with mixture-of-expert layers; and (3) Wide-ResNet, a significantly different model with no existing expert-designed model parallelization strategy. The performance is measured by peta-floating point operations per second (PFLOPS) achieved on the cluster.

We demonstrate that for GPT, Alpa outputs a parallelization strategy very similar to the one computed by the best existing framework, Megatron-ML, and matches its performance. For GShard-MoE, Alpa outperforms the best expert-designed baseline on GPU (i.e., Deepspeed) by up to 8x. Results for Wide-ResNet show that Alpa can generate the optimal parallelization strategy for models that have not been studied by experts. We also show the linear scaling numbers for reference.

GPT: Alpa matches the performance of Megatron-ML, the best expert-designed framework.
GShard MoE: Alpa outperforms Deepspeed (the best expert-designed framework on GPU) by up to 8x.
Wide-ResNet: Alpa generalizes to models without manual plans. Pipeline and Data Parallelism (PP-DP) is a baseline model that uses only pipeline and data parallelism but no other intra-operator parallelism.
The parallelization strategy for Wide-ResNet on 16 GPUs consists of three pipeline stages and is a complicated strategy even for an expert to design. Stages 1 and 2 are on 4 GPUs performing data parallelism, and stage 3 is on 8 GPUs performing operator parallelism.

Conclusion
The process of designing an effective parallelization plan for distributed model-parallel deep learning has historically been a difficult and labor-intensive task. Alpa is a new framework that leverages intra- and inter-operator parallelism for automated model-parallel distributed training. We believe that Alpa will democratize distributed model-parallel learning and accelerate the development of large deep learning models. Explore the open-source code and learn more about Alpa in our paper.

Acknowledgements
Thanks to the co-authors of the paper: Lianmin Zheng, Hao Zhang, Yonghao Zhuang, Yida Wang, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. We would also like to thank Shibo Wang, Jinliang Wei, Yanping Huang, Yuanzhong Xu, Zhifeng Chen, Claire Cui, Naveen Kumar, Yash Katariya, Laurent El Shafey, Qiao Zhang, Yonghui Wu, Marcello Maggioni, Mingyao Yang, Michael Isard, Skye Wanderman-Milne, and David Majnemer for their collaborations to this research.

Categories
Misc

Training, Test, AND Validation Split best practices

What is the best practice for splitting a dataset into train, test, and validation sets? I see all these tensorflow tutorials showing how to split data into train and test, but no mention of the validation set, which is important for hyperparameter tuning and ensuring that we can hold the test set until the end in order to get an unbiased estimate.

submitted by /u/berimbolo21
[visit reddit] [comments]

Categories
Misc

Question about increasing model accuracy

Hello, I am working on a project to detect and classify different types and stages of brain cancer. My issue is that no matter what I do I can’t get past ~80% accuracy. I have looked into various resources and papers and I can’t notice anything else I can do other than changing my model to a Dictionary Learning Model (I am currently using a CNN because I am quite new to machine learning).

If anyone could help me out if I have any obvious code errors or if they have any ideas on how to improve the accuracy that would be great.

Here is my code

submitted by /u/ACHANTAS1
[visit reddit] [comments]

Categories
Misc

Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO

Learn how to develop an end-to-end workflow starting with synthetic data generation in NVIDIA Isaac Sim, fine-tuning with the TAO Toolkit and deploying model with NVIDIA Isaac ROS.

From building cars to helping surgeons and delivering pizzas, robots not only automate but also speed up human tasks manyfold. With the advent of AI, you can build even smarter robots that can better perceive their surroundings and make decisions with minimal human intervention.

Take, for instance, an autonomous robot used in warehouses to move payloads from one place to another. It must perceive the free space around it, detect and avoid any obstacles in its path, and make “on-the-fly” decisions to pick a new path without any delay.

Therein lies the challenge. This means building an application powered by an AI model that has been trained and optimized to work in this environment. It requires collecting copious amounts of high-quality data and developing a highly accurate AI model to power the application. These are the key barriers when it comes to moving applications from the lab into a production environment.

In this post, we show how you can solve both your data challenge and model creation challenge with the NVIDIA Isaac platform and the TAO framework. You use NVIDIA Isaac Sim, a robotics simulation application to create virtual environments and generate synthetic data. The NVIDIA TAO Toolkit is a low-code AI model development solution with built-in transfer learning to fine-tune a pretrained model with a fraction of the data, compared to training from scratch. Finally, deploy the optimized model using NVIDIA Isaac ROS onto a robot and put it to work in the real world.

Diagram shows an overview of workflow with synthetic data generation using NVIDIA Isaac Sim and training a pretrained model with TAO in a simulated environment, evaluate the results and collect more data if necessary. The model is then pruned and retrained again in the simulated environment. The model then is moved to the next phase where it is fine-tuned on real-world data and eventually deployed. 
Figure 1. Overview of workflow to train a TAO toolkit model on synthetic data using NVIDIA Isaac Sim to adapt a real-world use case.

Prerequisites

Before you start, you must have the following resources for training and deploying:

  • NVIDIA GPU Driver version: >470
  • NVIDIA Docker: 2.5.0-1
  • NVIDIA GPU in the cloud or on-premises:
    • NVIDIA A100
    • NVIDIA V100
    • NVIDIA T4
    • NVIDIA RTX 30×0 (NVIDIA Isaac Sim supports NVIDIA RTX 20 series as well)
    • NVIDIA Jetson Xavier or Jetson Xavier NX
  • NVIDIA TAO Toolkit: 4.22. For more information, see the TAO Toolkit Quick Start guide
  • NVIDIA Isaac Sim and Isaac ROS

Synthetic data generation with NVIDIA Isaac Sim

In the section, we outline the steps for generating synthetic data in NVIDIA Isaac Sim. Synthetic data is annotated information that computer simulations or algorithms generate. Synthetic data can help solve data challenges when real data is difficult or expensive to acquire.

NVIDIA Isaac Sim provides three methods for generating synthetic data:

  • Replicator composer
  • Python scripts
  • GUI

For this experiment, we chose to use Python scripts to generate data with domain randomization. Domain randomization varies the parameters that define a scene in the simulation environment, including the position, scale of various objects in a scene, the lighting of the simulated environment, the color and texture of objects, and more.

Adding domain randomization to simultaneously vary many parameters of the scene improves the dataset quality and enhances the model’s performance by exposing it to a wide variety of domain parameters seen in the real world.

In this case, you use two environments for training data: a warehouse and a small room. Next steps include adding objects into the scene that obey the laws of physics. We used sample objects from NVIDIA Isaac Sim, which also includes everyday objects from the YCB dataset.

Picture of cans, and bottles randomly placed on the floor. Another picture of a box placed underneath the cart along with some additional boxes placed in a warehouse type environment.
Figure 2. Sample simulation images from the simple room and warehouse environments

After installing NVIDIA Isaac Sim, the Isaac Sim App Selector provides an option for Open in Folder, which contains a python.sh script. This is used to run the scripts for data generation.

Follow the steps listed to generate the data.

Select the environment and add a camera to the scene

def add_camera_to_viewport(self):
  # Add a camera to the scene and attach it to the viewport
  self.camera_rig = UsdGeom.Xformable(create_prim("/Root/CameraRig", "Xform"))
  self.camera = create_prim("/Root/CameraRig/Camera", "Camera")

Add a semantic ID to the floor:

def add_floor_semantics(self):
  # Get the floor from the stage and update its semantics
  stage = kit.context.get_stage()
  floor_prim = stage.GetPrimAtPath("/Root/Towel_Room01_floor_bottom_218")
  add_update_semantics(floor_prim, "floor")

Add objects in the scene with Physics:

def load_single_asset(self, object_transform_path, object_path, usd_object):
  # Random x, y points for the position of the USD object 
  translate_x , translate_y = 150 * random.random(), 150 * random.random()
  # Load the USD Object
  try:
      asset = create_prim(object_transform_path, "Xform",
               position=np.array([150 + translate_x, 175 + translate_y, -55]), 
               orientation=euler_angles_to_quat(np.array([0, 0.0, 0]),
               usd_path=object_path)
	# Set the object with correct physics
      utils.setRigidBody(asset, "convexHull", False)

Initialize domain randomization components:

def create_camera_randomization(self):
  #  A range of values to move and rotate the camera       
  camera_tranlsate_min_range, camera_translate_max_range = (100, 100, -58),
                                                            (220, 220, -52)      
  camera_rotate_min_range, camera_rotate_max_range = (80, 0, 0), (85, 0 ,360)

  # Create a Transformation DR Component for the Camera
  self.camera_transform = self.dr.commands.CreateTransformComponentCommand(
                              prim_paths=[self.camera.GetPath()], 
                              translate_min_range=camera_tranlsate_min_range,
                              translate_max_range=camera_translate_max_range,
                              rotate_min_range=camera_rotate_min_range,
                              rotate_max_range=camera_rotate_max_range,
                              duration=0,5).do()

Make sure that the camera position and properties in the simulation are similar to the real-world attributes. Adding a semantic ID to the floor is necessary for generating the correct free space segmentation masks. As mentioned earlier, domain randomization was applied to help with the sim2real performance of the model.

The Offline Data Generation sample provided in the NVIDIA Isaac Sim documentation is the starting point for our scripts. Changes have been made for this use case that include adding objects to a scene with physics, updating domain randomization, and adding semantics to the floor. We have generated nearly 30,000 images with their corresponding segmentation masks for the dataset.

Train, adapt, and optimize with the TAO Toolkit

In this section, you use the TAO Toolkit to fine-tune the model with the generated synthetic data. For this task, we chose to experiment with UNET models available from NGC.

!ngc registry model list nvidia/tao/pretrained_semantic_segmentation:*

Set up your data, spec file (TAO specifications), and experiment directories:

%set_env KEY=tlt_encode
%set_env GPU_INDEX=0
%set_env USER_EXPERIMENT_DIR=/workspace/experiments
%set_env DATA_DOWNLOAD_DIR=/workspace/freespace_data
%set_env SPECS_DIR=/workspace/specs

The next step is to pick the model.

Picking the right pretrained model

A pretrained AI and deep learning model is one that has been trained on representative datasets and fine-tuned with weights and biases. You can quickly and easily fine-tune a pretrained model by applying transfer learning with only a fraction of data compared to training from scratch.

Within the realm of pretrained models, there are models that perform a specific task like detecting people, cars, license plates, and so on.

We first picked a U-Net model with ResNet10 and ResNet18 backbones. The results obtained from the models showed the walls and the floor merged as a single entity on real-world data, instead of two separate entities. This was true even though the performance of the model on simulation images showed high levels of accuracy. 

BackBone Pruned Dataset Size Image Size Training Evaluations  
    Train Val   F1 Score mIoU (%) Epochs  
RN10 NO 25K 4.5K 512×512 89.2 80.1 50  
RN18 NO 25K 4.5K 512×512 91.1 83.0 50  
Table 1. Experiments on different pretrained models available from the NGC platform for TAO.

We experimented with different backbones and image sizes to observe the trade-off of latency (FPS) to accuracy. All models in the table are the same (UNET); only the backbones are different. 

Images of a carpeted floor with objects scattered around. The blue areas represent blocked space and the red areas are free space.
Figure 3. Predictions of ResNet18 model. (left) Simulation image; (right) a real-world image.

Based on the results, it was evident that we needed a different model that better fit the use case. We picked the PeopleSemSeg model available in the NGC catalog. The model was pretrained on five million objects for the “person” class with the dataset consisting of a mix of camera heights, crowd density, and field-of-view (FOV). This model also can segment the background and the free space as two separate entities.

After training this model with the same dataset, the mean IOU increased by more than 10% and the resulting images clearly show better segmentation between floor and walls. 

BackBone Pruned Dataset Size Image Size Training Evaluations
    Train Val   F1 Score mIoU (%) Epochs
PeopleSemSegNet NO 25K 4.5K 512×512 98.1 96.4 50
PeopleSemSegNet NO 25K 4.5K 960×544 99.0 98.1 50
Table 2. Experiments with PeopleSemSegNet Trainable model
Images shows a carpeted floor with objects scattered around.
Figure 4. Prediction results for transfer learning on the peoplesemseg TAO model with synthetic data (left) and real-world data (right).

Figure 4 shows free space identification on the simulation image and the real-world images from the robot’s perspective before fine-tuning on the PeopleSemSeg model with real-world data. That is, with models trained on purely NVIDIA Isaac Sim data.

The key takeaway is that while there may be many pretrained models that can do the task, it is important to pick one that is closest to your current application. This is where TAO’s purpose-built models are useful.

!tao unet train --gpus=1 --gpu_index=$GPU_INDEX 
              -e $SPECS_DIR/spec_vanilla_unet.txt 
              -r $USER_EXPERIMENT_DIR/semseg_experiment_unpruned 
              -m $USER_EXPERIMENT_DIR/peoplesemsegnet.tlt  
              -n model_freespace 
              -k $KEY 

After the model is trained, evaluate the model performance on validation data:

!tao unet evaluate --gpu_index=$GPU_INDEX -e$SPECS_DIR/spec_vanilla_unet.txt 
-m $USER_EXPERIMENT_DIR/semseg_experiment_unpruned/weights/model_freespace.tlt 
    -o $USER_EXPERIMENT_DIR/semseg_experiment_unpruned/ 
    -k $KEY

When you are satisfied with the model performance on NVIDIA Isaac Sim data and the Sim2Sim validation performance, prune the model.

To run this model with minimal latency, optimize it to run on the target GPU. There are two ways to achieve this:

  • Pruning: The pruning feature in the TAO Toolkit automatically removes the unwanted layers and neurons, effectively reducing the size of the model. You must retrain the model to recover the accuracy lost during pruning.
  • Post-training quantization: Another feature in the TAO toolkit enables the model size to be further reduced. This changes its precision from FP32 to INT8, enhancing performance without sacrificing its accuracy.

First, prune the model:

!tao unet prune 
    -e $SPECS_DIR/spec_vanilla_unet.txt 
    -m $USER_EXPERIMENT_DIR/semseg_experiment_unpruned/weights/model_freespace.tlt 
    -o $USER_EXPERIMENT_DIR/unet_experiment_pruned/model_unet_pruned.tlt 
     -eq union 
     -pth 0.1 
     -k $KEY

Re-train and prune the model:

!tao unet train --gpus=1 --gpu_index=$GPU_INDEX 
              -e $SPECS_DIR/spec_vanilla_unet_retrain.txt 
              -r $USER_EXPERIMENT_DIR/unet_experiment_retrain 
              -m $USER_EXPERIMENT_DIR/unet_experiment_pruned/model_unet_pruned.tlt 
              -n model_unet_retrained 
              -k $KEY

When you are satisfied with the Sim2Sim validation performance of the pruned model, go to the next step to fine-tune on real-world data.

!tao unet train --gpus=1 --gpu_index=$GPU_INDEX 
              -e $SPECS_DIR/spec_vanilla_unet_domain_adpt.txt 
              -r $USER_EXPERIMENT_DIR/semseg_experiment_domain_adpt 
              -m $USER_EXPERIMENT_DIR/semseg_experiment_retrain/model_unet_pruned.tlt
              -n model_domain_adapt 
              -k $KEY 

Results

Table 1 shows a summary of the results between unpruned and pruned models. The final pruned and quantized model, chosen for deployment, was 17x smaller and delivered an inference performance 5x faster compared to the original model, measured on NVIDIA Jetson Xavier NX.

Model Dataset Training Evaluations   Inference Performance
Pruned Fine-Tune on
Real World Data
Training Set Validation Set F1 Score (%) mIoU (%) Precision FPS
NO NO Sim Sim 0.990 0.981 FP16 3.9
YES NO Sim Sim 0.991 0.982 FP16 15.29
YES NO Sim Real 0.680 0.515 FP16 15.29
YES YES Real Real 0.979 0.960 FP16 15.29
YES YES Real Real 0.974 0.959 INT8 20.25
Table 3. Results on Sim2Sim and Sim2Real

The training dataset for the sim data consists of 25K images, whereas training data for real-world images for fine-tuning, consists of 44 images only. The validation dataset of real images consists of 56 images only. For real-world data, we collected a dataset in three different indoor scenarios. The input image size for the model is 960×544. The inference performance is measured using the NVIDIA TensorRT trtexec tool.

Four images showing the results obtained from the model chosen for deployment. The images show clear delineation between the floor, walls, and obstacles directly in front of the robot.
Figure 5. Results on real-world images from the robot after fine-tuning on real-world data

Deployment with NVIDIA Isaac ROS

In this section, we show the steps to take the trained and optimized model and deploy it using NVIDIA Isaac ROS on iRobot’s Create 3 robot powered by Jetson Xavier NX. Both Create 3 and the NVIDIA Isaac ROS image segmentation node run on ROS2.

This example uses the /isaac_ros_image_segmentation/isaac_ros_unet GitHub repo for deploying free space segmentation.

Picture of a carpeted floor with objects on it and the same picture using free space identification results.
Figure 6. Image and segmentation mask using rqt_image_viewer in ROS2. (left) Uses a USB camera on the Create 3 robot; (right) Uses the isaac-ros-image-segmentation node.

To use the free space segmentation model, perform the following steps from the /NVIDIA-ISAAC-ROS/isaac_ros_image_segmentation GitHub repo.

Create a Docker interactive workspace:

$isaac_ros_common/scripts/run_dev.sh your_ws

Clone all the package dependencies:

Build and source the workspace:

$cd /workspaces/isaac_ros-dev
$colcon build && . install/setup.bash

Download the trained free space identification (.etlt) model from your work machine:

$scp :

Convert the encrypted TLT model (.etlt) and format to the TensorRT engine plan. Run the following command for the INT8 model:

tao converter -k tlt_encode  
               -e  trt.fp16.freespace.engine 
               -p input_1,1x3x544x960,1x3x544x960,1x3x544x960 
               unet_freespace.etlt

Follow the walkthrough from Isaac ROS image segmentation:

  • Keep the TensorRT model engine file in right directory.
  • Create config.pbtxt.
  • Update the model engine path and name in the isaac_ros_unet launch file.
  • Rebuild and run the following commands:
$ colcon build --packages-up-to isaac_ros_unet && . install/setup.bash
$ ros2 launch isaac_ros_unet isaac_ros_unet_triton.launch.py

Summary

In this post, we showed you an end-to-end workflow starting with synthetic data generation in NVIDIA Isaac Sim, fine-tuning with the TAO Toolkit, and deploying the model with NVIDIA Isaac ROS.

Both NVIDIA Isaac Sim and TAO Toolkit are solutions that abstract away the AI framework complexity, enabling you to build and deploy AI-powered robotic applications in production, without the need for any AI expertise.

Get started with this experiment by pulling the /NVIDIA-AI-IOT/robot_freespace_seg_Isaac_TAO GitHub project.

Categories
Misc

‘In the NVIDIA Studio’ Welcomes Concept Designer Yangtian Li

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology accelerates creative workflows.  This week In the NVIDIA Studio, we welcome Yangtian Li, a senior concept artist at Singularity6. Li is a concept designer and illustrator Read article >

The post ‘In the NVIDIA Studio’ Welcomes Concept Designer Yangtian Li appeared first on NVIDIA Blog.

Categories
Misc

Trying to compare different models.

So I am trying to use neural networks for time series forecasting. I have created 3 different models (LSTM, CNN and a LSTM-CNN hybrid). Now I want to compare their performance based on the metrics MAE, RMSE and sMAPE. Here comes the dilemma, whenever I retrain the models on the same dataset, the performance of every model changes, and the previously best performing model now performs the worst now. I have tried to set the random seed to a constant, yet I am facing the same issue. Please help 🙁 Thanks.

submitted by /u/rumble_ftw
[visit reddit] [comments]

Categories
Misc

Upcoming Webinar: Optimizing DNN Inference with NVIDIA TensorRT on DRIVE Orin

Join an upcoming webinar highlighting the newest features of NVIDIA TensorRT and learn how to optimize inference engines for production on the Orin AI platform.

Categories
Misc

Mown Away: Startup Rolls Out Autonomous Lawnmower With Cutting Edge Tech

Jack Morrison and Isaac Roberts, co-founders of Replica Labs, were restless two years after their 3D vision startup was acquired, seeking another adventure. Then, in 2018, when Morrison was mowing his lawn, it struck him: autonomous lawn mowers. The two, along with Davis Foster, co-founded Scythe Robotics. The company, based in Boulder, Colo., has a Read article >

The post Mown Away: Startup Rolls Out Autonomous Lawnmower With Cutting Edge Tech appeared first on NVIDIA Blog.

Categories
Misc

Meet the Omnivore: 3D Artist Creates Towering Work With NVIDIA Omniverse

Edward McEvenue grew up making claymations in LEGO towns. Now, he’s creating photorealistic animations in virtual cities, drawing on more than a decade of experience in the motion graphics industry.

The post Meet the Omnivore: 3D Artist Creates Towering Work With NVIDIA Omniverse appeared first on NVIDIA Blog.

Categories
Misc

Neural Network Generates Global Tree Height Map, Reveals Carbon Stock Potential

Using remote sensing and an ensemble of convolutional neural networks, the study could guide sustainable forest management and climate mitigation efforts.

A new study from researchers at ETH Zurich’s EcoVision Lab is the first to produce an interactive Global Canopy Height map. Using a newly developed deep learning algorithm that processes publicly available satellite images, the study could help scientists identify areas of ecosystem degradation and deforestation. The work could also guide sustainable forest management by identifying areas for prime carbon storage—a cornerstone in mitigating climate change.

“Global high-resolution data on vegetation characteristics are needed to sustainably manage terrestrial ecosystems, mitigate climate change, and prevent biodiversity loss. With this project, we aim to fill the missing data gaps by merging data from two space missions with the help of deep learning,” said Konrad Schindler, a Professor in the Department of Civil, Environmental, and Geomatic Engineering at ETH Zurich. 

From rainforests to boreal woodland, forests play a key role in climate mitigation, absorbing up to 2 billion tons of carbon dioxide every year. Aboveground biomass, which includes all parts of the tree such as the trunk, bark, or branches, correlates with the amount of carbon stored in a forest.

Tree height is often an indicator of biomass, meaning accurate measurements could help with more precise carbon sequestration data and climate science models. This information could also guide forest management by identifying areas in need of conservation, restoration, and reforestation.

There have been many studies using AI-powered remote sensing models for forest monitoring. However, these typically work regionally and pose a compute challenge due to vast amounts of data. Models have also been unsuccessful in measuring heights over 30 meters, leading to an underestimation of tall canopies.  

A handful of current studies are deploying satellites for capturing and measuring vegetation from space. One such mission, NASA’s Global Ecosystem Dynamics Investigation (GEDI), aims to monitor the structure of forests worldwide using a space-borne laser scanner. However, it captures only sparse samples that cover less than 4% of the global landmass.

Other global remote sensing missions offer complete coverage. The Copernicus Sentinel-2 satellites capture images at a resolution of 10×10 meters per pixel and the entire globe is captured every 5 days. However, it only sees a bird’s-eye view of the vegetation and does not measure height.

The researchers developed, trained, and deployed deep learning algorithms using data from these separate remote sensing missions to create the first global vegetation height map. The team trained an ensemble of fully convolutional neural networks (CNNs) on canopy top height from the GEDI data. Using a dataset of 600 million GEDI footprints, along with the corresponding Sentinel-2 image patches, the algorithm learns to extract canopy height from spectral and textural image patterns.

A graphic showing the steps in the CNN from using GEDI data, to Sentinel-2 image layering, to the CNN processing under sparse supervision.
Figure 1. Illustration of the model training process with sparse supervision from GEDI LIDAR. The CNN takes the Sentinel-2 image and encoded geographical coordinates as an input to estimate dense canopy top height and its uncertainty (variance).

When the models are up and running, they automatically process over 250,000 images and estimate canopy height for the map at a 10-meter ground-sampling distance. It takes 10 days to cover the globe using a high-performance cluster equipped with NVIDIA  RTX 2080 GPUs. According to Schindler, ETH Zurich’s high-performance computing system Euler, contains a variety of GPUs running over 1,500 graphics cards.

By modeling the uncertainty in the data and using an ensemble of five separately trained CNNs the models have a level of transparency not often seen in deep learning algorithms. Uncertainty is quantified for every single pixel estimate, which could give researchers and forest managers confidence when making decisions based on the information.

The researchers found that protected areas, such as the Oregon Coast Range and the Ulu Temburong National Park in Borneo often contained higher vegetation. About 34% of canopies taller than 30 meters grow in these areas.

A canopy height map showing areas on Oregon and Borneo with dense top heights reaching up to 50 meters.
Figure 2. A dense canopy height map reveals the spatial patterns of protected areas (left) Devil’s Staircase Wilderness, Oregon, and (right) Ulu Temburong National Park, Borneo.

According to the study, the model can be deployed annually to track canopy height change over time. The researchers also point out that the maps could be used to evaluate regions where wildfires have occurred, giving a more accurate map of damage.

“We hope that this work will advance future research in climate, carbon, and biodiversity modeling. We also hope that our freely available map can support the work of conservationists in practice. In the future, we would like to expand our approach to mapping biomass as well as temporal changes on a global scale,” said lead author Nico Lang, a PhD student in the EcoVision Lab, part of the Photogrammetry and Remote Sensing group at ETH Zürich. 

EcoVision also plans to make the code available soon. The lab, founded by ETH Zurich Professor Konrad Schindler and University of Zurich Professor Jan Dirk Wegner in 2017, is dedicated to developing machine learning algorithms for large-scale environmental data analysis. For more information refer to their project page, A high-resolution canopy height model of the Earth.

Read the study. >> 
Read more. >>