Categories
Misc

Sparse Forests with FIL

Introduction The RAPIDS Forest Inference Library, affectionately known as FIL, dramatically accelerates inference (prediction) for tree-based models, including gradient-boosted decision tree models (like those from XGBoost and LightGBM) and random forests. (For a deeper dive into the library overall, check out the original FIL blog.) Models in the original FIL are stored as dense binary … Continued

This post was originally published on the RAPIDS AI Blog.

Introduction

The RAPIDS Forest Inference Library, affectionately known as FIL, dramatically accelerates inference (prediction) for tree-based models, including gradient-boosted decision tree models (like those from XGBoost and LightGBM) and random forests. (For a deeper dive into the library overall, check out the original FIL blog.) Models in the original FIL are stored as dense binary trees. That is, the storage of the tree assumes that all leaf nodes occur at the same depth. This leads to a simple, runtime-efficient layout for shallow trees. But for deep trees, it also requires a lot of GPU memory 2d+1-1 nodes for a tree of depth d. To support even the deepest forests, FIL supports

sparse tree storage. If a branch of a sparse tree ends earlier than the maximum depth d, no storage will be allocated for potential children of that branch. This can deliver significant memory savings. While a dense tree of depth 30 will always require over 2 billion nodes, the skinniest possible sparse tree of depth 30 would require only 61 nodes.

Using Sparse Forests with FIL

Using sparse forests in FIL is no harder than using dense forests. The type of forest created is controlled by the new storage_type parameter to ForestInference.load(). Its possible values are:

  • DENSE to create a dense forest,
  • SPARSE to create a sparse forest,
  • AUTO (default) to let FIL decide, which currently always creates a dense forest.

There is no need to change the format of the input file, input data or prediction output. The initial model could be trained by scikit-learn, cuML, XGBoost, or LightGBM. Below is an example of using FIL with sparse forests.

from cuml import ForestInference
import sklearn.datasets
# Load the classifier previously saved with xgboost model_save()
model_path = 'xgb.model'
fm = ForestInference.load(model_path, output_class=True,
                         storage_type='SPARSE')
# Generate random sample data
X_test, y_test = sklearn.datasets.make_classification()
# Generate predictions (as a gpu array)
fil_preds_gpu = fm.predict(X_test.astype('float32'))

Implementation

Figure 1 depicts how sparse forests are stored in FIL.
Figure 1: Storing sparse forests in FIL.

Figure 1 depicts how sparse forests are stored in FIL. All nodes are stored in a single large nodes array. For each tree, the index of its root in the nodes array is stored in the trees array. Each sparse node, in addition to the information stored in a dense node, stores the index of its left child. As each node always has two children, left and right nodes are stored adjacently. Therefore, the index of the right child can always be obtained by adding 1 to the index of the left child. Internally, FIL continues to support dense as well as sparse nodes, with both approaches deriving from a base forest class.

Compared to the internal changes, the changes to the Python API have been kept to a minimum. The new storage_type parameter specifies whether to create a dense or sparse forest. Additionally, a new value,'AUTO', has been made the new default for the inference algorithm parameter; it allows FIL to choose the inference algorithm itself. For sparse forests, it currently uses the'NAIVE'algorithm, which is the only one supported. For dense forests, it uses the'BATCH_TREE_REORG' algorithm.

Benchmarks

To benchmark the sparse trees, we train a random forest using scikit-learn, specifically,sklearn.ensemble.RandomForestClassifier. We then convert the resulting model into a FIL forest and benchmark the performance of inference. The data is generated using sklearn.datasets.make_classification(), and contains 2 million rows split equally between training and validation dataset, and 32 columns. For benchmarking, inference is performed on 1 million rows.

We use two sets of parameters for benchmarking.

  • With the depth limit, set to either 10 or 20; in this case, either a dense or sparse FIL forest can fit into GPU memory.
  • Without depth limit; in this case, the model trained by SKLearn contains really deep trees. In our benchmark runs, the trees usually have a depth between 30 and 50. Trying to create a dense FIL forest runs out of memory, but a sparse forest can be created smoothly.

In both cases, the size of the forest itself remains relatively small, as the number of leaf nodes in a tree is limited to 2048, and the forest consists of 100 trees. We measure the time of the CPU inference and the GPU inference. The GPU inference was performed on V100, and the CPU inference was performed on a system with 2 sockets, each with 16 cores with 2-way hyperthreading. The benchmark results are presented in Figure 2.

Results in figure 2 compare  sparse and dense FIL predictors (if the latter is available) to SKLearn CPU predictors. FIL predictors are about 34–60x faster.
Figure 2: Benchmark results for FIL (dense and sparse trees) and SKLearn.

Both sparse and dense FIL predictors (if the latter is available) are about 34–60x faster than the SKLearn CPU predictor. The sparse FIL predictor is slower compared to the dense one for shallow forests, but can be faster for deeper forests; the exact performance difference varies. For instance, in Figure 2 with max_depth=10, the dense predictor is about 1.14x faster than the sparse predictor, but with max_depth=20, it is slower, achieving only 0.75x speed of the sparse predictor. Therefore, the dense FIL predictor should be used for shallow forests.

For deep forests, however, the dense predictor runs out of memory, as its space requirements grow exponentially with the forest depth. The sparse predictor does not have this problem and provides fast inference on the GPU even for very deep trees.

Conclusion

With sparse forest support, FIL applies to a wider range of problems. Whether you’re building gradient-boosted decision trees with XGBoost or random forests with cuML or scikit-learn, FIL should be an easy drop-in option to accelerate your inference. As always, if you encounter any issues, feel free to file issues on GitHub or ask questions in our public Slack channel!

Categories
Misc

Epochs not running and GPU memory usage disappearing on cnn model.

I’m currently a student playing around with basic Deep Learning and tensorflow for a project.

I’ve installed and am running tensorflow on my RTX 3070, and use jupyter notebooks on anaconda for my code.

I’m currently playing around with an American Sign Language dataset, (one made up of 28×28 grayscale images of various letters in asl)

I’ve gotten simple models like:

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense model = Sequential() model.add(Dense(units=512, activation='relu', input_shape=(784,))) model.add(Dense(units=512, activation='relu')) model.add(Dense(units=num_classes, activation='softmax')) 

working to great effect on my GPU, but if I try a convolutional neural network on the same dataset, like this:

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import ( Dense, Conv2D, MaxPool2D, Flatten, Dropout, BatchNormalization, ) model = Sequential() model.add(Conv2D(75, (3, 3), strides=1, padding="same", activation="relu", input_shape=(28, 28, 1))) model.add(BatchNormalization()) model.add(MaxPool2D((2, 2), strides=2, padding="same")) model.add(Conv2D(50, (3, 3), strides=1, padding="same", activation="relu")) model.add(Dropout(0.2)) model.add(BatchNormalization()) model.add(MaxPool2D((2, 2), strides=2, padding="same")) model.add(Conv2D(25, (3, 3), strides=1, padding="same", activation="relu")) model.add(BatchNormalization()) model.add(MaxPool2D((2, 2), strides=2, padding="same")) model.add(Flatten()) model.add(Dense(units=512, activation="relu")) model.add(Dropout(0.3)) model.add(Dense(units=num_classes, activation="softmax")) 

and then I compile using:

model.compile(loss="categorical_crossentropy", metrics=["accuracy"]) 

and train using:

model.fit(x_train, y_train, epochs=20, verbose=1, validation_data=(x_valid, y_valid)) 

But if I run the above code, all I get is:

Epoch 1/20 

as my output, and while when I define the model, I see that the majority of my GPU memory is being used (specifically 7.6/8GB), when i try training it, all of the memory just instantly disappears, as if there never was a model.

can anyone tell me what is wrong here?

submitted by /u/the_mashrur
[visit reddit] [comments]

Categories
Misc

How Diversity Drives Innovation: Catch Up on Inclusion in AI with NVIDIA On-Demand

NVIDIA’s GPU Technology Conference is a hotbed for sharing groundbreaking innovations — making it the perfect forum for developers, students and professionals from underrepresented communities to discuss the challenges and opportunities surrounding AI. Last month’s GTC brought together virtually tens of thousands of attendees from around the world, with more than 20,000 developers from emerging Read article >

The post How Diversity Drives Innovation: Catch Up on Inclusion in AI with NVIDIA On-Demand appeared first on The Official NVIDIA Blog.

Categories
Misc

Realistic Lighting in Justice with Mesh Shading

NetEase Thunder Fire Games Uses Mesh Shading To Create Beautiful Game Environments for Justice In December, we interviewed Haiyong Qian, NetEase Game Engine Development Researcher and Manager of NetEase Thunder Fire Games Technical Center, to see what he’s learned as the Justice team added NVIDIA ray-tracing solutions to their development pipeline. Their results were nothing … Continued

NetEase Thunder Fire Games Uses Mesh Shading To Create Beautiful Game Environments for Justice

In December, we interviewed Haiyong Qian, NetEase Game Engine Development Researcher and Manager of NetEase Thunder Fire Games Technical Center, to see what he’s learned as the Justice team added NVIDIA ray-tracing solutions to their development pipeline. Their results were nothing short of stunning with local 8K DLSS support and global illumination through RTXGI

Recently, NetEase introduced Mesh Shader support to Justice. Not only are the updated environments breathtaking, the game supports 1.8 billion triangles running over 60 FPS in 4K on an NVIDIA 3060Ti. 

To learn more about the implementation and results, we sat down with Yuheng Zou, game engine developer at NetEase. His work focuses on the rendering engine in Justice, specifically GPU features enabled by DirectX 12.

Q: What are you trying to achieve by adding mesh shading to Justice?

Our first thought is to render some highly detailed models which may need insane number of triangles. Soon we found we can combine Mesh Shaders with auto-generated LODs to achieve almost only-resolution-relevant rendering complexity, instead of polygon number. And we decided to try it out. With so much potential of Mesh Shader, we conceive that it would be the main stream of future games.

Q: What is not currently working with regular compute / draw indirect / traditional methods?

The simple draw call just doesn’t work for this. It lacks the ability to process mesh in a coarser grain than triangle, like meshlet culling.

Compute or draw indirect may be fine, but we do need to make a huge change on the rendering pipeline. The underlying idea of the algorithm is, in the first place, to do culling, then draw the effective parts of mesh. While it can be achieved by culling and compacting with compute shader and then drawing indirect, the data exchange between the two-step process can sometimes be fatal to GPU under highpoly rendering context. Mesh shader solves this problem from the hardware level.

Q: How do Mesh Shaders solve this?

Mesh shader can extend the scalability of geometry stage, and is very easy to integrate to engine runtime. It has the ability to encapsulate the culling procedure in a single API call, which omits tedious state and resource set up procedure as draw indirect requires. With MeshShaders, the culling algorithms we use can be of great flexibility. For example, in the shadow pass, we don’t have the depth information so occlusion culling is simply ignored in the shader.

Q: Is the end result of adding Mesh Shaders something your players will quickly notice, or is the effect more subtle?

Our technology customizes a highpoly mesh pipeline, including production, processing, serialization, streaming and rendering, aiming to provide our players with a refreshed experience with such high-fidelity contents. Actually, it works. Soon after “Wan Fo Ku” released, our players found the models presented were much more elaborate than the traditional one, with many close-up screenshots posted on the forums. While adding Mesh Shaders under customized highpoly scene do boost the rendering effectiveness, how to optimize our traditional scene remains subtle and needs more engineering efforts.

Q: What kinds of environments benefited most from the technology?

Our technology enables the ability of rendering parallax and silhouette of models in an incredible fidelity. For scenes like caves, these details can produce a visually better image. It also provides Chinese ancient buildings, furniture and ornaments with “meticulous” rendering result, which enables the culture carried by them to be expressed in Justice to the finest extent.

Q: What is the one thing you wish you knew before you added mesh shading? What would you do differently based on your learnings?

With such detailed models, large texture resolution is a must. We will pay more attention to texture loading or streaming. This also makes further requirements to our mesh/texture compression algorithms.

Q: Any other tips for developers looking to work with Mesh Shaders for the first time?

Mesh shader has the possibility of boosting geometry stage drastically. However, careful profiling and optimization is required. We highly recommend NSight for debugging and profiling Mesh Shaders.

Doing GPU culling with Mesh Shaders will sometimes require a change on mesh representation data format. A clever data format design will enable your trial and error, making a lot faster progress.

Since 2016, Justice has been closely engaged with NVIDIA in China on video game graphics technology innovation. As one of the best PC MMO games in China market, Justice has attracted thirty millions of players in the past three years with its excellent technologies and beautiful graphics.

Learn more about NVIDIA’s technical resources for Game Developers here

Categories
Misc

How to get bias?

How to get bias?

There is a way to get the Bias matrix like the Weights matrix?

Or do you need to activate some function in the layers to get it out?

https://preview.redd.it/f27x2fmj7g071.png?width=663&format=png&auto=webp&s=fb31dc6a757689939be40edd1c5333a5270a5322

submitted by /u/Filippo9559
[visit reddit] [comments]

Categories
Misc

What is the correct way to zip multiple inputs into a tf.data.Dataset?

Hi,

I have a multi input model with a single target variable

After a lot of trial and error (and countless hours on SO) I’ve come this far

#reads in filepaths to images from dataframe train

images = tf.data.Dataset.from_tensor_slices(train.image.to_numpy())

#converts labels to one hot encoding vector

target = tf.keras.utils.to_categorical(

train.Label.to_numpy(), num_classes=n_class, dtype=’float32′

)

#reads in the image and resizes it

images= images.map(transform_img)

#zips the former according to

#https://stackoverflow.com/questions/65295459/how-to-use-tensorflow-dataset-correctly-for-multiple-input-layers-with-keras

input_1 = tf.data.Dataset.zip((anchors, target))

dataset = tf.data.Dataset.zip((input_1, target))

and I think this is almost the solution, but the second input shape get’s distorted.

Because I get a warning that input expected is (None, n_class)

But it received an input of (n_class, 1)

And an error:

ValueError: Shapes (n_class, 1) and (n_class, n_class) are incompatible

I checked though, the shapes from to_categorical is correct, it’s num_examples, n_class

Could someone help an utterly confused me out?

Thanks a lot!

submitted by /u/obolli
[visit reddit] [comments]

Categories
Misc

NVIDIA Announces Four-for-One Stock Split, Pending Stockholder Approval at Annual Meeting Set for June 3

NVIDIA today announced that its board of directors declared a four-for-one split of NVIDIA’s common stock in the form of a stock dividend …

Categories
Misc

Fighting Fire with Insights: CAPE Analytics Uses Computer Vision to Put Geospatial Data and Risk Information in Hands of Property Insurance Companies

Every day, vast amounts of geospatial imagery are being collected, and yet, until recently, one of the biggest potential users of that trove — property insurers — had made surprisingly little use of it. Now, CAPE Analytics, a computer vision startup and NVIDIA Inception member, seeks to turn that trove of geospatial imagery into better Read article >

The post Fighting Fire with Insights: CAPE Analytics Uses Computer Vision to Put Geospatial Data and Risk Information in Hands of Property Insurance Companies appeared first on The Official NVIDIA Blog.

Categories
Misc

From the Omniverse Experiment Archives: NVIDIA Omniverse RTX Racing Demo Showcases Powerful Rendering, Realistic Simulation

A team of NVIDIA artists released never-before-seen imagery and behind-the-scenes videos from the Omniverse RTX Racer playable sample project. The clips and imagery are the result of 3 weeks of progress, showcasing the Omniverse platform’s power in multi-GPU rendering, dynamic lighting, and real-time rendering.

A team of NVIDIA artists released never-before-seen imagery and behind-the-scenes videos from the Omniverse RTX Racer playable sample project. The clips and imagery are the result of 3 weeks of progress, showcasing the Omniverse platform’s power in multi-GPU rendering, dynamic lighting, and real-time rendering.

The racing demo consists of a loop level, where a player controls a dune buggy vehicle, with the ability to steer, accelerate, decelerate, and brake.

With no difference between editing and “play” mode within NVIDIA Omniverse, users can move the camera and control the vehicle, or change the lighting, simultaneously, in real time.

Creative and Collaborative Workflows in Omniverse

The photorealistic scenes, characters, and props were created using a variety of creative applications such as Pixologic ZBrush, Autodesk 3ds Max, Autodesk Maya and Blender, textured in Substance Painter, and rendered with Omniverse RTX Renderer. Using Omniverse Create, two artists worked collaboratively on the rocky, desert environment, leveraging the platform’s live sync abilities to adjust environment art, layout and lighting simultaneously. The Omniverse platform’s simulation suite, including NVIDIA PhysX 5, delivered physically accurate simulation, providing players with a true-to-reality experience as they raced their buggy across the rugged terrain.

The cheeky alien main character was sculpted entirely in ZBrush, refined in Autodesk Maya and textured in Substance Painter.

Alien driver. Substance Painter. Rendered in NVIDIA Iray.

The dune buggy, a custom vehicle designed to race around desert dunes, was modeled in Autodesk Fusion 360 and textured in Substance Painter. The vehicle is physically based, and went through various iterations in size and dimensions to find the optimal racing balance in the physically accurate environment.

Dune buggy. Autodesk Fusion 360 viewport.

One Renderer, Dual Personalities

Omniverse RTX Renderer works in two different modes: real-time ray-traced mode and referenced path-traced mode. The demo is able to run in both modes without a significant difference in graphics quality — the materials and visual output are very similar.

At a high level, both modes implement the same rendering algorithm for path tracing, but with different optimization and approximations choices. The real-time ray-traced mode is typically used in real-time simulation applications such as games, while the reference path-traced mode is used for projects requiring ultra-high fidelity, final-frame quality results as it accumulates many samples over many frames, often requiring multiple GPUs.

Leveraging the power of four NVIDIA RTX 8000 GPUs, the NVIDIA team experimented with referenced path-traced mode, ultimately achieving offline film quality rendering at near real-time speeds.

Bringing it to Life with Omniverse Physics

The Omniverse RTX Racer project also showcases NVIDIA Omniverse’s advanced physics simulation suite including PhysX 5, exclusively available to the public as part of the platform, NVIDIA Flow, and Blast.

PhysX provides the powerful vehicle authoring technology that made the dune buggy come to life. The same vehicle authoring technology is used as part of the new NVIDIA DRIVE Sim platform for training and validating self-driving cars, now built entirely on Omniverse. With accurate steering, suspension, wheel collision, and tire friction behavior, users can easily create multiple instances of different styles of vehicles.

The team of artists and developers used this project to experiment with Flow, a sparse grid-based fluids simulation package for real-time applications that was recently integrated into the Omniverse platform. Flow was used to simulate the buggy’s blazing rocket engine and cloudy dust trail. This effect is written in Slang, the NVIDIA cross-platform shading language, and simulated entirely on NVIDIA GPUs.

The Omniverse RTX Racer sample is a living project with future iterations to come. Learn more about NVIDIA Omniverse and download the open beta today.

Dive further into Omniverse RTX Renderer, Omniverse Physics, and Collaborative Content Creation for Game Development in our exclusive, free-to-access GTC On-Demand sessions.

Categories
Misc

Getting Nan loss after some time when training custom AnoVAEGAN model.

I am training a custom AnoVAEGAN model using mixed precision and mirror strategy.

I have usedAdam (with epsilon 1e-04), lr = 0.0001,MSE and BinaryCrossEntropy loss for generator.BinaryCrossEntropy for the discriminator.

For some reason, after some time, MSE loss becomes Nan which makes the overall generator loss also Nan.

Any advice from anyone who has also faced a similar issue, please help me out here.

Github repo: https://github.com/dhruvampanchal/AnoVAEGAN

Thanks!

Check comment for nan loss output.

submitted by /u/dhruvampanchal
[visit reddit] [comments]