submitted by /u/Ordinary_Craft [visit reddit] [comments] |
It doesn’t make sense. the shadow of the corner T looks like it has 5 cubic lengths when the physical figure’s T has 1 cubic length on the top right.
submitted by /u/Silver4R4449
[visit reddit] [comments]
submitted by /u/NickLRealtor
[visit reddit] [comments]
Deepset bridges the gap between NLP research and industry – their core product, Haystack, is an open-source framework that enables developers to utilize the latest NLP models for semantic search and question answering at scale.
Language models are essential for modern NLP. Building a new language model from scratch can be beneficial for many domains. NVIDIA Inception member Deepset bridges the gap between NLP research and industry – their core product, Haystack, is an open-source framework that enables developers to utilize the latest NLP models for semantic search and question answering at scale. Haystack Hub, is their software as a service (SaaS) platform, used by developers from various industries, including finance, legal, and automotive, to find answers in all kinds of text documents.
In a collaborative effort with NVIDIA and AWS, deepset used NVIDIA V100 GPUs for training their language model. The GPU performance profiles were captured by the NVIDIA Nsight Systems.
The collaboration was a product of the partnership between NVIDIA Inception and AWS Activate, an initiative to support AI startups by providing access to the benefits of both acceleration programs. The benefits for NVIDIA Inception startups joining AWS Activate include business and marketing support, as well as AWS Cloud credits, which can be used to access NVIDIA’s latest generation GPUs in Amazon EC2 – P3 Instances. AWS Activate members that are using AI and machine learning are referred to NVIDIA Inception and can benefit from immediate preferred pricing on NVIDIA GPUs and Deep Learning Institute credits.
“A considerable amount of manual development is required to create the training data and vocabulary, configure hyperparameters, start and monitor training jobs, and run periodical evaluation of different model checkpoints. In our first training runs, we also found several bugs only after multiple hours of training, resulting in a slow development cycle. In summary, language model training can be a painful job for a developer and easily consumes multiple days of work”.
“The increased efficiency of training jobs reduces our energy usage and lowers our carbon footprint. By tackling different areas of FARM’s training pipeline, we were able to significantly optimize the resource utilization. In the end, we were able to achieve a speedup in training time of 3.9 times faster, a 12.8 times reduction in training cost, and reduced the developer effort required from days to hours”.
Collaborating with NVIDIA and AWS, NVIDIA Inception partner deeepset achieves a 3.9x speedup and 12.8x cost reduction for training NLP models. As a result, the developer effort was significantly reduced.
Read more about technologies used in the training and their impact on improving BERT training performance.
3D computer animation is a time-consuming and highly technical medium — to complete even a single animated scene requires numerous steps, like modeling, rigging and animating, each of which is itself a sub-discipline that can take years to master. Because of its complexity, 3D animation is generally practiced by teams of skilled specialists and is inaccessible to almost everyone else, despite decades of advances in technology and tools. With the recent development of tools that facilitate game character creation and game balance, a natural question arises: is it possible to democratize the 3D animation process so it’s accessible to everyone?
To explore this concept, we start with the observation that most forms of artistic expression have a casual mode: a classical guitarist might jam without any written music, a trained actor could ad-lib a line or two while rehearsing, and an oil painter can jot down a quick gesture drawing. What these casual modes have in common is that they allow an artist to express a complete thought quickly and intuitively without fear of making a mistake. This turns out to be essential to the creative process — when each sketch is nearly effortless, it is possible to iteratively explore the space of possibilities far more effectively.
In this post, we describe Monster Mash, an open source tool presented at SIGGRAPH Asia 2020 that allows experts and amateurs alike to create rich, expressive, deformable 3D models from scratch — and to animate them — all in a casual mode, without ever having to leave the 2D plane. With Monster Mash, the user sketches out a character, and the software automatically converts it to a soft, deformable 3D model that the user can immediately animate by grabbing parts of it and moving them around in real time. There is also an online demo, where you can try it out for yourself.
Creating a walk cycle using Monster Mash. Step 1: Draw a character. Step 2: Animate it. |
Creating a 2D Sketch
The insight that makes this casual sketching approach possible is that many 3D models, particularly those of organic forms, can be described by an ordered set of overlapping 2D regions. This abstraction makes the complex task of 3D modeling much easier: the user creates 2D regions by drawing their outlines, then the algorithm creates a 3D model by stitching the regions together and inflating them. The result is a simple and intuitive user interface for sketching 3D figures.
For example, suppose the user wants to create a 3D model of an elephant. The first step is to draw the body as a closed stroke (a). Then the user adds strokes to depict other body parts such as legs (b). Drawing those additional strokes as open curves provides a hint to the system that they are meant to be smoothly connected with the regions they overlap. The user can also specify that some new parts should go behind the existing ones by drawing them with the right mouse button (c), and mark other parts as symmetrical by double-clicking on them (d). The result is an ordered list of 2D regions.
Steps in creating a 2D sketch of an elephant. |
Stitching and Inflation
To understand how a 3D model is created from these 2D regions, let’s look more closely at one part of the elephant. First, the system identifies where the leg must be connected to the body (a) by finding the segment (red) that completes the open curve. The system cuts the body’s front surface along that segment, and then stitches the front of the leg together with the body (b). It then inflates the model into 3D by solving a modified form of Poisson’s equation to produce a surface with a rounded cross-section (c). The resulting model (d) is smooth and well-shaped, but because all of the 3D parts are rooted in the drawing plane, they may intersect each other, resulting in a somewhat odd-looking “elephant”. These intersections will be resolved by the deformation system.
Illustration of the details of the stitching and inflation process. The schematic illustrations (b, c) are cross-sections viewed from the elephant’s front. |
Layered Deformation
At this point we just have a static model — we need to give the user an easy way to pose the model, and also separate the intersecting parts somehow. Monster Mash’s layered deformation system, based on the well-known smooth deformation method as-rigid-as-possible (ARAP), solves both of these problems at once. What’s novel about our layered “ARAP-L” approach is that it combines deformation and other constraints into a single optimization framework, allowing these processes to run in parallel at interactive speed, so that the user can manipulate the model in real time.
The framework incorporates a set of layering and equality constraints, which move body parts along the z axis to prevent them from visibly intersecting each other. These constraints are applied only at the silhouettes of overlapping parts, and are dynamically updated each frame.
Meanwhile, in a separate thread of the framework, we satisfy point constraints to make the model follow user-defined control points (described in the section below) in the xy-plane. This ARAP-L method allows us to combine modeling, rigging, deformation, and animation all into a single process that is much more approachable to the non-specialist user.
The model deforms to match the point constraints (red dots) while the layering constraints prevent the parts from visibly intersecting. |
Animation
To pose the model, the user can create control points anywhere on the model’s surface and move them. The deformation system converges over multiple frames, which gives the model’s movement a soft and floppy quality, allowing the user to intuitively grasp its dynamic properties — an essential prerequisite for kinesthetic learning.
Because the effect of deformations converges over multiple frames, our system lends 3D models a soft and dynamic quality. |
To create animation, the system records the user’s movements in real time. The user can animate one control point, then play back that movement while recording additional control points. In this way, the user can build up a complex action like a walk by layering animation, one body part at a time. At every stage of the animation process, the only task required of the user is to move points around in 2D, a low-risk workflow meant to encourage experimentation and play.
Conclusion
We believe this new way of creating animation is intuitive and can thus help democratize the field of computer animation, encouraging novices who would normally be unable to try it on their own as well as experts who often require fast iteration under tight deadlines. Here you can see a few of the animated characters that have been created using Monster Mash. Most of these were created in a matter of minutes.
A selection of animated characters created using Monster Mash. The original hand-drawn outline used to create each 3D model is visible as an inset above each character. |
All of the code for Monster Mash is available as open source, and you can watch our presentation and read our paper from SIGGRAPH Asia 2020 to learn more. We hope this software will make creating 3D animations more broadly accessible. Try out the online demo and see for yourself!
Acknowledgements
Monster Mash is the result of a collaboration between Google Research, Czech Technical University in Prague, ETH Zürich, and the University of Washington. Key contributors include Marek Dvorožňák, Daniel Sýkora, Cassidy Curtis, Brian Curless, Olga Sorkine-Hornung, and David Salesin. We are also grateful to Hélène Leroux, Neth Nom, David Murphy, Samuel Leather, Pavla Sýkorová, and Jakub Javora for participating in the early interactive sessions.
This year, Microsoft’s free Game Stack Live event (April 20-21), starting at 8am PDT, will offer a wide range of can’t-miss sessions for game developers, in categories that include Graphics, System & Tools, Production & Publishing, Accessibility & Inclusion, Audio, Multiplayer, and Community Connections.
This year, Microsoft’s free Game Stack Live event (April 20-21), starting at 8am PDT, will offer a wide range of can’t-miss sessions for game developers, in categories that include Graphics, System & Tools, Production & Publishing, Accessibility & Inclusion, Audio, Multiplayer, and Community Connections.
NVIDIA will be participating with two talks:
Introduction to Real-Time Ray Tracing with Minecraft
This talk is aimed at graphics engineers that have little or no experience with ray tracing. It serves as a gentle introduction to many topics, including “What is ray tracing?”, “How many rays do you need to make an image?”, “The importance of [importance] sampling. (And more importantly, what is importance sampling?)”, “Denoising”, “The problem with small bright things”. Along the way, you will learn about specific implementation details from Minecraft.
RTXDI: Details on Achieving Real-time Performance
RTXDI offers realistic lighting of dynamic scenes that require computing shadows from millions of area lights. Until now, this has not been possible in video games. Traditionally, game developers have baked most lighting and supported a small number of “hero” lights that are computed at runtime. This talk gives an overview of RTXDI and offers a deep dive into previously undisclosed details that enable high performance.
Register to Game Stack Live today here.
We hope you’ll join us!
This post was originally published on the Mellanox blog in April 2020. People generally assume that faster network interconnects maximize endpoint performance. In this post, I examine the key factors and considerations when choosing the right speed for your leaf-spine data center network. To establish a common ground and terminology, Table 1 lists the five … Continued
This post was originally published on the Mellanox blog in April 2020.
People generally assume that faster network interconnects maximize endpoint performance. In this post, I examine the key factors and considerations when choosing the right speed for your leaf-spine data center network.
To establish a common ground and terminology, Table 1 lists the five building blocks of a standard leaf-spine networking infrastructure.
Building block | Role |
Network interface card (NIC) | A gateway between the server (compute resource) and network. |
Leaf switch/top-of-rack switch | The first connection from the NIC to the rest of the network. |
Spine switch | The “highway” junctions, responsible for the east-west traffic. Its port capacity determines the number of required racks. |
Cable | Connects the different devices in the network. |
Optic transceiver | Allows longer distances of connectivity (above a few meters) between a leaf-to-spine switch by modulating the data into light that traverses the optic cable. |
I start by reviewing the trends in 2020 for data center leaf-spine networks deployments and describing the main ecosystem that lies behind it all. Figure 1 shows an overview of leaf-spine network connectivity. It is divided into two main connectivity parts, each of which takes different factors into consideration when picking the deployment rate:
- Switch-to-switch (applies also when using a level of super-spine)
- NIC-to-switch
Together these parts comprise an ecosystem, which I now analyze in depth.
Switch-to-switch speed dynamics
New leaf-spine data center deployments in 2020 evolve around four IEEE approved speeds: 40, 100, 200, and 400GbE. There are different combinations of supported switches per speed. For example, constructing a network of 400GbE leaf-spine connectivity requires the network owner to pick switches and cables that can support those rates.
Like every other product in the world, each speed generation demonstrates a unique product life cycle (PLC), while each stage comes with its own attributes.
- Introduction—Product adoption is concentrated within a small group of innovators who are neither afraid to take risks nor suffer from birth pangs. In networking, these are usually the networking giants (also known as hyperscalers).
- Growth—Occurs as leaders and decision makers start adopting a new generation.
- Maturity—Characterized by the adoption of products by more conservative customers.
- Decline—A speed generation is used to connect legacy equipment.
The main questions that pop-up in my mind are, “Why do generations change?” and “What drives the ambition for faster switch-to-switch connectivity?” The answer for both is surprisingly simple: $MONEY$. When you constantly optimize your production process and, at the same time, allow bigger scale (bigger ASIC switching capacity), the result is lower connectivity costs.
This price reduction does not happen at once; it takes time to reach maturity. Hyperscalers can benefit from cost reduction even when a generation is in its Introduction stage, because being big allows them to get better prices (the economy of scale offers better buying power), often much lower than the manufacturer’s suggested retail price. In some sense, you could say that hyperscalers are paving the way for the rest of the market to use new generations.
Armed with this new knowledge, here’s some analysis.
Before focusing on the present, rewind a decade, back to 2010-11, when 10GbE was approaching maturity and the industry was hyped about transitioning from a 10 to 100GbE switch-to-switch speed. At the time, the 100GbE leaf-spine ecosystem had many caveats, including that 100GbE NRZ technology spine switches did not have the right radix for scale, providing only 12 ports of 100GbE in a spine switch, meaning only 12 racks could have been connected in a leaf-spine cluster.
Also, at the same time, 40GbE switch-to-switch connectivity started to gain traction even though it was slower, due to mature SerDes technology, a reliable ASIC process, better scale, and lower overall cost for most of the market.
Put yourself in the shoes of a decision maker who needs to deploy a new cluster in 2011: what switch-to-switch speed would you pick? Hard dilemma, right? Fortunately, as it was a decade ago, we have since accumulated lots of data about what happened. Take a moment to analyze Figure 3. The 10/40GbE generation is a perfect example for a PLC curve.
Beginning in 2011 until 2015, most of the industry picked 40GbE as its leaf-spine network speed. When asked in retrospect about the benefits of 40GbE, businesses typically mention improved application performance and better ROI. Only at the end of 2015, roughly four years after the advent of 40GbE, did the 100GbE leaf-spine ecosystem begin its rise and be seen as reliable and cost-effective. Some deployments did benefit from 100GbE, since picking “the latest and the greatest” would fit some use cases, even at higher prices.
Fast forward to 2020
New data center deployments enjoy a set of wonderful new options of switch-to-switch rates to pick from, starting from 40GbE to 400GbE. Most of the current deployments are using 100GbE connections, which is mature at this point. With the continuous drive to lower costs, the demand for faster network speeds isn’t easing up, as newer technologies of 200GbE and 400GbE are deployed. Figure 4 presents the attributes currently associated with each switch-to-switch speed generation.
You can conclude that each generation has its own pros and cons and picking one should be based on your personal preferences. Now I explain the dynamics taking place in the data center speed ecosystem and try to answer which switch-to-switch speed generation fits you best: 100, 200, or 400GbE?
Dynamics between switch-to-switch speed and NIC-to-switch speed
As mentioned earlier, new switch-to-switch data center deployments in 2020 evolve around four IEEE approved speeds: 40, 100, 200, and 400GbE. Each one is at a different PLC stage (Table 2).
switch-to-switch speed | switch-to-switch generation stage (2020) |
40GbE | Decline |
100GbE | Maturity |
200GbE | Growth |
400GbE | Introduction (with several years to reach growth, according to past leaps and current market trends) |
Let me share with you the reasons I view the market in this way. To begin with, 400GbE is the current latest and greatest. No doubt, it will take a major part of deployments in the future by offering the fastest connectivity, with a projected lowest cost per GbE. However, at the present, it still has not reached the required maturity to gain the associated benefits of commoditization.
A small number of hyperscalers—known for innovation, compute-intense applications, engineering capabilities, and most importantly, enjoying the economy of scale—are deploying clusters at that speed. To mitigate technical issues with 400GbE native connections, some have shifted to 2x200GbE or pure 200GbE deployments. The reason is that with 200GbE leaf-spine connections, hyperscalers can rely on a more resilient infrastructure, leveraging both cheaper optics and switch radix that allows for scaling a fabric.
At present, non-hyperscalers trying to move to 400GbE switch-to-switch connectivity may come to realize that the cables and transceivers are still expensive and produced in low volumes. Moreover, the 7nm ASIC process for creating high-capacity switches is not optimized.
At the opposite side of the curve lies the 40GbE, which is a generation in decline. You should consider 40GbE if you are deploying a legacy cluster, with legacy equipment that cannot work at faster speeds.
Most of the market is not being caught up in the hype and doesn’t waste money on unnecessary bandwidth. It is focused on the 100GbE mature ecosystem. Because it exhibits textbook characteristics when it comes to cost reduction, market availability and reliability means that the 100GbE is not going away. It is here to stay.
This is a great opportunity to mention the other part of the story: the NIC-to-switch speed. At this point, it might seem that they co-exist orthogonally, but in fact they are entwined and affect one another.
Whether your application is in the field of intense compute, storage, or AI, the NIC is the heart of it. In practice, the NIC speed determines the optimal choice of the surrounding network infrastructure, as it connects your compute and storage to the network. When deciding the switch-to-switch speed to pick, also consider what kind of traffic, generated from the compute nodes, is going to run between the switches. Different applications have different traffic patterns. Nowadays, most of the traffic in a data center is east-west traffic, from one NIC to another.
To get the best application performance, opt for a leaf switch that has the appropriate blocking factor (optimally non-blocking at all) to avoid congestion, by deploying enough uplink and downlink ports.
Data center deployments frequently use NICs at one of the following speeds:
- 10GbE (NRZ)
- 25GbE (NRZ)
- 50GbE (PAM-4)
- 100GbE (PAM-4)
There are also 50GbE and 100GbE NRZ NICs, but they are less common.
This is where the complete ecosystem builds up, the point where switch-to-switch and NIC-to-switch complement each other. After reviewing dozens of different data center deployments, I noticed that there is a clear pattern when it comes to overall costs, regarding choosing a switch-to-switch speed when considering also the NIC-to-switch speed-of-choice. The math just works that way. There is an optimal point where a specific switch-to-switch speed generation allows the NIC-to-switch speed to maximize application performance, both in terms of bandwidth utilization and ROI.
Take into consideration the application, wanted blocking factor, and price per GbE. If your choice is based on the NIC speed, you would probably want to use the switch-to-switch speed, as shown in Table 3.
NIC port speed | Possible use case (2020) | Recommended switch-to-switch speed |
100GbE PAM-4 | Hyperscalers, innovators | 200/400GbE |
50GbE PAM-4 | Hyperscalers, innovators, AI, advanced storage applications, public cloud | 200/400GbE |
25GbE NRZ | Enterprises, private cloud, HCI, edge | 100GbE |
10GbE NRZ | Legacy | 40GbE |
50/100GbE NRZ act the same as 25GbE NRZ economically.
Of course, other combinations might be better, depending on the prices you get from your vendor, but on average, this is how I view the market.
Here are some important takeaways:
- Lower the cost per GbE, the switch-to-switch speed is always increasing. A new generation is introduced every several years.
- When picking according to the NIC-to-switch speed, consider the projected traffic patterns and the necessary blocking pattern from the leaf-switch.
- Data center maturity is determined from the maturity of both switch-to-switch and NIC-to-switch speeds.
Along comes 200GbE
If you’ve made it this far, then you must have realized that 200GbE leaf-spine speed is also an option to consider.
In December 2017, the IEEE approved a standard that contains the specifications for 200 and 400GbE. As discussed earlier, a small number of hyperscalers are upgrading their deployment from 100GbE to 400GbE directly. Practically speaking, the industry acknowledged that the 200GbE can serve as an intermediate step, like the transition between 10 to 100GbE, in which 40GbE served as an intermediate step.
So, what’s in it for you?
200GbE switch-to-switch deployments enjoy a comprehensive set of benefits:
- Increased ROI by doubling the bandwidth using a ready-to-deploy ecosystem (NIC, Switch, Cable & Transceiver) with an economical premium over 100GbE. The cost analysis just makes sense, providing the lowest price per GbE (Figure 6).
- The next generation of switch and NIC ASICs, with an improved feature set, including enhanced telemetry.
- A reduced cable footprint, to avoid signal integrity problems and cable crosstalk in a high-density front panel 100GbE switch compared to half the number of ports in a 200GbE switch.
- 200G is a native rate of InfiniBand (IB). Leading the IB market in supplying switches, NICs and cables/transceivers, NVIDIA has proven this technology mature, by providing over 1M ports of 200G, reaching economy of scale and optimizing price. The NVIDIA 200GbE supporting devices (NICs, cables, and transceivers) are shared between IB and Ethernet.
In preparation for the 200/400GbE era, NVIDIA has optimized its 200GbE switch portfolio. It allows the fabric to scale the radix with better ROI than 400GbE, by using a 64×200(12.8Tbps) spine and 12×200+48×50(6.4Tbps) as a non-blocking leaf switch.
When you consider the competition, NVIDIA offers an optimized non-blocking leaf switch (top-of-rack) for 50 G PAM-4 NICs.
NVIDIA Spectrum-2 based platforms provide a capacity of 6.4 Tb, 50 G PAM-4 SerDes and a feature set that complies with the virtualized data center environment.
Using a competitor’s 12.8TbE switch as a leaf switch is just overkill for today’s deployments because the majority of top-of-rack switches have 48 downlink ports of 50GbE. By doing the math to get to a non-blocking ratio, the switches need 6 ports of 400 or 12 ports of 200GbE, resulting in a total of 4.8TbE. There is no added value to paying for unused switching capacity.
By the way, NVIDIA offers a 200GbE development kit for people who want to take the SN3700 Ethernet switch for a test drive.
Summary
Deploying or upgrading a data center in 2020? Make sure to take into consideration the following:
- The market is dynamic, and past trends may assist you in predicting future ones
- Select your switch-to-switch and NIC-to-switch speed according to your requirements
- 200GbE holds massive benefits
Disagree with the need for 200GbE, or anything else in this post? Feel free to reach out to me. I would love to have a discussion with you.
How to execute tf.signal.stft?
Hi,
i am trying to get the result of tf.signal.stft, eg.
test_stft = tf.math.log(tf.abs(tf.signal.stft(test,frame_length=512,frame_step=128)))
I thought in eager execution it will give me the result, but all i get is:
tf.Tensor([], shape=(20000, 0, 257), dtype=float32)
What can I do, to get Tensorflow to finally calcuate the result. I have trouble to understand EagerMode and GraphMode. Maybe a good Youtube resource may also help.
submitted by /u/alex_bababu
[visit reddit] [comments]
Over the past couple of years, NVIDIA and NASA have been working closely on accelerating data science workflows using RAPIDS and integrating these GPU-accelerated libraries with scientific use cases. In this blog, we’ll share some of the results from an atmospheric science use case, and code snippets to port existing CPU workflows to RAPIDS on … Continued
Over the past couple of years, NVIDIA and NASA have been working closely on accelerating data science workflows using RAPIDS and integrating these GPU-accelerated libraries with scientific use cases. In this blog, we’ll share some of the results from an atmospheric science use case, and code snippets to port existing CPU workflows to RAPIDS on NVIDIA GPUs.
Accelerated Simulation of Air Pollution from Christoph Keller
One example science use case from NASA Goddard simulates chemical compositions of the atmosphere to monitor, forecast, and better understand the impact of air pollution on the environment, vegetation, and human health. Christoph Keller, a research scientist at the NASA Global Modeling and Assimilation Office, is exploring alternative approaches based on machine learning models to simulate the chemical transformation of air pollution in the atmosphere. Doing such calculations with a numerical model is computationally expensive, which limits the use of comprehensive air quality models for real-time applications such as air quality forecasting. For instance, the NASA GEOS composition forecast model GEOS-CF, which simulates the distribution of 250 chemical species in the Earth atmosphere in near real-time, needs to be run on more than 3000 CPUs and more than 50% of the required compute cost is related to the simulation of chemical interactions between these species.
We were able to accelerate the simulation of atmospheric chemistry in the NASA GEOS Model with GEOS-Chem chemistry more than 10-fold by replacing the default numerical chemical solver in the model with XGBoost emulators. To train these gradients boosted decision tree models, we produced a dataset using hourly output from the original GEOS model with GEOS-Chem chemistry. The input dataset contains 126 key physical and chemical parameters such as air pollution concentrations, temperature, humidity, and sun intensity. Based on these inputs, the XGBoost model is trained to predict the chemical formation (or destruction) of an air pollutant under the given atmospheric conditions. Separate emulators are trained for individual chemicals.
To make sure that the emulators are accurate for the wide range of atmospheric conditions found in the real world, the training data needs to capture all geographic locations and annual seasons. This results in very large training datasets – quickly spanning 100s of millions of data points, making it slow to train. Using RAPIDS Dask-cuDF (GPU-accelerated dataframes) and training XGBoost on an NVIDIA DGX-1 with 8 V100 GPUs, we are able to achieve 50x overall speedup compared to Dual 20-Core Intel Xeon E5-2698 CPUs on the same node.
An example of this is given in the gc-xgb repo sample code, showcasing the creation of an emulator for the chemical compound ozone (O3), a key air pollutant and climate gas. For demonstration purposes, a comparatively small training data set spanning 466,830 samples is used. Each sample contains up to 126 non-zero features, and the full size of the training data contains 58,038,743 entries. In the provided example, the training data – along with the corresponding labels – is loaded from a pre-generated txt file in svmlight / libsvm format, available in the GMAO code repo:
Loading the training data from a pre-generated text file, as shown in the example here, sidesteps the data preparation process whereby the 4-dimensional model data (latitude x longitude x altitude x time) as generated by the GEOS model (in netCDF format) are being read, subsampled and flattened.
The loaded training data can directly be used to train an XGBoost model:
Setting the tree_method to ‘gpu_hist’
instead of ‘hist’
performs the training on GPUs instead of CPUs, highlighting a significant speed-up in training time even for the comparatively small sample training data used in this example. This difference is exacerbated on the much larger data sets needed for developing emulators suitable for actual use in the GEOS model. Since our application requires training of dozens of ML emulators – ideally on a recurring basis as new model data is produced – the much shorter training time on RAPIDS is critical and ensures a short enough model development cycle.
As shown in the figure below, the chemical tendencies of ozone (i.e., the change in ozone concentration due to atmospheric chemistry) predicted by the gradient boosted decision tree model shows good agreement with the true chemical tendencies simulated by the numerical model. Given the relatively small training sample size (466,830 samples), the here trained model shows some signs of overfitting, with the correlation coefficient R2 dropping from 0.95 for the training data to 0.88 in the validation data, and the normalized root means square error (NRMSE) increasing from 22% to 35%. This indicates that larger training samples are needed to ensure that the training dataset captures all chemical environments.
In order to deploy the XGBoost emulator in the GEOS model as a replacement to the GEOS-Chem chemical solver, the XGBoost algorithm needs to be called from within the GEOS modeling system, which is written in Fortran. To do so, the trained XGBoost model is saved to disk so that it can then be read (and evoked) from a Fortran model by leveraging XGBoost’s C API (The XGBoost interface for Fortran can be found in the fortran2xgb GitHub repo.
As shown in the figure below, running the GEOS model with atmospheric chemistry emulated by XGBoost produces surface ozone concentrations that are similar to the numerical solution (red vs. black line). The blue line shows a simulation using a model with no chemistry, highlighting the critical role of atmospheric chemistry for surface ozone.
GEOS model simulations using XGBoost emulators instead of the GEOS-Chem chemical solver have the potential to be 20-50% faster than the reference simulation, depending on the model configuration (such as horizontal and temporal resolution). By offering a much faster calculation of atmospheric chemistry, these ML emulators open the door for a range of new applications, such as probabilistic air quality forecasts or a better combination of atmospheric observations and model simulations. Further improvements to the ML emulators can be achieved through mass balance considerations and by accounting for error correlations, tasks that Christoph and colleagues are currently working on.
In the next blog, we’ll talk about another application leveraging XGBoost and RAPIDS for live monitoring of air quality across the globe during the COVID-19 pandemic.
References:
Keller, C. A., Clune, T. L., Thompson, M. A., Stroud, M. A., Evans, M. J., and Ronaghi, Z.: Accelerated Simulation of Air Pollution Using NVIDIA RAPIDS, GPU Technology Conference, https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20190033152.pdf, 2019.
Keller, C. A. and Evans, M. J.: Application of random forest regression to the calculation of gas-phase chemistry within the GEOS-Chem chemistry model v10, Geosci. Model Dev., 12, 1209–1225, https://doi.org/10.5194/gmd-12-1209-2019, 2019.
My dataset uses tf.train.SequenceExample, which contains a sequence of N elements, and this N by definition can vary from one sequence to another. I want to select M, which is fixed for all sequences, elements uniformly from the N elements. For example, if the sequence has N=10 elements, then for M = 2 I want to select index=0, index=5 elements. M will always be smaller than any N in the dataset.
Now the issue is, when dataset iterator calls parser function through the ‘map’ method it is executed in the ‘graph’ mode and axis dimension corresponding to ‘N’ is ‘None’. So, I can’t iterate on that axis to find the value of N.
I resolved this issue by using tf.py_function, but it is 10X slower. I tried using tf.data.AUTOTUNE in num_parallel_calls and also in prefetch, and also set deterministic=False, But performance is still 10X slower.
What is the other possible solution for this?
submitted by /u/learnml
[visit reddit] [comments]