Categories
Misc

GPUs for ETL? Run Faster, Less Costly Workloads with NVIDIA RAPIDS Accelerator for Apache Spark and Databricks

Stylized image of a computer chip.We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on…Stylized image of a computer chip.

We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on trillions of point-of-sale transaction records in a few hours. The results of this job would feed a series of downstream machine learning (ML) models that would make critical retail assortment allocation decisions for a global retailer. Those models needed to be tested and validated on real transactional data.

However, up to that point, not a single ETL job ran to completion. Each test run took several days of processing time and all had to be terminated before completion.

Using NVIDIA RAPIDS Accelerator for Apache Spark, we observed significantly faster run times with additional cost savings when compared to a conventional approach using Spark on CPUs. Let us back up a bit.

Getting unstuck: ETL for a global retailer

The Artificial Intelligence & Analytics practice at Capgemini is a data science team that provides bespoke, platform–, and language-agnostic solutions that span the data science continuum, from data engineering to data science to ML engineering and MLOps. We are a team with deep technical experience and knowledge, having 100+ North America-based data science consultants, and a global team of 1,600+ data scientists.

For this project, we were tasked with providing an end-to-end solution for an international retailer with the following deliverables:

  • Creating the foundational ETL
  • Building a series of ML models
  • Creating an optimization engine
  • Designing a web-based user interface to visualize and interpret all data science and data engineering work

This work ultimately provided an optimal retail assortment allocation solution for each retail store. What made the project more complex was the state-space explosion that occurs after we begin to incorporate halo effects, such as interaction effects across departments. For example, if we allocated shelf space to fruit, what effect does that have on KPIs associated with allocating further shelf space to vegetables, and how can we jointly optimize those interaction effects?

But none of that ML, optimization, or front end would matter without the foundational ETL. So here we were, stuck. We were operating in an Azure cloud environment, using Databricks and Spark SQL, and even then, we were not observing the results we needed in the timeframe required by the downstream models.

Spurred by a sense of urgency, we explored potential variations that might enable us to significantly speed up our ETL process.

Accelerating ETL

Was the code inefficiently written? Did it maximize compute speed? Did it have to be refactored?

We rewrote code several times, and tested various cluster configurations, only to observe marginal gains. However, we had limited options to scale up owing to cost limitations, none of which provided the horsepower we needed to make significant gains. Remember when cramming for final exams, and time was just a little too tight, that pit in your stomach getting deeper by the minute? We were quickly running out of options and time. We needed help. Now.

With the Databricks Runtime 9.1 LTS, Databricks released a native vectorized query engine named Photon. Photon is a C++ runtime environment that can run faster and be more configurable than its traditional Java runtime environment. Databricks support assisted us for several weeks in configuring a Photon runtime for our ETL application.

We also reached out to our partners at NVIDIA, who recently updated the RAPIDS suite of accelerated software libraries. Built on CUDA-X AI, RAPIDS executes data science and analytics pipelines entirely on GPUs with APIs that look and feel like the most popular open-source libraries. They include a plug-in that integrates with Spark’s query planner to speed up Spark jobs.

With support from both Databricks and NVIDIA over the course of the following month, we developed both solutions in parallel, getting previously untenable run times down to sub-two hours, an amazing jump in speed!

This was the target speed that we needed to hit for the downstream ML and optimization models. The pressure was off, and—owing solely to having solved the ETL problem with Photon a few days earlier than we did with RAPIDS—the Databricks Photon solution was put into production.

Having emerged from the haze of anxiety surrounding the tight deadlines around the ETL processes, we collected our thoughts and results and conducted a posthoc analysis. Which solution was the fastest to implement? Which solution provided the fastest ETL? The cheapest ETL? Which solution would we implement for similar future projects?

Experimental results

To evaluate our hypotheses, we created a set of experiments. We ran these experiments on Azure using two approaches:

  1. Databricks Photon would be run on third-generation Intel Xeon Platinum 8370C (Ice Lake) CPUs in a hyper-threaded configuration. This is what was ultimately put into production for the client.
  2. RAPIDS Accelerator for Apache Spark would be run on NVIDIA GPUs.

We would run the same ETL jobs on both, using two different data sets. The data sets were five and 10 columns of mixed numeric and unstructured (text) data, each with 20 million rows that measured 156 and 565 terabytes, respectively. The number of workers was maximized as permitted by infrastructure spending limits. Each individual experiment was run three times.

The experimental parameters are summarized in Table 1.

Worker type Driver type Number of workers Platform Number of columns Data size
Standard_NC6s_v3 Standard_NC6s_v3 12 RAPIDS 10 565
Standard_E20s_v5 Standard_E16s_v5 6 PHOTON 10 565
Standard_NC6s_v3 Standard_NC6s_v3 16 RAPIDS 10 565
Standard_NC6s_v3 Standard_NC6s_v3 14 RAPIDS 10 565
Standard_NC6s_v3 Standard_NC6s_v3 14 RAPIDS 5 157
Standard_E20s_v5 Standard_E16s_v5 6 PHOTON 5 148
Table 1. ETL experimentation parameters

We examined the pure speed of runtimes. The experimental results demonstrated that run times across all different combinations of worker types, driver types, workers, data set size, platform, columns of data, and data set size were remarkably consistent and statistically and practically indifferentiable at an average of 4 min 37 sec per run, with min and max run times at 4 min 28 sec and 4 min 54 sec, respectively.

We had a DBU/hour infrastructure spending limit and, as a result, a limit on the varying workers per cluster tested. In response, we developed a composite metric that enabled the most balanced evaluation of results, which we named adjusted DBUs per minute (ADBUs). DBUs are Databricks units, a proprietary Databricks unit of computational cost. ADBUs are computed as follows:

text{emph{Adjusted DBUs per Minute}} = frac{text{emph{Runtime (mins)}}}{text{emph{Cluster DBUs Cost per Hour}}}

In the aggregate, we observed a 6% decrease in ADBUs by using RAPIDS Accelerator for Apache Spark when compared to running Spark on the Photon runtime, when accounting for the cloud platform cost. This meant we could achieve similar run times using RAPIDS at a lower cost.

Considerations

Other considerations include the ease of implementation and the need for rewriting code, both of which were similar for RAPIDS and Photon. A first-time implementation of either is not for the faint of heart.

Having done it one time, we are quite certain we can replicate the required cluster configuration tasks in a matter of hours for each. Moreover, neither RAPIDS nor Photon required us to refactor the Spark SQL code, which was a huge time savings.

The limitations of this experiment were the small number of replications, the limited number of worker and driver types, and the number of worker combinations, all owing to infrastructure cost limitations.

What’s next?

In the end, combining Databricks with RAPIDS Accelerator for Apache Spark helped expand the breadth of our data engineering toolkit, and demonstrated a new and viable paradigm for ETL processing on GPUs.

For more information, see RAPIDS Accelerator for Apache Spark.

Categories
Offsites

Symbol tuning improves in-context learning in language models

A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via in-context learning. Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering or phrasing tasks as instructions, and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown incorrect labels.

In “Symbol tuning improves in-context learning in language models”, we propose a simple fine-tuning procedure that we call symbol tuning, which can improve in-context learning by emphasizing input–label mappings. We experiment with symbol tuning across Flan-PaLM models and observe benefits across various settings.

  • Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
  • Symbol-tuned models are much stronger at algorithmic reasoning tasks.
  • Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.
An overview of symbol tuning, where models are fine-tuned on tasks where natural language labels are replaced with arbitrary symbols. Symbol tuning relies on the intuition that when instruction and relevant labels are not available, models must use in-context examples to learn the task.

Motivation

Instruction tuning is a common fine-tuning method that has been shown to improve performance and allow models to better follow in-context examples. One shortcoming, however, is that models are not forced to learn to use the examples because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, on the left in the figure above, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary since the model could ignore the examples and just read the instruction that indicates what the task is.

In symbol tuning, the model is fine-tuned on examples where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context examples. For example, on the right in the figure above, multiple in-context examples would be needed to figure out the task. Because symbol tuning teaches the model to reason over the in-context examples, symbol-tuned models should have better performance on tasks that require reasoning between in-context examples and their labels.

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We selected 22 publicly-available natural language processing (NLP) datasets that we use for our symbol-tuning procedure. These tasks have been widely used in the past, and we only chose classification-type tasks since our method requires discrete labels. We then remap labels to a random label from a set of ~30K arbitrary labels selected from one of three categories: integers, character combinations, and words.

For our experiments, we symbol tune Flan-PaLM, the instruction-tuned variants of PaLM. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

We want to evaluate a model’s ability to perform unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8K tasks). Hence, we choose 11 NLP datasets that were not used during fine-tuning.

In-context learning

In the symbol-tuning procedure, models must learn to reason with in-context examples in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from relevant labels or instructions. Symbol-tuned models should perform better in settings where tasks are unclear and require reasoning between in-context examples and their labels. To explore these settings, we define four in-context learning settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels)

Depending on the availability of instructions and relevant natural language labels, models may need to do varying amounts of reasoning with in-context examples. When these features are not available, models must reason with the given in-context examples to successfully perform the task.

Symbol tuning improves performance across all settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms FlanPaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on these tasks (effectively saving ∼10X inference compute).

Large-enough symbol-tuned models are better at in-context learning than baselines, especially in settings where relevant labels are not available. Performance is shown as average model accuracy (%) across eleven tasks.

Algorithmic reasoning

We also experiment on algorithmic reasoning tasks from BIG-Bench. There are two main groups of tasks: 1) List functions — identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers; and 2) simple turing concepts — reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string).

On the list function and simple turing concept tasks, symbol tuning results in an average performance improvement of 18.2% and 15.3%, respectively. Additionally, Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks on average, which is equivalent to a ∼10x reduction in inference compute. These improvements suggest that symbol tuning strengthens the model’s ability to learn in-context for unseen task types, as symbol tuning did not include any algorithmic data.

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

In the flipped-label experiment, labels of in-context and evaluation examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability.

We see that there is a similar trend across all model sizes — symbol-tuned models are much more capable of following flipped labels than instruction-tuned models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, symbol-tuned models achieve similar or better than average performance as pre-training–only models.

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

We presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context examples. We tuned four language models using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30K arbitrary symbols as labels.

We first showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Finally, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) restores the ability to follow flipped labels that was lost during instruction tuning.

Future work

Through symbol tuning, we aim to increase the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.

Acknowledgements

The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. We would like to thank our colleagues at Google Research and Google DeepMind for their advice and helpful discussions.

Categories
Misc

Whole Slide Image Analysis in Real Time with MONAI and RAPIDS

BiospecimenDigital pathology slide scanners generate massive images. Glass slides are routinely scanned at 40x magnification, resulting in gigapixel images. Compression…Biospecimen

Digital pathology slide scanners generate massive images. Glass slides are routinely scanned at 40x magnification, resulting in gigapixel images. Compression can reduce the file size to 1 or 2 GB per slide, but this volume of data is still challenging to move around, save, load, and view. To view a typical whole slide image at full resolution would require a monitor about the size of a tennis court. 

Like histopathology, both genomics and microscopy can easily generate terabytes of data. Some use cases involve multiple modalities. Getting this data into a more manageable size usually involves progressive transformations, until only the most salient features remain. This post explores some ways this data refinement might be accomplished, the type of analytics used, and how tools such as MONAI and RAPIDS can unlock meaningful insights. It features a typical digital histopathology image as an example, as these are now used in routine clinical settings across the globe.

MONAI is a set of open-source, freely available collaborative frameworks optimized for accelerating research and clinical collaboration in medical imaging. RAPIDS is a suite of open-source software libraries for building end-to-end data science and analytics pipelines on GPUs. RAPIDS cuCIM, a computer vision processing software library for multidimensional images, accelerates imaging for MONAI, and the cuDF library helps with the data transformation required for the workflow. 

Managing whole slide image data

Previous work has shown how cuCIM can speed up the loading of whole slide images. See, for example, Accelerating Scikit-Image API with cuCIM: n-Dimensional Image Processing and I/O on GPUs

But what about the rest of the pipeline, which may include image preprocessing, inference, postprocessing, visualization, and analytics? A growing number of instruments capture a variety of data, including multi-spectral images, and genetic and proteomic data, all of which present similar challenges.

A diagram showing how whole slide images are saved in a pyramid format, with individual high resolution tiles that can be extracted from each level of the pyramid.
Figure 1. Whole slide images are usually saved in a pyramid format that enables faster loading, viewing, and navigation of the image. At each level of the pyramid, the images may be separated into many tiles

Diseases such as cancer emanate from cell nuclei, which are only ~5-20 microns in size. To discern the various cell subtypes, the shape, color, internal textures, and patterns need to be visible to the pathologist. This requires very large images.

High-resolution images of cells. At 40x magnification, it is possible to see the nuclei of these cells.
Figure 2. A high-resolution image (40x magnification) of cells, in which some internal structures of cell nuclei can be seen. Image credit: Cancer Digital Slide Archive

Given that a common input size for a 2D deep learning algorithm (such as DenseNet) is usually around 200 x 200 pixels, high-resolution images need to be split into patches–potentially 100,000–just for one slide. 

The slide preparation and tissue staining process can take hours. While the value of low-latency inference results is minimal, the analysis must still keep up with the digital scanner acquisition rate to prevent a backlog. Throughput is therefore critical. The way to improve throughput is to process the images faster or compute many images simultaneously.

Potential solutions 

Data scientists and developers have considered many approaches to make the problem more tractable. Given the size of the images and the limited time pathologists have to make diagnoses, there is no practical way to view every single pixel at full resolution. 

Instead, they review images at lower resolution and then zoom into the regions they identify as likely to contain features of interest. They can usually make a diagnosis having viewed 1-2% of the full resolution image. In some respects, this is like a detective at a crime scene: most of the scene is irrelevant, and conclusions usually hinge on one or two fibers or fingerprints that provide key information.

Two images showing how MONAI’s HoVerNet model is able to segment and classify a histology image.
Figure 3. A low-resolution rendering of a gigapixel TCGA slide (left) and a plot of all 709,000 nuclear centroids with color-coded cell types (right)

Unlike their human counterparts, AI and machine learning (ML) are not able to discard 98-99% of the pixels of an image, because of concerns that they might miss some critical detail. This may be possible in the future, but would require considerable trust and evidence to show that it is safe. 

In this respect, ‌current algorithms treat all input pixels equally. Various algorithmic mechanisms may subsequently assign more or less weight to them (Attention, Max-Pooling, Bias and Weights), but initially they all have the same potential to influence the prediction. 

This not only puts a large computational burden on histopathology processing pipelines, but also requires moving a substantial amount of data between disk, CPU, and GPU. ‌Most histopathology slides contain empty space, redundant information, and noise. These properties can be exploited to reduce the actual computation needed to extract the important information. 

For example, it may be sufficient for a pathologist to count certain cell types within a pertinent region to classify a disease. To do this, the algorithm must turn pixel-intensity values into an array of nucleus centroids with an associated cell-type label. It is then very simple to compute the cell counts within a region. There are many ways in which whole slide images are filtered down to the essential elements for the specific task. Some examples might include:

  • Learning a set of image features using unsupervised methods, such as training a variational autoencoder, to encode image tiles into a small embedding.
  • Localizing all the features of interest (nuclei, for example) and only using this information to derive metrics using a specialized model such as HoVerNet.

MONAI and RAPIDS

For either of these approaches, MONAI provides many models and training pipelines that you can customize for your own needs. Most are generic enough to be adapted to the specific requirements of your data (the number of channels and dimensions, for example), but several are specific to, say, digital pathology.

Once these features have been derived, they can be used for analysis. However, even after this type of dimensionality reduction, there may still be many features to analyze. For example, Figure 3 shows an image (originally 100K x 60K RGB pixels) with hundreds of thousands of nuclei. Even generating an embedding for each 64 x 64 tile could still result in millions of data points for one slide.

This is where RAPIDS can help. The open-source suite of libraries for GPU-accelerated data science with Python includes tools that cover a range of common activities, such as ML, graph analytics, ETL, and visualization. There are a few underlying technologies, such as CuPy that enable different operations to access the same data in GPU memory without having to copy or restructure the underlying data. This is one of the primary reasons that RAPIDS is so, well, rapid.

A visual description of the pathology image analysis pipeline, from raw images (or omics) to predictions.
Figure 4. A diagram showing the pathology image analysis pipeline, from raw images, or omics (left) to localized features and feature graphs (middle), and finally to predictions with GNNs (right)

One of the main interaction tools for developers is the CUDA accelerated DataFrame (cuDF). Data is presented in a tabular format and can be filtered and manipulated using the cuDF API with pandas-like commands, making it easy to adopt. These dataframes are then used as the input to many of the other RAPIDS tools. 

For example, suppose you want to create a graph from all of the nuclei, linking each nucleus to its nearest neighbors within a certain radius. To do this, you need to present a dataframe to the cuGraph API that has columns representing the source and destination nodes of each graph edge (with an optional weight). To generate this list, you can use the cuML Nearest Neighbor search capability. Again, simply provide a dataframe listing all of the nuclei coordinates and cuML will do all the heavy lifting.

from cuml.neighbors import NearestNeighbors 

knn = NearestNeighbors() 
knn.fit(cdf) 
distances, indices = knn.kneighbors(cdf, 5)

Note that the distances calculated are, by default, Euclidean distances and, to save unnecessary computation, they are squared values. Secondly, the algorithm may use heuristics by default. If you want actual values, you can specify the optional algorithm=‘brute’ parameter. Either way, the computation is extremely fast on a GPU.

Next, merge the distance and indices dataframes into one single dataframe. To do this, you need to assign unique names to the distance columns first:

distances.columns=['ix2','d1','d2','d3','d4'] 
all_cols = cudf.concat(
[indices[[1,2,3,4]], distances[['d1','d2','d3','d4']]],
axis=1)

Each row must correspond to an edge in the graph, so the dataframe needs to be split into a row for each nearest neighbor. Then the columns can be renamed as ‘source’, ‘target,’ and ‘distance.’

all_cols['index1'] = all_cols.index
c1 = all_cols[['index1',1,'d1']]
c1.columns=['source','target','distance']
c2 = all_cols[['index1',2,'d2']]
c2.columns=['source','target','distance']
c3 = all_cols[['index1',3,'d3']]
c3.columns=['source','target','distance']
c4 = all_cols[['index1',4,'d4']]
c4.columns=['source','target','distance']


edges = cudf.concat([c1,c2,c3,c4])
edges = edges.reset_index()
edges = edges[['source','target','distance']]

To eliminate all ‌neighbors beyond a certain distance, use the following filter:

distance_threshold = 15
edges = edges.loc[edges["distance"] 

At this point, you could dispense with the ‘distance’ column unless the edges within the graph need to be weighted. Then create the graph itself:

cell_graph = cugraph.Graph()
cell_graph.from_cudf_edgelist(edges,source='source', destination='target', edge_attr='distance', renumber=True)

After you have the graph, you can do standard graph analysis operations. Triangle count is the number of cycles of length three. A k-core of a graph is a maximal subgraph that contains nodes of degree k or more:

count = cugraph.triangle_count(cell_graph)
coreno = cugraph.core_number(cell_graph)

It is also possible to visualize the graph, even though it may contain hundreds of thousands of edges. With a modern GPU, the graph can be viewed and navigated in real time. To generate visualizations such as this, use cuXFilter:

nodes = tiles_xy_cdf
nodes['vertex']=nodes.index
nodes.columns=['x','y','vertex']
cux_df = fdf.load_graph((nodes, edge_df))


chart0 = cfc.graph(
edge_color_palette=['gray', 'black'],
timeout=200,      
node_aggregate_fn='mean', 
node_pixel_shade_type='linear',
edge_render_type='direct',#other option available -> 'curved', 	edge_transparency=0.5)
d = cux_df.dashboard([chart0], layout=clo.double_feature)
chart0.view()
An image showing the location and connections between all cell nuclei in a histopathology slide.
Figure 5. A visualization of the graph of all 709,000 cell nuclei detected in the whole slide image

You can then pan and zoom down to the cell nuclei level to see the clusters of nearest neighbors (Figure 6).

A cluster of cell nuclei with nearest neighbors connected by lines.
Figure 6. A zoomed view of the cell nuclei graph showing nearest neighbors connected by graph edges

Conclusion

Drawing insights from raw pixels can be difficult and time consuming. Several powerful tools and techniques can be applied to large-image problems to provide near-real-time analysis of even the most challenging data. Apart from ‌ML capabilities, GPU-accelerated tools such as RAPIDS also provide powerful visualization capabilities that help to decipher the computational features that DL-based methods produce. This post has described an end-to-end set of tools that can ingest, preprocess, infer, postprocess, plot using DL, ML Graph, and GNN methods.

Get started with RAPIDS and MONAI and unleash the power of GPUs on your data. And join the MONAI Community in the NVIDIA Developer Forums. 

Categories
Misc

Customize Your Own Carrier Board with NVIDIA SDK Manager

An illustration showing an abstract workflow.NVIDIA SDK Manager is the go-to tool for installing the NVIDIA JetPack SDK on NVIDIA Jetson Developer Kits. It provides a guided and simple way to install the…An illustration showing an abstract workflow.

NVIDIA SDK Manager is the go-to tool for installing the NVIDIA JetPack SDK on NVIDIA Jetson Developer Kits. It provides a guided and simple way to install the development environment and get started with the developer kits in a matter of minutes. SDK Manager handles the dependencies between the components and brings the latest software to NVIDIA Jetson with every JetPack release.

Previously, this seamless installation experience provided by SDK Manager was limited to NVIDIA developer kits. We are expanding support across the Jetson community. To create the same seamless experience across Jetson partner products and custom carrier boards, we are enabling Jetson ecosystem partners and customers to integrate support for their Jetson-based carrier boards into NVIDIA SDK Manager. This update also gives users the ability to customize JetPack installation.

You can modify installation steps and the binaries of the NVIDIA JetPack software stack to fit your needs and overwrite NVIDIA Jetson hardware information to use your own carrier boards.

Tailor your package

You can configure the development environment by providing an extra configuration file to the SDK Manager application. This enables you to use SDK Manager to support the installation of your carrier board, customize packages, and more.

To get started, follow these steps:

  1. Create the extra configuration file customized to your needs.
  2. Using the SDK Manager and the extra configuration file you created, configure, and set up the development environment. 

Extra configuration file

The extra configuration file provides a way for you to customize your installation packages, processes, and hardware using SDK Manager.

SDK Manager uses data (hardware and software information) that is dynamically obtained for each SDK release. The data is stored in JSON manifest files that are loaded as needed during the installation session. When you supply an extra configuration file, it overwrites the original values (stored in the JSON manifest files) for the selected object or adds new objects to the installation session.

To modify objects to create your own extra config file, you must allocate the objects that need modifications from the original release manifest. The easiest way to do this is by inquiring about the original release manifest files, along with the provided example file.

For more information, see The Extra Configuration File in the NVIDIA SDK Manager documentation.

Example walkthrough

In this example, we use the following configuration to create a custom development environment:

  • Jetpack 5.1.1 (rev. 1) with customized BSP and flashing commands.
  • Customized NVIDIA Jetson AGX Xavier module.
  • SDK Manager version 1.9.3.

Create the extra configuration file

Screenshot of the SDK Manager interface.
Figure 1. SDK Manager JetPack SDK installation user interface view
  1. Download the software JSON manifest file (using the user interface or command line):
    • Using the SDK Manager user interface, run the NVIDIA SDK Manager, select the JP 5.1.1 (rev. 1)
    • Go to STEP 2 to review the list of components. 
    • When finished, exit SDK Manager.
    • To use the SDK Manager command-line interface, run the NVIDIA SDK Manager CLI with specific parameters, such as:

      # sdkmanager --cli install --logintype devzone --product Jetson --host --targetos Linux --version 5.1.1 --target JETSON_AGX_XAVIER_TARGETS --flash all

    • Review the list of components in the main window.
    • When finished, exit SDK Manager.
SDK Manager installation of JetPack SDK - list of components command-line interface view.
Figure 2. SDK Manager and JetPack SDK installation command-line interface view
  1. Obtain the software reference file (sdkml3_jetpack_511.json) from the ~/.nvsdkm/dist/ directory.
Screenshot of the software reference file directory.
Figure 3. Software reference file
  1. Obtain the hardware reference file from the ~/.nvsdkm/hwdata/ directory.
Screenshot of the hardware reference file directory.
Figure 4. Hardware reference file
  1. Download the example configuration file (extraconfig) based on JetPack 5.1.1 (rev. 1) from the JetPack 5.1.1 sample file
    • For this example, we renamed it: extra_config_jetpack_511_xavier.json
  2. Overwrite the information section.
    • From the software reference file, copy the version-related keys and values from the information section to your extra configuration file. For this example, it is:
"information": {

        "release": {

            "releaseVersion": "JetPack 5.1.1",

            "releaseEdition": "",

            "releaseRevision": 1

        }

},
  1. Overwrite the software section. This step overwrites specific component installation with your customized software and installation steps. The components are located in the components object in the s reference file.
    • In this example, we are modifying JetPack 5.1.1 (rev. 1) to support a customized BSP and flashing command, so the relevant components are:
      • components.NV_L4T_FILE_SYSTEM_AND_OS_COMP (used for the BSP)
      • components.NV_L4T_FLASH_JETSON_LINUX_COMP (used for the flash command)
  1. Copy both of the components into the software section in the extra configuration file.
    • NV_L4T_FILE_SYSTEM_AND_OS_COMP: Update the downloadFiles object with the customized BSP file information and correct installation commands for it. Refer to the schema object for details.
    • NV_L4T_FLASH_JETSON_LINUX_COMP: Update the componentInstallParameters.installCommands object with the correct flashing commands for the customized Jetson AGX Xavier board. Refer to the schema object for details.
  1. Overwrite the hardware section. This step overwrites specific hardware device parameters with your customized hardware device. The hardware device is located in the hw object in the hardware reference file and should be copied into the hw object at the extra configuration file.
    • In this example, the closest file would be Jetson AGX Xavier: ~/.nvsdkm/hwdata/HWDevices/Jetson/JETSON_AGX_XAVIER.json
    • Copy the JETSON_AGX_XAVIER object from the hardware reference file to the hw object in the extra config file, and then modify it per the customized hardware information with the guide from schema object.

Configure and set up the development environment

  • Share the extra configuration file you created with your customers. They ‌can:
    • Download the extra configuration file and run SDK Manager with the following command:

sdkmanager --extraconfig [local path to extra_config_jetpack_511_xavier.json]

This can be used along with other command-line arguments as needed. 

Learn more

Get started with SDK Manager to customize the installation packages for JetPack that support your developer community.

For more information about supported arguments, see Install with the Command Line

Share your ideas in the Jetson developer forum

Categories
Misc

AI-Fueled Productivity: Generative AI Opens New Era of Efficiency Across Industries

A watershed moment on Nov. 22, 2022, was mostly virtual, yet it shook the foundations of nearly every industry on the planet. On that day, OpenAI released ChatGPT, the most advanced artificial intelligence chatbot ever developed. This set off demand for generative AI applications that help businesses become more efficient, from providing consumers with answers Read article >

Categories
Misc

Full-Scale Gaming: ‘Dragon’s Dogma: Dark Arisen’ Comes to GeForce NOW

Arise, members! Capcom’s legendary role-playing game Dragon’s Dogma: Dark Arisen joins the GeForce NOW library today. The RPG and THQ Nordic’s Jagged Alliance 3 are newly supported on GeForce NOW, playable on nearly any device. From Dusk Till Pawn Become the Arisen and take up the challenge in Capcom’s critically acclaimed RPG. Set in a Read article >

Categories
Misc

Webinar: Empower Your Industrial Edge AI Applications with NVIDIA Jetson

Picture of NVIDIA Jetson AGX Orin Industrial SoM on a black background.Gain insights from advanced AI use cases powered by the NVIDIA Jetson Orin in ruggedized environments.Picture of NVIDIA Jetson AGX Orin Industrial SoM on a black background.

Gain insights from advanced AI use cases powered by the NVIDIA Jetson Orin in ruggedized environments.

Categories
Misc

Near-Range Obstacle Perception with Early Grid Fusion

Image of a car backing into a parking spot using object perception.Automatic parking assist must overcome some unique challenges when perceiving obstacles. An ego vehicle contains sensors that perceive the environment around…Image of a car backing into a parking spot using object perception.

Automatic parking assist must overcome some unique challenges when perceiving obstacles. An ego vehicle contains sensors that perceive the environment around the vehicle. During parking, the ego vehicle must be close to dynamic obstacles like pedestrians and other vehicles, as well as static obstacles such as pillars and poles. To fit into the parking spot, it also may be required to navigate low obstacles such as wheel barriers and curbs.

NVIDIA DRIVE Labs videos take an engineering-focused look at autonomous vehicle challenges and how the NVIDIA DRIVE team is addressing them. The following video introduces early grid fusion (EGF) as a new technique that enhances near-field obstacle avoidance in automatic parking assist.

Video 1. NVIDIA DRIVE Labs Episode 29: Enhanced Obstacle Avoidance for Autonomous Parking in Tight Spaces

Existing parking obstacle perception solutions depend on either ultrasonic sensors or fisheye cameras. Ultrasonic sensors are mounted on the front and rear bumpers and typically don’t cover the flank. As a result, the system is unable to perceive the ego vehicle’s sides—especially for dynamic obstacles.

Fisheye cameras, on the other hand, suffer from degraded performance in low visibility, low light, and bad weather conditions.

The NVIDIA DRIVE platform is equipped with a suite of cameras, radar, and ultrasonic sensors to minimize unseen areas and maximize sensing redundancy for all operating conditions. EGF uses a machine-learned early fusion of multiple sensor inputs to provide an accurate, efficient, and robust near-field 3D obstacle perception.

Image showing DNN output and camera views in an automatic parking process.
Figure 1. EGF detecting parked cars as obstacles while parking with NVIDIA automatic parking assist

Early grid fusion overview

To better understand the innovative technique behind EGF, look at its DNN architecture and output/input representation.

Output: Height map representation

EGF outputs a height map with a grid resolution of 4 cm. Each pixel in the height map has a float value representing the height relative to the local ground.

In Figure 2, the green highlighted panel is the output of the EGF DNN. Light blue represents the ground. Yellow represents low obstacles, for example, the curb in the back. Bright red represents the outline of high obstacles, for example, the rounded L shape contours for parked cars and the dot for a tree behind the ego vehicle. The dark red area behind the bright red outlines represents potential occluded areas behind high obstacles.

Image showing DNN output, ultrasonic, and camera in a tight outdoor perpendicular parking.
Figure 2. EGF input and output visualization

This representation enables EGF to capture rich information about the world around it. The high-resolution grid can represent the rounded corner of cars on the rear left and right of the ego vehicle. Capturing the rounded corners is essential for the parking planner to have sufficient free space to perform the parking maneuver between two parked cars in a tight spot.

With different height values per pixel, you can distinguish between the curb for which the car has sufficient clearance and the light pole on the curb that the car must stay clear of.

Input: Ultrasonic and camera

Most multi-sensor fusion perception solutions are late-fusion systems operating on the detection level. Traditional ultrasonic detections obtained by trilateration are fused with polygon detections from the camera in a late-fusion stage, usually with hand-crafted fusion rules.

In contrast, EGF uses an early-fusion approach. Low-level signals from the sensor are directly fed to the DNN, which learns the sensor fusion through a data-driven approach.

For ultrasonic sensors, EGF taps into the raw envelope interface that provides reflection intensity with sub-centimeter accuracy. These envelope signals are projected into a planar view map using the extrinsic position and intrinsic beam properties of the ultrasonic sensor (Figure 3 bottom left). These ultrasonic maps, as shown in the highlighted pink panel in Figure 2, capture much more information than trilateration detections. This enables height detection in EGF.

Diagram shows the MLMCF shared trunk, USSNET shared trunk, and the combined features.
Figure 3. EGF DNN architecture

For camera sensors, EGF shares the image encoder backbone with MLMCF–NVIDIA multitask multi-camera perception backbone used for higher-speed driving. First, we process image features through CNN layers. Then, we perform uplifting of the features from an image space to a bird’s eye view space using a learned transform per camera (Figure 3 top-right box).

The ultrasonic and camera feature maps are then fused in an encoder network, and the height map is decoded from the combined features (Figure 3 right side).

Conclusion

EGF is an innovative, machine-learning–based, perception component to increase safety for autonomous parking. By using early fusion for multi-modality raw sensor signals, EGF builds a high level of trust for near-field obstacle avoidance.

To learn more about the software functionality that we’re building, see the rest of the NVIDIA DRIVE Labs video series. Catch up on more NVIDIA DRIVE posts.

Categories
Misc

Apache Airflow for Authoring Workflows in NVIDIA Base Command Platform

Laptop with dataSo, you have a ton of data pipelines today and are considering investing in GPU acceleration through NVIDIA Base Command Platform. What steps should you take?…Laptop with data

So, you have a ton of data pipelines today and are considering investing in GPU acceleration through NVIDIA Base Command Platform. What steps should you take? Use workflow management to integrate NVIDIA Base Command into your existing pipeline. 

A workflow manager enables you to easily manage your pipelines, and connect to Base Command to leverage NVIDIA compute power. This example uses Apache Airflow, which comes with a rich open-source community, is well established, and widely adopted. 

What is workflow management and why is it important?

Workflow management enables you to connect and manage all tasks in a pipeline. It accomplishes this by creating, documenting, and monitoring all steps required to complete necessary tasks. It streamlines your workflow by making sure that everything is completed correctly and efficiently.

A business often has a BizOps team, MLOps team, and DevOps team working on various tasks to reach a given goal. For a simple workflow, many people complete various tasks, some are related or dependent upon each other, while others are completely independent. Workflow management can provide invaluable support for reaching the final outcome, particularly in complex situations.

To provide an analogy, imagine you are at your favorite sushi restaurant, and you place an order for your favorite roll. In the kitchen, there are several chefs working on various tasks to prepare your sushi. One is preparing the fish, the next is carefully slicing vegetables, the third is making the rice (cooking, washing, seasoning), and the fourth is toasting the nori over an open flame. 

Only after each chef has completed their task can a master sushi chef assemble the roll. Here we see multiple roles with different expertise, required to accomplish various tasks in order to complete the end goal.

Flow diagram (boxes with arrows to the next step) depicting the steps: wash rice, cook rice, season rice, toast nori, slice vegetables, and prepare fish to make a sushi roll.
Figure 1. Example workflow to make a sushi roll

If the sushi restaurant offers 50 different menu items, there will be at least 50 different workflows. Figure 2 shows a workflow that includes just several menu items.

Flow diagram depicting several processes for different menu items including green tea, tempura shrimp, tempura chicken, and three different sushi rolls.
Figure 2. Example workflow for several menu items at a sushi restaurant

Now think of a food hall with 20 restaurants, each with their own menus and workflows.

Complex flow diagram depicting the process to make several menu items at many restaurants.
Figure 3. Example workflow of several restaurants in a food hall

You can see how this situation becomes too much for a human to organize. Digital tools help organize and execute complex tasks—tools like Apache Airflow.

If you need to maintain current processes while also adding new steps, workflow management is key. Managing workflows is an established problem, and as AI adoption accelerates, it is clear that bringing AI tasks and outcomes into existing workflows becomes the next challenge.

Apache Airflow

What does it mean to include AI as part of a bigger workflow for deploying applications? In 2015, Airbnb had trouble managing their complex data pipelines, so they created Airflow. After doing market research, they found that most people were using cron schedulers or internal workflow tools. These tools were not very sophisticated, and did not anticipate future needs. They were “make it up as you go” kind of tools. 

Airflow was made to be scalable and dynamic. It was open sourced in 2016 and became part of the Apache Foundation. This made Airflow increasingly popular and led to its rich open-source community.

NVIDIA Base Command Platform

NVIDIA Base Command Platform is an AI training platform that enables businesses and scientists to accelerate AI development. NVIDIA Base Command enables you to train AI with NVIDIA GPU acceleration. NVIDIA Base Command, in combination with NVIDIA-accelerated AI infrastructure, provides a cloud-hosted solution for AI development so you can avoid the overhead and pitfalls of deploying and running a do-it-yourself platform. 

NVIDIA Base Command efficiently configures and manages AI workloads, delivers integrated dataset management, and executes them on right-sized resources ranging from a single GPU to large-scale, multi-node clusters.

Apache Airflow plus NVIDIA Base Command Platform

Having a tool like Apache Airflow schedule and run jobs, as well as monitor their progress, helps streamline the model training process. Additionally, once the model is trained and ready for production, you can use Airflow to get the results from Base Command Platform and use it in NVIDIA Fleet Command for production. Airflow reaches across platforms to make an end to end pipeline easier to operate. Adding AI with Base Command to a new or existing pipeline is made easier with a workflow management tool. 

Key Airflow features for MLOPs

Airflow is a popular, well-established tool with a large user community. Many companies already use it, and abundant resources are available. It is open source and one of the first well-known workflow management tools. Cron schedulers have their place, but make it difficult to manage a pipeline when a job fails. Workflow tools (like Airflow) help resolve dependencies, when another job depends on the output of a failed task. 

Workflow management tools have more features; for example, alerting team members if a task/job fails so that someone can fix it and rerun jobs. Applying workflow management tools can benefit many people in the workflow, including data engineers doing ETL jobs, data scientists doing model training jobs, analysts doing reporting jobs, and more.

Tasks and DAGs

Airflow uses Directed Acyclic Graphs (DAGs) to run a workflow. DAGs are built in Python. You set up your tasks and dependencies, and Airflow returns a graph depicting your workflow. Airflow triggers jobs to run once their dependencies have been fully met. 

Figure 4 shows a DAG workflow to bake and frost a cake. Some of the tasks are dependent, such as measuring and mixing the ingredients, to bake the cake. Other tasks, such as ‘preheat oven,’ are necessary to complete the final goal: a frosted cake. Everything needs to be connected to complete the final product.

In this DAG, ‘measure ingredients,’ ‘preheat oven,’ and ‘make frosting’ would be triggered and executed first. When those tasks are completed, the next steps will be run in accordance to their dependencies. 

Flow diagram showing steps to bake a cake including: measure ingredients, mix batter, preheat oven, bake cake, make frosting.
Figure 4. DAG depicting workflow to bake a cake

Airflow UI

The Airflow UI is intuitive and easy to use. It can be used to trigger your DAGs, as well as monitor the progress of tasks and DAGs. You can also view logs, which can be used for troubleshooting. 

Dynamic jobs 

Dynamic jobs enable you to run the same job, while changing a few parameters. These jobs will run in parallel, and you are able to add variables instead of coding the same job with minor changes multiple times. 

Continuing with the cake example, suppose you set out to make a single chocolate cake, but then decide to start a bakery. Instead of manually creating tasks for each separate cake, you can give Airflow a list of cakes: strawberry, coconut, and red velvet (Figure 5). You can do this through the UI or by uploading a JSON file. Airflow will dynamically create three more jobs to make three more cake flavors, instead of manually recreating the process for each new cake flavor. If someone is allergic to coconut, you can remove it from the list. Or you could have a flavor of the week and programmatically change the variables (cake flavors) weekly.  

Flow diagram showing the steps to bake several flavors of cake (chocolate, strawberry, red velvet, coconut)  following the same steps.
Figure 5. DAG depicting dynamic workflow to bake several different cakes
Screenshot of the Airflow UI, showing the dynamic variables (cake flavors) used to create Figure 5
Figure 6. ‌List of variables from Airflow UI used to dynamically create jobs

If you apply this approach to an ML pipeline, you can imagine all that can be accomplished. The updates can be programmatic and automated as part of a larger pipeline. Combined with a more complex Base Command job, such as running some framework and possibly only changing one simple variable or set of variables per container, then compare the results of all of the different job runs to make a decision. 

Airflow could then be configured to kick off an additional multi-node training run based on the results, or the winning model could be uploaded to the private registry and further workloads or tasks could be integrated with Fleet Command to take it to production. 

How to use Airflow

First, make sure you have a Kubernetes environment, with Helm installed. Helm is a package manager for Kubenetes used to find, share, and use software. If you are working from a Mac, Homebrew can help with installing Helm. 

helm repo add apache-airflow https://airflow.apache.org
helm repo update
kubectl create namespace airflow
helm upgrade --install airflow apache-airflow/airflow --namespace airflow --debug

Next, generate a secret for the webserver: 

Generate Secret for UX
python3 -c 'import secrets; print(secrets.token_hex(16))'
helm show values apache-airflow/airflow > values.yaml
Open the file
When inside file, change: ‘webserverSecretKey: ’ 
helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug

Airflow stores DAGs in a local file system. A more scalable and straightforward way to keep DAGs is in a GitHub repository. Airflow checks for new/updated DAGs in your file system, but using GitSync, Airflow looks at a GitHub repository which is easier to maintain and change.

Next, configure GitSync. When you add the key to your GitHub repository, be sure to enable write access.

ssh-keygen -t ed25519 -C “airflow-git-ssh”
  When asked where to save, press enter
  When asked for a passcode, press enter (do NOT put in a passcode, it will break)
Copy/Paste public key into private github repository 
Repository > settings > deploy key > new key 
kubectl create secret generic airflow-git-ssh 
 --from-file=gitSshKey=/Users/skropp/.ssh/id_ed25519 
 --from-file=known_hosts=/Users/skropp/.ssh/known_hosts 
 --from-file=id_ed25519.pub=/Users/skropp/.ssh/id_ed25519.pub 
 -n airflow
edit values.yaml (see below)
helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml —-debug

Original:

gitSync:
 enabled: false
 # git repo clone url
 # ssh examples ssh://git@github.com/apache/airflow.git
 # https example: https://github.com/apache/airflow.git
 repo: https://github.com/apache/airflow.git
 branch: v2-2-stable 
 rev: HEAD 
 depth: 1
 # the number of consecutive failures allowed before aborting
 maxFailures: 0
 # subpath within the repo where dags are located
 # should be "" if dags are at repo root
 subPath: "tests/dags"
 # if your repo needs a user name password
 # you can load them to a k8s secret like the one below
 # ___
 # apiVersion: vI
 # kind: Secret
 # metadata:
 #    name: git-credentials
 # data:
 #    GIT_SYNC_USERNAME: 
 #    GIT_SYNC_PASSWORD: 
 # and specify the name of the secret below
 # sshKeySecret: airflow-ssh-secret

With changes:

gitSync:
 enabled: True
 # git repo clone url
 # ssh examples ssh://git@github.com/apache/airflow.git
 # https example: https://github.com/apache/airflow.git
 repo: ssh://git@github.com/sushi-sauce/airflow.git
 branch: main 
 rev: HEAD 
 depth: 1
 # the number of consecutive failures allowed before aborting
 maxFailures: 0
 # subpath within the repo where dags are located
 # should be "" if dags are at repo root
 subPath: ""
 # if your repo needs a user name password
 # you can load them to a k8s secret like the one below
 # ___
 # apiVersion: v1
 # kind: Secret
 # metadata:
 # name: git-credentials
 # data:
 # GIT_SYNC_USERNAME: 
 # and specify the name of the secret below
 credentialsSecret: git-credentials
 # If you are using an ssh clone url, you can load
 # the ssh private key to a k8s secret like the one below
 # ___
 # apiVersion: v1
 # kind: Secret
 # metadata:
 # name: airflow-ssh-secret
 # data:
 # key needs to be gitSshKey
 # gitSshKey: 
 # and specify the name of the secret below
sshKeySecret: airflow-git-ssh

Now you have Airflow running, and a place to store DAG files. 

DAG examples

Below, find a simple example DAG, and one that is more complex.

Simple example DAG

A very simple DAG is shown below:

  • Lines 1-3 import various tools and operators needed to run the tasks.
  • Lines 5-6 create a Python function that prints a message.
  • Lines 8-10 define the DAG, giving the display name hello_world and a description, as well as schedule interval and start date. Schedule interval and start date are required configurations.
  • Line 12 defines the task, task_id names the task, Python callable calls the function, dag=DAG brings in the configs set above.
1 from datetime import datetime
2 from airflow import DAG
3 from airflow.operators.python_operator import PythonOperator
4
5 def print_hello():
6 return 'Hello world from first Airflow DAG!'
7
8 dag = DAG('hello_world', description='Hello World DAG',
9		schedule interval='0 12 * * *',
10 	 	start_date=datetime (2017, 3, 20), catchup=False)
11
12 hello_operator = PythonOperator (task_id-'hello_task', python_callable=print_hello, dag=dag)
13
14 hello_operator
View of the Airflow interface showing the single task “hello task” generated by the simple DAG code
Figure 7. Airflow graph generated from the simple example code

More complex example DAG

This example creates the same task three times, echoing hello from task 1, 2, and 3. It gets interesting when you use a list of variables instead of simply numbers in your loop. This means you can‌ change pieces of your code to dynamically create different jobs.

1 from airflow import DAG
2 from airflow.operators.bash_operator import BashOperator
3 from airflow.operators. dummy_operator import DummyOperator
4 from datetime import datetime, timedelta
5
6 # Step 1 - Define the default arguments for DAG
7 default_args = {
8 	'depends_on_past': False,
9 	'start_date': datetime (2020, 12, 18),
10 	'retry_delay': timedelta(minutes=5)
11 }
12
13 # Step 2 - Declare a DAG with default arguments
14 dag = DAG( 'hello_dynamic_tasks',
15 	schedule_interval='0 8 * * *' 
16 	default_args=default_args,
17 	catchup=False
18 	)
19 # Step 3 - Declare dummy start and stop tasks
20 start_task = DummyOperator(task_id='start', dag=dag)
21 end_task = DummyOperator (task_id='end', dag=dag)
22
23 # Step 4 - Create dynamic tasks for a range of numbers
24 for 1 in range(1, 4):
25 	# Step 4a - Declare the task
26 	t1 = BashOperator (
27 		task_id='task_t' + str(i),
28 		bash _command='echo hello from task: '+str(i), 
29		dag=dag
30 	)
31 # Step 4b - Define the sequence of execution of tasks
32 start_task »> t1 >> end_task

While similar to the first example, this example uses the placeholder operator to create empty tasks, and a loop to create dynamic tasks.

Flow diagram showing the tasks start, task 1, task 2, task 3, and end. Tasks 1,2, and 3 are parallel.
Figure 8. Airflow graph generated from the more complex example code

Example DAG with NVIDIA Base Command

Airflow can leverage the Base Command API. Fleet Command uses the same API. This enables Airflow to use many NVIDIA AI platforms, making an accelerated AI pipeline easy to manage with Airflow. Let’s walk through some code from Airflow showing the tasks needed to connect to Base Command and run a job.

t1= PythonOperator(
task_id = 'api_connect'
python_callable= find_api_key,
dag = dag,
)
t2 = PythonOperator (
task_id = 'token',
python_callable = get_token,
op_kwargs=("org":org_, "team": team_),
dag = dag
)
t3 = PythonOperator (
task_id = 'get_dataset',
op kwargs=("org":org_), 
python_callable = get_datasets, 
dag = dag
)

t5 = PythonOperator(
task_id = 'job',
python_callable run_job,
dag = dag
)

for element in instance_v:
t4 = PythonOperator (
task_id = 'create_job_' + str(element),
op_kwargs={"org":org_. ,"team": team_, "ace": ace_, "name": name_, "command": command_ , "container": container_, "instance": str(element))},
python_callable=create_job,
dag = dag
)

t1 >> t2 >> t3 >> t4 >> t5

Key functions being used in the tasks include:

def find_api_key(ti):
        expanded_conf_file_path = os.path.expanduser("~/.ngc/config")
        if os.path.exists(expanded_conf_file_path):
            print("Config file exists, pulling API key from it")
            try:
                config_file = open(expanded_conf_file_path, "r")
                lines = config_file.readlines()
                for line in lines:
                 if "apikey" in line:
                    elements = line.split()
                    return elements[-1]
                   
            except:
                print("Failed to find the API key in config file")
                return ''
        elif os.environ.get('API_KEY'):
            print("Using API_KEY environment variable")
            return os.environ.get('API_KEY')
            
        else:
            print("Could not find a valid API key")
            return ''
       
def get_token(ti, org,team ):
        api = ti.xcom_pull(task_ids='api_connect')
        '''Use the api key set environment variable to generate auth token'''
        scope_list = []
        scope = f'group/ngc:{org}'
        scope_list.append(scope)
        if team:
            team_scope = f'group/ngc:{org}/{team}'
            scope_list.append(team_scope)

        querystring = {"service": "ngc", "scope": scope_list}
 
 auth = '$oauthtoken:{0}'.format(api)
        auth = base64.b64encode(auth.encode('utf-8')).decode('utf-8')
        headers = {
          'Authorization': f'Basic {auth}',
          'Content-Type': 'application/json',
          'Cache-Control': 'no-cache',
        }
        url = 'https://authn.nvidia.com/token'
        response = requests.request("GET", url, headers=headers, params=querystring)
        if response.status_code != 200:
            raise Exception("HTTP Error %d: from %s" % (response.status_code, url))

        return json.loads(response.text.encode('utf8'))["token"]

  • Task 1 finds your API key using a Python function defined in the DAG.
  • Task 2 gets a token; there are two very interesting things happening here:
    • To get a token, you need to give the API your key. Task 1 finds the key, but in Airflow, all the tasks are separate. (Note that I used the xcom feature in get_token to pull the results of Task 1 into Task 2. xcom pulls the API key found in the function find_api_key, into get_token to generate a token.)
    • The org and team arguments are Airflow variables. This means you can go into the Airflow UI and change the credentials depending on who is using it. This makes changing users clean and easy.
  • Task 3 gets the dataset needed for the job. Similarly, it uses the org variable defined in the Airflow UI.
  • Task 4 is the main character. For each element in the list of instances, Airflow creates a job. Variables are also used for team, org, container, name, command, and instance. If you want to change any of these components, make the change on the variable page inside of Airflow.
  • Task 5 runs the job.
The Airflow UI, showing the values assigned to the instance variable.
Figure 9. List of instances from Airflow UI, enabling you to run the job on one, two, four, and eight GPUs, respectively
Flow diagram showing how to run a dynamic job in Base Command Platform. The steps include: api connect, token, get dataset, create job on ‘instance’, run job. It has four different instances running in parallel.
Figure 10. Graph showing dynamic job to Base Command Platform, where the same job is running on four different instances: one, two, four, and eight GPUs

Conclusion

Incorporating a workflow management tool such as Apache Airflow into your data pipelines is crucial for managing and executing complex tasks efficiently. With the rapid adoption of AI in various industries, the need to integrate AI tasks into existing workflows becomes increasingly important.

Integrating Airflow with an AI platform such as NVIDIA Base Command, which leverages GPU acceleration, streamlines the process of training and deploying AI models. The automation and flexibility of Airflow, combined with NVIDIA computing power through Base Command Platform, enable efficient experimentation, model comparison, and decision making within your ML pipeline. 

A well-managed, faster workflow is the end product. Base Command Platform and Airflow together empower organizations to optimize data pipelines, enhance collaboration among different teams, and facilitate the integration of accelerated AI into existing workflows. This leads to quicker AI development and deployment that is more effective, scalable, and reliable.

To learn more, watch the NVIDIA Base Command demo. And check out the related post, Simplifying AI Development with NVIDIA Base Command Platform

Categories
Misc

Score! Team NVIDIA Takes Trophy in Recommendation Systems

A crack NVIDIA team of five machine learning experts spread across four continents won all three tasks in a hotly contested, prestigious competition to build state-of-the-art recommendation systems. The results reflect the group’s savvy applying the NVIDIA AI platform to real-world challenges for these engines of the digital economy. Recommenders serve up trillions of search Read article >