Categories
Misc

Predicting Metastatic Cancer Risk with AI

Using movies of living cancer cells, scientists create a convolutional neural network that can identify and predict aggressive metastatic melanomas.

Using a newly developed AI algorithm, researchers from the University of Texas Southwestern Medical Center are making early detection of aggressive forms of skin cancer possible. The study, recently published in Cell Systems, creates a deep learning model capable of predicting if a melanoma will aggressively spread, by examining cell features undetectable to the human eye.

“We now have a general framework that allows us to take tissue samples and predict mechanisms inside cells that drive disease, mechanisms that are currently inaccessible in any other way,” said senior author Gaudenz Danuser, the Patrick E. Haggerty Distinguished Chair in Basic Biomedical Science at the University of Texas Southwestern.

Melanoma—a serious form of skin cancer caused by changes in melanocyte cells­— is the most likely of all skin cancers to spread if not caught early. Quickly identifying it helps doctors create effective treatment plans, and when diagnosed early, has a 5-year survival rate of about 99%.

Doctors often use biopsies, blood tests, or X-rays, CT, and PET scans to determine the stage of melanoma and whether it has spread to other areas of the body, known as metastasizing. Changes in cellular behavior could hint at the likelihood of the melanoma to spread, but they are too subtle for experts to observe. 

The researchers thought using AI to help determine the metastatic potential of melanoma could be very valuable, but up until now AI models have not been able to interpret these cellular characteristics.

“We propose an algorithm that combines unsupervised deep learning and supervised conventional machine learning, along with generative image models to visualize the specific cell behavior that predicts the metastatic potential. That is, we map the insight gained by AI back into a data cue that is interpretable by human intelligence,” said Andrew Jamieson, study coauthor and assistant professor in bioinformatics at UT Southwestern. 

Using tumor images from seven patients with a documented timeline of metastatic melanoma, the researchers compiled a time-lapse dataset of more than 12,000 single melanoma cells in petri dishes. Resulting in approximately 1,700,000 raw images, the researchers used a deep learning algorithm to identify different cellular behaviors.

Time lapse of single melanoma cells cropped from the field of view and used as the input for the autoencoder.
Credit: Danuser et al/Cell Systems

Based on these features, the team then “reverse engineered’’ a deep convolutional neural network able to tease out the physical properties of aggressive melanoma cells and predict whether cells have high metastatic potential.

The experiments were run on the UT Southwestern Medical Center BioHPC cluster with CUDA-accelerated NVIDIA V100 Tensor Core GPUs. They trained multiple deep learning models on the 1.7 million cell images to visualize and explore the massive data set that started at over five TBs of raw microscopy data.

The researchers then tracked the spread of melanoma cells in mice and tested whether these specific predictors lead to highly metastatic cells. They found the cell types they’d classified as highly metastatic spread throughout the entire animal, while those classified low did not. 

There is more work to be done before the research can be deployed in a medical setting. The team also points out that the study raises questions about whether this applies to other cancers, or if melanoma metastasis is an outlier. 

“The result seems to suggest that the metastatic potential, at least of melanoma, is set by cell-autonomous rather than environmental factors,” Jamieson said. 

Applications of the study could also go beyond cancer, and transform diagnoses of other diseases.


Read the full article in Cell Systems>>
Read more >>   

Categories
Misc

Writing Portable Rendering Code with NVRHI

Modern graphics APIs, such as Direct3D 12 and Vulkan, are designed to provide relatively low-level access to the GPU and eliminate the GPU driver overhead associated with API translation. This low-level interface allows applications to have more control over the system and provides the ability to manage pipelines, shader compilation, memory allocations, and resource descriptors … Continued

Modern graphics APIs, such as Direct3D 12 and Vulkan, are designed to provide relatively low-level access to the GPU and eliminate the GPU driver overhead associated with API translation. This low-level interface allows applications to have more control over the system and provides the ability to manage pipelines, shader compilation, memory allocations, and resource descriptors in a way that is best for each application.

On the other hand, this closer-to-the-hardware access to the GPU means that the application must manage these things on its own, instead of relying on the GPU driver. A basic “hello world” program that draws a single triangle using these APIs can grow to a thousand lines of code or more. In a complex renderer, managing the GPU memory, descriptors, and so on, can quickly become overwhelming if not done in a systematic way.

If an application or an engine must work with more than one graphics API, it can be done in two ways:

  • Duplicate the rendering code to work with each API separately. This approach has an obvious drawback of having to develop and maintain multiple independent implementations.
  • Implement an abstraction layer over the graphics APIs that provides the necessary functionality in a common interface. This has a different drawback of the development and maintenance of the abstraction layer. Most major game engines implement the second approach.

NVIDIA Rendering Hardware Interface (NVRHI) is a library that handles these drawbacks. It defines a custom, higher-level graphics API that maps well to the three supported native graphics APIs: Vulkan, D3D12, and D3D11. It manages resources, pipelines, descriptors, and barriers in a safe and automatic way that can be easily disabled or bypassed when necessary to reduce CPU overhead. On top of that, NVRHI provides a validation layer that ensures that the application’s use of the API is correct, similar to what the Direct3D debug runtime or the Vulkan validation layers do, but on a higher level.

There are some features related to portability that NVRHI doesn’t provide. First, it doesn’t compile shaders at run time or read shader reflection data to bind resources dynamically. In fact, NVRHI doesn’t process shaders at run time at all. The application provides a platform-specific shader binary, that is, a DXBC, DXIL or SPIR-V blob. NVRHI passes that directly to the underlying graphics API. Matching the binding layouts is left up to the application and is validated by the underlying graphics API. Secondly, NVRHI doesn’t create graphics devices or windows. That is also left up to the application or other libraries, such as GLFW.

In this post, I go over the main features of NVRHI and explain how each feature helps graphics engineers be more productive and write safer code.

  • Resource lifetime management
  • Binding layouts and binding sets
  • Automatic resource state tracking
  • Upload management
  • Interaction with graphics APIs
  • Shader permutations

Resource lifetime management

In Vulkan and D3D12, the application must take care to destroy only the device resources that the GPU is no longer using. This can be done with little overhead if the resource usage is planned carefully, but the problem is in the planning.

NVRHI follows the D3D11 resource lifetime model almost exactly. Resources, such as buffers, textures, or pipelines, have a reference count. When a resource handle is copied, the reference count is incremented. When the handle is destroyed, the reference count is decremented. When the last handle is destroyed and the reference count reaches zero, the resource object is destroyed, including the underlying graphics API resource. But that’s what D3D12 does as well, right? Not quite.

NVRHI also keeps internal references to resources that are used in command lists. When a command list is opened for recording, a new instance of the command list is created. That instance holds references to each resource it uses. When the command list is closed and submitted for execution, the instance is stored in a queue along with a fence or semaphore value that can be used to determine if the instance has finished executing on the GPU. The same command list can be reopened for recording immediately after that, even while the previous instance is still executing on the GPU.

The application should call the nvrhi::IDevice::runGarbageCollection method occasionally, at least one time per frame. This method looks at the in-flight command list instance queue and clears the instances that have finished executing. Clearing the instance automatically removes the internal references to the resources used in the instance. If a resource has no other references left, it is destroyed at that time.

This behavior can be shown with the following code example:

{
       // Create a buffer in a scope, which starts with reference count of 1
       nvrhi::BufferHandle buffer = device->createBuffer(...);
  
       // Creates an internal instance of the command list
       commandList->open(); 
  
       // Adds a buffer reference to the instance, which increases reference count to 2
       commandList->clearBufferUInt(buffer, 0); 
  
       commandList->close();
  
       // The local reference to the buffer is released here, decrements reference count to 1
 }
  
 // Puts the command list instance into the queue
 device->executeCommandList(commandList); 
  
 // Likely doesn't do anything with the instance
 // because it's just been submitted and still executing on the GPU
 device->runGarbageCollection();
  
 device->waitForIdle();
  
 // This time, the buffer should be destroyed because
 // waitForIdle ensures that all command list instances
 // have finished executing, so when the finished instance
 // is cleared, the buffer reference count is decremented to zero
 // and it can be safely destroyed
 device->runGarbageCollection(); 

The “fire and forget” pattern shown here, when the application creates a resource, uses it, and then immediately releases it, is perfectly fine in NVRHI, unlike D3D12 and Vulkan.

You might wonder whether this type of resource tracking becomes expensive if the application performs many draw calls with lots of resources bound for each draw call. Not really. Draw calls and dispatches do not deal with individual resources. Textures and buffers are grouped into immutable binding sets, which are created, hold permanent references to their resources, and are tracked as a single object.

So, when a certain binding set is used in a command list, the command list instance only stores a reference to the binding set. And that store is skipped if the binding set is already bound, so that repeated draw calls with the same bindings do not add tracking cost. I explain binding sets in more detail in the next section.

Another thing that can help reduce the CPU overhead imposed by resource lifetime tracking is the trackLiveness setting that is available on binding sets and acceleration structures. When this parameter is set to false, the internal references are not created for that particular resource. In this case, the application is responsible for keeping its own reference and not releasing it while the resource is in use.

Binding layouts and binding sets

NVRHI features a unique resource binding model designed for safety and runtime efficiency. As mentioned earlier, various resources that are used by graphics or compute pipelines are grouped into binding sets.

Put simply, a binding set is an array of resource views that are bound to particular slots in a pipeline. For example, a binding set may contain a structured buffer SRV bound to slot t1, a UAV for a single texture mip level bound to slot u0, and a constant buffer bound to slot b2. All the bindings in a set share the same visibility mask (which shader stages will see that binding) and register space, both dictated by the binding layout.

Binding layouts are the NVRHI version of D3D12 root signatures and Vulkan descriptor set layouts. A binding layout is like a template for a binding set. It declares what resource types are bound to which slots, but does not tell which specific resources are used.

Like the root signatures and descriptor set layouts, NVHRI binding layouts are used to create pipelines. A single pipeline may be created with multiple binding layouts. These can be useful to bin resources into different groups according to their modification frequency, or to bind different sets of resources to different pipeline stages.

The following code example shows how a basic compute pipeline can be created with one binding layout:

auto layoutDesc = nvrhi::BindingLayoutDesc()
     .setVisibility(nvrhi::ShaderType::All)
     .addItem(nvrhi::BindingLayoutItem::Texture_SRV(0))     // texture at t0
     .addItem(nvrhi::BindingLayoutItem::ConstantBuffer(2)); // constants at b2
  
// Create a binding layout.
nvrhi::BindingLayoutHandle bindingLayout = device->createBindingLayout(layoutDesc);
  
auto pipelineDesc = nvrhi::ComputePipelineDesc()
       .setComputeShader(shader)
       .addBindingLayout(bindingLayout);
  
// Use the layout to create a compute pipeline.
nvrhi::ComputePipelineHandle computePipeline = device->createComputePipeline(pipelineDesc); 

Binding sets can only be created from a matching binding layout. Matching means that the layout must have the same number of items, of the same types, bound to the same slots, in the same order. This may look redundant, and the D3D12 and Vulkan APIs have less redundancy in their descriptor systems. This redundancy is useful: it makes the code more obvious, and it allows the NVRHI validation layer to catch more bugs.

auto bindingSetDesc = nvrhi::BindingSetDesc()
       // An SRV for two mip levels of myTexture.
       // Subresource specification is optional, default is the entire texture.
     .addItem(nvrhi::BindingSetItem::Texture_SRV(0, myTexture, nvrhi::Format::UNKNOWN,
       nvrhi::TextureSubresourceSet().setBaseMipLevel(2).setNumMipLevels(2)))
     .addItem(nvrhi::BindingSetItem::ConstantBuffer(2, constantBuffer));
  
// Create a binding set using the layout created in the previous code snippet.
nvrhi::BindingSetHandle bindingSet = device->createBindingSet(bindingSetDesc, bindingLayout); 

Because the binding set descriptor contains almost all the information necessary to create the binding layout as well, it is possible to create both with one function call. That may be useful when creating some render passes that only need one binding set.

#include 
...
nvrhi::BindingLayoutHandle bindingLayout;
nvrhi::BindingSetHandle bindingSet;
nvrhi::utils::CreateBindingSetAndLayout(device, /* visibility = */ nvrhi::ShaderType::All,
       /* registerSpace = */ 0, bindingSetDesc, /* out */ bindingLayout, /* out */ bindingSet);
  
// Now you can create the pipeline using bindingLayout. 

Binding sets are immutable. When you create a binding set, NVRHI allocates the descriptors from the heap on D3D12 or creates a descriptor set on Vulkan and populates it with the necessary resource views.

Later, when the binding set is used in a draw or dispatch call, the binding operation is lightweight and translates to the corresponding graphics API binding calls. There is no descriptor creation or copying happening at render time.

Automatic resource state tracking

Explicit barriers that change resource states and introduce dependencies in the graphics pipelines are an important part of both D3D12 and Vulkan APIs. They allow applications to minimize the number of pipeline dependencies and bubbles and to optimize their placement. They reduce CPU overhead at the same time by removing that logic from the driver. That’s relevant mostly to tight render loops that draw lots of geometry. Most of the time, especially when writing new rendering code, dealing with barriers is just annoying and bug-prone.

NVHRI implements a system that tracks the state of each resource and, optionally, subresource per command list. When a command interacts with a resource, the resource is transitioned into the state required for that command, if it’s not already in that state. For example, a writeTexture command transitions the texture into the CopyDest state, and a subsequent draw operation that reads from the texture transitions it into the ShaderResources state.

Special handling is applied when a resource is in the UnorderedAccess state for two consecutive commands: there is no transition involved, but a UAV barrier is inserted between the commands. It is possible to disable the insertion of UAV barriers temporarily, if necessary.

I said earlier that NVRHI tracks the state of each resource per command list. An application may record multiple command lists in any order or in parallel and use the same resource differently in each command list. Therefore, you can’t track the resource states globally or per-device because the barriers need to be derived while the command lists are being recorded. Global tracking may not happen in the same order as actual resource usage on the device command queue when the command lists are executed.

So, you can track resource states in each command list separately. In a sense, this can be viewed as a differential equation. You know how the state changes inside the command list, but you don’t know the boundary conditions, that is, which state each resource is in when you enter and exit the command list in their order of execution.

The application must provide the boundary conditions for each resource. There are two ways to do that:

  • Explicit: Use the beginTrackingTextureState and beginTrackingBufferState functions after opening the command list and the setTextureState and setBufferState functions before closing it.
  • Automatic: Use the initialState and keepInitialState fields of the TextureDesc and BufferDesc structures when creating the resource. Then, each command list that uses the resource assumes that it’s in the initial state upon entering the command list, and transition it back into the initial state before leaving the command list.

Here, you might wonder about avoiding the CPU overhead of resource state tracking, or manually optimizing barrier placement. Well, you can! The command lists have the setEnableAutomaticBarriers function that can completely disable automatic barriers. In this mode, use the setTextureState and setBufferState functions where a barrier is necessary. It still uses the same state tracking logic but potentially at a lower frequency.

Upload management

NVRHI automates another aspect of modern graphics APIs that is often annoying to deal with. That’s the management of upload buffers and the tracking of their usage by the GPU.

Typically, when some texture or buffer must be updated from the CPU on every frame or multiple times per frame, a staging buffer is allocated whose size is multiple times larger than the resource memory requirements. This enables multiple frames in-flight on the GPU. Alternately, portions of a large staging buffer are suballocated at run time. It is possible to implement the same strategy using NVRHI, but there is a built-in implementation that works well for most use cases.

Each NVRHI command list has its own upload manager. When writeBuffer or writeTexture is called, the upload manager tries to find an existing buffer that is no longer used by the GPU that can fit the necessary data. If no such buffer is available, a new buffer is created and added to the upload manager’s pool. The provided data is copied into that buffer, and then a copy command is added to the command list. The tracking of which buffers are used by the GPU is performed automatically.

ConstantBufferStruct myConstants;
myConstants.member = value;
  
// This is all that's necessary to fill the constant buffer with data and have it ready for rendering.
commandList->writeBuffer(constantBuffer, myConstants, sizeof(myConstants)); 

The upload manager never releases its buffers, nor shares them with other command lists. Perhaps an application is doing a significant number of uploads, such as during scene loading, and then switching to a less upload-intensive mode of operation. In that case, it’s better to create a separate command list for the uploading activity and release it when the uploads are done. That releases the upload buffers associated with the command list.

It’s not necessary to wait for the GPU to finish copying data from the upload buffers. The resource lifetime tracking system described earlier does not release the upload buffers until the copies are done.

Interaction with graphics APIs

Sometimes, it is necessary to escape the abstraction layers and do something with the underlying graphics API directly. Maybe you have to use some feature that is not supported by NVRHI, demonstrate some API usage in a sample application, or make the portable rendering code work with a native resource coming from elsewhere. NVRHI makes it relatively easy to do these things.

Every NVRHI object has a getNativeObject function that returns an underlying API resource of the necessary type. The expected type is passed to that function, and it only returns a non-NULL value if that type is available, to provide some type safety.

Supported types include interfaces like ID3D11Device or ID3D12Resource and handles like vk::Image. In addition, the NVRHI texture objects have a getNativeView function that can create and return texture views, such as SRV or UAV.

For example, to issue some native D3D12 rendering commands in the middle of an NVRHI command list, you might use code like the following example:

ID3D12GraphicsCommandList* d3dCmdList = nvrhiCommandList->getNativeObject(
       nvrhi::ObjectTypes::D3D12_GraphicsCommandList);
  
D3D12_CPU_DESCRIPTOR_HANDLE d3dTextureRTV = nvrhiTexture->getNativeView(
       nvrhi::ObjectTypes::D3D12_RenderTargetViewDescriptor);
  
const float clearColor[4] = { 0.f, 0.f, 0.f, 0.f };
d3dCmdList->ClearRenderTargetView(d3dTextureRTV, clearColor, 0, nullptr); 

Shader permutations

The final productivity feature to mention here is the batch shader compiler that comes with NVRHI. It is an optional feature, and NVRHI can be completely functional without it. NVRHI accepts shaders compiled through other means. Still, it is a useful tool.

It is often necessary to compile the same shader with multiple combinations of preprocessor definitions. However, the native tools that Visual Studio provides for shader compilation, for example, do not make this task easy at all.

The NVRHI shader compiler solves exactly this problem. Driven by a text file that lists the shader source files and compilation options, it generates option permutations and calls the underlying compiler (DXC or FXC) to generate the binaries. The binaries for different versions of the same shader are then packaged into one file of a custom chunk-based format that can be processed using the functions declared in .

The application can load the file with all the shader permutations and pass it to nvrhi::utils::createShaderPermutation or nvrhi::utils::createShaderLibraryPermutation, along with the list of preprocessor definitions and their values. If the requested permutation exists in the file, the corresponding shader object is created. If it doesn’t, an error message is generated.

In addition to permutation processing, the shader compiler has other nice features. First, it scans the source files to build a tree of headers included in each one. It detects if any of the headers have been modified, and whether a particular shader must be rebuilt. Second, it can build all the outdated shaders in parallel using all available CPU cores.

Conclusion

In this post, I covered some of the most important features of NVRHI that, in my opinion, make it a pleasure to use. For more information about NVHRI, see the NVIDIAGameWorks/nvrhi GitHub repo, which includes a tutorial and a more detailed programming guide. The Donut Examples repository on GitHub has several complete applications written with NVRHI.

If you have questions about NVRHI, post an issue on GitHub or send me a message on Twitter (@more_fps).

Categories
Misc

How do I use Autograph and TensorArrays correctly?

The following python snippet

@tf.function def testfn(arr): if arr.size() == 0: raise ValueError('arr is empty') else: return tf.constant(True) arr = tf.TensorArray(tf.float32, 3) arr = arr.unstack([tf.constant(range(5), dtype=tf.float32) for i in range(3)]) tf.print('Size:', arr.size()) # "Size: 3" arrVal = testfn(arr) # Unexpected(?) ValueError: arr is empty 

produces the following output

Size: 3 --------------------------------------------------------------------------- ValueError Traceback (...) ValueError: in user code: <ipython-input-72-47c594b1a619>:4 testfn * if arr.size() == 0: raise ValueError('arr is empty') ValueError: arr is empty 

and I don’t understand why. When I print the size of the array I get 3. When I call testfn without the decorator, python also evaluates arr.size() to 3 in eager mode. But why does Autograph compile the array to something of size 0? And more importantly, how do I solve this problem? (I am very new to tensorflow)

submitted by /u/Dubmove
[visit reddit] [comments]

Categories
Misc

Auto trigger bot for CS:GO in Tensorflow Keras

Don’t worry this bot is absolutely abysmal in online games if anyone tried to use it for this purpose. The detection is nowhere near as fast as human eye to finger reflex, even if you are a complete “noob”.

The reason I made this series is because not only was it fun and usual for me, maybe even uncharted territory as far as I am aware, but also I felt like it would be a fun way for young people to get started in learning about basic neural networks applied to computer vision.

The project includes two rather large datasets for CS:GO, one in 92×192 pixels and the other is 28×28 pixels. So if you are looking for unusual or “exotic” datasets to work with you may find this an interesting project.

The whole series can be found here:
https://james-william-fletcher.medium.com/list/fps-machine-learning-autoshoot-bot-for-csgo-100153576e93

It documents my journey through creating very basic neural networks in C and working my way up all the way to a CNN fully programmed and trained in C to then using a mixture of Tensorflow Keras and C. It covers many different aspects of working with neural networks, the intricacies of exporting Keras FNN into C and using Keras CNN models from C using a bridging daemon.

Maybe some of you will find it interesting or informative and I am always welcome to even the harshest of criticism.

submitted by /u/SirFletch
[visit reddit] [comments]

Categories
Misc

Translate a tic tac toe board to text

Let’s say you have a picture of a tic tac toe board after a game is done. I’m trying to make something that would be able to to translate the board to text.

I found this article which seemed to be pretty close but not exactly, also it’s not using Tensorflow.

My assumption is there would have to be some way to split the image into each of the nine different spaces, then check each space for either X,O, or BLANK. That’s the part I’m trying to wrap my head around. I’ve heard of stuff like image segmentation but not sure if that’s what I need.

Thanks!

submitted by /u/ThomasWaldick
[visit reddit] [comments]

Categories
Misc

NVIDIA Announces Upcoming Events for Financial Community

SANTA CLARA, Calif., Aug. 19, 2021 (GLOBE NEWSWIRE) — NVIDIA will present at the following events for the financial community: BMO 2021 Technology Summit Tuesday, Aug. 24, at 10 a.m. Pacific …

Categories
Misc

Meet the Researcher: Peerapon Vateekul, Deep Learning Solutions for Medical Diagnosis and NLP

‘Meet the Researcher’ is a series spotlighting researchers in academia who use NVIDIA technologies to accelerate their work.  This month’s spotlight features Peerapon Vateekul, assistant professor at the Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University (CU), Thailand. Vateekul drives collaboration activities between CU and the NVIDIA AI Technology Center (NVAITC) including seminars and workshops on … Continued

‘Meet the Researcher’ is a series spotlighting researchers in academia who use NVIDIA technologies to accelerate their work. 

This month’s spotlight features Peerapon Vateekul, assistant professor at the Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University (CU), Thailand. Vateekul drives collaboration activities between CU and the NVIDIA AI Technology Center (NVAITC) including seminars and workshops on joint applied research in Medical AI and NLP. He has been collaborating with NVIDIA since 2016 and became a university ambassador and certified DLI instructor for NVIDIA in 2018.

What are your research areas of focus?

My research focuses on interdisciplinary data analysis, applying machine learning techniques and a deep learning approach, to various domains. This includes AI-assisted medical diagnosis, hydrometeorology, geoinformatics, NLP, and finance. Some of my recent work focuses on medical diagnoses, such as working on AI-assisted solution polyp detection in colonoscopies in real time.

For NLP, my research group recently presented on a research project that deploys software agents equipped with natural language, understanding capabilities to read scholarly publications on the web.

When did you know that you wanted to be a researcher, that you wanted to pursue this field?

I realized that I wanted to be a researcher in the machine learning domain when I pursued my master’s degree. It was even clearer to me after I returned to Thailand and joined the Department of Computer Engineering at CU. I had the chance to collaborate with many researchers and professors from various schools. It felt impactful to be able to apply machine learning techniques to solve real-world problems.

 What is the impact of your work on the field/community/world?

I tackle real world problems. AI-assisted telemedicine has played an important role in the pandemic era, thus I want to develop a model that can help a doctor to diagnose patients more accurately and efficiently.

For example, our research team is now implementing a real-time polyp detection solution for colonoscopies that can be deployed in an operation room. In this work, we have overcome the real-time inference constraint by using both software and hardware solutions. Especially for the hardware, where we have used a medical computer with an NVIDIA GeForce RTX, alongside a video switcher to analyze and render the video in real-time.

How do you use NVIDIA technology in your research?

I have been using NVIDIA technology in two aspects. First, a powerful server with an NVIDIA GPU is crucial to my research. Second, I have been using NVIDIA SDK’s and pretrained models from NVIDIA NGC in research works. For example, I have collaborated with Prof. Sira Sriswasdi from the medical school at CU. We aim to improve the COVID-19 diagnosis and prognosis prediction by using multi-zone lung segmentation in chest x-ray images. We’re using the NVIDIA CLARA SDK, which is a complete solution for AI-assisted medical that also contains federated learning to train a model across multiple sites (hospitals) without sharing sensitive patient data.

What’s next for your research? 

The next step in my research is to translate the research work into a product (software) that can be used in a real-world scenario. Furthermore, I plan to extend my current works in many aspects. For example, I plan to extend the solution to support other parts of the gastrointestinal tracts. Apart from the polyp detection in colonoscopy, we can train a model to segment gastric intestinal metaplasia areas in gastroscopy.

Also, we are working to extend our scientific research system, ESRA, which now supports only publications in the computer science domain, to other domains, including bioinformatics.

Any advice for new researchers, especially to those who are inspired and motivated by your work?

For me, building successful research required three main factors: suitable machine learning techniques, training data along with domain experts, and a powerful GPU server. In addition, it is important to be a part of the research community since the AI related technologies have been changing rapidly; thus, you can keep updating your knowledge by being a part of the research community. 

To learn more about the work that Peerapon Vateekul and his group is doing, visit his academia webpage.

Categories
Misc

Offloading and Isolating Data Center Workloads with NVIDIA Bluefield DPU

The Data Processing Unit, or DPU, has recently become popular in data center circles. But not everyone agrees on what tasks a DPU should perform or how it should do them. Idan Burstein, DPU Architect at NVIDIA, presents the applications and use cases that drive the architecture of the NVIDIA BlueField DPU.

Today’s data centers are evolving rapidly and require new types of processors called data processing units (DPUs). The new requirements demand a specific type of DPU architecture, capable of offloading, accelerating, and isolating specific workloads. On August 23 at the Hot Chips 33 conference, NVIDIA silicon architect Idan Burstein discusses changing data center requirements and how they have driven the architecture of the NVIDIA BlueField DPU family.

Why is a DPU needed?

Data centers today have changed from running applications in silos on dedicated server clusters. Now, resources such as CPU compute, GPU compute, and storage are disaggregated so that they can be composed (allocated and assembled) as needed. They are then recomposed (reallocated) as the applications and workloads change.

GPU-accelerated AI is becoming mainstream and enhancing myriad business applications, not just scientific applications. Servers that were primarily virtualized are now more likely to run in containers on bare metal servers, which still need software-defined infrastructure even though they no longer have a hypervisor or VMs. Cybersecurity tools such as firewall agents and anti-malware filters must run on every server to support a zero-trust approach to information security. These changes have huge consequences for the way networking, security, and management need to work, driving the need for DPUs in every server.

The best definition of the DPU’s mission is to offload, accelerate, and isolate infrastructure workloads.

  • Offload: Take over infrastructure tasks from the server CPU so more CPU power can be used to run applications.
  • Accelerate: Run infrastructure functions more quickly than the CPU can, using hardware acceleration in the DPU silicon.
  • Isolate: Move key data plane and control plane functions to a separate domain on the DPU, both to relieve the server CPU from the work and to protect the functions in case the CPU or its software are compromised.

A DPU should be able to do all three tasks.

Diagram shows evolution of modern servers, from the model on the left, which shows all infrastructure tasks running in software on the server’s CPU cores, to the model on the right, which shows infrastructure tasks offloaded to and accelerated by the DPU. This change frees up many server CPU cores to run application workloads.
Figure 1. Data centers evolve to be software-defined, containerized, and composable. Offloading infrastructure tasks to the DPU improves server performance, efficiency, and security.

Moving CPU cores around is not enough

One approach tried by some DPU vendors is to place a large number of CPU cores on the DPU to offload the workloads from the server CPU. Whether these are Arm, RISC, X86, or some other type of CPU core, the approach is fundamentally flawed because the server’s CPUs or GPUs are already efficient for CPU-optimal or GPU-optimal workloads. While it’s true that Arm (or RISC or other) cores on a DPU might be more power efficient than a typical server CPU, the power savings are not worth the added complexity unless the Arm cores have an accelerator for that specific workload.

In addition, servers built on Arm CPUs are already available, for example, Amazon EC2 Graviton-based instances, Oracle A1 instances, or servers built on Ampere Computing’s Altra CPUs and Fujitsu’s A64FX CPUs. Applications that run more efficiently on Arm can already be deployed on server Arm cores. They should only be moved to DPU Arm cores if it’s part of the control plane or an infrastructure application that must be isolated from the server CPU.

Offloading a standard application workload from n number of server X86 cores to n or 2n Arm cores on a DPU doesn’t make technical or financial sense. Neither does offloading AI or serious machine learning workloads from server GPUs to DPU Arm cores. Moving workloads from a server’s CPU and GPU to the DPU’s CPU without any type of acceleration is at best a shell game and at worst decreases server performance and efficiency.

Diagram shows why naively moving application or infrastructure workloads from the server CPU to the DPU CPU without suitable hardware acceleration does not provide any benefits to performance or efficiency. It shows that such a split merely moves the CPU workload around so what previously ran on 30 CPU cores now requires 36 cores, 18 CPU cores and 18 DPU cores.
Figure 2. Moving application workloads from the server’s CPU cores to the DPU’s CPU cores without acceleration doesn’t provide any benefits, unless those workloads must be isolated from the server CPU domain.

Best types of acceleration for a DPU

It’s clear that a proper DPU must use hardware acceleration to add maximum benefit to the data center. But what should it accelerate? The DPU is best suited for offloading workloads involving data movement, and security. For example, networking is an ideal task to offload to DPU silicon, along with remote direct memory access (RDMA), used to accelerate data movement between servers for AI, HPC, and big data, and storage workloads. 

When the DPU has acceleration hardware for specific tasks, it can offload and run those with much higher efficiency than a CPU core. A properly designed DPU can perform the work of 30, 100, or even 300 CPU cores when the workload meets the DPU’s hardware acceleration capabilities.

The DPU’s CPU cores are ideal for running control plane or security workloads that must be isolated from the server’s application and OS domain. For example in a bare metal server, the tenants don’t want a hypervisor or VM running on their server to do remote management, telemetry, or security, because it hurts performance or may interfere with their applications. Yet the cloud operator still needs the ability to monitor the server’s performance and detect, block, or isolate security threats if they invade that server.

A DPU can run this software in isolation from the application domain, providing security and control while not interfering with the server’s performance or operations.

Learn more at Hot Chips

To learn more about how the NVIDIA BlueField DPU chip architecture meets the performance, security, and manageability requirements of modern data center, attend Idan Burstein’s session at Hot Chips 33. Idan explores what DPUs should offload or isolate. He explains what current and upcoming NVIDIA DPUs accelerate, allowing them to improve performance, efficiency, and security in modern data centers.  

Categories
Misc

Is IoT Defining Edge Computing? Or is it the Other Way Around?

Edge computing is quickly becoming standard technology for organizations heavily invested in IoT, allowing organizations to process more data and generate better insights.

The only thing more impressive than the growth of IoT in the last decade is the predicted explosive growth of IoT in the next decade. Up from 46 billion in 2021, ARM predicts one trillion IoT devices will be produced by 2035

That’s over 100 IoT devices for every person on earth. The impact of this growth is amazing. As these devices continue to become smarter and more capable, organizations are finding creative new uses, as well as locations for these devices to operate.   

With IoT spending predicted to hit $1 trillion in 2022, companies are seeing the value of IoT as an investment. That’s because every location in which IoT devices are present has the potential to become a data collection site, providing invaluable insights for virtually every industry. With new and more accurate insights, retailers can reduce shrinkage and streamline distribution system processes, manufacturers can detect visual anomalies on high-speed product lines, and hospitals can provide contact-free patient interactions.  

What is AI for IoT?

Organizations have rallied around the power of vision to generate insights from IoT devices. Why? 

Computer vision is a broad term for the work done with deep neural networks to develop human-vision capabilities for applications. It uses images and videos to automate tasks and generate insights. Devices, infrastructure, and spaces can leverage this power to enhance their perception, in much the same way the field of robotics has benefited from the technology. 

While every computer vision setup is different, they all have one thing in common: they generate a ton of data. IDC predicts that IoT devices alone will generate over 90 zettabytes of data. The typical smart factory generates about 5 petabytes of video data per day and a smart city could generate 200 petabytes of data per day.

The sheer number of devices installed and the amount of data collected is putting a strain on traditional cloud and data center infrastructure. This is due to computer vision algorithms running in the cloud being unable to process data fast enough to return real-time insights. For many organizations, high-latency presents a significant safety concern. 

Take the example of an autonomous forklift in a fulfillment center for a major retailer. The forklift uses a variety of sensors to perceive the world around it, making decisions off of the data it collects. It understands where it can and cannot drive, it can identify objects to move around the warehouse, and it knows when to stop abruptly to avoid colliding with a human worker in its path. 

If the forklift sends data to the cloud, waits for it to be processed, and insights sent back to then act on, the forklift might not be able to stop in time to avoid a collision with a human worker. 

In addition to latency concerns, sending the massive amount of data collected by IoT devices to the cloud to be processed is extremely costly. This high cost is why only 25% of IoT data gets analyzed*. 451 Research conducted a study in “Voice of the Enterprise: Internet of Things, Organizational Dynamics – Quarterly Advisory Report”where respondents admitted to only storing about half of all IoT data they create, and only analyzing about half of the data they store. By choosing not to process data due to high transit costs, organizations are neglecting valuable insights that could have a significant impact on their business. 

These are some of the reasons why organizations have started using edge computing. 

What is edge computing and its importance for IoT?

Edge computing is the concept of capturing and processing data as close to the source of the data as possible. This is done by deploying servers or other hardware to process data at the physical location of the IoT sensors. Since edge computing processes data locally—on the “edge” of a network, instead of in the cloud or a data center—it minimizes latency and data transit costs, allowing for real-time feedback and decision-making. 

Edge computing allows organizations to process more data and generate more complete insights, which is why it is quickly becoming standard technology for organizations heavily invested in IoT. In fact, IDC reports that the edge computing market will be worth $34 billion by 2023

Although the benefits of edge computing for AI applications using IoT are tangible, the combination of edge and IoT solutions has been an afterthought for many organizations. Ideally the convergence of these technologies is baked into the design, allowing the full potential of computer vision to be recognized, reaching new levels of automation and efficiency.

To learn more about how edge computing works and the benefits of edge computing, read the edge computing introduction post.

Check out Considerations for Deploying AI at the Edge to learn more about the technologies involved in an edge deployment.


* 451 Research “Voice of the Enterprise: Internet of Things, Organizational Dynamics – Quarterly Advisory Report”

Categories
Misc

Analyzing Cassandra Data using GPUs, Part 1

Editor’s Note: Watch the Analysing Cassandra Data using GPUs workshop. Organizations keep much of their high-speed transactional data in fast NoSQL data stores like Apache Cassandra®. Eventually, requirements emerge to obtain analytical insights from this data. Historically, users have leveraged external, massively parallel processing analytics systems like Apache Spark for this purpose. However, today’s analytics … Continued

Editor’s Note: Watch the Analysing Cassandra Data using GPUs workshop.

Organizations keep much of their high-speed transactional data in fast NoSQL data stores like Apache Cassandra®. Eventually, requirements emerge to obtain analytical insights from this data. Historically, users have leveraged external, massively parallel processing analytics systems like Apache Spark for this purpose. However, today’s analytics ecosystem is quickly embracing AI and ML techniques whose computation relies heavily on GPUs.

In this post, we explore a cutting-edge approach for processing Cassandra SSTables by parsing them directly into GPU device memory using tools from the RAPIDS ecosystem. This will let users reach insights faster with less initial setup and also make it easy to migrate existing analytics code written in Python.

In this first post of a two-part series, we will take a quick dive into the RAPIDS project and explore a series of options to make data from Cassandra available for analysis with RAPIDS. Ultimately we will describe our current approach: parsing SSTable files in C++ and converting them into a GPU-friendly format, making the data easier to load into GPU device memory.

If you want to skip the step-by-step journey and try out sstable-to-arrow now, check out the second post.

What is RAPIDS

RAPIDS is a suite of open source libraries for doing analytics and data science end-to-end on a GPU. It emerged from CUDA, a developer toolkit developed by NVIDIA to empower developers to take advantage of their GPUs.

RAPIDS takes common AI / ML APIs like pandas and scikit-learn and makes them available for GPU acceleration. Data science, and particularly machine learning, uses numerous parallel calculations, which makes it better-suited to run on a GPU, which can “multitask” at a few orders of magnitude higher than current CPUs (image from rapids.ai):

Figure 1:

Once we get the data on the GPU in the form of a cuDF (essentially the RAPIDS equivalent of a pandas DataFrame), we can interact with it using an almost identical API to the Python libraries you might be familiar with, such as pandas, scikit-learn, and more, as shown in the images from RAPIDS below:

Figure 2:
Figure 3:

Note the use of Apache Arrow as the underlying memory format. Arrow is based on columns rather than rows, causing faster analytic queries. It also comes with inter-process communication (IPC) mechanism used to transfer an Arrow record batch (that is a table) between processes. The IPC format is identical to the in-memory format, which eliminates any extra copying or deserialization costs and gets us some extremely fast data access.

The benefits of running analytics on a GPU are clear. All you want is the proper hardware, and you can migrate existing data science code to run on the GPU simply by finding and replacing the names of Python data science libraries with their RAPIDS equivalents.

How do we get Cassandra data onto the GPU?

Over the past few weeks, I have been looking at five different approaches, listed in order of increasing complexity below:

  • Fetch the data using the Cassandra driver, convert it into a pandas DataFrame, and then turn it into a cuDF.
  • Same as the preceding, but skip the pandas step and transform data from the driver directly into an Arrow table.
  • Read SSTables from the disk using Cassandra server code, serialize it using the Arrow IPC stream format, and send it to the client.
  • Same as approach 3, but use our own parsing implementation in C++ instead of using Cassandra code.
  • Same as approach 4, but use GPU vectorization with CUDA while parsing the SSTables.

First, I will give a brief overview of each of these approaches, and then go through comparison at the end and explain our next steps.

Fetch data using the Cassandra driver

This approach is quite simple because you can use existing libraries without having to do too much hacking. We grab the data from the driver, setting session.row_factory to our pandas_factory function to tell the driver how to transform the incoming data into a pandas.DataFrame. Then, it is a simple matter to call the cudf.DataFrame.from_pandas function to load our data onto the GPU, where we can then use the RAPIDS libraries to run GPU-accelerated analytics.

The following code requires you to have access to a running Cassandra cluster. See the DataStax Python Driver docs for more info. You will also want to install the required Python libraries with Conda:

BashCopy

conda install -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql cudf pyarrow pandas numpy cassandra-driver

PythonCopy

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

import pandas as pd
import pyarrow as pa
import cudf
from blazingsql import BlazingContext

import config

# connect to the Cassandra server in the cloud and configure the session settings
cloud_config= {
        'secure_connect_bundle': '/path/to/secure/connect/bundle.zip'
}
auth_provider = PlainTextAuthProvider(user=’your_username_here’, password=’your_password_here’)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

def pandas_factory(colnames, rows):
    """Read the data returned by the driver into a pandas DataFrame"""
    return pd.DataFrame(rows, columns=colnames)
session.row_factory = pandas_factory

# run the CQL query and get the data
result_set = session.execute("select * from your_keyspace.your_table_name limit 100;")
df = result_set._current_rows # a pandas dataframe with the information
gpu_df = cudf.DataFrame.from_pandas(df) # transform it into memory on the GPU

# do GPU-accelerated operations, such as SQL queries with blazingsql
bc = BlazingContext()
bc.create_table("gpu_table", gpu_df)
bc.describe_table("gpu_table")
result = bc.sql("SELECT * FROM gpu_table")
print(result)

Fetch data using the Cassandra driver directly into Arrow

This step is identical to the previous one, except we can switch out pandas_factory with the following arrow_factory:

PythonCopy

def get_col(col):
    rtn = pa.array(col) # automatically detects the type of the array

    # for a full implementation, we would want to fully check which 
arrow types want
    # to be manually casted for compatibility with cudf
    if pa.types.is_decimal(rtn.type):
        return rtn.cast('float32')
    return rtn

def arrow_factory(colnames, rows):
    # convert from the row format passed by
    # CQL into the column format of arrow
    cols = [get_col(col) for col in zip(*rows)]
    table = pa.table({ colnames[i]: cols[i] for i in 
range(len(colnames)) })
    return table

session.row_factory = arrow_factory

We can then fetch the data and create the cuDF in the same way.

However, both of these two approaches have a major drawback: they rely on querying the existing Cassandra cluster, which we don’t want because the read-heavy analytics workload might affect the transactional production workload, where real-time performance is key.

Instead, we want to see if there is a way to get the data directly from the SSTable files on the disk without going through the database. This brings us to the next three approaches.

Read SSTables from the disk using Cassandra server code

Probably the simplest way to read SSTables on disk is to use the existing Cassandra server technologies, namely SSTableLoader. Once we have a list of partitions from the SSTable, we can manually transform the data from Java objects into Arrow Vectors corresponding to the columns of the table. Then, we can serialize the collection of vectors into the Arrow IPC stream format and then stream it in this format across a socket.

The code here is more complex than the previous two approaches and less developed than the next approach, so I have not included it in this post. Another drawback is that although this approach can run in a separate process or machine than the Cassandra cluster, to use SSTableLoader, we first want to initialize embedded Cassandra in the client process, which takes a considerable amount of time on a cold start.

Use a custom SSTable parser

To avoid initializing Cassandra, we developed our own custom implementation in C++ for parsing the binary data SSTable files. More information about this approach can be found in the next blog post. Here is a guide to the Cassandra storage engine by The Last Pickle, which helped a lot when deciphering the data format. We decided to use C++ as the language for the parser to anticipate eventually bringing in CUDA and also for low-level control to handle binary data.

Integrate CUDA to speed up table reads

We plan to start working on this approach once the custom parsing implementation becomes more comprehensive. Taking advantage of GPU vectorization should greatly speed up the reading and conversion processes.

Comparison

At the current stage, we are mainly concerned with the time it takes to read the SSTable files. For approaches 1 and 2, we can’t actually measure this time fairly, because 1) the approach relies on additional hardware (the Cassandra cluster) and 2). There are complex caching effects at play within Cassandra itself. However, for approaches 3 and 4, we can perform simple introspection to track how much time the program takes to read the SSTable file from start to finish.

Here are the results against datasets with 1k, 5K, 10k, 50k, 100k, 500k, and 1m rows of data generated by NoSQLBench:

Figure 4:

As the graph shows, the custom implementation is slightly faster than the existing Cassandra implementation, even without any additional optimizations such as multithreading.

Conclusion

Given that data access patterns for analytical use cases usually include large scans and often reading entire tables, the most efficient way to get at this data is not through CQL but by getting at SSTables directly. We were able to implement a sstable parser in C++ that can do this and convert the data to Apache Arrow so that it can be leveraged by analytics libraries including NVIDIA’s GPU-powered RAPIDS ecosystem. The resulting open-source (Apache 2 licensed) project is called sstable-to-arrow and it is available on GitHub and accessible through Docker Hub as an alpha release.

We will be holding a free online workshop, which will go deeper into this project with hands-on examples in mid-August! Sign up here if you are interested.

If you are interested in trying out sstable-to-arrow, look at the second blog post in this two-part series and feel free to reach out to seb@datastax.com with any feedback or questions.