DataBloom - Part 447

Misc

Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

Post author By
Post date July 27, 2021
No Comments on Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also … Continued

In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.

GPU Big Data Benchmark

To measure the performance impact of the new stream-ordered allocator in a real application, here are results from the RAPIDS GPU Big Data Benchmark (gpu-bdb). gpu-bdb is a benchmark of 30 queries representing real-world data science and machine learning workflows at various scale factors: SF1000 is 1 TB of data and SF10000 is 10 TB. Each query is, in fact, a model workflow that can include SQL, user-defined functions, careful subsetting and aggregation, and machine learning.

Figure 1 shows the performance of cudaMallocAsync compared to cudaMalloc for a subset of gpu-bdb queries conducted at SF1000 on an NVIDIA DGX-2 across 16 V100 GPUs. As you can see, thanks to memory reuse and eliminating extraneous synchronization, there’s a 2–5x improvement in end-to-end performance when using cudaMallocAsync.

Figure showing the performance improvement using stream-ordered memory allocation API functions in the GPU big data benchmark. — *Figure 1. Speed up of* cudaMallocAsync over cudaMalloc for various queries on the RAPIDS GPU Big Data Benchmark.

Interoperability with cudaMalloc and cudaFree

An application can use cudaFreeAsync to free a pointer allocated by cudaMalloc. The underlying memory is not freed until the next synchronization of the stream passed to cudaFreeAsync.

cudaMalloc(&ptr, size);
kernel>>(ptr);
cudaFreeAsync(ptr, stream);
cudaStreamSynchronize(stream); // The memory for ptr is freed at this point

Similarly, an application can use cudaFree to free memory allocated using cudaMallocAsync. However, cudaFree does not implicitly synchronize in this case, so the application must insert the appropriate synchronization to ensure that all accesses to the to-be-freed memory are complete. Any application code that may be intentionally or accidentally relying on the implicit synchronization behavior of cudaFree must be updated.

cudaMallocAsync(&ptr, size, stream);
kernel>>(ptr);
cudaStreamSynchronize(stream); // Must synchronize first
cudaFree(ptr);

Multi-GPU access

By default, memory allocated using cudaMallocAsync is accessible from the device associated with the specified stream. Accessing the memory from any other device requires enabling access to the entire pool from that other device. It also requires the two devices to be peer capable, as reported by cudaDeviceCanAccessPeer. Unlike cudaMalloc allocations, cudaDeviceEnablePeerAccess and cudaDeviceDisablePeerAccess have no effect on memory allocated from memory pools.

For example, consider enabling device 4access to the memory pool of device 3:

cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, 3);
cudaMemAccessDesc desc = {};
desc.location.type = cudaMemLocationTypeDevice;
desc.location.id = 4;
desc.flags = cudaMemAccessFlagsProtReadWrite;
cudaMemPoolSetAccess(mempool, &desc, 1 /* numDescs */);

Access from a device other than the device on which the memory pool resides can be revoked by using cudaMemAccessFlagsProtNone when calling cudaMemPoolSetAccess. Access from the memory pool’s own device cannot be revoked.

Interprocess communication support

Memory allocated using the default memory pool associated with a device cannot be shared with other processes. An application must explicitly create its own memory pools to share memory allocated using cudaMallocAsync with other processes. The following code sample shows how to create an explicit memory pool with interprocess communication (IPC) capabilities:

cudaMemPool_t exportPool;
cudaMemPoolProps poolProps = {};
poolProps.allocType = cudaMemAllocationTypePinned;
poolProps.handleTypes = cudaMemHandleTypePosixFileDescriptor;
poolProps.location.type = cudaMemLocationTypeDevice;
poolProps.location.id = deviceId;
cudaMemPoolCreate(&exportPool, &poolProps);

The location type Device and location ID deviceId indicate that the pool memory must be allocated on a specific GPU. The allocation type Pinned indicates that the memory should be non-migratable, also known as non-pageable. The handle type PosixFileDescriptor indicates that the user intends to query a file descriptor for the pool to share it with another process.

The first step to share memory from this pool through IPC is to query the file descriptor that represents the pool:

int fd;
cudaMemAllocationHandleType handleType = cudaMemHandleTypePosixFileDescriptor;
cudaMemPoolExportToShareableHandle(&fd, exportPool, handleType, 0);

The application can then share the file descriptor with another process, for example through a UNIX domain socket. The other process can then import the file descriptor and obtain a process-local pool handle:

cudaMemPool_t importPool;
cudaMemAllocationHandleType handleType = cudaMemHandleTypePosixFileDescriptor;
cudaMemPoolImportFromShareableHandle(&importPool, &fd, handleType, 0);

The next step is for the exporting process to allocate memory from the pool:

cudaMallocFromPoolAsync(&ptr, size, exportPool, stream);

There is also an overloaded version of cudaMallocAsync that takes the same arguments as cudaMallocFromPoolAsync:

cudaMallocAsync(&ptr, size, exportPool, stream);

After memory is allocated from this pool through either of these two APIs, the pointer can then be shared with the importing process. First, the exporting process gets an opaque handle representing the memory allocation:

cudaMemPoolPtrExportData data;
cudaMemPoolExportPointer(&data, ptr);

This opaque data can then be shared with the importing process through any standard IPC mechanism, such as through shared memory, pipes, and so on The importing process then converts the opaque data into a process-local pointer:

cudaMemPoolImportPointer(&ptr, importPool, &data);

Now both processes share access to the same memory allocation. The memory must be freed in the importing process before it is freed in the exporting process. This is to ensure that the memory does not get reutilized for another cudaMallocAsync request in the exporting process while the importing process is still accessing the previously shared memory allocation, potentially causing undefined behavior.

The existing function cudaIpcGetMemHandle works only with memory allocated through cudaMalloc and cannot be used on any memory allocated through cudaMallocAsync, regardless of whether the memory was allocated from an explicit pool.

Changing a device pool

If the application expects to use an explicit memory pool most of the time, it can consider setting that as the current pool for the device through cudaDeviceSetMemPool. This enables the application to avoid having to specify the pool argument each time that it must allocate memory from that pool.

cudaDeviceSetMemPool(device, pool);
cudaMallocAsync(&ptr, size, stream); // This now allocates from the earlier pool set instead of the device’s default pool.

This has the advantage that any other function allocating with cudaMallocAsync now automatically uses the new pool as its default. The current pool associated with a device can be queried using cudaDeviceGetMemPool.

Library composability

In general, libraries should not change a device’s pool, as doing so affects the entire top-level application. If a library must allocate memory with different properties than those of the default device pool, it may create its own pool and then allocate from that pool using cudaMallocFromPoolAsync. The library could also use the overloaded version of cudaMallocAsync that takes the pool as an argument.

To make interoperability easier for applications, libraries should consider providing APIs for the top-level application to coordinate the pools used. For example, libraries could provide set or get APIs to enable the application to control the pool in a more explicit manner. The library could also take the pool as a parameter to individual APIs.

Code migration guide

When porting an existing application that uses cudaMalloc or cudaFree to the new cudaMallocAsync or cudaFreeAsync APIs, consider the following guidelines.

Guidelines for determining the appropriate pool:

The initial default pool is suitable for many applications.
Today, an explicitly constructed pool is only required to share pool memory across processes with CUDA IPC. This may change with future features.
For convenience, consider making the explicitly created pool the device’s current pool to ensure that all cudaMallocAsync calls within the process use that pool. This must be done by the top-level application and not by libraries, so as to avoid conflicting with the goals of the top-level application.

Guidelines for setting the release threshold for all memory pools:

The choice of release threshold depends on whether and how a device is shared:
- Exclusive to a single process: Use the maximum release threshold.
- Shared among cooperating processes: Coordinate to use the same pool through IPC or set each process pool to an appropriate value to avoid any one process monopolizing all device memory.
- Shared among unknown processes: If known, set the threshold to the working set size of the application. Otherwise, leave it at zero and use a profiler to determine whether allocation performance is a bottleneck before using a nonzero value.

Guidelines for replacing cudaMalloc with cudaMallocAsync:

Ensure that all memory accesses are ordered after the stream-ordered allocation.
If peer access is required, use cudaMemPoolSetAccess as cudaEnablePeerAccess and cudaDisablePeerAccesss have no effect on pool memory.
Unlike cudaMalloc allocations, cudaDeviceReset does not implicitly free pool memory, so it must be explicitly freed.
If freeing with cudaFree, ensure that all accesses are complete through appropriate synchronization before freeing, as there is no implicit synchronization in this case. Any subsequent code that relied on the implicit synchronization may also have to be updated.
If memory is shared with another process through IPC, allocate from an explicitly created pool with IPC support and remove all references to cudaIpcGetMemHandle, cudaIpcOpenMemHandle, and cudaIpcCloseMemHandle for that pointer.
If the memory must be used with GPUDirect RDMA, continue to use cudaMalloc for now because memory allocated through cudaMallocAsync currently does not support it. CUDA aims to support this in the future.
Unlike memory allocated with cudaMalloc, memory allocated with cudaMallocAsync is not associated with a CUDA context. This has the following implications:
- Calling cuPointerGetAttribute with the attribute CU_POINTER_ATTRIBUTE_CONTEXT returns null for the context.
- When calling cudaMemcpy with at least one of source or destination pointers allocated using cudaMallocAsync, that memory must be accessible from the calling thread’s current context/device. If it’s not accessible from that context or device, use cudaMemcpyPeer instead.

Guidelines for replacing cudaFree with cudaFreeAsync:

Ensure that all memory accesses are ordered before the stream-ordered deallocation.
The memory may not be freed back to the system until the next synchronization operation. If the release threshold is set to a nonzero value, the memory may not be freed back to the system until the corresponding pool is explicitly trimmed.
Unlike cudaFree, cudaFreeAsync does not implicitly synchronize the device. Any code relying on this implicit synchronization must be updated to synchronize explicitly.

Conclusion

The stream-ordered allocator and cudaMallocAsync and cudaFreeAsync API functions added in CUDA 11.2 extend the CUDA stream programming model by introducing memory allocation and deallocation as stream-ordered operations. This enables allocations to be scoped to the kernels, which use them while avoiding costly device-wide synchronization that can occur with traditional cudaMalloc/cudaFree.

Furthermore, these API functions add the concept of memory pools to CUDA, enabling the reuse of memory to avoid costly system calls and improve performance. Use the guidelines to migrate your existing code and see how much your application performance improves!

Offsites

Advances in TF-Ranking

Posted by Michael Bendersky and Xuanhui Wang, Software Engineers, Google Research

In December 2018, we introduced TF-Ranking, an open-source TensorFlow-based library for developing scalable neural learning-to-rank (LTR) models, which are useful in settings where users expect to receive an ordered list of items in response to their query. LTR models — unlike standard classification models that classify one item at a time — receive an entire list of items as an input, and learn an ordering that maximizes the utility of the entire list. While search and recommendation systems are the most common applications of LTR models, since its release, we have seen TF-Ranking being applied in diverse domains beyond search, including e-commerce, SAT solvers, and smart city planning.

The goal of learning-to-rank (LTR) is to learn a function f() that takes as an input a list of items (documents, products, movies, etc.) and outputs the list of items in the optimal order (descending order of relevance). Here, green shade indicates item relevance level, and the red item marked with ‘x’ is non-relevant.

In May 2021, we published a major release of TF-Ranking that enables full support for natively building LTR models using Keras, a high-level API of TensorFlow 2. Our native Keras ranking model has a brand-new workflow design, including a flexible ModelBuilder, a DatasetBuilder to set up training data, and a Pipeline to train the model with the provided dataset. These components make building a customized LTR model easier than ever, and facilitate rapid exploration of new model structures for production and research. If RaggedTensors are your tool of choice, TF-Ranking is now working with them as well. In addition, our most recent release, which incorporates the Orbit training library, contains a long list of advances — the culmination of two and half years of neural LTR research. Below we share a few of the key improvements available in the latest TF-Ranking version.

Workflow to build and train a native Keras ranking model. Blue modules are provided by TF-Ranking, and green modules are customizable.

Learning-to-Rank with TFR-BERT
Recently, pretrained language models like BERT have achieved state-of-the-art performance on various language understanding tasks. To capture the expressiveness of these models, TF-Ranking implements a novel TFR-BERT architecture that couples BERT with the power of LTR to optimize the ordering of list inputs. As an example, consider a query and a list of n documents that one might like to rank in response to this query. Instead of learning an independent BERT representation for each <query, document> pair, LTR models apply a ranking loss to jointly learn a BERT representation that maximizes the utility of the entire ranked list with respect to the ground-truth labels.

The figure below illustrates this process. First, we flatten a list of n documents to rank in response to a query into a list <query, document> tuples. These tuples are fed into a pre-trained language model (e.g., BERT). The pooled BERT outputs for the entire document list are then jointly fine-tuned with one of the specialized ranking losses available in TF-Ranking. Our experience shows that this TFR-BERT architecture delivers significant improvements in pretrained language model performance, leading to state-of-the-art performance for several popular ranking tasks, especially when multiple pretrained language models are ensembled. Our users can now get started with TFR-BERT using this simple example.

An illustration of the TFR-BERT architecture, in which a joint LTR model over a list of n documents is constructed using BERT representations of individual <query, document> pairs.

Interpretable Learning-to-Rank
Transparency and interpretability are important factors in deploying LTR models in ranking systems that can be involved in determining the outcomes of processes such as loan eligibility assessment, advertisement targeting, or guiding medical treatment decisions. In such cases, the contribution of each individual feature to the final ranking should be examinable and understandable to ensure transparency, accountability and fairness of the outcomes.

One possible way to achieve this is using generalized additive models (GAMs) — intrinsically interpretable machine learning models that are linearly composed of smooth functions of individual features. However, while GAMs have been extensively studied on regression and classification tasks, it is less clear how to apply them in a ranking setting. For instance, while GAMs can be straightforwardly applied to model each individual item in the list, modeling both item interactions and the context in which these items are ranked is a more challenging research problem. To this end, we have developed a neural ranking GAM — an extension of generalized additive models to ranking problems.

Unlike standard GAMs, a neural ranking GAM can take into account both the features of the ranked items and the context features (e.g., query or user profile) to derive an interpretable, compact model. This ensures that not only the contribution of each item-level feature is interpretable, but also the contribution of the context features. For example, in the figure below, using a neural ranking GAM makes visible how distance, price, and relevance, in the context of a given user device, contribute to the final ranking of the hotel. Neural ranking GAMs are now available as a part of TF-Ranking,

An example of applying neural ranking GAM for local search. For each input feature (e.g., price, distance), a sub-model produces a sub-score that can be examined, providing transparency. Context features (e.g., user device type) can be utilized to derive importance weights of submodels.

Neural Ranking or Gradient Boosting?
While neural models have achieved state of the art performance in multiple domains, specialized gradient boosted decision trees (GBDTs) like LambdaMART remained the baseline to beat in a variety of open LTR datasets. The success of GBDTs in open datasets is due to several reasons. First, due to their relatively small size, neural models are prone to overfitting on these datasets. Second, since GBDTs partition their input feature space using decision trees, they are naturally more resilient to variations in numerical scales in ranking data, which often contain features with Zipfian or otherwise skewed distributions. However, GBDTs do have their limitations in more realistic ranking scenarios, which often combine both textual and numerical features. For instance, GBDTs cannot be directly applied to large discrete feature spaces, such as raw document text. They are also, in general, less scalable than neural ranking models.

Therefore, since the TF-Ranking release, our team has significantly deepened the understanding of how best to leverage neural models in ranking with numerical features. This culminated in a Data Augmented Self-Attentive Latent Cross (DASALC) model, described in an ICLR 2021 paper, which is the first to establish parity, and in some cases statistically significant improvements, of neural ranking models over strong LambdaMART baselines on open LTR datasets. This achievement is made possible through a combination of techniques, which include data augmentation, neural feature transformation, self-attention for modeling document interactions, listwise ranking loss, and model ensembling similar to boosting in GBDTs. The architecture of the DASALC model was entirely implemented using the TF-Ranking library.

Conclusion
All in all, we believe that the new Keras-based TF-Ranking version will make it easier to conduct neural LTR research and deploy production-grade ranking systems. We encourage everyone to try out the latest version and follow this introductory example for a hands-on experience. While we are very excited about this new release, our research and development journey is far from over, so we will continue to advance our understanding of learning-to-rank problems and share these advances with our users.

Acknowledgements
This project was only possible thanks to the current and past members of the TF-Ranking team: Honglei Zhuang, ‎Le Yan, Rama Pasumarthi, Rolf Jagerman, Zhen Qin, Shuguang Han, Sebastian Bruch, Nathan Cordeiro, Marc Najork and Patrick McGregor. We also extend special thanks to our collaborators from the Tensorflow team: Zhenyu Tan, Goldie Gadde, Rick Chao, Yuefeng Zhou‎, Hongkun Yu, and Jing Li.

Offsites

함께 자라기: 우리는 함께 성장할 수 있을까?

Post author By
Post date July 27, 2021
No Comments on 함께 자라기: 우리는 함께 성장할 수 있을까?

우리는 점점 협업이 중요해지는 시대에 살고 있습니다. 도메인과 기술, 각각의 분야는 갈수록 세밀해지고 고도화되고 있기 때문에, 혼자서 이 모든 것을 다 알기란 불가능에 가까워지고 있습니다. 그래서 한명의 천재보다는 훌륭한 팀이 더 좋은 결과들을 만들어 내는 시대입니다.

images

출처: pixabay

면접에서 커뮤니케이션 스킬 역시 중요하게 평가되고 있죠. ‘팀원과의 협업에서 어려움이 있을 때 어떻게 하셨나요?’ 이런 질문들은 흔하게 접하셨을 것 같습니다. 여기에서 저는 개인적으로 ‘팀으로 일하면서 팀원 모두의 성장을 위해서 무엇을 해보았나요?’ 이 질문을 좋아합니다. 개인이 성장하는 것이 선형적이라면, 팀으로 성장하는 것은 기하급수적으로 볼 수 있기 때문입니다.

이번에 소개하는 책의 저자께서도 이 책을 읽으며, 다음과 같은 질문들로 생각이 나아갈 수 있기를 기대하고 있습니다.

우리가 정말 함께 자랄 수 있을까?
우리가 정말 매일매일 함께 자랄 수 있을까?

함께 자라기 : 애자일로 가는 길

images

출처: 알라딘 ‘함께 자라기’

이번 책은 애자일 컨설팅으로 알려져 있는 김창준님의 ≪함께 자라기≫ 입니다. 이 책은 그 동안 블로그와 페이스 북 등에서 공유해오시던 효과적으로 배우는 방법과 협업에 대한 다양한 글들을 엮은 결과입니다. 이 책의 특징 중 하나는 연구, 논문 등의 자료를 기반으로 조금 더 구체적이고 분석적으로 성장과 협업에 대해서 바라 본다는 것 입니다.

그럼 책의 내용들을 조금 더 살펴보겠습니다. 1장 자라기 에서는 성장을주제로 다양한 이야기를 하고 있습니다.

시스템

저는 시스템과 프로세스가 중요하다고 생각을 합니다. 적합한 사람들을 뽑는 것이 무엇보다 중요하지만, 이 사람들이 마음껏 능력을 펼칠 수 있는 조직의 시스템도 그에 못지 않게 중요합니다.

조직은 개인이 자신의 전문성을 좀 더 발전시키고 관리할 수 있게 최대한 지원을 해야 합니다. 그것이 윈윈하는 길입니다. 뽑고 나서 잘 교육하고 성장하게 도와주는 것 이상으로 중요한 것이 또 있습니다. 시스템입니다. 아무리 훌륭한 사람을 뽑아도 조직의 시스템과 문화에 문제가 있으면 그런 사람은 묻혀버리기 쉽고, 반대로 실력이 평범한 사람일지라도 좋은 시스템 속에서 뛰어난 성과를 낼 수도 있습니다.

잘 뽑는 것 이상으로 중요한 것 중에서

프로세스와 시스템은 아래 더글러스의 말에서 B와 C단계에 해당하는 일 입니다. 이렇게 한 단계 혹은 한 차원 높게 개선을 함으로써 그 조직은 계속해서 발전할 수 있는 것이죠. 항상 일을 함에 있어서 언제 무엇에 집중해야 할지를 생각하는 것이 필요합니다. 일례로 스타트업에서는 빠르게 A 작업을 해내는 것이 중요한 반면, 대기업에서는 더 빠르게 확장할 수 있도록 B작업, 즉 프로세스를 개선하는데 집중해야 하는 것이죠.

더글러스는 작업을 세 가지 수준으로 구분합니다. A, B, C 작업입니다.
A 작업은 원래 그 조직이 하기로 되어 있는 일을 하는 걸 말합니다.
B 작업은 A 작업을 개선하는 걸 말합니다. 제품을 만드는 사이클에서 시간과 품질을 개선하는 것이죠
C 작업은 B 작업을 개선하는 것 입니다. 개선 사이클 자체의 시간과 품질을 개선하는 것입니다. … 한마디로 개선하는 능력을 개선하는 걸 말합니다.
더글러스는 “우리가 더 잘하는 것을 더 잘하게 될수록 우리는 더 잘하는 걸 더 잘 그리고 더 빨리 하게 될 것이다”

복리의 비밀 중에서

의도적 수련

images

출처: 함께자라기 ‘제자리걸음에서 벗어나기’ 중에서

의도적 수련은 자신의 실력에 맞춰서 가장 빠르게 배울 수 있는 방법 중에 하나입니다. 위 그림처럼, ‘작업 난이도’ 와 ‘실력’ 을 유사한 수준으로 맞춰서 일에 몰입할 수 있도록 하는 것이죠. 너무 쉬운 일이라면, 스스로 퀘스트를 부여하면서 더 문제를 어렵게 만들거나 어려운 일의 경우에는 주변의 도움을 받기도 하고, 문제를 구조적으로 접근함녀서 난이도를 낮추는 방법 등을 제시하고 있습니다.

의도적 수련이 되려면 나의 실력과 작업의 난이도가 비슷해야 합니다. 이것은 미하이 칙센트미하이의 몰입이론(무슨 활동을 하냐가 중요한게 아니라 뭘 하든지 몰입해서 하면 만족도가 올라갔다)과도 일치하는 부분인데요, … 우리가 주목해야 할 부분은 C 영역입니다. 난이도와 실력이 엇비슷하게 맞는 부분이죠. 미하이는 이 부분에서 인간이 몰입을 경험한다고 합니다. 그리고 바로 이때 최고 수준의 집중력을 보이고, 그 덕분에 퍼포먼스나 학습 능력이 최대치가 될 수 있다고 합니다. 또한 그때 최고 수준의 행복감을 경험한다는 흥미로운 사실을 발견하기도 했습니다. 비슷한 이야기를 언어학자인 크라센이 입력가설을 통해 말합니다. i+1 이론이라고 하는데, 현재 언어 학습자의 언어 수준을 i라고 할 때 딱 한 단계 높은 i+1 수준의 입력이 주어질 때에만 언어 능력이 유의미하게 진전한다는 이론이죠.

의도적 수련의 필수조건, 적절한 난이도 중에서

다음으로 2장 함께 에서는 협업에 대한 다양한 주제들을 다루고 있습니다.

심리적 안전감

성공적인 팀의 특징들 중에서 가장 중요하다고 이야기 되는 요소가 바로 ‘심리적 안전감’ 입니다. 이 ‘심리적 안전감’ 하나의 주제만을 가지고 다양한 이야기하는 ≪두려움 없는 조직≫ 이라는 책도 있죠. 어떻게 보면 뻔하게 보이기도 하지만 그 만큼 심리적 안전감을 팀 내에 정착시키는 것은 어렵기도 합니다.

구글은 데이터 중심 회사답게 데이터 기반으로 뛰어난 관리자의 특징을 찾는 옥시전 프로젝트 이후에도 뛰어난 팀의 특징을 찾기 위해 2년간 노력했습니다. 이름하여 아리스토텔레스 프로젝트 입니다.

팀에 누가 있는지 (전문가, 내향/외향, 지능 등) 보다 팀원들이 서로 어떻게 상호작용하고 자신의 일을 어떻게 바라보는지가 훨씬 중요했다.

5가지 성공적 팀의 특징을 찾았는데, 그중 압도적으로 높은 예측력을 보인 변수는 팀의 심리적 안전감이었다.

팀 토론 등 특별히 고안된 활동을 통해 심리적 안전감을 개선할 수 있었다.

구글이 밝힌 탁월한 팀의 비밀 중에서

심리적 안전감은 보통 조직문화를 기반으로 하고 있다고 이야기합니다. 조직문화 중에서도 특히 ‘투명성’ 에 연결이 됩니다. 아래 사례처럼, 실수를 투명하게 공개하고 더 나은 방향으로 모두 나아갈 수 있는 것. 그 외에도 회사 내에서 정보가 투명하게 흐르게 되면 서로 간의 신뢰가 생기기 때문입니다. 이 신뢰가 곧 심리적 안전감으로 직결되게 되죠.

마이클 프레제는 회사에서의 실수 문화에 대해 연구를 했습니다. 그에 따르면 실수 문화에는 크게 두 가지가 있습니다. 실수 예방과 실수 관리. 실수 예방은 행동에서 실수로 가는 경로를 차단하려고 합니다. 즉, 실수를 저지르지 말라고 요구합니다. 근데, 사실 이것이 불가능에 가깝습니다. 전문가도 1시간에 평균 3~5개의 실수를 저지른다고 합니다. … 실수 예방 문화에서는 실수를 한 사람을 비난하고, 처벌하고, 따라서 실수를 감추고 그에 대해 논의하기 꺼리며 문제가 생겼을 때 협력도 덜하게 됩니다. 실수에서 배우지 못하겠지요. 반대로 실수 관리 문화에서는 실수가 나쁜 결과를 내기 전에 빨리 회복하도록 돕고, 실수를 공개하고, 실수에 대해 서로 이야기하고 거기에서 배우는 분위기가 생깁니다.
이 부분이 굉장히 중요합니다. 실수 연구의 역사를 보면, 초기에는 기술적인 부분만 보다가 그 다음에는 인간적인 부분 (결국 80%가 사람 실수라든지)을 보다가 이제는 문화적인 부분을 이야기합니다. 심리적 안전감이라고 하는 것이 이 문화의 일부입니다.

두 가지의 실수 문화 중에서

추상화

다음은 개발자들끼리 많이 진행하는 짝 프로그래밍에 대한 이야기 입니다. 그 동안 많이 해봤음에도, 왜 효과적인지 잘 모르고 있다가 이 책을 읽으면서 깨닫게 되는 사례 중에 하나였습니다. 짝 프로그래밍까지 가지 않더라도 문제에 대해서 설명하다가 스스로 좋은 방법을 찾기도 하는데, 이것 역시 설명의 과정에서 추상화를 시키면서 스스로 이해도가 높아지기 때문이 아닐까 싶습니다.

짝 프로그래밍은 두 사람이 한 컴퓨터를 사용해 함께 프로그래밍하는 것입니다. 생각할수록 짝 프로그래밍의 구성은 절묘합니다. 두 사람이라는 구성은 대화를 통해 추상화를 높이게 합니다. 한 컴퓨터라는 구성은 구체화를 통해 검증하게 합니다. 미루고 헤아리는 것) 이 빈번히 교차합니다. 그리고 그 사이에서 “아하”가 터져 나옵니다. … 자신이 작성하는 코드의 추상성을 높이고 싶다면 혼자서 고민하지 말고 다른 사람들과 협동하고, 대화하세요. 같이 그림도 그려보고 함께 소스코드를 편집하세요. 인간에게는 다른 인간과 소통하고 협력할 수 있는 놀라운 능력이 있습니다. 대화는 기적입니다.

대화하는 프로그래밍 중에서

새로운 방법론의 도입

아마 많은 이런 경험이 많이 있으실 것 같습니다. 같이 일을 하면서 새로운 프레임워크 혹은 애자일 등의 방법론 혹은 도구를 도입하는 것이죠. 무난하게 도입을 한 경우도 있을 것이고, 생각하지 못한 반대의견을 맞닥뜨린 경우도 있을 것 입니다. 어떻게 하는 것이 가장 좋은 방법인지 모르겠지만, 동료분들과 이야기를 충분히 하고 니즈를 이해해야 한다는 것 입니다. 이 도구가 왜 좋은지 보다는 동료분들이 어떤 생각을 가지고 있는지 알아보는 것이 어떨까요?

그리고 이렇게 대화를 하면서, 중간의 매개체가 될 수 있다면 단순히 도구를 도입하려는 시도에서 더 나아가 팀에서 필요로 하는 것이 무엇인지 제대로 이해하고 더 좋은 방안을 제시할 수 있을 것 입니다.

팀장 자리에 있으면 새로운 아이디어 전파가 쉬울 거라고 생각하는 것은 환상입니다. … 그 중 어떤 분들은 이미 나름의 객관적 수치들을 수집하고 계시죠. 그런 분들을 만나면 저는 다음과 같은 질문을 던집니다. “상대방에 대해 얼마나 이해를 하고 계신가요? 얼마나 대화를 해보셨나요?” 십중팔구는 “그분이랑은 별로 이야기 못 해봤습니다.” 란 답이 돌아옵니다. 만약 그렇다면 앞으로도 설득에 성공할 확률은 낫다고 봐야 합니다.

객관성의 주관성 중에서

복잡한 분야일수록 어떤 특정 기법의 효과보다도 치료자 효과가 더 큰 영향을 미칠 것입니다. 그렇다면 어떻게 해야 할까요? 슈퍼슈링크들을 찾고 그들을 연구하고 육성해야 합니다. … 소프트웨어 개발 방법론, 새 프로젝트를 진행할 때에 우리가 어떤 방법론을 쓰느냐는 문제보다도 누가 참여하는가가 훨씬 더 압도적으로 중요한 문제가 아닐까요? 여러분은 어떻게 생각하시나요? 저는 이렇게 생각합니다. 예를 들어 애자일 방법론 도입을 원하는 팀장이라면 “나는 어떤 팀장인가”를 먼저 자문해봐야 하지 않을까 싶습니다.

당신의 조직에 새 방법론이 먹히지 않는 이유 중에서

다음은 전문가들끼리 팀이 구성되었을 때, 가장 효과적일지에 대한 이야기가 있습니다. 분야가 겹치지 않는 상황에서는 전문가들이 서로의 전문성을 믿고 각자 최고의 결과를 만들어 낼 수 있지만, 비슷한 분야에서 전문가들이 같이 일을 하는 것은 개인에서 협업을 하게되는 상황이기도 합니다. 이때에는 필연적으로 생산성이 떨어지는 순간들이 있게 되는 것 같습니다. 협업에는 연습이 필요하기 때문이죠.

회사에서의 올스타는 어떨까요? 그로이스버그(Groysberg) 등의 연구에 따르면 이런 스타들이 한 명씩 팀에 추가될 때마다 팀의 추가적 성과 향상은 한계효용(점차 줄어듬)을 보이며 어느 수준을 지나면 음의 방향으로 작용한다(즉, 전체 팀의 성과를 깎아먹음)”고 합니다. … 성과를 깎아먹는 경향은 특히 전문가들이 전문성이 서로 유사할 때 도드라졌습니다. 이 연구는 그 원인 중 하나로 전문가들의 에고(ego)를 꼽습니다.

전문가팀이 실패하는 이유 중에서

애자일

마지막 3장에서는 애자일에 대한 이야기가 간단하게 다루어집니다. 사실 앞의 1장, 2장에서도 ‘애자일’ 이라는 용어만 쓰지 않았지, 주제는 애자일에 포함되는 이야기였기 때문이죠.

그 동안 일을 해오면서, 아래의 사례처럼 ‘고객 참여’는 무엇보다 중요한 요소 입니다. 고객 참여에는 다양한 방식이 있을 것 입니다. 고객이 바로 옆에서 도움을 줄 수도 있고, CS를 통해서 피드백을 받을 수도 있고, 인터뷰를 진행할 수도 있습니다. 고객이 무엇을 원하는지 알아볼 수 있는 선구안은 정말 흔하지 않기 때문에, 고객 참여를 통해서 니즈를 발견하고 빠르게 개발해나가는 것이 중요하죠.

성숙도가 낮은 조직의 경우 (성숙도 4 이하), 고객 참여 (0.94), 통계적으로 유의미한 실천법 딱 하나입니다. 고객 참여. 그리고 기여도는 0.94로 아까 전체로 볼 때보다 더 높습니다. 거의 1 입니다. 성숙도가 낮아도 고객 참여를 잘하면 프로젝트 성공도가 한 칸 올라간다는 뜻 입니다. … 성숙도가 높은 조직을 보시죠. 짧은 반복 개발 주기가 1등입니다. 고객 참여보다 더 기여도가 높습니다. 그 말은 성숙도가 높은 조직에서는 고객 참여보다 짧은 반복 개발 주기가 성공에 더 도움이 될 수 있다는 뜻입니다. 그만큼 짧은 반복 개발 주기를 통해 고객 참여가 잘 안 될 때를 어느 정도 보완할 수 있다는 뜻일 수도 있겠습니다.

성숙도가 낮다면 고객 참여는 필수 중에서

끝으로

images

출처: 존잡생각 Ep.18 회사에서 본인을 빠르게 성장시키는 방법 – People Scaling

포스트를 작성하면서 협업에 대해서 생각을 하다보니, 최근에 자주 보고 있는 존잡생각 이라는 샌드버드 CEO인 김동선 대표님의 유투브 채널에서 다뤘던 내용이 생각났습니다. 저 문장이 협업의 측면에서 핵심이 되는 요소라고 생각합니다. 문제가 되는 약점은 고쳐야 하지만, 기본적으로 개개인이 가진 강점을 기반으로 팀으로서의 합이 최대치가 되도록 하는 것이죠.

이렇게 팀이 성장하는 방향으로, 함께 자랄 수 있기를 바랍니다!

Misc

Using the MNIST database to recognize numbers

Post author By
Post date July 27, 2021
No Comments on Using the MNIST database to recognize numbers

Im trying to use the MNIST database of handwritten numbers to make a neural network that predicts a number. The issue is that the validation accuracy is very high, around 98%, but when i try to apply the model to numbers outside of the database, the accuracy is very low. I got an accuracy of 50% with a dense network and 60% with a CNN. Im not sure what the issue is or how I can make the network work more consistently with new data that I give it. Any advice would be appreciated. If needed, I can send the code and test images.

submitted by /u/UnreadyDog
[visit reddit] [comments]

Misc

Self supervision using tensorflow and keras

Post author By
Post date July 27, 2021
No Comments on Self supervision using tensorflow and keras

I have a self supervised problem, where I have a sequence of web page visits, each of which is denoted by some hash, and I can lookup this has to find a dense vector to represent this page.

I would like to train my model on various self supervised tasks, such as predicting a target page vector, given some context, or predicting the next vector in the sequence.

My question here is what is the best approach for implementation of such a problem? The naive approach would be to prep the data in a separate script.

I feel there must be a better approach, possibly using `tf.data.map` within the pipeline?

Would love to hear some best practices for self supervision, and the creation of datasets using TF and Keras.

submitted by /u/ydennisy
[visit reddit] [comments]

Misc

Upcoming DL RecSys Summit: Develop and Optimize Deep Learning Recommender Systems

Post author By
Post date July 26, 2021
No Comments on Upcoming DL RecSys Summit: Develop and Optimize Deep Learning Recommender Systems

The NVIDIA, Facebook, and TensorFlow recommender teams will be hosting a summit with live Q&A to dive into best practices and insights on how to develop and optimize deep learning recommender systems.

Develop and Optimize Deep Learning Recommender Systems
Thursday, July 29 at 10 a.m. PT

By joining this Deep Learning Recommender Summit, you will hear from fellow ML engineers and data scientists from NVIDIA, Facebook, and TensorFlow on best practices, learnings, and insights for building and optimizing highly effective DL recommender systems.

Sessions include:

High-Performance Recommendation Model Training at Facebook
In this talk, we will first analyze how model architecture affects the GPU performance and efficiency, and also present the performance optimizations techniques we applied to improve the GPU utilization, which includes optimized PyTorch-based training stack supporting both model and data parallelism, high-performance GPU operators, efficient embedding table sharding, memory hierarchy and pipelining.

RecSys2021 Challenge: Predicting User Engagements with Deep Learning Recommender Systems
The NVIDIA team, a collaboration of Kaggle Grandmaster and NVIDIA Merlin, won the RecSys2021 challenge. It was hosted by Twitter, who provided almost 1 billion tweet-user pairs as a dataset. The team will present their winning solution with a focus on deep learning architectures and how to optimize them.

Revisiting Recommender Systems on GPU
A new era of faster ETL, Training, and Inference is coming to the RecSys space and this talk will walk through some of the patterns of optimization that guide the tools we are building to make recommenders faster and easier to use on the GPU.

TensorFlow Recommenders
TensorFlow Recommenders is an end-to-end library for recommender system models: from retrieval, through ranking, to post-ranking. In this talk, we describe how TensorFlow Recommenders can be used to fit and safely deploy sophisticated recommender systems at scale.

Misc

Accelerating Volkswagen Connected Car Data Pipelines 100x Faster with NVIDIA RAPIDS

Post author By
Post date July 26, 2021
No Comments on Accelerating Volkswagen Connected Car Data Pipelines 100x Faster with NVIDIA RAPIDS

Connected cars are vehicles that communicate with other vehicles using backend systems to enhance usability, enable convenient services, and keep distributed software maintained and up to date. At Volkswagen, we are working on connected car with NVIDIA to solve the challenges which have computational inefficiencies like Geospatial Indexing and K-Nearest Neighbors when implemented in native … Continued

Connected cars are vehicles that communicate with other vehicles using backend systems to enhance usability, enable convenient services, and keep distributed software maintained and up to date.

At Volkswagen, we are working on connected car with NVIDIA to solve the challenges which have computational inefficiencies like Geospatial Indexing and K-Nearest Neighbors when implemented in native python and pandas.

Processing driving and sensor data is critical for connected cars to understand their environment. It enables connected cars to perform tasks such as parking spot detection, location-based services, theft protection, route recommendation through real-time traffic, fleet management, and many more. Location information is key to most of these use cases and requires a fast processing pipeline to enable real-time services.

Global sales of connected cars are increasing rapidly, in turn, increasing the amount of data available. As per Gartner, the average connected vehicle will generate 280 petabytes of data annually, with four terabytes of data being generated in a day at the very least. The research also states that around 470 million connected vehicles will be deployed by 2025.

This blog post will focus on the data pipeline required to process location-based geospatial information and deliver necessary services for connected cars.

Challenges with connected cars data

Working with connected car data poses both technical and business challenges:

Fast processing of huge amounts of streaming data is needed because users expect a near real-time experience to make timely decisions. For example, if a user requests a parking spot and the system takes five minutes to respond, it’s likely the spot will already be taken by the time it answers. Faster processing and analyzing of the data is the key factor to overcome this challenge.

There are also data privacy issues to consider. Connected cars must satisfy the General Data Protection Regulation (GDPR). In short, GDPR requires that after data analysis, there should not be a chance to identify individual users from the analyzed data. Additionally, storage of data pertaining to individual users is prohibited (unless written consent is given by the user). Anonymization can meet these requirements by either masking the data that identifies the individual user or by grouping and aggregating the data so that traces of the user are not possible. For this purpose, we need to make sure that the software processing connected car data complies with the regulations required by GDPR on data anonymization, which adds additional compute requirements during the data processing.

Taking a data science approach

RAPIDS can address both the technical and business challenges of connected cars. The RAPIDS suite of open-source software (OSS) libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. Licensed under Apache 2.0, RAPIDS was incubated by NVIDIA and based on extensive hardware and data science experience. RAPIDS utilizes NVIDIA CUDA primitives for low-level compute optimization and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

In the following sections, we discuss how RAPIDS (software) and NVIDIA GPUs (hardware) help tackle both the technical and business challenges on a prototype application using test data. Two different approaches will be evaluated, geospatial indexing and k-nearest neighbors.

Using RAPIDS, we are able to achieve a 100x speedup for this pipeline.

A brief introduction to geospatial indexing

Geospatial indexing is the basis for many algorithms in the domain of connected cars. It is the process of partitioning areas of the earth into identifiable grid cells. It is an effective way to prune the search space when querying the huge amount of data produced by connected cars.

Popular approaches include Military Grid Reference System (MGRS) and Uber’s Hexagonal Hierarchical Spatial Index (Uber H3).

In this data pipeline example, we use Uber H3 to split the records spatially into a set of smaller subsets.

The following are the conditions, which need to be satisfied after splitting the records into subsets:

Each subset consists of a maximum of ‘N’ records. This ‘N’ is chosen based on computational capacity constraints. For this experiment, we consider ‘N’ equals 2500 records.
The subset is denoted by subset_id, which is an auto-increment number starting from 0.

The following is the sample input data, which has two columns – latitude and longitude:

A table containing a sample of the input data. It consists of two columns: latitude and longitude. — *Table 1: Sample input data.*

The following is the algorithm that needs to be implemented to apply Uber H3 for the use case

Iterate over latitude and longitude, and assign hex_id from resolution 0.
If found any hex_id comprising less than 2500 records, then assign subset_id incrementally starting from 0.
Identify the hex_ids that comprise more than 2500 records.
Split the preceding records with an incremental resolution, which is now 1.
Repeat steps 3 and 4, until all the records are assigned to subset_id & hex_id or until the resolution reaches 15.

Once the preceding algorithm is applied, it results to the following output data:

*Table 2: Sample output data after applying geospatial indexing.*

Code snippets of Uber H3 implementation

Following is the code snippet of the implementation of Uber H3 using pandas:

#while loop until all the records are assigned to subset_id
while resolution  16 and df["subset_id"].isnull().any():
     #assignment of hex_id
     df['hex_id']= df.apply(lambda row: h3.geo_to_h3(row["latitude"],
                   row["longitude"], resolution), axis = 1)
      df_aggreg = df.groupby(by = "hex_id").size().reset_index()
     df_aggreg.columns = ["hex_id", "value"]
      #filtering the records that are less than 2500 count    
     hex_id = df_aggreg[df_aggreg['value']2500]['hex_id']
      #assignment of subset_id
     for index, value in hex_id.items():
         df.loc[df['hex_id'] == value, 'subset_id'] = subset_id
         subset_id += 1     
     df_return = df_return.append(df[~df['subset_id'].isna()],
                 ignore_index=True)
     df = df[df['subset_id'].isna()]
     resolution += 1

Following is the code snippet of the implementation of Uber H3 using PySpark:

#while loop until all the records are assigned to subset_id
while resolution  16 and (len(df.head(1)) != 0):
      #assignment of hex_id
     df = df.rdd.map(lambda x: (x["latitude"], x["longitude"],
          x["subset_id"],h3.geo_to_h3(x["latitude"], x["longitude"],
          resolution)))
      df = sqlContext.createDataFrame(df, schema)
     df_aggreg = df.groupby("hex_id").count()        
     df_aggreg = df_aggreg.withColumnRenamed("hex_id", "hex_id") 
          .withColumnRenamed("count", "value")
      #filtering the records that are less than 2500 count
     hex_id = df_aggreg.filter(df_aggreg.value  2500)         
     var_hex_id = list(hex_id.select('hex_id').toPandas()['hex_id'])
     for i in var_hex_id:
         #assignment of subset_id
         df = df.withColumn('subset_id',F.when(df.hex_id==i,subset_id)
             .otherwise(df.subset_id)).select(df.latitude, df.longitude,
             'subset_id', df.hex_id)
          subset_id += 1
      df_return = df_return.union(df.filter(df.subset_id != 0))         
     df = df.filter(df.subset_id == 0)
     resolution += 1

With this pandas implementation of the Uber H3 model, we have identified a painfully slow execution. The slow execution of the code leads to significantly reduced productivity, as only little experiments can be done. The tangible goal is to speed up the execution time by 10x.

To accelerate the pipeline, we follow a step-by-step approach as follows.

Step 1: Simple CPU parallel version

The idea of this version is to implement a simple multiprocessing-based kernel for the H3 library processing. The second part of the processing, which is assigning subsets according to the data, is the pandas library function, which cannot be easily parallelized.

#Function to assign hex_id
def minikernel(df, resolution):
 df['hex_id'] = np.vectorize(lambda latitude, longitude:
                 h3.geo_to_h3(latitude, longitude, resolution))(
                 np.array(df['latitude']), np.array(df['longitude']))
  return df

#while loop until all the records are assigned to subset_id
while resolution  16 and df["subset_id"].isnull().any():     
     #CPU Parallelization
     df_chunk = np.array_split(df, n_cores)
     pool = Pool(n_cores)
 
     #assigning hex_id by calling the function minikernel()
     df_chunk_res=pool.map(partial(minikernel, resolution=resolution),
                     df_chunk)
 
     df = pd.concat(df_chunk_res)
 
     pool.close()
     pool.join()
             
     df_aggreg = df.groupby(by = "hex_id").size().reset_index()
     df_aggreg.columns = ["hex_id", "value"]
     
     #filtering the records that are less than 2500 count
     hex_id = df_aggreg[df_aggreg['value']2500]['hex_id']
 
     for index, value in hex_id.items():
            #assignment of subset_id is pandas library function
            #which cannot be parallelized
         df.loc[df['hex_id'] == value, 'subset_id'] = subset_id
         subset_id += 1
    
     df_return = df_return.append(df[~df['subset_id'].isna()],
                 ignore_index=True)
 
     df = df[df['subset_id'].isna()]     
     resolution += 1

By applying simple parallelization with a thread pool, we can significantly reduce the first part of the code (H3 library), but the second part (pandas library) is completely single-threaded and extremely slow.

Step 2: Apply RAPIDS cuDF

The idea here is to use as many standard features from cuDF as possible (hence, the slightest code change) to achieve the best performance. As cuDF now operates on CUDA unified memory, it is not simply possible to parallelize the 1st part (H3 library) as cuDF does not handle CPU partitioning. The code is shown below. Note, the following code operates on a cuDF dataframe.

#while loop until all the records are assigned to subset_id
while resolution  16 and df["subset_id"].isnull().any():
     #assignment of hex_id
     #df is a cuDF
     df['hex_id'] = np.vectorize(lambda latitude, longitude:
                     h3.geo_to_h3(latitude, longitude, resolution))
                     (df['latitude'].to_array(), df['longitude']
                     .to_array())
     
     df_aggreg = df.groupby('hex_id').size().reset_index()
     df_aggreg.columns = ["hex_id", "value"]
 
     #filtering the records that are less than 2500 count
     hex_id = df_aggreg[df_aggreg['value']2500]['hex_id']
             
     for index, value in hex_id.to_pandas().items():
         #assignment of subset_id
         df.loc[df['hex_id'] == value, 'subset_id'] = subset_id
         subset_id += 1
         
     df_return = df_return.append(df[~df['subset_id'].isna()],
                 ignore_index=True)
 
     df = df[df['subset_id'].isna()]
     
     resolution += 1

Step 3: Executing simple CPU parallelism version and cuDF GPU version using larger data

In this step, we increase the data volume three times, from half a million records to 1.5 million records, and execute a simple CPU parallel version and its equivalent cuDF GPU version.

Step 4: One more experiment with a copy to pandas and back to cuDF

As discussed in step 2, cuDF operates on CUDA-unified memory and it is not possible to parallelize the first part (H3 library) due to the lack of CPU-partitioning of the cuDF. Therefore, we have not used the function array_split. To overcome this challenge, first, we converted cuDF to pandas data frame, then applied the function array_split and then converted back the split chunk to cuDF and proceeded further with H3 library processing.

#while loop until all the records are assigned to subset_id
while resolution  16 and df["subset_id"].isnull().any():
     #copy to pandas
     df_temp = df.to_pandas()
     
     #CPU Parallelization
     df_chunk = np.array_split(df_temp, n_cores)
     pool = Pool(n_cores)
     df_chunk_res=pool.map(partial(minikernel, resolution=resolution),
                     df_chunk)
 
     pool.close()
     pool.join()
     
     df_temp = pd.concat(df_chunk_res)
     
     #Back to cuDF
     df = cudf.DataFrame(df_temp)
 
     #assignment of hex_id
     df['hex_id'] = np.vectorize(lambda latitude, longitude:
                     h3.geo_to_h3(latitude, longitude, resolution))
                     (df['latitude'].to_array(), df['longitude']
                     .to_array())
    
     df_aggreg = df.groupby('hex_id').size().reset_index()
     df_aggreg.columns = ["hex_id", "value"]
 
     #filtering the records that are less than 2500 count
     hex_id = df_aggreg[df_aggreg['value']2500]['hex_id']
             
     for index, value in hex_id.to_pandas().items():
         #assignment of subset_id
         df.loc[df['hex_id'] == value, 'subset_id'] = subset_id
         subset_id += 1
         
     df_return = df_return.append(df[~df['subset_id'].isna()],
                 ignore_index=True)
 
     df = df[df['subset_id'].isna()]
    
     resolution += 1

Glancing graph with execution times over all the preceding approaches:

*Figure 2: Execution times for various approaches with data size of 0.5 million.*

*Figure 3: Execution times for various approaches with data size of 1.5 million.*

Lessons learned on speed-up geospatial index computation

High-Performance: The conclusion from the preceding glancing graphs is clear that cuDF GPU version delivers the best performance. And also bigger the dataset is, bigger the speedup is.
Code Adaptability and Easy Transition: Please notice that the code being ported is not the best scenario for GPU acceleration. We are running the comparison on a third-party library (Uber H3) which runs on CPU. To make use of that library, we need to copy the data from GPU memory to CPU memory on each loop, which is not the optimal approach.
In addition to that, there is a subset_id calculation that is also done in a row-wise approach, which could potentially be speeded up by changing the original code. But still, the code is not changed because one of our main targets is to check code adaptability and easy transition between the libraries pandas and cuDF.
Reusable Code: As you had already observed from the preceding that the pipeline is a set of standardized functions and can just be used as functions to solve other use cases too.

Working towards a CUDA accelerated K-Nearest Neighbors (KNN) classification

Rather than measuring the density of connected cars by means of indexing and grouping, using the scheme above – another way is to perform geographical classification based on the earth distance between two locations.

The classification algorithm of our choice is K-Nearest Neighbors. The principle behind nearest neighbor methods is to find a predefined number of data points (K) closest in distance to the data point. We will be comparing a CPU-based implementation of KNN to the RAPIDS GPU-accelerated version of the same algorithm.

In our current use case, we work with anonymized streamed connected car data (as shortly described in business challenges preceding). Here, grouping and aggregating data using KNN is opted as part of anonymization.

However, for our use case, as we are grouping and aggregating on Geo-Coordinates, we will be using Haversine metric, which is the only metric that can cluster Geo-Coordinates.

In our pipeline inputs to KNN using haversine as distance metric will be the geo-coordinates (latitude, longitude) and the number of desired closest data points. In the example below, K = 7 was to be created.

In the following, we showcase the example with the same data in tuples (longitude and latitude).

Input data are the same tuples (longitude and latitude) as shown in the previous example.
Once KNN is applied, a cluster id is calculated by the KNN algorithm: The clustered output data looks like below for the first two rows of input data. To avoid confusion, we marked the cluster ids with corresponding colors.

A table containing the output data after applying k-nearest neighbors classification to sample input data. It consists of four columns: custer_set_id, latitude and longitude. — *Table 3: Sample output data after applying k-nearest neighbors classification.*

Following is the code snippet of the implementation of KNN using pandas:

nbrs = NearestNeighbors(n_neighbors=7, algorithm='ball_tree',
     metric = "haversine").fit(coord_array_rad)
 
distances, indices = nbrs.kneighbors(coord_array_rad)
 
# Distance is computed in radians from haversine
distances_m = earth_radius * distances
 
# Drop KNN, which are not compliant with minimum distance
compliant_distances_mask = (distances_mKNN_MAX_DISTANCE)
                         .all(axis = 1)
 
compliant_indices = indices[compliant_distances_mask]

KNN is used as a classification algorithm. Drawback of KNN is its computational performance, especially when the data size is large. Our intention is to finally leverage cuML´s KNN implementation.

Preceding implementation worked in pretty small datasets but did not finish processing 3 million records within 1.5 days. Thus we stopped it.

In order to turn towards the CUDA accelerated KNN implementation, we had to mimic the haversine distance with an equivalent metric as shown below.

Step 1: Coordinate transformation to work around Haversine

At the moment of running this exercise, haversine distance metric was not available natively in cuML’s KNN implementation. Therefore, euclidean distance was used instead. Nevertheless, it is fair to mention that the current version of RAPIDS KNN already supports haversine metrics.

First of all, we converted the coordinates into the distance in meters in order to perform a distancing metric calculation. [10] This is implemented through a function named df_geo(), which will be used in the next step.

One caveat of Euclidean distance is that it does not work on coordinates on earth that are further distanced. Rather, it will basically “dig a hole” into the earth’s surface instead of being on the surface of the earth. However, for smaller distances

Step 2: Perform KNN algorithm

By now, we have converted all coordinates into a north-easting coordinate format and in this step, the actual KNN algorithm can be applied.

We used the CUDA accelerated KNN in the following setting. We observe that this implementation performs extremely fast and it is absolutely worthy implementation.

#defining the hyperparameters
n_neighbors = 7
algorithm = "brute"
metric = "euclidean"
 
#Implementation of kNN by calling df_geo() which converts the coordinates #into Northing and Easting coordinate format
nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm=algorithm,
     metric=metric).fit(df_geo[['northing', 'easting']])
 
distances, indices = nbrs.kneighbors(df_geo[['northing', 'easting']])

Step 3: Perform the distance masking and filtering

This part is done on the CPU again because no significant speedup is expected on the GPU.

distances = cp.asnumpy(distances.values)
indices = cp.asnumpy(indices.values)

#running on CPU
KNN_MAX_DISTANCE = 10000 # meters
 
# Drop KNN, which are not compliant with minimum distance
compliant_distances_mask = (distances  KNN_MAX_DISTANCE).all(axis = 1)
compliant_indices = indices[compliant_distances_mask]

Our result is a speedup of 800x when applied to a dataset with 3 million samples over the naive pandas implementation.

*Figure 4: Execution time for various approaches with data size of 3 million.*

Lessons learnt for K-Nearest Neighbors (KNN) clustering

High-Performance: The conclusion from the preceding glancing graph is clear that, cuDF GPU version delivers the best performance. Even though the dataset is bigger, the execution will not take a long time like in CPU executions.
Comparing KNN from cuML and scikit: The cuML based implementation is lightning fast. But we had to go the extra mile to mimic the missing distance metric. It was absolutely worth doing more than required given the performance boost achieved. In the meantime, the haversine distance is supported in RapidsAI and comes at the same convenience as the scikit implementation.
We overcome the missing haversine distance by using the Euclidean distance with Northing-Easting Approach. As per the research “Over fairly large distances–perhaps up to a few thousand kilometers or more, Euclidean starts erroneous calculation” In our code, we are limiting the distance to 10 Kilometers. By using Northing-Easting, we first needed to convert the coordinates. As the overall performance is much better, we can accept the time taken for converting the coordinates.
Code Adaptability and Easy Transition: Except the Northing-Easting function.
The remaining code is similar to CPU code and still achieved better performance. We had not changed the code because one of our main targets is also to check code adaptability and easy transition between the libraries pandas and cuDF.
Reusable Code: As you already observed from the preceding, pipeline is a set of standardized functions and can be used as functions to solve other use cases too.

Summary

This article summarized how RAPIDS helps in accelerating data pipelines 100x faster by evaluating it over two models, namely Geospatial Indexing (Uber H3) and K-Nearest Neighbors Classification (KNN). Furthermore, we analyzed the advantages and disadvantages of NVIDIA RapidsAI with respect to the preceding two models with many criteria like performance, code adaptability, and reusability. We conclude that RAPIDS is surely a technology for streaming data processing (connected car data). It provides the benefits of faster processing of data which is the crucial factor for streaming data analysis. Also, RAPIDS has a large number of machine learning algorithms supported. The API’s of accelerated RAPIDS cuDF and cuML libraries kept similar to pandas to enable the easy transition. It is very easy to transform existing ML pipelines and make them benefit from cuDF and cuML.

When to choose RAPIDS over standard Python and pandas:

When the application requires faster processing of data.
If you are sure that the code gives benefits on running in GPU over CPU.
If the recommended algorithms are available as part of cuML.

This article aims at automotive engineers, data engineers, big data architects, project managers, and industry consultants interested in exploring or dealing with the possibilities of data science and using Python to analyze data.

Listen to the GTC session: Accelerating VW Connected-Car Data Pipelines 100x Faster with RAPIDS [E31421]

Misc

NVIDIA and King’s College London Accelerate Brain Research with Synthetic Image Project

Post author By
Post date July 26, 2021
No Comments on NVIDIA and King’s College London Accelerate Brain Research with Synthetic Image Project

NVIDIA and King’s College London today unveiled new details about one of the first projects on Cambridge-1, the United Kingdom’s most powerful supercomputer.

Misc

King’s College London Accelerates Synthetic Brain 3D Image Creation Using AI Models Powered by Cambridge-1 Supercomputer

Post author By
Post date July 26, 2021
No Comments on King’s College London Accelerates Synthetic Brain 3D Image Creation Using AI Models Powered by Cambridge-1 Supercomputer

3D image of a brain from 3 angles King’s College London, along with partner hospitals and university collaborators, unveiled new details today about one of the first projects on Cambridge-1, the United Kingdom’s most powerful supercomputer. The Synthetic Brain Project is focused on building deep learning models that can synthesize artificial 3D MRI images of human brains. These models can help scientists understand … Continued 3D image of a brain from 3 angles

King’s College London, along with partner hospitals and university collaborators, unveiled new details today about one of the first projects on Cambridge-1, the United Kingdom’s most powerful supercomputer.

The Synthetic Brain Project is focused on building deep learning models that can synthesize artificial 3D MRI images of human brains. These models can help scientists understand what a human brain looks like across a variety of ages, genders, and diseases.

The AI models were developed by King’s College London, and NVIDIA data scientists and engineers, as part of The London Medical Imaging & AI Centre for Value Based Healthcare. The research was funded by UK Research and Innovation and a Wellcome Flagship Programme (in collaboration with University College London).

The aim of developing the AI models is to help diagnose neurological diseases based on brain MRI scans. They also could be used for predicting diseases a brain may develop over time, enabling preventative treatment.

The use of synthetic data has the additional benefit of ensuring patient privacy and gives King’s the ability to open the research to the broader UK healthcare community. Without Cambridge-1, the AI models would have taken months rather than weeks to train, and the resulting image quality would not have been as clear.

King’s and NVIDIA researchers used Cambridge-1 to scale the models to the necessary size using multiple GPUs, and then applied a process known as hyperparameter tuning, which dramatically improved the accuracy of the models.

“Cambridge-1 enables accelerated generation of synthetic data that gives researchers at King’s the ability to understand how different factors affect the brain, anatomy, and pathology,” said Jorge Cardoso, senior lecturer in Artificial Medical Intelligence at King’s College London. “We can ask our models to generate an almost infinite amount of data, with prescribed ages and diseases; with this, we can start tackling problems such as how diseases affect the brain and when abnormalities might exist.”

Introduction of the NVIDIA Cambridge-1 supercomputer poses new possibilities for groundbreaking research like the Synthetic Brain Project and could be used to accelerate research in digital biology on disease, drug design, and the human genome.

As one of the world’s top 50 fastest supercomputers, Cambridge-1 is built on 80 DGX A100 systems, integrating NVIDIA A100 GPUs, Bluefield-2 DPUs, and NVIDIA HDR InfiniBand networking.

King’s College London is leveraging NVIDIA hardware and the open-source MONAI software framework supported by PyTorch, with cuDNN and Omniverse for their Synthetic Brain Project. MONAI is a freely available, community-supported PyTorch-based framework for deep learning in healthcare imaging. The CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library for deep neural networks. Omniverse is an open platform for virtual collaboration and real-time simulation. King’s has just begun using it to visualize brains, which can help physicians better understand the morphology and pathology of brain diseases.

The increasing efficiency of deep learning architectures—together with hardware improvements—have enabled complex and high-dimensional modelling of medical volumetric data at higher resolutions. Vector-Quantized Variational Autoencoders (VQ-VAE) have been an option for an efficient generative unsupervised learning approach that can encode images to a substantially compressed representation compared to its initial size, while preserving the decoded fidelity.

King’s used a VQ-VAE inspired and 3D optimized network to efficiently encode a full-resolution brain volume, compressing the data to less than 1% of the original size while maintaining image fidelity, and outperforming the previous State-of-the-Art.

A synthetic healthy human brain generated by King’s College London and NVIDIA AI models.

After the images are encoded by the VQ-VAE, the latent space is learned through a long-range transformer model optimized for the volumetric nature of the data and associated sequence length. The sequence length caused by the three-dimensional nature of the data requires unparalleled model sizes made possible by the multi-GPU and multinode scaling provided by Cambridge-1.

By sampling from these large transformer models, and conditioning on clinical variables of interest (such as age or disease), new latent space sequences can be generated, and decoded into volumetric brain images using the VQ-VAE. Transformer AI models adopt the mechanism of attention, differentially weighing the significance of each part of the input data, and used to understand these sequence lengths.

Creating generative brain images that are eerily similar to real life neurological radiology studies helps understand how the brain forms, how trauma and disease affect it, and how to help it recover. Instead of real patient data, the use of synthetic data mitigates problems with data access and patient privacy.

As part of the synthetic brain generation project from King’s College London, the code and models are open-source. NVIDIA has made open-source contributions to improve the performance of the fast-transformers project, on which The Synthetic Brain Project depends upon.

To learn more about Cambridge-1, watch the replay of the Cambridge-1 Inauguration featuring a special address from NVIDIA founder and CEO Jensen Huang, and a panel with UK healthcare experts from AstraZeneca, GSK, Guy’s and St Thomas’ NHS Foundation Trust, King’s College London and Oxford Nanopore.

Misc

TFLite Conversion Error: Element Shape Required to be 1D

Post author By
Post date July 25, 2021
No Comments on TFLite Conversion Error: Element Shape Required to be 1D

Currently trying to convert a TF mask rcnn model to TFLite, so I can use it on a TPU. When I try to run the quantization code, I get the following error:

 error: 'tf.TensorListReserve' op requires element_shape to be 1D tensor during TF Lite transformation pass

I’m not sure how to deal with the error, or how to fix it. Here’s the code:

 import tensorflow as tf import model as modellib import coco import os import sys # Enable eager execution tf.compat.v1.enable_eager_execution() class InferenceConfig(coco.CocoConfig): GPU_COUNT = 1 IMAGES_PER_GPU = 1 config = InferenceConfig() model = modellib.MaskRCNN(mode="inference", model_dir='logs', config=config) model.load_weights('mask_rcnn_coco.h5', by_name=True) model = model.keras_model tf.saved_model.save(model, "tflite") # Preparing before conversion - making the representative dataset ROOT_DIR = os.path.abspath("../") CARS = os.path.join(ROOT_DIR, 'Mask_RCNN\mrcnn\smallCar') IMAGE_SIZE = 224 datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255) def representative_data_gen(): dataset_list = tf.data.Dataset.list_files(CARS) for i in range(100): image = next(iter(dataset_list)) image = tf.io.read_file(image) image = tf.io.decode_jpeg(image, channels=3) image = tf.image.resize(image, [IMAGE_SIZE, IMAGE_SIZE]) image = tf.cast(image / 255., tf.float32) image = tf.expand_dims(image, 0) yield [image] converter = tf.lite.TFLiteConverter.from_keras_model(model) # This enables quantization converter.optimizations = [tf.lite.Optimize.DEFAULT] # This sets the representative dataset for quantization converter.representative_dataset = representative_data_gen # This ensures that if any ops can't be quantized, the converter throws an error converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # For full integer quantization, though supported types defaults to int8 only, we explicitly declare it for clarity. converter.target_spec.supported_types = [tf.int8] # These set the input and output tensors to uint8 (added in r2.3) converter.inference_input_type = tf.uint8 converter.inference_output_type = tf.uint8 tflite_model = converter.convert() with open('modelQuantized.tflite', 'wb') as f: f.write(tflite_model)

Any help is appreciated!

submitted by /u/Tomatorumrum
[visit reddit] [comments]