DataBloom - Part 451

Mapping Africa’s Buildings with Satellite Imagery

Post author By
Post date July 28, 2021
No Comments on Mapping Africa’s Buildings with Satellite Imagery

Posted by John Quinn, Software Engineer, Google Research, Ghana

An accurate record of building footprints is important for a range of applications, from population estimation and urban planning to humanitarian response and environmental science. After a disaster, such as a flood or an earthquake, authorities need to estimate how many households have been affected. Ideally there would be up-to-date census information for this, but in practice such records may be out of date or unavailable. Instead, data on the locations and density of buildings can be a valuable alternative source of information.

A good way to collect such data is through satellite imagery, which can map the distribution of buildings across the world, particularly in areas that are isolated or difficult to access. However, detecting buildings with computer vision methods in some environments can be a challenging task. Because satellite imaging involves photographing the earth from several hundred kilometres above the ground, even at high resolution (30–50 cm per pixel), a small building or tent shelter occupies only a few pixels. The task is even more difficult for informal settlements, or rural areas where buildings constructed with natural materials can visually blend into the surroundings. There are also many types of natural and artificial features that can be easily confused with buildings in overhead imagery.

Objects that can confuse computer vision models for building identification (clockwise from top left) pools, rocks, enclosure walls and shipping containers.

In “Continental-Scale Building Detection from High-Resolution Satellite Imagery”, we address these challenges, using new methods for detecting buildings that work in rural and urban settings across different terrains, such as savannah, desert, and forest, as well as informal settlements and refugee facilities. We use this building detection model to create the Open Buildings dataset, a new open-access data resource containing the locations and footprints of 516 million buildings with coverage across most of the African continent. The dataset will support several practical, scientific and humanitarian applications, ranging from disaster response or population mapping to planning services such as new medical facilities or studying human impact on the natural environment.

Model Development
We built a training dataset for the building detection model by manually labelling 1.75 million buildings in 100k images. The figure below shows some examples of how we labelled images in the training data, taking into account confounding characteristics of different areas across the African continent. In rural areas, for example, it was necessary to identify different types of dwelling places and to disambiguate them from natural features, while in urban areas we needed to develop labelling policies for dense and contiguous structures.

(1) Example of a compound containing both dwelling places as well as smaller outbuildings such as grain stores. (2) Example of a round, thatched-roof structure that can be difficult for a model to distinguish from trees, and where it is necessary to use cues from pathways, clearings and shadows to disambiguate. (3) Example of several contiguous buildings for which the boundaries cannot be easily distinguished.

We trained the model to detect buildings in a bottom-up way, first by classifying each pixel as building or non-building, and then grouping these pixels together into individual instances. The detection pipeline was based on the U-Net model, which is commonly used in satellite image analysis. One advantage of U-Net is that it is a relatively compact architecture, and so can be applied to large quantities of imaging data without a heavy compute burden. This is critical, because the final task of applying this to continental-scale satellite imagery means running the model on many billions of image tiles.

Example of segmenting buildings in satellite imagery. Left: Source image; Center: Semantic segmentation, with each pixel assigned a confidence score that it is a building vs. non-building; Right: Instance segmentation, obtained by thresholding and grouping together connected components.

Initial experiments with the basic model had low precision and recall, for example due to the variety of natural and artificial features with building-like appearance. We found a number of methods that improved performance. One was the use of mixup as a regularisation method, where random training images are blended together by taking a weighted average. Though mixup was originally proposed for image classification, we modified it to be used for semantic segmentation. Regularisation is important in general for this building segmentation task, because even with 100k training images, the training data do not capture the full variation of terrain, atmospheric and lighting conditions that the model is presented with at test time, and hence, there is a tendency to overfit. This is mitigated by mixup as well as random augmentation of training images.

Another method that we found to be effective was the use of unsupervised self-training. We prepared a set of 100 million satellite images from across Africa, and filtered these to a subset of 8.7 million images that mostly contained buildings. This dataset was used for self-training using the Noisy Student method, in which the output of the best building detection model from the previous stage is used as a ‘teacher’ to then train a ‘student’ model that makes similar predictions from augmented images. In practice, we found that this reduced false positives and sharpened the detection output. The student model gave higher confidence to buildings and lower confidence to background.

Difference in model output between the student and teacher models for a typical image. In panel (d), red areas are those that the student model finds more likely to be buildings than the teacher model, and blue areas more likely to be background.

One problem that we faced initially was that our model had a tendency to create “blobby” detections, without clearly delineated edges and with a tendency for neighbouring buildings to be merged together. To address this, we applied another idea from the original U-Net paper, which is to use distance weighting to adapt the loss function to emphasise the importance of making correct predictions near boundaries. During training, distance weighting places greater emphasis at the edges by adding weight to the loss — particularly where there are instances that nearly touch. For building detection, this encourages the model to correctly identify the gaps in between buildings, which is important so that many close structures are not merged together. We found that the original U-Net distance weighting formulation was helpful but slow to compute. So, we developed an alternative based on Gaussian convolution of edges, which was both faster and more effective.

Distance weighting schemes to emphasise nearby edges: U-Net (left) and Gaussian convolution of edges (right).

Our technical report has more details on each of these methods.

Results
We evaluated the performance of the model on several different regions across the continent, in different categories: urban, rural, and medium-density. In addition, with the goal of preparing for potential humanitarian applications, we tested the model on regions with displaced persons and refugee settlements. Precision and recall did vary between regions, so achieving consistent performance across the continent is an ongoing challenge.

Precision-recall curves, measured at 0.5 intersection-over-union threshold.

When visually inspecting the detections for low-scoring regions, we noted various causes. In rural areas, label errors were problematic. For example, single buildings within a mostly-empty area can be difficult for labellers to spot. In urban areas, the model had a tendency to split large buildings into separate instances. The model also underperformed in desert terrain, where buildings were hard to distinguish against the background.

We carried out an ablation study to understand which methods contributed most to the final performance, measured in mean average precision (mAP). Distance weighting, mixup and the use of ImageNet pre-training were the biggest factors for the performance of the supervised learning baseline. The ablated models that did not use these methods had a mAP difference of -0.33, -0.12 and -0.07 respectively. Unsupervised self-training gave a further significant boost of +0.06 mAP.

Ablation study of training methods. The first row shows the mAP performance of the best model combined with self-training, and the second row shows the best model with supervised learning only (the baseline). By disabling each training optimization from the baseline in turn, we observe the impact on mAP test performance. Distance weighting has the most significant effect.

Generating the Open Buildings Dataset
To create the final dataset, we applied our best building detection model to satellite imagery across the African continent (8.6 billion image tiles covering 19.4 million km², 64% of the continent), which resulted in the detection of 516M distinct structures.

Each building’s outline was simplified as a polygon and associated with a Plus Code, which is a geographic identifier made up of numbers and letters, akin to a street address, and useful for identifying buildings in areas that don’t have formal addressing systems. We also include confidence scores and guidance on suggested thresholds to achieve particular precision levels.

The sizes of the structures vary as shown below, tending towards small footprints. The inclusion of small structures is important, for example, to support analyses of informal settlements or refugee facilities.

Distribution of building footprint sizes.

The data is freely available and we look forward to hearing how it is used. In the future, we may add new features and regions, depending on usage and feedback.

Acknowledgements
This work is part of our AI for Social Good efforts and was led by Google Research, Ghana. Thanks to the co-authors of this work: Wojciech Sirko, Sergii Kashubin, Marvin Ritter, Abigail Annkah, Yasser Salah Edine Bouchareb, Yann Dauphin, Daniel Keysers, Maxim Neumann and Moustapha Cisse. We are grateful to Abdoulaye Diack, Sean Askay, Ruth Alcantara and Francisco Moneo for help with coordination. Rob Litzke, Brian Shucker, Yan Mayster and Michelina Pallone provided valuable assistance with geo infrastructure.

Misc

An AI a Day Keeps Dr.Fill at Play: Matt Ginsberg on Building GPU-Powered Crossword Solver

Post author By
Post date July 28, 2021
No Comments on An AI a Day Keeps Dr.Fill at Play: Matt Ginsberg on Building GPU-Powered Crossword Solver

9 Down, 14 letters: Someone skilled in creating and solving crossword puzzles. This April, the fastest “cruciverbalist” at the American Crossword Puzzle Tournament was Dr.Fill, a crossword puzzle-solving AI program created by Matt Ginsberg. Dr.Fill perfectly solved the championship puzzle in 49 seconds. The first human champion, Tyler Hinman, filled the 15×15 crossword in exactly Read article >

The post An AI a Day Keeps Dr.Fill at Play: Matt Ginsberg on Building GPU-Powered Crossword Solver appeared first on The Official NVIDIA Blog.

Misc

YOLOR + DeepSORT Object Tracking

Post author By
Post date July 28, 2021
No Comments on YOLOR + DeepSORT Object Tracking

submitted by /u/NickFortez06
[visit reddit] [comments]

Misc

anime recommender model & analysis – I didn’t use TFRS but it seems to be working pretty decent. Any feedback?

Post author By
Post date July 28, 2021
No Comments on anime recommender model & analysis – I didn’t use TFRS but it seems to be working pretty decent. Any feedback?

anime recommender model & analysis - I didn't use TFRS but it seems to be working pretty decent. Any feedback?

submitted by /u/much_bad_gramer
[visit reddit] [comments]

Misc

Discovering New Features in CUDA 11.4

Post author By
Post date July 27, 2021
No Comments on Discovering New Features in CUDA 11.4

NVIDIA announces the newest release of the CUDA development environment, CUDA 11.4. This release includes GPU-accelerated libraries, debugging and optimization tools, programming language enhancements, and a runtime library to build and deploy your application on GPUs across the major CPU architectures: x86, Arm, and POWER. CUDA 11.4 is focused on enhancing the programming model and … Continued

CUDA 11.4 is focused on enhancing the programming model and performance of your CUDA applications. CUDA continues to push the boundaries of GPU acceleration and lay the foundation for new applications in HPC, graphics, CAE applications, AI and deep learning, automotive, healthcare, and data sciences.

CUDA 11.4 has several important features. This post offers an overview of the key capabilities:

CUDA Programming model enhancements:
- CUDA Graphs
- Multi-Process Service (MPS)
- Formalizing Asynchronous Data Movement
C++ Language support – CUDA
Compiler enhancements
CUDA Driver Enhancements

CUDA 11.4 ships with the R470 driver, which is a long-term support branch. GPUDirect RDMA and GPUDirect technology Storage (GDS) are now part of the CUDA Driver and Toolkit. This streamlines the workflow and enables our developers to leverage these technologies without the need for separate installation of additional packages. The driver enables new MIG configurations for the recently launched NVIDIA A30 GPU, which doubles the memory per MIG slice. This results in greater peak performance for various workloads on the A30 GPU, especially for AI inference workloads.

CUDA 11.4 is available to download.

CUDA programming model enhancements

This release introduced key enhancements to improve the performance of CUDA Graphs without requiring any modifications to the application or any other user intervention. It also improves the ease of use of Multi-Process Service (MPS). We formalized the asynchronous programming model in the CUDA Programming Guide.

CUDA Graphs

Reducing graph launch latency is a common request from the developer community, especially in applications that have real-time constraints, such as 5G telecom workloads or AI inference workloads. CUDA 11.4 delivers performance improvements in reducing the CUDA graph launch times. In addition, we also integrated the stream-ordered memory allocation feature that was introduced in CUDA 11.2.

For more information, see CUDA Graphs in the CUDA Toolkit Programming Guide and Getting Started with CUDA Graphs.

Performance improvements

CUDA graphs are ideal for workloads that are executed multiple times, so a key tradeoff in choosing graphs for a workload is amortizing the cost of creating a graph over repeated launches. The higher the number of repetitions or iterations, the larger the performance improvement.

In CUDA 11.4, we made a couple of key changes to CUDA graph internals that further improve the launch performance. CUDA graphs already sidesteps streams to enable lower latency runtime execution. We extended this, to bypass streams even at the launch phase, submitting a graph as a single block of work directly to the hardware. We’ve seen good performance gains from these host improvements, both for single-threaded and multithreaded applications.

Figure 1 shows the relative improvement in launch latency for the re-launch of different graph patterns. There is significant benefit for graphs that have a fork or join pattern.

Figure showing the relative launch latency improvements in CUDA 11.4 for repeat launches of a single graph for different graph patterns against CUDA 11.3. — *Figure 1.* *Launch latency performance improvement for a single graph for repeat launches with CUDA 11.3 baselines.*

Multithreaded launch performance is particularly affected by the resource contention that happens when launching multiple graphs in parallel. We’ve optimized the interthread locking to reduce contention, and so multithreaded launch is now significantly more efficient. Figure 2 shows the relative performance benefits of the changes in CUDA 11.4 to ease resource contention and how it scales with the number of threads.

Figure showing the relative launch latency improvements in CUDA 11.4 for multithreaded launch of a straight-line graph for different numbers of threads. — *Figure 2.* *Multithreaded launch latency improvement for a straight line for repeat launches with CUDA 11.3 baselines.*

Stream-ordered memory allocator support

The stream-ordered memory allocator enables applications to order memory allocation and deallocation with respect to other work launched into a CUDA stream. This also enables allocation re-use, which can significantly improve application performance. For more information about the feature and capabilities, see Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features.

In CUDA 11.4, CUDA Graphs now supports stream-ordered memory allocation both through stream capture or in native graph construction through new allocate and free node types, enabling the same efficient, deferred memory reuse logic within graphs.

These node types are collectively referred to as memory nodes. They can be created in several ways:

Using the explicit API
- Using cudaGraphAddMemAllocNode and cudaGraphAddMemFreeNode, respectively
Using stream capture
- Using cudaMallocAsync/cudaMallocFromPoolAsync and cudaFreeAsync, respectively

In the same way that stream-ordered allocation uses implicit stream ordering and event dependencies to reuse memory, graph-ordered allocation uses the dependency information defined by the edges of the graph to do the same.

Figure showing the new Memalloc and MemFree graph nodes in a simple graph example — *Figure 3. Intra-graph memory reuse. When a* MemAlloc node is created, it attempts to reuse memory, which was freed by MemFree nodes that it depends upon.

For more information, see Stream Ordered Memory Allocator.

Enhancements to MPS

The Multi-Process Service (MPS) is a binary-compatible client-server runtime implementation of the CUDA API designed to transparently enable co-operative multiprocess CUDA applications.

It consists of a control daemon process, client runtime, and server process. MPS enables better GPU utilization in cases where a single process does not use all the compute and memory-bandwidth capacity. MPS also reduces on-GPU context storage and context switching. For more information, see Multi-Process Service in the GPU Management and Deployment guide.

In this release, we made a couple of key enhancements to improve the ease of use of MPS.

Figure showing the schematic representation of MPS on pre-Volta and post-Volta GPUs — *Figure 4.* *Schematic representation of MPS with reduced on-GPU context storage and context switching to improve ease-of-use.*

Programmatic configuration of SM partitions

There are certain use cases that share the following characteristics:

They consist of kernels that have little to no interaction, which enables concurrent execution.
The ratio of SMs required by these workloads may change and requires flexibility in allocating the right number of SMs.

The MPS active thread percentage setting enables you to limit the execution to a portion of the SMs. Before CUDA 11.4, this was a fixed value that was set equally for all clients within the process. In CUDA 11.4, this has been extended to offer a mechanism to partition the SMs at a per-client level through a programmatic interface. This enables you to create contexts that have different SM partitions within the same application process.

A new resource type called CU_EXEC_AFFINITY_TYPE_SM_COUNT enables you to specify a minimum number N that the context requires. The system guarantees that at least this many SMs are assigned, although more may be reserved. CUDA 11.4 also introduces a related affinity API cuCtxGetExecAffinity, which queries the exact amount of a resource (such as the SM count) allocated for a context. For more information, see the cuCtxGetExecAffinity section in the API documentation.

Error reporting

To improve the error reporting and ease of diagnosing the root cause of MPS issues, we introduced new and detailed driver and runtime error codes. These error codes provide more specificity regarding the type of error. They supplement the common MPS error codes with additional information to help you trace down the cause of the failures. Use these error codes in your applications with the error messages in the server log, as part of the root cause analysis.

New error codes:

 CUDA_ERROR_MPS_CONNECTION_FAILED
 CUDA_ERROR_MPS_SERVER_NOT_READY
 CUDA_ERROR_MPS_RPC_FAILURE
 CUDA_ERROR_MPS_MAX_CLIENTS_REACHED
 CUDA_ERROR_MPS_MAX_CONNECTIONS_REACHED

Formalizing asynchronous data movement

In support of the asynchronous memory transfer operations, enabled by NVIDIA A100 GPU microarchitecture, in CUDA 11.4, we formalized and defined the asynchronous SIMT programming model. The asynchronous programming model defines the behavior and the APIs for C++ 20 barriers and cuda::memcpy_async on the GPU.

For more information about how you can use the asynchronous APIs to overlap memory operations from global memory, with computations in the streaming multiprocessors (SMs), see Asynchronous SIMT Programming Model.

Other enhancements

In addition to the key capabilities listed earlier, there are a few enhancements in CUDA 11.4 geared towards improving the mulit-thread submission throughput and extending the CUDA forward compatibility support to NVIDIA RTX GPUs.

Multithread submission throughput

In 11.4, we reduced the serialization of the CUDA API overheads between CPU threads. These changes are enabled by default. However, to assist with the triage of possible issues because of the underlying changes, we provide an environment variable, CUDA_REDUCE_API_SERIALIZATION, to turn off these changes. This was one of the underlying changes discussed earlier that contributed to the performance improvements for CUDA graphs.

CUDA forward compatibility

To enable use cases where you want to update your CUDA toolkit but stay on your current driver version, for example to reduce the risk or the overhead of additional validation needed to move to a new driver, CUDA offers the CUDA forward compatibility path. This was introduced in CUDA 10.0 but was initially limited to data-center GPUs. CUDA 11.4 eases those restrictions and you can now take advantage of the forward compatibility path for NVIDIA RTX GPUs as well.

Figure showing flowchart diagram to help developers decide which CUDA driver upgrade path best meets their needs — *Figure 5.* *Forward compatibility upgrade path between pre-CUDA 11.0 releases and current CUDA 11.x drivers.*

C++ language support for CUDA

Here are some key enhancements included with C++ language support in CUDA 11.4.

Major releases:
- NVIDIA C++ Standard Library (libcu++) 1.5.0 was released with CUDA 11.4.
- Thrust 1.12.0 has the new thrust::universal_vector API that enables you to use the CUDA unified memory with Thrust.
Bug fix release: The CUDA 11.4 toolkit release includes CUB 1.12.0.
New asynchronous thrust::async:exclusive_scan and inclusive_scan algorithms have been added, and the synchronous versions of these were updated to use cub::DeviceScan directly.

CUDA compiler enhancements

CUDA 11.4 NVCC C++ compiler has JIT LTO support in preview, offers more L1 and L2 cache control, and exposes a C++ symbol demangling static library along with NVIDIA Nsight debugger support for alloca.

JIT link time optimization

JIT link-time optimization (LTO) is a preview feature and is available only on CUDA Toolkit 11.4, not on embedded platforms. This feature enables LTO to be performed at runtime. Use NVRTC to generate NVVM IR, and then use the cuLink driver APIs to link the NVVM IR and do LTO.

The following code example shows how runtime JIT LTO can be used in your program.
Generate NVVM IR using nvrtcCompileProgram with the -dlto option and retrieve the generated NVVM IR using the newly introduced nvrtcGetNVVM. Existing cuLink APIs are augmented to take newly introduced JIT LTO options to accept NVVM IR as input and to perform JIT LTO. Pass the CU_JIT_LTO option to cuLinkCreate API to instantiate the linker and then use CU_JIT_INPUT_NVVM as option to cuLinkAddFile or cuLinkAddData API for further linking of NVVM IR.

 nvrtcProgram prog1, prog2;
 CUlinkState linkState;
 int err;
 void* cubin;
 size_t cubinSize;
 char *nvvmIR1, *nvvmIR2;
 
 
 NVRTC_SAFE_CALL(
      nvrtcCompileProgram(&prog1, ...);
 NVRTC_SAFE_CALL(
      nvrtcCompileProgram(&prog2, ...);
 
 const char* opts = (“--gpu-architecture=compute_80”, “--dlto”);
 
 nvrtcGetNVVM(prog1, &nvvmIR1);
 nvrtcGetNVVM(prog1, &nvvmIR2);
  
  
 options[0] = CU_JIT_LTO;
 values[0] = (void*)&walltime;
 ...
 cuLinkCreate(..., options, values, &linkState);
 err = cuLinkAddData(linkState, CU_JIT_INPUT_NVVM,
                     (void*)nvvmIR1, strlen(nvvmIR1) + 1, ...);
 ...
 err = cuLinkAddData(linkState, CU_JIT_INPUT_NVVM,
                     (void*)nvvmIR2, strlen(nvvmIR2) + 1, ...);
 ...
 cuLinkComplete(linkState, &cubin, &cubinSize);
 ...

Libcu++flt library support

The CUDA SDK now ships with libcu++filt, a static library that converts compiler-mangled C++ symbols into user-readable names. The following API, found in the nv_decode.h header file, is the entry point to the library:

char* __cu_demangle(const char* id, char *output_buffer, size_t *length, int *status)

The following C++ example code shows usage:

#include #include "/usr/local/cuda-14.0/bin/nv_decode.h" using namespace std; int main(int argc, char **argv) { const char* mangled_name = "_ZN6Scope15Func1Enez"; int status = 1; char* w = __cu_demangle(mangled_name,0,0,&status); if(status != 0) cout

This code example outputs as follows:

Demangling Succeeded: Scope1::Func1(__int128, long double, ...)
Demangling Succeeded: Scope1::Func1(__int128, long double, ...)

For more information, see Library Availability in the CUDA Binary Utilities documentation.

Configuring cache behavior in PTX

PTX ISA 7.4 gives you more control over caching behavior of both L1 and L2 caches. The following capabilities are introduced in this PTX ISA version:

Enhanced data prefetching: The new .level::prefetch_size qualifier can be used to prefetch additional data along with memory load or store operations. This enables exploiting the spatial locality of data.
Eviction priority control: PTX ISA 7.4 introduces four cache eviction priorities. These eviction priorities can be specified with the .level::eviction_priority qualifier on memory load or store operations (applicable to the L1 cache) and on the prefetch instruction (applicable to the L2 cache).
- evict_normal (default)
- evict_last (useful when the data should be kept in the cache longer)
- evict_first (useful for streaming data)
- no_allocate (avoid data from being cached at all)
Enhanced L2 cache control: This comes in two flavors:
- Cache control on specific addresses: The new discard instruction enables discarding data from cache without writing it back to memory. It should be only used when the data is no longer required. The new applypriority instruction sets the eviction priority of specific data to evict_normal. This is [articularly useful in downgrading the eviction priority from evict_last when the data no longer needs to be persistent in cache.
- Cache-hints on memory operations: The new createpolicy instruction enables creating a cache policy descriptor that encodes one or more cache eviction priorities for different data regions. Several memory operations including load, store, asynchronous copy, atom, red, and so on can accept cache policy descriptor as an operand when the .level::cache_hint qualifier is used.

These extensions are treated as performance hints only. The caching behavior specified using these extensions is not guaranteed by the caching system. For more information about usage, see the PTX ISA specification.

Other compiler enhancements in CUDA 11.4 include support for a new host compiler: ICC 2021. The diagnostics emitted by the CUDA frontend compiler are now ANSI colored and Nsight debugger can now correctly unwind CUDA applications with alloca calls, in the Call Stack view.

Nsight Developer Tools

New versions are now available for NVIDIA Nsight Visual Studio Code Edition (VSCE) and Nsight Compute 2021.2, adding enhancements to the developer experience for CUDA programming.

NVIDIA Nsight VSCE is an application development environment for heterogeneous platforms bringing CUDA development for GPUs into Microsoft Visual Studio Code. NVIDIA Nsight VSCE enables you to build and debug GPU kernels and native CPU code in the same session as well as inspect the state of the GPU and memory.

It includes IntelliSense code highlighting for CUDA applications and an integrated GPU debugging experience from the IDE with support for stepping through code, setting breakpoint, and inspecting memory states and system information in CUDA kernels. Now it’s easy to develop and debug CUDA applications directly from Visual Studio Code.

Nsight Compute 2021.2 adds new features that help detect more performance issues and make it easier to understand and fix them. The new register dependency visualization (Figure 6) helps identify long dependency chains and inefficient register usage that can limit performance. This release also adds a frequently requested feature to enable you to view the side-by-side assembly and correlated source code for CUDA kernels in the source view, without needing to collect a profile. This standalone source viewer feature enables you to open .cubin files directly from disk in the GUI to see the code correlation.

Example screen shot of Nsight Compute tool helping developers visualize registry dependency to help improve code performance — *Figure 6. Nsight Compute Register Dependency Visualization capability helps identify long dependency chains and inefficient register usage to help improve performance.*

Several features, including highlighted focus metrics, report cross-links, increased rule visibility, and documentation references all add to the built-in profile and optimization guided analysis that Nsight Compute provides to help you understand and fix performance bottlenecks.

This release also includes support for OptiX 7 resource tracking, a new Python interface for reading report data, and improvements to management of baseline reports, font settings, and CLI filters.

For overall updates, see NVIDIA Developer Tools Overview. Download the tools to your code.

For more information about the CUDA 11 generation toolkit capabilities and introductions, see CUDA 11 Features Revealed and follow all CUDA posts.

Acknowledgements

Thanks to the following key contributors: Stephen Jones, Arthy Sundaram, Fred Oh, and Sally Stevenson.

Misc

How Was NVIDIA’s 2021 GTC Keynote Made? Step Inside Our Kitchen Aug. 11 to Find Out

Post author By
Post date July 27, 2021
No Comments on How Was NVIDIA’s 2021 GTC Keynote Made? Step Inside Our Kitchen Aug. 11 to Find Out

Ever see a virtual kitchen materialize in real-time? If you caught NVIDIA CEO Jensen Huang’s keynote for our March 2021 GPU Technology Conference you’re no doubt wondering about more than a few of the presentation’s magic tricks. With the premiere of “Connecting in the Metaverse: The Making of the GTC Keynote,” Wednesday, Aug. 11, at Read article >

The post How Was NVIDIA’s 2021 GTC Keynote Made? Step Inside Our Kitchen Aug. 11 to Find Out appeared first on The Official NVIDIA Blog.

Misc

Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1

Post author By
Post date July 27, 2021
No Comments on Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1

Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren’t stream ordered. In this post, we introduce new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In part … Continued

Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren’t stream ordered. In this post, we introduce new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations.

In part 2 of this series, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.

Stream ordering efficiency

The following code example on the left is inefficient because the first cudaFree call has to wait for kernelA to finish, so it synchronizes the device before freeing the memory. To make this run more efficiently, the memory can be allocated upfront and sized to the larger of the two sizes, as shown on the right.

cudaMalloc(&ptrA, sizeA);
kernelA>>(ptrA);
cudaFree(ptrA); // Synchronizes the 
device before freeing memory
cudaMalloc(&ptrB, sizeB);
kernelB>>(ptrB);
cudaFree(ptrB);

cudaMalloc(&ptr,   max(sizeA, sizeB));
kernelA>>(ptr);
kernelB>>(ptr);
cudaFree(ptr);

This increases code complexity in the application because the memory management code is separated out from the business logic. The problem is exacerbated when other libraries are involved. For example, consider the case where kernelA is launched by a library function instead:

libraryFuncA(stream);
cudaMalloc(&ptrB, sizeB);
kernelB>>(ptrB);
cudaFree(ptrB);
  
void libraryFuncA(cudaStream_t stream) {
    cudaMalloc(&ptrA, sizeA);
    kernelA>>(ptrA);
    cudaFree(ptrA);
 }

This is much harder for the application to make efficient because it may not have complete visibility or control over what the library is doing. To circumvent this problem, the library would have to allocate memory when that function is invoked for the first time and never free it until the library is deinitialized. This not only increases code complexity, but it also causes the library to hold on to the memory longer than it needs to, potentially denying another portion of the application from using that memory.

Some applications take the idea of allocating memory upfront even further by implementing their own custom allocator. This adds a significant amount of complexity to application development. CUDA aims to provide a low-effort, high-performance alternative.

CUDA 11.2 introduced a stream-ordered memory allocator to solve these types of problems, with the addition of cudaMallocAsync and cudaFreeAsync. These new API functions shift memory allocation from global-scope operations that synchronize the entire device to stream-ordered operations that enable you to compose memory management with GPU work submission. This eliminates the need for synchronizing outstanding GPU work and helps restrict the lifetime of the allocation to the GPU work that accesses it. Consider the following code example:

cudaMallocAsync(&ptrA, sizeA, stream);
kernelA>>(ptrA);
cudaFreeAsync(ptrA, stream); // No synchronization necessary
cudaMallocAsync(&ptrB, sizeB, stream); // Can reuse the memory freed previously
kernelB>>(ptrB);
cudaFreeAsync(ptrB, stream);

It is now possible to manage memory at function scope, as in the following example of a library function launching kernelA.

libraryFuncA(stream);
cudaMallocAsync(&ptrB, sizeB, stream); // Can reuse the memory freed by the library call
kernelB>>(ptrB);
cudaFreeAsync(ptrB, stream);
  
void libraryFuncA(cudaStream_t stream) {
    cudaMallocAsync(&ptrA, sizeA, stream);
    kernelA>>(ptrA);
    cudaFreeAsync(ptrA, stream); // No synchronization necessary
}

Stream-ordered allocation semantics

All the usual stream-ordering rules apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync can be accessed by any kernel or memcpy operation as long as the kernel or memcpy is ordered to execute after the allocation operation and before the deallocation operation, in stream order. Deallocation can be performed in any stream, as long as it is ordered to execute after the allocation operation and after all accesses on all streams of that memory on the GPU.

In effect, stream-ordered allocation behaves as if allocation and free were kernels. If kernelA produces a valid buffer on a stream and kernelB invalidates it on the same stream, then an application is free to access the buffer after kernelA and before kernelB in the appropriate stream order.

The following example shows various valid usages.

auto err = cudaMallocAsync(&ptr, size, streamA);
// If cudaMallocAsync completes successfully, ptr is guaranteed to be
// a valid pointer to memory that can be accessed in stream order
  
assert(err == cudaSuccess);
  
// Work launched in the same stream can access the memory because 
// operations within a stream are serialized by definition
  
kernel>>(ptr);
  
// Work launched in another stream can access the memory as long as
// the appropriate dependencies are added
  
cudaEventRecord(event, streamA);
cudaStreamWaitEvent(streamB, event, 0);
kernel>>(ptr);
 
 
// Synchronizing the stream at a point beyond the allocation operation 
// also enables any stream to access the memory
  
cudaEventSynchronize(event);
kernel>>(ptr);
  
// Deallocation requires joining all the accessing streams. Here, 
// streamD will be deallocating.
// Adding an event dependency on streamB ensures that all accesses in 
// streamB will be done before the deallocation
  
cudaEventRecord(event, streamB);
cudaStreamWaitEvent(streamD, event, 0);
  
// Synchronizing streamC also ensures that all its accesses are done before 
// the deallocation
  
cudaStreamSynchronize(streamC);
cudaFreeAsync(ptr, streamD);

Figure 1 shows the various dependencies specified in the earlier code example. As you can see, all kernels are ordered to execute after the allocation operation and complete before the deallocation operation.

Figure showing how to correctly access memory allocated using cudaMallocAsync. — *Figure 1. Various ways to insert dependencies between streams to ensure correctness when accessing memory allocated using* cudaMallocAsync.

Memory allocation and deallocation cannot fail asynchronously. Memory errors that occur because of a call to cudaMallocAsync or cudaFreeAsync (for example, out of memory) are reported immediately through an error code returned from the call. If cudaMallocAsync completes successfully, the returned pointer is guaranteed to be a valid pointer to memory that is safe to access in the appropriate stream order.

err = cudaMallocAsync(&ptr, size, stream);
if (err != cudaSuccess) {
    return err;
}
// Now you’re guaranteed that ‘ptr’ is valid when the kernel executes on stream
kernel>>(ptr);
cudaFreeAsync(ptr, stream);

The CUDA driver uses memory pools to achieve the behavior of returning a pointer immediately.

Memory pools

The stream-ordered memory allocator introduces the concept of memory pools to CUDA. A memory pool is a collection of previously allocated memory that can be reused for future allocations. In CUDA, a pool is represented by a cudaMemPool_t handle. Each device has a notion of a default pool whose handle can be queried using cudaDeviceGetDefaultMemPool.

You can also explicitly create your own pools and either use them directly or set them as the current pool for a device and use them indirectly. Reasons for explicit pool creation include custom configuration, as described later in this post. When no explicitly created pool has been set as the current pool for a device, the default pool acts as the current pool.

When called without an explicit pool argument, each call to cudaMallocAsync infers the device from the specified stream and attempts to allocate memory from that device’s current pool. If the pool has insufficient memory, the CUDA driver calls into the OS to allocate more memory. Each call to cudaFreeAsync returns memory to the pool, which is then available for re-use on subsequent cudaMallocAsync requests. Pools are managed by the CUDA driver, which means that applications can enable pool sharing between multiple libraries without those libraries having to coordinate with each other.

If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space. Remapping existing pool memory instead of allocating new memory from the OS also helps keep the application’s memory footprint low.

By default, unused memory accumulated in the pool is returned to the OS during the next synchronization operation on an event, stream, or device, as the following code example shows.

cudaMallocAsync(ptr1, size1, stream); // Allocates new memory into the pool
kernel>>(ptr);
cudaFreeAsync(ptr1, stream); // Frees memory back to the pool
cudaMallocAsync(ptr2, size2, stream); // Allocates existing memory from the pool
kernel>>(ptr2);
cudaFreeAsync(ptr2, stream); // Frees memory back to the pool
cudaDeviceSynchronize(); // Frees unused memory accumulated in the pool back to the OS
// Note: cudaStreamSynchronize(stream) achieves the same effect here

Retaining memory in the pool

Returning memory from the pool to the system can affect performance in some cases. Consider the following code example:

for (int i = 0; i >>(ptr);
    cudaFreeAsync(ptr, stream);
    cudaStreamSynchronize(stream);
}

By default, stream synchronization causes any pools associated with that stream’s device to release all unused memory back to the system. In this example, that would happen at the end of every iteration. As a result, there is no memory to reuse for the next cudaMallocAsync call and instead memory must be allocated through an expensive system call.

To avoid this expensive reallocation, the application can configure a release threshold to enable unused memory to persist beyond the synchronization operation. The release threshold specifies the maximum amount of memory the pool caches. It releases all excess memory back to the OS during a synchronization operation.

By default, the release threshold of a pool is zero. This means that allunused memory in the pool is released back to the OS during every synchronization operation. The following code example shows how to change the release threshold.

cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, device);
uint64_t threshold = UINT64_MAX;
cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold, &threshold);
for (int i = 0; i >>(ptr);
    cudaFreeAsync(ptr, stream);
    cudaStreamSynchronize(stream);    // Only releases memory down to “threshold” bytes
}

Using a nonzero release threshold enables reusing memory from one iteration to the next. This requires only simple bookkeeping and makes the performance of cudaMallocAsync independent of the size of the allocation, which results in dramatically improved memory allocation performance (Figure 2).

Figure showing differences in cost of memory allocation with and without a release threshold. — *Figure 2. Cost of allocating memory using* cudaMallocAsync with and without setting a release threshold (all values relative to performance of 0.4MB with threshold allocation).

The pool threshold is just a hint. Memory in the pool can also be released implicitly by the CUDA driver to enable an unrelated memory allocation request in the same process to succeed. For example, a call to cudaMalloc or cuMemCreate could cause CUDA to free unused memory from any memory pool associated with the device in the same process to serve the request.

This is especially helpful in scenarios where an application makes use of multiple libraries, some of which use cudaMallocAsync and some that do not. By automatically freeing up unused pool memory, those libraries do not have to coordinate with each other to have their respective allocation requests succeed.

There are limitations to when the CUDA driver automatically reassigns memory from a pool to unrelated allocation requests. For example, the application may be using a different interface, like Vulkan or DirectX, to access the GPU, or there may be more than one process using the GPU at the same time. Memory allocation requests in those contexts do not cause automatic freeing of unused pool memory. In such cases, the application may have to explicitly free unused memory in the pool, by invoking cudaMemPoolTrimTo.

size_t bytesToKeep = 0;
cudaMemPoolTrimTo(mempool, bytesToKeep);

The bytesToKeep argument tells the CUDA driver how many bytes it can retain in the pool. Any unused memory that exceeds that size is released back to the OS.

Better performance through memory reuse

The stream parameter to cudaMallocAsync and cudaFreeAsync helps CUDA reuse memory efficiently and avoid expensive calls into the OS. Consider the following trivial code example.

cudaMallocAsync(&ptr1, size1, stream);
kernelA>>(ptr1);
cudaFreeAsync(ptr1, stream);
cudaMallocAsync(&ptr2, size2, stream);
kernelB>>(ptr2);

Figure showing how memory can be reused within a stream. — *Figure 3. Memory reuse within the same stream*.

In this code example, ptr2 is allocated in stream order after ptr1 is freed. The ptr2 allocation could reuse some, or all, of the memory that was used for ptr1 without any synchronization, because kernelA and kernelB are launched in the same stream. So, stream-ordering semantics guarantee that kernelB cannot begin execution and access the memory until kernelA has completed. This way, the CUDA driver can help keep the memory footprint of the application low while also improving allocation performance.

The CUDA driver can also follow dependencies between streams inserted through CUDA events, as shown in the following code example:

cudaMallocAsync(&ptr1, size1, streamA);
kernelA>>(ptr1);
cudaFreeAsync(ptr1, streamA);
cudaEventRecord(event, streamA);
cudaStreamWaitEvent(streamB, event, 0);
cudaMallocAsync(&ptr2, size2, streamB);
kernelB>>(ptr2);

Figure showing how memory can be reused across dependent streams. — *Figure 4. Memory reuse across streams with an event dependency between them*.

As the CUDA driver is aware of the dependency between streams A and B, it can reuse the memory used by ptr1 for ptr2. The dependency chain between streams A and B can contain any number of streams, as shown in the following code example.

cudaMallocAsync(&ptr1, size1, streamA);
kernelA>>(ptr1);
cudaFreeAsync(ptr1, streamA);
cudaEventRecord(event, streamA);
for (int i = 0; i >>(ptr2);

If necessary, the application can disable this feature on a per-pool basis:

int enable = 0;
cudaMemPoolSetAttribute(mempool, cudaMemPoolReuseFollowEventDependencies, &enable);

The CUDA driver can also reuse memory opportunistically in the absence of explicit dependencies specified by the application. While such heuristics may help improve performance or avoid memory allocation failures, they can add nondeterminism to the application and so can be disabled on a per-pool basis. Consider the following code example:

cudaMallocAsync(&ptr1, size1, streamA);
kernelA>>(ptr1);
cudaFreeAsync(ptr1);
cudaMallocAsync(&ptr2, size2, streamB);
kernelB>>(ptr2);
cudaFreeAsync(ptr2);

In this scenario, there are no explicit dependencies between streamA and streamB. However, the CUDA driver is aware of how far each stream has executed. If, on the second call to cudaMallocAsync in streamB, the CUDA driver determines that kernelA has finished execution on the GPU, then it can reuse some or all of the memory used by ptr1 for ptr2.

Figure showing how memory can be reused opportunistically across streams. — *Figure 5. Opportunistic memory reuse across streams.*

If kernelA has not finished execution, the CUDA driver can add an implicit dependency between the two streams such that kernelB does not begin executing until kernelA finishes.

Figure showing how memory can be reused across streams through implicit dependencies added by the CUDA driver. — *Figure 6. Memory reuse through internal dependencies*.

The application can disable these heuristics as follows:

int enable = 0;
cudaMemPoolSetAttribute(mempool, cudaMemPoolReuseAllowOpportunistic, &enable);
cudaMemPoolSetAttribute(mempool, cudaMemPoolReuseAllowInternalDependencies, &enable);

Summary

In part 1 of this series, we introduced the new API functions cudaMallocAsync and cudaFreeAsync , which enable memory allocation and deallocation to be stream-ordered operations. Use them to avoid expensive calls to the OS through memory pools maintained by the CUDA driver.

In part 2 of this series, we share some benchmark results to show the benefits of stream-ordered memory allocation. We also provide a step-by-step recipe for modifying your existing applications to take full advantage of this advanced CUDA capability.

Misc

Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

Post author By
Post date July 27, 2021
No Comments on Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also … Continued

In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.

GPU Big Data Benchmark

To measure the performance impact of the new stream-ordered allocator in a real application, here are results from the RAPIDS GPU Big Data Benchmark (gpu-bdb). gpu-bdb is a benchmark of 30 queries representing real-world data science and machine learning workflows at various scale factors: SF1000 is 1 TB of data and SF10000 is 10 TB. Each query is, in fact, a model workflow that can include SQL, user-defined functions, careful subsetting and aggregation, and machine learning.

Figure 1 shows the performance of cudaMallocAsync compared to cudaMalloc for a subset of gpu-bdb queries conducted at SF1000 on an NVIDIA DGX-2 across 16 V100 GPUs. As you can see, thanks to memory reuse and eliminating extraneous synchronization, there’s a 2–5x improvement in end-to-end performance when using cudaMallocAsync.

Figure showing the performance improvement using stream-ordered memory allocation API functions in the GPU big data benchmark. — *Figure 1. Speed up of* cudaMallocAsync over cudaMalloc for various queries on the RAPIDS GPU Big Data Benchmark.

Interoperability with cudaMalloc and cudaFree

An application can use cudaFreeAsync to free a pointer allocated by cudaMalloc. The underlying memory is not freed until the next synchronization of the stream passed to cudaFreeAsync.

cudaMalloc(&ptr, size);
kernel>>(ptr);
cudaFreeAsync(ptr, stream);
cudaStreamSynchronize(stream); // The memory for ptr is freed at this point

Similarly, an application can use cudaFree to free memory allocated using cudaMallocAsync. However, cudaFree does not implicitly synchronize in this case, so the application must insert the appropriate synchronization to ensure that all accesses to the to-be-freed memory are complete. Any application code that may be intentionally or accidentally relying on the implicit synchronization behavior of cudaFree must be updated.

cudaMallocAsync(&ptr, size, stream);
kernel>>(ptr);
cudaStreamSynchronize(stream); // Must synchronize first
cudaFree(ptr);

Multi-GPU access

By default, memory allocated using cudaMallocAsync is accessible from the device associated with the specified stream. Accessing the memory from any other device requires enabling access to the entire pool from that other device. It also requires the two devices to be peer capable, as reported by cudaDeviceCanAccessPeer. Unlike cudaMalloc allocations, cudaDeviceEnablePeerAccess and cudaDeviceDisablePeerAccess have no effect on memory allocated from memory pools.

For example, consider enabling device 4access to the memory pool of device 3:

cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, 3);
cudaMemAccessDesc desc = {};
desc.location.type = cudaMemLocationTypeDevice;
desc.location.id = 4;
desc.flags = cudaMemAccessFlagsProtReadWrite;
cudaMemPoolSetAccess(mempool, &desc, 1 /* numDescs */);

Access from a device other than the device on which the memory pool resides can be revoked by using cudaMemAccessFlagsProtNone when calling cudaMemPoolSetAccess. Access from the memory pool’s own device cannot be revoked.

Interprocess communication support

Memory allocated using the default memory pool associated with a device cannot be shared with other processes. An application must explicitly create its own memory pools to share memory allocated using cudaMallocAsync with other processes. The following code sample shows how to create an explicit memory pool with interprocess communication (IPC) capabilities:

cudaMemPool_t exportPool;
cudaMemPoolProps poolProps = {};
poolProps.allocType = cudaMemAllocationTypePinned;
poolProps.handleTypes = cudaMemHandleTypePosixFileDescriptor;
poolProps.location.type = cudaMemLocationTypeDevice;
poolProps.location.id = deviceId;
cudaMemPoolCreate(&exportPool, &poolProps);

The location type Device and location ID deviceId indicate that the pool memory must be allocated on a specific GPU. The allocation type Pinned indicates that the memory should be non-migratable, also known as non-pageable. The handle type PosixFileDescriptor indicates that the user intends to query a file descriptor for the pool to share it with another process.

The first step to share memory from this pool through IPC is to query the file descriptor that represents the pool:

int fd;
cudaMemAllocationHandleType handleType = cudaMemHandleTypePosixFileDescriptor;
cudaMemPoolExportToShareableHandle(&fd, exportPool, handleType, 0);

The application can then share the file descriptor with another process, for example through a UNIX domain socket. The other process can then import the file descriptor and obtain a process-local pool handle:

cudaMemPool_t importPool;
cudaMemAllocationHandleType handleType = cudaMemHandleTypePosixFileDescriptor;
cudaMemPoolImportFromShareableHandle(&importPool, &fd, handleType, 0);

The next step is for the exporting process to allocate memory from the pool:

cudaMallocFromPoolAsync(&ptr, size, exportPool, stream);

There is also an overloaded version of cudaMallocAsync that takes the same arguments as cudaMallocFromPoolAsync:

cudaMallocAsync(&ptr, size, exportPool, stream);

After memory is allocated from this pool through either of these two APIs, the pointer can then be shared with the importing process. First, the exporting process gets an opaque handle representing the memory allocation:

cudaMemPoolPtrExportData data;
cudaMemPoolExportPointer(&data, ptr);

This opaque data can then be shared with the importing process through any standard IPC mechanism, such as through shared memory, pipes, and so on The importing process then converts the opaque data into a process-local pointer:

cudaMemPoolImportPointer(&ptr, importPool, &data);

Now both processes share access to the same memory allocation. The memory must be freed in the importing process before it is freed in the exporting process. This is to ensure that the memory does not get reutilized for another cudaMallocAsync request in the exporting process while the importing process is still accessing the previously shared memory allocation, potentially causing undefined behavior.

The existing function cudaIpcGetMemHandle works only with memory allocated through cudaMalloc and cannot be used on any memory allocated through cudaMallocAsync, regardless of whether the memory was allocated from an explicit pool.

Changing a device pool

If the application expects to use an explicit memory pool most of the time, it can consider setting that as the current pool for the device through cudaDeviceSetMemPool. This enables the application to avoid having to specify the pool argument each time that it must allocate memory from that pool.

cudaDeviceSetMemPool(device, pool);
cudaMallocAsync(&ptr, size, stream); // This now allocates from the earlier pool set instead of the device’s default pool.

This has the advantage that any other function allocating with cudaMallocAsync now automatically uses the new pool as its default. The current pool associated with a device can be queried using cudaDeviceGetMemPool.

Library composability

In general, libraries should not change a device’s pool, as doing so affects the entire top-level application. If a library must allocate memory with different properties than those of the default device pool, it may create its own pool and then allocate from that pool using cudaMallocFromPoolAsync. The library could also use the overloaded version of cudaMallocAsync that takes the pool as an argument.

To make interoperability easier for applications, libraries should consider providing APIs for the top-level application to coordinate the pools used. For example, libraries could provide set or get APIs to enable the application to control the pool in a more explicit manner. The library could also take the pool as a parameter to individual APIs.

Code migration guide

When porting an existing application that uses cudaMalloc or cudaFree to the new cudaMallocAsync or cudaFreeAsync APIs, consider the following guidelines.

Guidelines for determining the appropriate pool:

The initial default pool is suitable for many applications.
Today, an explicitly constructed pool is only required to share pool memory across processes with CUDA IPC. This may change with future features.
For convenience, consider making the explicitly created pool the device’s current pool to ensure that all cudaMallocAsync calls within the process use that pool. This must be done by the top-level application and not by libraries, so as to avoid conflicting with the goals of the top-level application.

Guidelines for setting the release threshold for all memory pools:

The choice of release threshold depends on whether and how a device is shared:
- Exclusive to a single process: Use the maximum release threshold.
- Shared among cooperating processes: Coordinate to use the same pool through IPC or set each process pool to an appropriate value to avoid any one process monopolizing all device memory.
- Shared among unknown processes: If known, set the threshold to the working set size of the application. Otherwise, leave it at zero and use a profiler to determine whether allocation performance is a bottleneck before using a nonzero value.

Guidelines for replacing cudaMalloc with cudaMallocAsync:

Ensure that all memory accesses are ordered after the stream-ordered allocation.
If peer access is required, use cudaMemPoolSetAccess as cudaEnablePeerAccess and cudaDisablePeerAccesss have no effect on pool memory.
Unlike cudaMalloc allocations, cudaDeviceReset does not implicitly free pool memory, so it must be explicitly freed.
If freeing with cudaFree, ensure that all accesses are complete through appropriate synchronization before freeing, as there is no implicit synchronization in this case. Any subsequent code that relied on the implicit synchronization may also have to be updated.
If memory is shared with another process through IPC, allocate from an explicitly created pool with IPC support and remove all references to cudaIpcGetMemHandle, cudaIpcOpenMemHandle, and cudaIpcCloseMemHandle for that pointer.
If the memory must be used with GPUDirect RDMA, continue to use cudaMalloc for now because memory allocated through cudaMallocAsync currently does not support it. CUDA aims to support this in the future.
Unlike memory allocated with cudaMalloc, memory allocated with cudaMallocAsync is not associated with a CUDA context. This has the following implications:
- Calling cuPointerGetAttribute with the attribute CU_POINTER_ATTRIBUTE_CONTEXT returns null for the context.
- When calling cudaMemcpy with at least one of source or destination pointers allocated using cudaMallocAsync, that memory must be accessible from the calling thread’s current context/device. If it’s not accessible from that context or device, use cudaMemcpyPeer instead.

Guidelines for replacing cudaFree with cudaFreeAsync:

Ensure that all memory accesses are ordered before the stream-ordered deallocation.
The memory may not be freed back to the system until the next synchronization operation. If the release threshold is set to a nonzero value, the memory may not be freed back to the system until the corresponding pool is explicitly trimmed.
Unlike cudaFree, cudaFreeAsync does not implicitly synchronize the device. Any code relying on this implicit synchronization must be updated to synchronize explicitly.

Conclusion

The stream-ordered allocator and cudaMallocAsync and cudaFreeAsync API functions added in CUDA 11.2 extend the CUDA stream programming model by introducing memory allocation and deallocation as stream-ordered operations. This enables allocations to be scoped to the kernels, which use them while avoiding costly device-wide synchronization that can occur with traditional cudaMalloc/cudaFree.

Furthermore, these API functions add the concept of memory pools to CUDA, enabling the reuse of memory to avoid costly system calls and improve performance. Use the guidelines to migrate your existing code and see how much your application performance improves!

Offsites

Advances in TF-Ranking

Posted by Michael Bendersky and Xuanhui Wang, Software Engineers, Google Research

In December 2018, we introduced TF-Ranking, an open-source TensorFlow-based library for developing scalable neural learning-to-rank (LTR) models, which are useful in settings where users expect to receive an ordered list of items in response to their query. LTR models — unlike standard classification models that classify one item at a time — receive an entire list of items as an input, and learn an ordering that maximizes the utility of the entire list. While search and recommendation systems are the most common applications of LTR models, since its release, we have seen TF-Ranking being applied in diverse domains beyond search, including e-commerce, SAT solvers, and smart city planning.

The goal of learning-to-rank (LTR) is to learn a function f() that takes as an input a list of items (documents, products, movies, etc.) and outputs the list of items in the optimal order (descending order of relevance). Here, green shade indicates item relevance level, and the red item marked with ‘x’ is non-relevant.

In May 2021, we published a major release of TF-Ranking that enables full support for natively building LTR models using Keras, a high-level API of TensorFlow 2. Our native Keras ranking model has a brand-new workflow design, including a flexible ModelBuilder, a DatasetBuilder to set up training data, and a Pipeline to train the model with the provided dataset. These components make building a customized LTR model easier than ever, and facilitate rapid exploration of new model structures for production and research. If RaggedTensors are your tool of choice, TF-Ranking is now working with them as well. In addition, our most recent release, which incorporates the Orbit training library, contains a long list of advances — the culmination of two and half years of neural LTR research. Below we share a few of the key improvements available in the latest TF-Ranking version.

Workflow to build and train a native Keras ranking model. Blue modules are provided by TF-Ranking, and green modules are customizable.

Learning-to-Rank with TFR-BERT
Recently, pretrained language models like BERT have achieved state-of-the-art performance on various language understanding tasks. To capture the expressiveness of these models, TF-Ranking implements a novel TFR-BERT architecture that couples BERT with the power of LTR to optimize the ordering of list inputs. As an example, consider a query and a list of n documents that one might like to rank in response to this query. Instead of learning an independent BERT representation for each <query, document> pair, LTR models apply a ranking loss to jointly learn a BERT representation that maximizes the utility of the entire ranked list with respect to the ground-truth labels.

The figure below illustrates this process. First, we flatten a list of n documents to rank in response to a query into a list <query, document> tuples. These tuples are fed into a pre-trained language model (e.g., BERT). The pooled BERT outputs for the entire document list are then jointly fine-tuned with one of the specialized ranking losses available in TF-Ranking. Our experience shows that this TFR-BERT architecture delivers significant improvements in pretrained language model performance, leading to state-of-the-art performance for several popular ranking tasks, especially when multiple pretrained language models are ensembled. Our users can now get started with TFR-BERT using this simple example.

An illustration of the TFR-BERT architecture, in which a joint LTR model over a list of n documents is constructed using BERT representations of individual <query, document> pairs.

Interpretable Learning-to-Rank
Transparency and interpretability are important factors in deploying LTR models in ranking systems that can be involved in determining the outcomes of processes such as loan eligibility assessment, advertisement targeting, or guiding medical treatment decisions. In such cases, the contribution of each individual feature to the final ranking should be examinable and understandable to ensure transparency, accountability and fairness of the outcomes.

One possible way to achieve this is using generalized additive models (GAMs) — intrinsically interpretable machine learning models that are linearly composed of smooth functions of individual features. However, while GAMs have been extensively studied on regression and classification tasks, it is less clear how to apply them in a ranking setting. For instance, while GAMs can be straightforwardly applied to model each individual item in the list, modeling both item interactions and the context in which these items are ranked is a more challenging research problem. To this end, we have developed a neural ranking GAM — an extension of generalized additive models to ranking problems.

Unlike standard GAMs, a neural ranking GAM can take into account both the features of the ranked items and the context features (e.g., query or user profile) to derive an interpretable, compact model. This ensures that not only the contribution of each item-level feature is interpretable, but also the contribution of the context features. For example, in the figure below, using a neural ranking GAM makes visible how distance, price, and relevance, in the context of a given user device, contribute to the final ranking of the hotel. Neural ranking GAMs are now available as a part of TF-Ranking,

An example of applying neural ranking GAM for local search. For each input feature (e.g., price, distance), a sub-model produces a sub-score that can be examined, providing transparency. Context features (e.g., user device type) can be utilized to derive importance weights of submodels.

Neural Ranking or Gradient Boosting?
While neural models have achieved state of the art performance in multiple domains, specialized gradient boosted decision trees (GBDTs) like LambdaMART remained the baseline to beat in a variety of open LTR datasets. The success of GBDTs in open datasets is due to several reasons. First, due to their relatively small size, neural models are prone to overfitting on these datasets. Second, since GBDTs partition their input feature space using decision trees, they are naturally more resilient to variations in numerical scales in ranking data, which often contain features with Zipfian or otherwise skewed distributions. However, GBDTs do have their limitations in more realistic ranking scenarios, which often combine both textual and numerical features. For instance, GBDTs cannot be directly applied to large discrete feature spaces, such as raw document text. They are also, in general, less scalable than neural ranking models.

Therefore, since the TF-Ranking release, our team has significantly deepened the understanding of how best to leverage neural models in ranking with numerical features. This culminated in a Data Augmented Self-Attentive Latent Cross (DASALC) model, described in an ICLR 2021 paper, which is the first to establish parity, and in some cases statistically significant improvements, of neural ranking models over strong LambdaMART baselines on open LTR datasets. This achievement is made possible through a combination of techniques, which include data augmentation, neural feature transformation, self-attention for modeling document interactions, listwise ranking loss, and model ensembling similar to boosting in GBDTs. The architecture of the DASALC model was entirely implemented using the TF-Ranking library.

Conclusion
All in all, we believe that the new Keras-based TF-Ranking version will make it easier to conduct neural LTR research and deploy production-grade ranking systems. We encourage everyone to try out the latest version and follow this introductory example for a hands-on experience. While we are very excited about this new release, our research and development journey is far from over, so we will continue to advance our understanding of learning-to-rank problems and share these advances with our users.

Acknowledgements
This project was only possible thanks to the current and past members of the TF-Ranking team: Honglei Zhuang, ‎Le Yan, Rama Pasumarthi, Rolf Jagerman, Zhen Qin, Shuguang Han, Sebastian Bruch, Nathan Cordeiro, Marc Najork and Patrick McGregor. We also extend special thanks to our collaborators from the Tensorflow team: Zhenyu Tan, Goldie Gadde, Rick Chao, Yuefeng Zhou‎, Hongkun Yu, and Jing Li.

Offsites

함께 자라기: 우리는 함께 성장할 수 있을까?

Post author By
Post date July 27, 2021
No Comments on 함께 자라기: 우리는 함께 성장할 수 있을까?

우리는 점점 협업이 중요해지는 시대에 살고 있습니다. 도메인과 기술, 각각의 분야는 갈수록 세밀해지고 고도화되고 있기 때문에, 혼자서 이 모든 것을 다 알기란 불가능에 가까워지고 있습니다. 그래서 한명의 천재보다는 훌륭한 팀이 더 좋은 결과들을 만들어 내는 시대입니다.

images

출처: pixabay

면접에서 커뮤니케이션 스킬 역시 중요하게 평가되고 있죠. ‘팀원과의 협업에서 어려움이 있을 때 어떻게 하셨나요?’ 이런 질문들은 흔하게 접하셨을 것 같습니다. 여기에서 저는 개인적으로 ‘팀으로 일하면서 팀원 모두의 성장을 위해서 무엇을 해보았나요?’ 이 질문을 좋아합니다. 개인이 성장하는 것이 선형적이라면, 팀으로 성장하는 것은 기하급수적으로 볼 수 있기 때문입니다.

이번에 소개하는 책의 저자께서도 이 책을 읽으며, 다음과 같은 질문들로 생각이 나아갈 수 있기를 기대하고 있습니다.

우리가 정말 함께 자랄 수 있을까?
우리가 정말 매일매일 함께 자랄 수 있을까?

함께 자라기 : 애자일로 가는 길

images

출처: 알라딘 ‘함께 자라기’

이번 책은 애자일 컨설팅으로 알려져 있는 김창준님의 ≪함께 자라기≫ 입니다. 이 책은 그 동안 블로그와 페이스 북 등에서 공유해오시던 효과적으로 배우는 방법과 협업에 대한 다양한 글들을 엮은 결과입니다. 이 책의 특징 중 하나는 연구, 논문 등의 자료를 기반으로 조금 더 구체적이고 분석적으로 성장과 협업에 대해서 바라 본다는 것 입니다.

그럼 책의 내용들을 조금 더 살펴보겠습니다. 1장 자라기 에서는 성장을주제로 다양한 이야기를 하고 있습니다.

시스템

저는 시스템과 프로세스가 중요하다고 생각을 합니다. 적합한 사람들을 뽑는 것이 무엇보다 중요하지만, 이 사람들이 마음껏 능력을 펼칠 수 있는 조직의 시스템도 그에 못지 않게 중요합니다.

조직은 개인이 자신의 전문성을 좀 더 발전시키고 관리할 수 있게 최대한 지원을 해야 합니다. 그것이 윈윈하는 길입니다. 뽑고 나서 잘 교육하고 성장하게 도와주는 것 이상으로 중요한 것이 또 있습니다. 시스템입니다. 아무리 훌륭한 사람을 뽑아도 조직의 시스템과 문화에 문제가 있으면 그런 사람은 묻혀버리기 쉽고, 반대로 실력이 평범한 사람일지라도 좋은 시스템 속에서 뛰어난 성과를 낼 수도 있습니다.

잘 뽑는 것 이상으로 중요한 것 중에서

프로세스와 시스템은 아래 더글러스의 말에서 B와 C단계에 해당하는 일 입니다. 이렇게 한 단계 혹은 한 차원 높게 개선을 함으로써 그 조직은 계속해서 발전할 수 있는 것이죠. 항상 일을 함에 있어서 언제 무엇에 집중해야 할지를 생각하는 것이 필요합니다. 일례로 스타트업에서는 빠르게 A 작업을 해내는 것이 중요한 반면, 대기업에서는 더 빠르게 확장할 수 있도록 B작업, 즉 프로세스를 개선하는데 집중해야 하는 것이죠.

더글러스는 작업을 세 가지 수준으로 구분합니다. A, B, C 작업입니다.
A 작업은 원래 그 조직이 하기로 되어 있는 일을 하는 걸 말합니다.
B 작업은 A 작업을 개선하는 걸 말합니다. 제품을 만드는 사이클에서 시간과 품질을 개선하는 것이죠
C 작업은 B 작업을 개선하는 것 입니다. 개선 사이클 자체의 시간과 품질을 개선하는 것입니다. … 한마디로 개선하는 능력을 개선하는 걸 말합니다.
더글러스는 “우리가 더 잘하는 것을 더 잘하게 될수록 우리는 더 잘하는 걸 더 잘 그리고 더 빨리 하게 될 것이다”

복리의 비밀 중에서

의도적 수련

images

출처: 함께자라기 ‘제자리걸음에서 벗어나기’ 중에서

의도적 수련은 자신의 실력에 맞춰서 가장 빠르게 배울 수 있는 방법 중에 하나입니다. 위 그림처럼, ‘작업 난이도’ 와 ‘실력’ 을 유사한 수준으로 맞춰서 일에 몰입할 수 있도록 하는 것이죠. 너무 쉬운 일이라면, 스스로 퀘스트를 부여하면서 더 문제를 어렵게 만들거나 어려운 일의 경우에는 주변의 도움을 받기도 하고, 문제를 구조적으로 접근함녀서 난이도를 낮추는 방법 등을 제시하고 있습니다.

의도적 수련이 되려면 나의 실력과 작업의 난이도가 비슷해야 합니다. 이것은 미하이 칙센트미하이의 몰입이론(무슨 활동을 하냐가 중요한게 아니라 뭘 하든지 몰입해서 하면 만족도가 올라갔다)과도 일치하는 부분인데요, … 우리가 주목해야 할 부분은 C 영역입니다. 난이도와 실력이 엇비슷하게 맞는 부분이죠. 미하이는 이 부분에서 인간이 몰입을 경험한다고 합니다. 그리고 바로 이때 최고 수준의 집중력을 보이고, 그 덕분에 퍼포먼스나 학습 능력이 최대치가 될 수 있다고 합니다. 또한 그때 최고 수준의 행복감을 경험한다는 흥미로운 사실을 발견하기도 했습니다. 비슷한 이야기를 언어학자인 크라센이 입력가설을 통해 말합니다. i+1 이론이라고 하는데, 현재 언어 학습자의 언어 수준을 i라고 할 때 딱 한 단계 높은 i+1 수준의 입력이 주어질 때에만 언어 능력이 유의미하게 진전한다는 이론이죠.

의도적 수련의 필수조건, 적절한 난이도 중에서

다음으로 2장 함께 에서는 협업에 대한 다양한 주제들을 다루고 있습니다.

심리적 안전감

성공적인 팀의 특징들 중에서 가장 중요하다고 이야기 되는 요소가 바로 ‘심리적 안전감’ 입니다. 이 ‘심리적 안전감’ 하나의 주제만을 가지고 다양한 이야기하는 ≪두려움 없는 조직≫ 이라는 책도 있죠. 어떻게 보면 뻔하게 보이기도 하지만 그 만큼 심리적 안전감을 팀 내에 정착시키는 것은 어렵기도 합니다.

구글은 데이터 중심 회사답게 데이터 기반으로 뛰어난 관리자의 특징을 찾는 옥시전 프로젝트 이후에도 뛰어난 팀의 특징을 찾기 위해 2년간 노력했습니다. 이름하여 아리스토텔레스 프로젝트 입니다.

팀에 누가 있는지 (전문가, 내향/외향, 지능 등) 보다 팀원들이 서로 어떻게 상호작용하고 자신의 일을 어떻게 바라보는지가 훨씬 중요했다.

5가지 성공적 팀의 특징을 찾았는데, 그중 압도적으로 높은 예측력을 보인 변수는 팀의 심리적 안전감이었다.

팀 토론 등 특별히 고안된 활동을 통해 심리적 안전감을 개선할 수 있었다.

구글이 밝힌 탁월한 팀의 비밀 중에서

심리적 안전감은 보통 조직문화를 기반으로 하고 있다고 이야기합니다. 조직문화 중에서도 특히 ‘투명성’ 에 연결이 됩니다. 아래 사례처럼, 실수를 투명하게 공개하고 더 나은 방향으로 모두 나아갈 수 있는 것. 그 외에도 회사 내에서 정보가 투명하게 흐르게 되면 서로 간의 신뢰가 생기기 때문입니다. 이 신뢰가 곧 심리적 안전감으로 직결되게 되죠.

마이클 프레제는 회사에서의 실수 문화에 대해 연구를 했습니다. 그에 따르면 실수 문화에는 크게 두 가지가 있습니다. 실수 예방과 실수 관리. 실수 예방은 행동에서 실수로 가는 경로를 차단하려고 합니다. 즉, 실수를 저지르지 말라고 요구합니다. 근데, 사실 이것이 불가능에 가깝습니다. 전문가도 1시간에 평균 3~5개의 실수를 저지른다고 합니다. … 실수 예방 문화에서는 실수를 한 사람을 비난하고, 처벌하고, 따라서 실수를 감추고 그에 대해 논의하기 꺼리며 문제가 생겼을 때 협력도 덜하게 됩니다. 실수에서 배우지 못하겠지요. 반대로 실수 관리 문화에서는 실수가 나쁜 결과를 내기 전에 빨리 회복하도록 돕고, 실수를 공개하고, 실수에 대해 서로 이야기하고 거기에서 배우는 분위기가 생깁니다.
이 부분이 굉장히 중요합니다. 실수 연구의 역사를 보면, 초기에는 기술적인 부분만 보다가 그 다음에는 인간적인 부분 (결국 80%가 사람 실수라든지)을 보다가 이제는 문화적인 부분을 이야기합니다. 심리적 안전감이라고 하는 것이 이 문화의 일부입니다.

두 가지의 실수 문화 중에서

추상화

다음은 개발자들끼리 많이 진행하는 짝 프로그래밍에 대한 이야기 입니다. 그 동안 많이 해봤음에도, 왜 효과적인지 잘 모르고 있다가 이 책을 읽으면서 깨닫게 되는 사례 중에 하나였습니다. 짝 프로그래밍까지 가지 않더라도 문제에 대해서 설명하다가 스스로 좋은 방법을 찾기도 하는데, 이것 역시 설명의 과정에서 추상화를 시키면서 스스로 이해도가 높아지기 때문이 아닐까 싶습니다.

짝 프로그래밍은 두 사람이 한 컴퓨터를 사용해 함께 프로그래밍하는 것입니다. 생각할수록 짝 프로그래밍의 구성은 절묘합니다. 두 사람이라는 구성은 대화를 통해 추상화를 높이게 합니다. 한 컴퓨터라는 구성은 구체화를 통해 검증하게 합니다. 미루고 헤아리는 것) 이 빈번히 교차합니다. 그리고 그 사이에서 “아하”가 터져 나옵니다. … 자신이 작성하는 코드의 추상성을 높이고 싶다면 혼자서 고민하지 말고 다른 사람들과 협동하고, 대화하세요. 같이 그림도 그려보고 함께 소스코드를 편집하세요. 인간에게는 다른 인간과 소통하고 협력할 수 있는 놀라운 능력이 있습니다. 대화는 기적입니다.

대화하는 프로그래밍 중에서

새로운 방법론의 도입

아마 많은 이런 경험이 많이 있으실 것 같습니다. 같이 일을 하면서 새로운 프레임워크 혹은 애자일 등의 방법론 혹은 도구를 도입하는 것이죠. 무난하게 도입을 한 경우도 있을 것이고, 생각하지 못한 반대의견을 맞닥뜨린 경우도 있을 것 입니다. 어떻게 하는 것이 가장 좋은 방법인지 모르겠지만, 동료분들과 이야기를 충분히 하고 니즈를 이해해야 한다는 것 입니다. 이 도구가 왜 좋은지 보다는 동료분들이 어떤 생각을 가지고 있는지 알아보는 것이 어떨까요?

그리고 이렇게 대화를 하면서, 중간의 매개체가 될 수 있다면 단순히 도구를 도입하려는 시도에서 더 나아가 팀에서 필요로 하는 것이 무엇인지 제대로 이해하고 더 좋은 방안을 제시할 수 있을 것 입니다.

팀장 자리에 있으면 새로운 아이디어 전파가 쉬울 거라고 생각하는 것은 환상입니다. … 그 중 어떤 분들은 이미 나름의 객관적 수치들을 수집하고 계시죠. 그런 분들을 만나면 저는 다음과 같은 질문을 던집니다. “상대방에 대해 얼마나 이해를 하고 계신가요? 얼마나 대화를 해보셨나요?” 십중팔구는 “그분이랑은 별로 이야기 못 해봤습니다.” 란 답이 돌아옵니다. 만약 그렇다면 앞으로도 설득에 성공할 확률은 낫다고 봐야 합니다.

객관성의 주관성 중에서

복잡한 분야일수록 어떤 특정 기법의 효과보다도 치료자 효과가 더 큰 영향을 미칠 것입니다. 그렇다면 어떻게 해야 할까요? 슈퍼슈링크들을 찾고 그들을 연구하고 육성해야 합니다. … 소프트웨어 개발 방법론, 새 프로젝트를 진행할 때에 우리가 어떤 방법론을 쓰느냐는 문제보다도 누가 참여하는가가 훨씬 더 압도적으로 중요한 문제가 아닐까요? 여러분은 어떻게 생각하시나요? 저는 이렇게 생각합니다. 예를 들어 애자일 방법론 도입을 원하는 팀장이라면 “나는 어떤 팀장인가”를 먼저 자문해봐야 하지 않을까 싶습니다.

당신의 조직에 새 방법론이 먹히지 않는 이유 중에서

다음은 전문가들끼리 팀이 구성되었을 때, 가장 효과적일지에 대한 이야기가 있습니다. 분야가 겹치지 않는 상황에서는 전문가들이 서로의 전문성을 믿고 각자 최고의 결과를 만들어 낼 수 있지만, 비슷한 분야에서 전문가들이 같이 일을 하는 것은 개인에서 협업을 하게되는 상황이기도 합니다. 이때에는 필연적으로 생산성이 떨어지는 순간들이 있게 되는 것 같습니다. 협업에는 연습이 필요하기 때문이죠.

회사에서의 올스타는 어떨까요? 그로이스버그(Groysberg) 등의 연구에 따르면 이런 스타들이 한 명씩 팀에 추가될 때마다 팀의 추가적 성과 향상은 한계효용(점차 줄어듬)을 보이며 어느 수준을 지나면 음의 방향으로 작용한다(즉, 전체 팀의 성과를 깎아먹음)”고 합니다. … 성과를 깎아먹는 경향은 특히 전문가들이 전문성이 서로 유사할 때 도드라졌습니다. 이 연구는 그 원인 중 하나로 전문가들의 에고(ego)를 꼽습니다.

전문가팀이 실패하는 이유 중에서

애자일

마지막 3장에서는 애자일에 대한 이야기가 간단하게 다루어집니다. 사실 앞의 1장, 2장에서도 ‘애자일’ 이라는 용어만 쓰지 않았지, 주제는 애자일에 포함되는 이야기였기 때문이죠.

그 동안 일을 해오면서, 아래의 사례처럼 ‘고객 참여’는 무엇보다 중요한 요소 입니다. 고객 참여에는 다양한 방식이 있을 것 입니다. 고객이 바로 옆에서 도움을 줄 수도 있고, CS를 통해서 피드백을 받을 수도 있고, 인터뷰를 진행할 수도 있습니다. 고객이 무엇을 원하는지 알아볼 수 있는 선구안은 정말 흔하지 않기 때문에, 고객 참여를 통해서 니즈를 발견하고 빠르게 개발해나가는 것이 중요하죠.

성숙도가 낮은 조직의 경우 (성숙도 4 이하), 고객 참여 (0.94), 통계적으로 유의미한 실천법 딱 하나입니다. 고객 참여. 그리고 기여도는 0.94로 아까 전체로 볼 때보다 더 높습니다. 거의 1 입니다. 성숙도가 낮아도 고객 참여를 잘하면 프로젝트 성공도가 한 칸 올라간다는 뜻 입니다. … 성숙도가 높은 조직을 보시죠. 짧은 반복 개발 주기가 1등입니다. 고객 참여보다 더 기여도가 높습니다. 그 말은 성숙도가 높은 조직에서는 고객 참여보다 짧은 반복 개발 주기가 성공에 더 도움이 될 수 있다는 뜻입니다. 그만큼 짧은 반복 개발 주기를 통해 고객 참여가 잘 안 될 때를 어느 정도 보완할 수 있다는 뜻일 수도 있겠습니다.

성숙도가 낮다면 고객 참여는 필수 중에서

끝으로

images

출처: 존잡생각 Ep.18 회사에서 본인을 빠르게 성장시키는 방법 – People Scaling

포스트를 작성하면서 협업에 대해서 생각을 하다보니, 최근에 자주 보고 있는 존잡생각 이라는 샌드버드 CEO인 김동선 대표님의 유투브 채널에서 다뤘던 내용이 생각났습니다. 저 문장이 협업의 측면에서 핵심이 되는 요소라고 생각합니다. 문제가 되는 약점은 고쳐야 하지만, 기본적으로 개개인이 가진 강점을 기반으로 팀으로서의 합이 최대치가 되도록 하는 것이죠.

이렇게 팀이 성장하는 방향으로, 함께 자랄 수 있기를 바랍니다!