Month: May 2022

Challenges in Multi-objective Optimization for Automatic Wireless Network Planning

Post author By
Post date May 12, 2022
No Comments on Challenges in Multi-objective Optimization for Automatic Wireless Network Planning

Posted by Sara Ahmadian and Matthew Fahrbach, Research Scientists, Google Research, Large-Scale Optimization Team

Economics, combinatorics, physics, and signal processing conspire to make it difficult to design, build, and operate high-quality, cost-effective wireless networks. The radio transceivers that communicate with our mobile phones, the equipment that supports them (such as power and wired networking), and the physical space they occupy are all expensive, so it’s important to be judicious in choosing sites for new transceivers. Even when the set of available sites is limited, there are exponentially many possible networks that can be built. For example, given only 50 sites, there are 2⁵⁰ (over a million billion) possibilities!

Further complicating things, for every location where service is needed, one must know which transceiver provides the strongest signal and how strong it is. However, the physical characteristics of radio propagation in an environment containing buildings, hills, foliage, and other clutter are incredibly complex, so accurate predictions require sophisticated, computationally-intensive models. Building all possible sites would yield the best coverage and capacity, but even if this were not prohibitively expensive, it would create unacceptable interference among nearby transceivers. Balancing these trade-offs is a core mathematical difficulty.

The goal of wireless network planning is to decide where to place new transceivers to maximize coverage and capacity while minimizing cost and interference. Building an automatic network planning system (a.k.a., auto-planner) that quickly solves national-scale problems at fine-grained resolution without compromising solution quality has been among the most important and difficult open challenges in telecom research for decades.

To address these issues, we are piloting network planning tools built using detailed geometric models derived from high-resolution geographic data, that feed into radio propagation models powered by distributed computing. This system provides fast, high-accuracy predictions of signal strength. Our optimization algorithms then intelligently sift through the exponential space of possible networks to output a small menu of candidate networks that each achieve different desirable trade-offs among cost, coverage, and interference, while ensuring enough capacity to meet demand.

Example auto-planning project in Charlotte, NC. Blue dots denote selected candidate sites. The heat map indicates signal strength from the propagation engine. The inset emphasizes the non-isotropic path loss in downtown.

<!–

–>

Radio Propagation
The propagation of radio waves near Earth’s surface is complicated. Like ripples in a pond, they decay with distance traveled, but they can also penetrate, bounce off, or bend around obstacles, further weakening the signal. Computing radio wave attenuation across a real-world landscape (called path loss) is a hybrid process combining traditional physics-based calculations with learned corrections accounting for obstruction, diffraction, reflection, and scattering of the signal by clutter (e.g., trees and buildings).

We have developed a radio propagation modeling engine that leverages the same high-res geodata that powers Google Earth, Maps and Street View to map the 3D distribution of vegetation and buildings. While accounting for signal origin, frequency, broadcast strength, etc., we train signal correction models using extensive real-world measurements, which account for diverse propagation environments — from flat to hilly terrain and from dense urban to sparse rural areas.

While such hybrid approaches are common, using detailed geodata enables accurate path loss predictions below one-meter resolution. Our propagation engine provides fast point-to-point path loss calculations and scales massively via distributed computation. For instance, computing coverage for 25,000 transceivers scattered across the continental United States can be done at 4 meter resolution in only 1.5 hours, using 1000 CPU cores.

Photorealistic 3D model in Google Earth (top-left) and corresponding clutter height model (top-right). Path profile through buildings and trees from a source to destination in the clutter model (bottom). Gray denotes buildings and green denotes trees.

Auto-Planning Inputs
Once accurate coverage estimates are available, we can use them to optimize network planning, for example, deciding where to place hundreds of new sites to maximize network quality. The auto-planning solver addresses large-scale combinatorial optimization problems such as these, using a fast, robust, scalable approach.

Formally, an auto-planning input instance contains a set of demand points (usually a square grid) where service is to be provided, a set of candidate transceiver sites, predicted signal strengths from candidate sites to demand points (supplied by the propagation model), and a cost budget. Each demand point includes a demand quantity (e.g., estimated from the population of wireless users), and each site includes a cost and capacity. Signal strengths below some threshold are omitted. Finally, the input may include an overall cost budget.

Data Summarization for Large Instances
Auto-planning inputs can be huge, not just because of the number of candidate sites (tens of thousands), and demand points (billions), but also because it requires signal strengths to all demand points from all nearby candidate sites. Simple downsampling is insufficient because population density may vary widely over a given region. Therefore, we apply methods like priority sampling to shrink the data. This technique produces a low-variance, unbiased estimate of the original data, preserving an accurate view of the network traffic and interference statistics, and shrinking the input data enough that a city-size instance fits into memory on one machine.

Multi-objective Optimization via Local Search
Combinatorial optimization remains a difficult task, so we created a domain-specific local search algorithm to optimize network quality. The local search algorithmic paradigm is widely applied to address computationally-hard optimization problems. Such algorithms move from one solution to another through a search space of candidate solutions by applying small local changes, stopping at a time limit or when the solution is locally optimal. To evaluate the quality of a candidate network, we combine the different objective functions into a single one, as described in the following section.

The number of local steps to reach a local optimum, number of candidate moves we evaluate per step, and time to evaluate each candidate can all be large when dealing with realistic networks. To achieve a high-quality algorithm that finishes within hours (rather than days), we must address each of these components. Fast candidate evaluation benefits greatly from dynamic data structures that maintain the mapping between each demand point and the site in the candidate solution that provides the strongest signal to it. We update this “strongest-signal” map efficiently as the candidate solution evolves during local search. The following observations help limit both the number of steps to convergence and evaluations per step.

Bipartite graph representing candidate sites (left) and demand points (right). Selected sites are circled in red, and each demand point is assigned to its strongest available connection. The topmost demand point has no service because the only site that can reach it was not selected. The third and fourth demand points from the top may have high interference if the signal strengths attached to their gray edges are only slightly lower than the ones on their red edges. The bottommost site has high congestion because many demand points get their service from that site, possibly exceeding its capacity.

Selecting two nearby sites is usually not ideal because they interfere. Our algorithm explicitly forbids such pairs of sites, thereby steering the search toward better solutions while greatly reducing the number of moves considered per step. We identify pairs of forbidden sites based on the demand points they cover, as measured by the weighted Jaccard index. This captures their functional proximity much better than simple geographic distance does, especially in urban or hilly areas where radio propagation is highly non-isotropic.

Breaking the local search into epochs also helps. The first epoch mostly adds sites to increase the coverage area while avoiding forbidden pairs. As we approach the cost budget, we begin a second epoch that includes swap moves between forbidden pairs to fine-tune the interference. This restriction limits the number of candidate moves per step, while focusing on those that improve interference with less change to coverage.

Three candidate local search moves. Red circles indicate selected sites and the orange edge indicates a forbidden pair.

Outputting a Diverse Set of Good Solutions
As mentioned before, auto-planning must balance three competing objectives: maximizing coverage, while minimizing interference and capacity violations, subject to a cost budget. There is no single correct tradeoff, so our algorithm delegates the final decision to the user by providing a small menu of candidate networks with different emphases. We apply a multiplier to each objective and optimize the sum. Raising the multiplier for a component guides the algorithm to emphasize it. We perform grid search over multipliers and budgets, generating a large number of solutions, filter out any that are worse than another solution along all four components (including cost), and finally select a small subset that represent different tradeoffs.

Menu of candidate solutions, one per row, displaying metrics. Clicking on a solution turns the selected sites pink and displays a plot of the interference distribution across covered area and demand. Sites not selected are blue.

Conclusion
We described our efforts to address the most vexing challenges facing telecom network operators. Using combinatorial optimization in concert with geospatial and radio propagation modeling, we built a scalable auto-planner for wireless telecommunication networks. We are actively exploring how to expand these capabilities to best meet the needs of our customers. Stay tuned!

For questions and other inquiries, please reach out to wireless-network-interest@google.com.

Acknowledgements
These technological advances were enabled by the tireless work of our collaborators: Aaron Archer, Serge Barbosa Da Torre, Imad Fattouch, Danny Liberty, Pishoy Maksy, Zifei Tong, and Mat Varghese. Special thanks to Corinna Cortes, Mazin Gilbert, Rob Katcher, Michael Purdy, Bea Sebastian, Dave Vadasz, Josh Williams, and Aaron Yonas, along with Serge and especially Aaron Archer for their assistance with this blog post.

Misc

Urban Jungle: AI-Generated Endangered Species Mix With Times Square’s Nightlife

Post author By
Post date May 12, 2022
No Comments on Urban Jungle: AI-Generated Endangered Species Mix With Times Square’s Nightlife

Bengal tigers, red pandas and mountain gorillas are among the world’s most familiar endangered species, but tens of thousands of others — like the Karpathos frog, the Perote deer mouse or the Mekong giant catfish — are largely unknown. Typically perceived as lacking star quality, these species are now roaming massive billboards in one of Read article >

The post Urban Jungle: AI-Generated Endangered Species Mix With Times Square’s Nightlife appeared first on NVIDIA Blog.

Misc

Colab is slow?

Post author By
Post date May 12, 2022
No Comments on Colab is slow?

I’m running cifar100 on resnet50 modal. On my local machine, I have a 1050ti(and on colab, I do have the GPU on)

I get half the train time per epoch on my local machine than via colab even though colab is running on a k80. is this normal

This is my code: https://colab.research.google.com/drive/1HU-vPLy0VLMVe7JebHJjeD4Mmw6MgTVl?usp=sharing

(its just used for testing)

submitted by /u/Mayfieldmobster
[visit reddit] [comments]

Misc

GFN Thursday Gets Groovy As ‘Evil Dead: The Game’ Marks 1,300 Games on GeForce NOW

Post author By
Post date May 12, 2022
No Comments on GFN Thursday Gets Groovy As ‘Evil Dead: The Game’ Marks 1,300 Games on GeForce NOW

Good. Bad. You’re the Guy With the Gun this GFN Thursday. Get ready for some horrifyingly good fun with Evil Dead: The Game streaming on GeForce NOW tomorrow at release. It’s the 1,300th game to join GeForce NOW, joining on Friday the 13th. And it’s part of eight total games joining the GeForce NOW library Read article >

The post GFN Thursday Gets Groovy As ‘Evil Dead: The Game’ Marks 1,300 Games on GeForce NOW appeared first on NVIDIA Blog.

Misc

Question about tensorflow models

Post author By
Post date May 12, 2022
No Comments on Question about tensorflow models

I’m currently an intern trying to wrap Face-api.js as a component, my problem is – and I think I know too little about to fully understand – the models.
I was wondering if there are other ways to save and load (already feeded) models, other than using script tags. I need to be looking into it to get a better performance rate, but I honestly don’t know if this is even a possibility.

submitted by /u/KawaiiFromOuterSpace
[visit reddit] [comments]

Misc

NVIDIA Transitioning To Official, Open-Source Linux GPU Kernel Driver

Post author By
Post date May 12, 2022
No Comments on NVIDIA Transitioning To Official, Open-Source Linux GPU Kernel Driver

NVIDIA Transitioning To Official, Open-Source Linux GPU Kernel Driver

submitted by /u/nikniuq
[visit reddit] [comments]

Misc

Tensorflow performing worst in better hardware

Post author By
Post date May 12, 2022
No Comments on Tensorflow performing worst in better hardware

I am migrating from a laptop with a RTX 3080 mobile to a desktop with a RTX3080 Ti. I am using the Tensorflow Docker image from the NGC Catalog (tensorflow:22.03-tf2-py3) in both instances and the same code/dataset. The laptop was running Pop OS 20.04 LTS and the desktop Ubuntu 20.04 LTS, basically the same setup.

The laptop took about 2 seconds per epoch (can’t re test again because I have to turn in the device, but epoch time is stored in the Jupiter notebook) while the desktop (with better a GPU) takes between 4 to 5 seconds. I already re-installed drivers and the whole docker engine with no luck. Currently while checking nvidia-smi gpu usage is only about 30%.

Does anyone has encountered a problem like this? Thanks.

submitted by /u/CESARIUX2596
[visit reddit] [comments]

Misc

Accelerating AI Inference Workloads with NVIDIA A30 GPU

Post author By
Post date May 11, 2022
No Comments on Accelerating AI Inference Workloads with NVIDIA A30 GPU

A30 enables researchers, engineers, and data scientists to deliver real-world results and deploy solutions into production at scale.

NVIDIA A30 GPU is built on the latest NVIDIA Ampere Architecture to accelerate diverse workloads like AI inference at scale, enterprise training, and HPC applications for mainstream servers in data centers. The A30 PCIe card combines the third-generation Tensor Cores with large HBM2 memory (24 GB) and fast GPU memory bandwidth (933 GB/s) in a low-power envelope (maximum 165 W).

A30 supports a broad range of math precisions:

double-precision (FP64)
single-precision (FP32)
half-precision (FP16)
Brain Float 16 (BF16)
Integer (INT8)

It also supports innovations such as Tensor Float 32 (TF32) and Tensor Core FP64, providing a single accelerator to speed up every workload.

Figure 1 shows TF32, which has the range of FP32 and precision of FP16. TF32 is the default option in PyTorch, TensorFlow, and MXNet, so no code change is needed to achieve speedup over the last-generation NVIDIA Volta Architecture.

Different precisions and their representations in bits: FP32 has 1 bit for sign, 8 bits for range, and 23 bits for precision. TF32 has 1 bit for sign, 8 bits for range, and 10 bits for precision. FP16 has 1 bit for sign, 5 bits for range, and 10 bits for precision. BF16 has 1 bit for sign, 8 bits for range, and 7 bits for precision. — *Figure 1. TF32 and other precisions in bit numbers*

Another important feature of A30 is Multi-Instance GPU (MIG) capability. MIG can maximize the GPU utilization across big to small workloads and ensure quality of service (QoS). A single A30 can be partitioned to up to four MIG instances to run four applications simultaneously, each fully isolated with its own streaming multiprocessors (SMs), memory, L2 cache, DRAM bandwidth, and decoder. For more information, see Supported MIG Profiles.

For interconnection, A30 supports both PCIe Gen4 (64 GB/s) and the high-speed third-generation NVLink (maximum 200 GB/s). Each A30 can support one NVLink bridge connection with a single adjacent A30 card. Wherever an adjacent pair of A30 cards exists in the server, the pair should be connected by the NVLink bridge that spans two PCIe slots for best bridging performance and balanced bridge topology.

	NVIDIA T4	NVIDIA A30
Design	Small Footprint Data Center & Edge Inference	AI Inference & Mainstream Compute
Form Factor	x16 PCIe Gen3 1 slot LP	x16 PCIe Gen4 2 Slot FHFL 1 NVLink bridge
Memory	16GB GDDR6	24GB HBM2
Memory Bandwidth	320 GB/s	933 GB/s
Multi-Instance GPU		Up to 4
Media Acceleration	1 Video Encoder 2 Video Decoder	1 JPEG Decoder 4 Video Decoder
Fast FP64	No	Yes
Ray Tracing	Yes	No
Power	70W	165W

Table 1. Summary of the features of A30 and T4

In addition to the hardware benefits summarized in Table 1, A30 can achieve higher performance per dollar compared to the T4 GPU. A30 also supports end-to-end software stack solutions:

Libraries
GPU-accelerated deep learning frameworks like PyTorch, TensorFlow, and MXNet
Optimized deep learning models
Over 2,000 HPC and AI applications, which can be obtained from NGC containers

Performance analysis

To analyze the performance improvement of A30 over T4, and CPUs, we benchmarked six models from MLPerf Inference v1.1 with the datasets:

ResNet-50 v1.5 (ImageNet)
SSD-Large ResNet-34 (COCO)
3D-Unet (BraTS 2019)
DLRM (1TB Click Logs, offline scenario)
BERT (SQuAD v1.1, seq-len: 384)
RNN-T (LibriSpeech)

The MLPerf benchmark suite covers a broad range of inference use cases, from image classification and object detection to recommenders, and natural language processing (NLP).

Figure 2 shows the results of the performance comparison of A30 with T4 and CPU on AI inference workloads.A30 is around 300x faster than a CPU for BERT inference.

Compared to T4, A30 delivers around 3-4x performance speedup for inference using the six models. The performance speedup is due to A30 larger memory size. This enables larger batch size for the models and faster GPU memory bandwidth (almost 3x T4), which can send the data to compute cores in a much shorter time.

[ALT: Bar chart uses T4 as a baseline. A30 achieves 2.6x perf on ResNet-50 compared to 0.20x on CPU, 3.5X perf on SSD-Large compared to 0.13x, 4.1x perf on 3D-UNet, 3.9x perf compared to 0.11x on DLRM, 3.7x perf on BERT compared to 0.01x, and 4.3x perf on RNN-T compared to 0.04x. — *Figure 2. Performance comparison of A30 over T4 and CPU using MLPerf*.
*CPU: 8380H (no submission on 3D-Unet)*

In addition to AI inference, A30 can rapidly pre-train AI models such as BERT Large with TF32, as well as accelerate HPC applications using FP64 Tensor Cores. A30 Tensor Cores with TF32 provide up to 10x higher performance over the T4 without requiring any changes in your code. They also provide an additional 2x boost with automatic mixed precision, delivering a combined 20x throughput increase.

Hardware decoders

While building a video analytics or video processing pipeline, there are several operations that must be considered:

Compute requirements for your model or preprocessing steps. This comes down to the Tensor Cores, GPU DRAM, and other hardware components that accelerate the models or frame preprocessing kernels.
Video stream encoding before transmission. This is done to minimize the bandwidth required on the network. To accelerate this workload, make use of NVIDIA hardware decoders.

Bar chart of the total throughput of combined video decoding operations and model inference. A30 can process up to 76 1080p streams. — *Figure 3. The number of streams being processed on different GPUs*

Measured performance with DeepStream 5.1. It represents e2e performance with video capture and decode, preprocessing, batching, inference, and post-processing. Output rendering was turned off for optimal perf running ResNet10, ResNet18, and ResNet50 networks for inference on H.264 1080p30 video-streams.

A30 is designed to accelerate intelligent video analysis (IVA) by providing four video decoders, one JPEG decoder, and one optical flow decoder.

To make use of these decoders along with the compute resources for analyzing videos, use the NVIDIA DeepStream SDK, which delivers a complete streaming analytics toolkit for AI-based, multisensor processing, video, audio, and image understanding. For more information, see TAO Toolkit Integration with DeepStream or Building a Real-time Redaction App Using NVIDIA DeepStream, Part 1: Training.

What’s next?

Representing the most powerful end-to-end AI and HPC platform for data centers, A30 enables researchers, engineers, and data scientists to deliver real-world results and deploy solutions into production at scale. For more information, see the NVIDIA A30 Tensor Core GPU datasheet and NVIDIA A30 GPU Accelerator product brief.

Misc

Advanced API Performance: Clears

Post author By
Post date May 11, 2022
No Comments on Advanced API Performance: Clears

Surface clearing is a widely used accessory operation. This post covers best practices for clears on NVIDIA GPUs.

This post covers best practices for clears on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Surface clearing is a widely used accessory operation.

Use clear functions from the graphics API to clear resources.
Use any clear color to clear render targets.
- Hardware optimizations improves the most clear operations.
Use a suitable clear value when clearing the depth buffer.
- Prefer clear values within the range [0.0, 0.5) when using depth test functions D3D12_COMPARISON_FUNC_GREATER or D3D12_COMPARISON_FUNC_GREATER_EQUAL
- Prefer clear values within the range [0.5, 1.0] when using depth test functions D3D12_COMPARISON_FUNC_LESS or D3D12_COMPARISON_FUNC_LESS_EQUAL.
Group clear operations into as few batches as possible.
- Batching reduces the performance overhead of each clear.

Not recommended

Avoid using more than a few different clear colors for surface clearing.
- Clearing optimization limited to 25 clear colors per frame on NVIDIA Ampere Architecture GPUs.
- Clearing optimization limited to 10 clear colors per frame on NVIDIA Turing GPUs.
Avoid interleaving single clear calls with rendering work.
- Group clears into batches whenever possible.
Never use clear-shaders as a replacement for API clears.
- It disables hardware optimizations and negatively impacts both CPU and GPU performance.
- Exception: Overlapping a compute clear with neighboring compute work may give better performance.

Acknowledgments

Thanks to Michael Murphy, Maurice Harris, Dmitry Zhdan, and Patric Neil for their advice and feedback.

Misc

Tensorflow Introduces Depth API To Convert Individual Images To 3D Photos

Post author By
Post date May 11, 2022
No Comments on Tensorflow Introduces Depth API To Convert Individual Images To 3D Photos

Tensorflow Introduces Depth API To Convert Individual Images To 3D Photos

Adepth map is an image channel in computer graphics and computer vision that provides information on the distance of the surface of objects as seen from a viewpoint for each pixel in that image. Because of its wide range of applications in augmented reality, portrait mode, and 3D reconstruction, ongoing research in the field of depth-sensing capabilities is being done to pave the way for the future (particularly with the release of the ARCore Depth API). Furthermore, the web community is increasing interest in merging the depth capabilities with JavaScript to enhance the existing web applications by integrating them with real-time AR effects. Despite these recent improvements, the number of photos connected with depth maps continues to be a source of worry.

To drive the next generation of web applications, Tensorflow released its first depth estimation API, called Depth API, and ARPortraitDepth, a model for estimating a depth map for portraiture. They also published 3D photo, a computational photography application that uses the anticipated depth and creates a 3D parallax effect on the given portrait image, further persuading people of the enormous possibilities of depth information. Tensorflow has also launched a live demo for people to try and convert their photographs into 3D versions.

Github: https://github.com/tensorflow/tfjs-models/blob/master/depth-estimation/README.md

https://i.redd.it/62gtdb52bwy81.gif

submitted by /u/No_Coffee_4638
[visit reddit] [comments]

Performance analysis

Hardware decoders

What’s next?

Recommended

Not recommended

Acknowledgments