Categories
Offsites

Holistic Video Scene Understanding with ViP-DeepLab

People are able to retrieve the visual information about 3D environments from a picture quite easily — we can identify objects, determine instance sizes, and reconstruct 3D scene layout, all using the limited signals contained in 2D images. This ability is commonly known as the inverse projection problem, which refers to reconstructing the ambiguous mapping from the retinal images to the sources of retinal stimulation. Real-world computer vision applications, such as autonomous driving, heavily rely on these capabilities to localize and identify 3D objects, which require vision models to infer the spatial location, semantic class, and instance label for each 3D point projected to the 2D images. The ability to reconstruct the 3D world from images can be decomposed into two disjoint computer vision tasks: monocular depth estimation (predicting depth from a single image) and video panoptic segmentation (the unification of instance segmentation and semantic segmentation, in the video domain). However, research has generally considered each task separately. Tackling these tasks jointly with a unified computer vision model could result in easier deployment and greater efficiency by sharing computation among multiple tasks.

Driven by the potential value of a model that predicts depth and video panoptic segmentation at the same time, we present “ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation”, accepted to CVPR 2021. In this work, we propose a new task, depth-aware video panoptic segmentation, that aims to simultaneously tackle monocular depth estimation and video panoptic segmentation. For the new task, we present two derived datasets accompanied by a new evaluation metric called depth-aware video panoptic quality (DVPQ). This new metric includes the metrics for depth estimation and video panoptic segmentation, requiring a vision model to simultaneously tackle the two sub-tasks. To this end, we extend Panoptic-DeepLab by adding network branches for depth and video predictions to create ViP-DeepLab, a unified model that jointly performs video panoptic segmentation and monocular depth estimation for each pixel on the image plane, and achieves state-of-the-art performance on several academic datasets for the sub-tasks. This video demonstrates the new task and shows the results of ViP-DeepLab.

Depth-aware video panoptic segmentation results obtained by ViP-DeepLab. Top-left: Video frames used as input. Top-right: Video panoptic segmentation results. Bottom-left: Estimated depth. Bottom-right: Reconstructed 3D points. Each object instance has a unique and temporally consistent label, e.g., pedestrain_1, pedestrain_2, etc. Input images are from the Cityscapes dataset.

Overview
While Panoptic-DeepLab is able to output semantic segmentation, center prediction, and center regression for a single frame, it lacks the capability of depth estimation and temporally consistent instance ID prediction for multiple frames. However, ViP-DeepLab accomplishes this by performing additional predictions from two consecutive frames as input. The first additional output is depth estimation for the first frame, for which it assigns an estimated depth to each pixel. In addition, ViP-DeepLab also performs center regression for two consecutive frames for only the object centers that appear in the first frame. This process is called center offset prediction, and allows ViP-DeepLab to group all the pixels in the two frames to the same object that appears in the first frame. New instances emerge if they are not grouped to the previously detected instances. This process continues for every two consecutive frames (with one overlapping frame) in a video sequence, stitching panoptic predictions together to form predictions with temporally consistent instance IDs. That is, it stitches together where objects are and how they move in a video scene with time.

Outputs of ViP-DeepLab for video panoptic segmentation. Two consecutive frames are concatenated as input. The semantic segmentation output associates each pixel with its semantic classes, while the instance segmentation outputs identify the pixels from two frames associated with an individual object in the first frame. Input images are from the Cityscapes dataset.
Visualization of stitching video panoptic predictions. ViP-DeepLab propagates IDs based on mask intersection-over-union between region pairs. It is capable of tracking objects with large movements, e.g., the cyclist in the image.

Neural Network Design
Building on top of Panoptic-DeepLab, ViP-DeepLab additionally contains two prediction branches: (1) a depth prediction branch, and (2) a next-frame instance branch. Specifically, the depth prediction head is a simple design that predicts depth regression for every pixel, while the next-frame instance branch predicts the center offsets for the pixels in the second frame with respect to the centers in the first frame.

Results
We have tested ViP-DeepLab on multiple popular benchmarks, including Cityscapes-VPS, KITTI Depth Prediction, and KITTI Multi-Object Tracking and Segmentation (MOTS).

Specifically, ViP-DeepLab achieves state-of-the-art (SOTA) results, significantly outperforming previous methods by 5.1% video panoptic quality (VPQ) on the Cityscapes-VPS test set.

Method VPQAll VPQThings VPQStuff
VPSNet 57.4% 45.8% 64.8%
ViP-DeepLab          62.5% (+5.1%)       50.2% (+4.4%)       70.3% (+5.5%)   
VPQ comparison on Cityscapes-VPS test set.

ViP-DeepLab ranks 1st on the KITTI depth prediction benchmark, improving over previous methods by 0.65 SILog (the smaller the better).

Method    SILog       sqErrorRel       absErrorRel       iRMSE   
PWA 11.45 2.30 9.05 12.32
ViP-DeepLab       10.80 2.19 8.94 11.77
Monocular depth estimation comparison on KITTI Depth Prediction benchmark. Note for the depth estimation metrics, the smaller the values, the better the performance. While differences may appear small, the top-performing method on this benchmark usually has a gap in SILog smaller than 0.1.

Additionally, ViP-DeepLab was also 1st on KITTI MOTS pedestrians and 3rd on KITTI MOTS cars ranked by the metric sMOTSA, and now is 3rd for both pedestrians and cars ranked by a newer metric HOTA.

Class Method HOTA
Car PointTrack 62.0%
ViP-DeepLab 76.4% (+14.4%)
Pedestrian       PointTrack 54.4%
ViP-DeepLab          64.3% (+9.9%)   
Performance comparison on KITTI Multi-Object Tracking and Segmentation.

Finally, we also present two new datasets for the new task, depth-aware video panoptic segmentation, and test ViP-DeepLab on them. We hope our ViP-DeepLab results on these two new datasets will serve as a strong baseline for the community to compare against. The results are shown below.

Dataset    DVPQAll       DVPQThings       DVPQStuff   
Cityscapes-DVPS       55.1% 43.3% 63.6%
SemKITTI-DVPS 45.6% 36.6% 52.2%
ViP-DeepLab performance for the task of depth-aware video panoptic segmentation on two new datasets.

Conclusion
With a simple architecture, ViP-DeepLab achieves state-of-the-art performance on video panoptic segmentation, monocular depth estimation, and multi-object tracking and segmentation. We hope that along with MaX-DeepLab, which proposes an efficient dual-path transformer module that allows for end-to-end image panoptic segmentation, ViP-DeepLab is useful to the community and furthers research into a more holistic understanding of scenes in the real world.

Acknowledgements
We would like to thank the support and valuable discussions with Yukun Zhu, Hartwig Adam, and Alan Yuille (co-authors of ViP-DeepLab), as well as Maxwell Collins, and the Mobile Vision team.

Categories
Misc

MLOps Made Simple & Cost Effective with Google Kubernetes Engine and NVIDIA A100 Multi-Instance GPUs

Google Cloud and NVIDIA collaborated to make MLOps simple, powerful, and cost-effective by bringing together the solution elements to build, serve and dynamically scale your end-to-end ML pipelines with the right-sized GPU acceleration in one place.

Building, deploying, and managing end-to-end ML pipelines in production, particularly for applications like recommender systems is challenging. Operationalizing ML models, within enterprise applications, to deliver business value involves a lot more than developing the machine learning algorithms and models themselves – it’s a continuous process of data collection and preparation, model building, training or retraining with newer data, model validation, inference serving, and monitoring model performance to ensure the relevance of the results. 

Figure 1. Elements for ML systems. Adapted from Hidden Technical Debt in Machine Learning Systems. [Source]

In addition to the challenge of developing the pipeline, you also need to secure and manage the right compute infrastructure to accelerate these steps for a guaranteed quality-of-service (QoS) for your customers. And, each step in the pipeline is unique too so your compute requirements for data preparation and training might be completely different from what’s required to service multiple disparate inference requests. This is both a development and infrastructure management challenge also commonly referred to as the MLOps challenge. 

Google Cloud and NVIDIA have collaborated to make MLOps simple, powerful, and cost-effective by bringing together the solution elements to build, serve and dynamically scale your end-to-end ML pipelines with the right-sized GPU acceleration in one place. You can focus on delivering the best value for your end customers while maximizing infrastructure utilization and minimizing operational costs for deploying your AI-enabled services.

GKE + MIG = Portability, Scalability & Productivity for MLOps 

Google Kubernetes Engine (GKE) is a managed environment for deploying, scaling and managing containerized ML applications in a secure Google infrastructure. GKE facilitates easy cluster creation, load balancing, autoscaling compute requirements based on demand among other things. Most importantly GKE frees up users from having to manage their own workstations, servers, and VMs while building and deploying ML pipelines – you can focus on the most important value-add tasks of building and training your ML models for your business use-case.

The Google Kubernetes Engine (GKE) now supports the Multi-Instance GPU (MIG) feature enabling each NVIDIA A100 Tensor Core GPU in the new A2 VM instance to be partitioned into as many as seven independent GPU instances, each with its own high-bandwidth memory, cache, and compute cores. GKE can then provision GPU resources for your workloads with greater granularity, share a single GPU for multi-user, multi-model use-cases and automatically scale up or down based on changing needs of your ML pipelines. 

Figure 2. Multiple AI inference requests on a single NVIDIA A100 GPU with NVIDIA Triton Inference Server and GKE

For example, GKE can provision multiple A100 GPU MIG instances to process inference requests for multiple models to be simultaneously executed on the independent MIG partitions within a single A100 GPU to maximize utilization. As the compute required for your deployed ML pipelines increase (e.g. a sudden surge in inference requests to service), GKE can automatically scale to additional node-pools with MIG partitions. In addition, the NVIDIA Collective Communication Library (NCCL) further optimizes multi-GPU, multi-node communications within the GKE cluster to ensure high bandwidth, high throughput, and low latency.

NVIDIA Solution Stack to Develop & Deploy End-to-End Machine Learning Pipelines

To develop ML application pipelines that are scalable and are optimized to leverage the full benefits of the MIG capabilities for GPU utilization on Google Cloud, NVIDIA offers several GPU-accelerated end-to-end application-specific frameworks – NVIDIA Merlin for end-to-end recommendation systems, NVIDIA Jarvis for multimodal conversational AI services, and NVIDIA RAPIDS for data analytics pipelines. All NVIDIA optimized frameworks, SDKs, pre-trained models, and performance-optimized libraries can be accessed from NGC Catalog, a hub for GPU-accelerated software.

Deploying the ML pipelines into production at scale on a GKE managed cluster is further simplified with NVIDIA Triton Inference Server software. This open-source inference serving software lets teams deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework), from local storage or Google Cloud’s managed storage products on any GPU- or CPU-based infrastructure. Triton Inference Server software is now directly available on the GCP Marketplace to seamlessly deploy, serve, monitor performance and dynamically scale multiple AI inference requests on MIG-enabled GKE clusters.

Bringing it All Together 

GKE’s managed Kubernetes services combined with the flexibility of the A100 MIG feature and NVIDIA’s GPU-optimized solution stack for accelerating ML pipelines helps address both the development and infrastructure management challenges of productizing end-to-end ML pipelines.

To see these technologies in action with a real example, check out this GTC21 Session – “Gain Competitive Advantage using MLOps: Kubeflow and NVIDIA Merlin and Google Cloud” to learn how GKE, NVIDIA A100 MIG, and NVIDIA’s GPU-optimized solution stack can be used to build and deploy an end-to-end recommender system. 

Categories
Misc

Run a GitHub project on google Colabratory

[removed]

submitted by /u/richard-romex
[visit reddit] [comments]

Categories
Misc

Machine Learning with ML.NET – Sentiment Analysis

Machine Learning with ML.NET - Sentiment Analysis submitted by /u/RubiksCodeNMZ
[visit reddit] [comments]
Categories
Misc

How to Build a Winning Recommendation System – Part 1

Recommender systems (RecSys) have become a key component in many online services, such as e-commerce, social media, news service, or online video streaming. However with their growth in importance,  the growth in scale of industry datasets, and more sophisticated models, the bar has been raised for computational resources required for recommendation systems.  After NVIDIA introduced Merlin … Continued

Recommender systems (RecSys) have become a key component in many online services, such as e-commerce, social media, news service, or online video streaming. However with their growth in importance,  the growth in scale of industry datasets, and more sophisticated models, the bar has been raised for computational resources required for recommendation systems. 

After NVIDIA introduced Merlin – a Framework for Deep Recommender Systems – to meet the computational demands for large-scale DL recommender systems, and a NVIDIA team won the ACM RecSys Challenge 2020,  now a NVIDIA team has won the  WSDM WebTour 21 Challenge organized by Booking.com.  The Booking.com challenge focused on predicting the last city destination for a traveler’s trip given their previous booking history within the trip. NVIDIA’s interdisciplinary team included colleagues from NVIDIA’s KGMON (Kaggle Grandmasters), NVIDIA’s RAPIDS (Data Science), and NVIDIA’s Merlin (Recommender Systems) who collaborated on the winning solution.

This post is the first of a three-part series that gives an overview of the NVIDIA team’s first place solution for the booking.com challenge. This first post gives an overview of recommender system concepts. The second post will discuss deep learning for recommender systems.  The third post will discuss the winning solution, the steps involved, and also what made a difference in the outcome.

What is a Recommendation System?

Recommender systems are trained to understand the preferences, previous decisions, and characteristics of people and products, using data gathered about their interactions, which include impressions, clicks, likes, and purchases. Recommender systems help solve information overload by helping users find relevant products from a wide range of selections by providing personalized content.  Because of their capability to predict consumer interests and desires on a highly personalized level, recommender systems are a favorite with content and product providers because they drive consumers to just about any product or service that interests them, from books to videos to health classes to clothing.

He image shows a user, items,  and a question mark representing which item to show the user.
Figure 1 A recommendation system filters items and only shows those most likely to induce an interaction.

Types of Recommendation Systems

Traditionally, recommender systems approaches could be divided into these broad categories:  collaborative filtering,  content filtering, and hybrid recommenders systems. More recently, some variations have been proposed to leverage explicitly the user context (context-aware recommendation), the sequence of user interactions (sequential recommendation) and the interactions of the current user session for next-click prediction (session-based recommendation).

Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part). This approach uses similarity of user preference behavior,  given previous interactions between users and items, recommender algorithms learn to predict future interaction. These recommender systems build a model from a user’s past behavior, such as items purchased previously or ratings given to those items and similar decisions by other users. The idea is that if some people have made similar decisions and purchases in the past, like a movie choice, then there is a high probability they will agree on additional future selections. For example, if a collaborative filtering recommender knows you and another user share similar tastes in movies, it might recommend a movie to you that it knows this other user already likes.

The image shows a movie watched by similar users being recommended.
Figure 2: collaborative filtering recommends items based on how similar users liked the item.

Content filtering, by contrast, uses the attributes or features of an item  (this is the content part) to recommend other items similar to the user’s preferences. This approach is based on similarity of items and user features,  given information about a user and items they have interacted with, (e.g. a user’s demographics, like age or gender, the category of a restaurant’s cuisine, the average review for a movie), model the likelihood of a new interaction.  For example, if a content filtering recommender sees you liked the movies “You’ve Got Mail” and “Sleepless in Seattle,” it might recommend another movie to you with the same genres and/or cast, such as “Joe Versus the Volcano.”

The image shows a movie with features similar to what the user has watched before being recommended.
Figure 3: Content filtering recommends items with features similar to the users’ preferences.

Collaborative filtering is straightforward to apply, as it only requires as input the user id and item id for each interaction. However, it requires a minimum number of interactions by user and by item before starting to provide meaningful recommendations, which is characterized as the cold-start problem. On the other hand, as content-based filtering only leverages the interactions of each user, it deals nicely with the user cold-start problem. But it tends to create a filter bubble, recommending only items very similar to those the user has interacted with before.

Hybrid recommender systems combine the advantages of the types above to create a more comprehensive recommending system.

Session or sequence-based recommender systems use the sequence of user item interactions within a session in the recommendation process. Examples include predicting the next item in an online shopping cart, the next video to watch, or in the booking.com example, the next travel destination of a traveler.

Netflix spoke at NVIDIA GTC about making better recommendations by framing a recommendation as a contextual sequence prediction. Their approach uses a sequence of user actions, plus the current context, to predict the probability of the next action. In the Netflix example, given one sequence for each user—the country, device, date, and time when they watched a movie—they trained a model to predict what to watch next. 

The image shows a sequence of Netflix user context and movie watched and a question for  the next movie watched.
Figure 4: Netflix uses a sequence of contextual user actions, plus the current context, to predict the probability of the next movie a user will want to watch.

How Recommenders Work

Recommender systems are trained using data gathered about the users, items, and their interactions, which include impressions, clicks, likes, mentions, and so on. How a recommender model makes recommendations will depend on the type of data you have.  If you only have data about which interactions have occurred in the past, you’ll probably be interested in collaborative filtering. If you have data describing the user and items they have interacted with (e.g. a user’s age, the category of a restaurant’s cuisine, the average review for a movie), you can model the likelihood of a new interaction given these properties at the current moment by adding content and context filtering.

The image shows a recommender function using user and product data to rank products by user preference, to propose new products by product similarity to propose products by user’s similarity,  in order to predict a user rating.
Figure 5: Recommenders use data gathered about the users, items, and their interactions to rank products by user preference, and then propose new products by product similarity and or to propose products by user’s similarity.

Matrix Factorization for Recommendation

Matrix factorization (MF) techniques are the core of many popular algorithms, including word embedding and topic modeling, and have become a dominant methodology within the collaborative-filtering-based recommendations. MF can be used to calculate the similarity in user’s ratings or interactions to provide recommendations. In the simple user-item matrix below, Ted and Carol like movies B and C. Bob likes movie B. To recommend a movie to Bob, matrix factorization calculates that users who liked B also liked C, so C is a possible recommendation for Bob.

The images shows a user item matrix with users as rows, Items as columns and a user rating for an item as the cell value.
Figure 6: A user-item matrix with users as rows, Items as columns, and a user rating for an item as the cell value.

Matrix factorization using the  alternating least squares (ALS) algorithm  approximates the sparse user item rating matrix u-by-i as the product of two dense matrices, user and item factor matrices of size u × f and f × i  (where u is the number of users, i the number of items and f the number of latent features) . The factor matrices represent latent or hidden features which the algorithm tries to discover. One matrix tries to describe the latent or hidden features of each user, and one tries to describe latent properties of each movie. For each user and for each item, the ALS algorithm iteratively learns (f) numeric “factors” that represent the user or item. In each iteration, the algorithm alternatively holds one factor matrix fixed and optimizes for the other by minimizing the loss function with respect to the other. This process continues until it converges. 

The image shows 3 matrices, a sparse user item rating matrix u-by-i as the product of two dense matrices, user and item factor matrices of size u × f and f × i
Figure 7: Matrix factorization factors a sparse ratings matrix R (u-by-i) into a u-by-f matrix (U) and an f-by-i matrix (I ).

Conclusion

In this blog, we gave an overview of recommender system concepts and matrix factorization. In part two we will go over deep learning models for recommender systems and in part three we will go over the booking.com winning solution. To learn more, be sure to: 

Categories
Misc

GAN for All Seasons: AI-Generated Art Accompanies Pandemic Poetry in The Washington Post

A recent National Poetry Month feature in The Washington Post presented AI-generated artwork alongside five original poems reflecting on seasons of the past year. 

A recent National Poetry Month feature in The Washington Post presented AI-generated artwork alongside five original poems reflecting on seasons of the past year. 

Created by the Lede Lab — an experimental news team at The Post dedicated to exploring emerging technologies and new storytelling techniques — the artwork combined the output of machine learning models including NVIDIA StyleGAN2. Developed by NVIDIA Research, StyleGAN is a popular AI for high-res image generation that’s been adopted for art exhibits, manga illustrations and reimagined historical portraits.

Running on NVIDIA GPUs in the cloud, StyleGAN2 was trained on scanned images of brush strokes and palette knife textures painted by the group’s designer, Shikha Subramaniam.

The team also used the open-source AttnGAN model to create generative art that responded line by line to each of the five commissioned poems in the piece. Combined, the outputs from both models created a series of abstract videos to accompany the text.

As readers scroll through the interactive feature, the dynamic AI-generated artwork morphs to reflect each line of the poems — in one case shifting from colorful to monochrome and back again.

“Anxiously watching the coronavirus spread across the globe, we missed sharing so much with others, including the four seasons with their shifts in color and temperature,” wrote Suzette Moyer, senior design editor at The Post. The poems — authored by Mary Szybist, Dorianne Laux, Ada Limón, Kazim Ali and Willie Perdomo — are “hopeful works about the seasons we missed and the days we can look forward to.”

View the interactive piece in The Post >>

For more AI-inspired artwork, visit the AI Art Gallery featured at the recent NVIDIA GPU Technology Conference.

 

Categories
Misc

Implementing backprop in python and comparing it to tensorflow

submitted by /u/jben_hun
[visit reddit] [comments]

Categories
Misc

Crop bounding box from an image

import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices(‘GPU’)
if len(physical_devices) > 0:
tf.config.experimental.set_memory_growth(physical_devices[0], True)
from absl import app, flags, logging
from absl.flags import FLAGS
import core.utils as utils
from core.yolov4 import filter_boxes
from tensorflow.python.saved_model import tag_constants
from PIL import Image
import cv2
import numpy as np
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
flags.DEFINE_string(‘framework’, ‘tf’, ‘(tf, tflite, trt’)
flags.DEFINE_string(‘weights’, ‘./checkpoints/yolov4-416’,
‘path to weights file’)
flags.DEFINE_integer(‘size’, 416, ‘resize images to’)
flags.DEFINE_boolean(‘tiny’, False, ‘yolo or yolo-tiny’)
flags.DEFINE_string(‘model’, ‘yolov4’, ‘yolov3 or yolov4’)
flags.DEFINE_string(‘image’, ‘./data/kite.jpg’, ‘path to input image’)
flags.DEFINE_string(‘output’, ‘result.png’, ‘path to output image’)
flags.DEFINE_float(‘iou’, 0.45, ‘iou threshold’)
flags.DEFINE_float(‘score’, 0.25, ‘score threshold’)
def main(_argv):
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
STRIDES, ANCHORS, NUM_CLASS, XYSCALE = utils.load_config(FLAGS)
input_size = FLAGS.size
image_path = FLAGS.image
original_image = cv2.imread(image_path)
original_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB)
# image_data = utils.image_preprocess(np.copy(original_image), [input_size, input_size])
image_data = cv2.resize(original_image, (input_size, input_size))
image_data = image_data / 255.
# image_data = image_data[np.newaxis, …].astype(np.float32)
images_data = []
for i in range(1):
images_data.append(image_data)
images_data = np.asarray(images_data).astype(np.float32)
if FLAGS.framework == ‘tflite’:
interpreter = tf.lite.Interpreter(model_path=FLAGS.weights)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print(input_details)
print(output_details)
interpreter.set_tensor(input_details[0][‘index’], images_data)
interpreter.invoke()
pred = [interpreter.get_tensor(output_details[i][‘index’]) for i in range(len(output_details))]
if FLAGS.model == ‘yolov3’ and FLAGS.tiny == True:
boxes, pred_conf = filter_boxes(pred[1], pred[0], score_threshold=0.25, input_shape=tf.constant([input_size, input_size]))
else:
boxes, pred_conf = filter_boxes(pred[0], pred[1], score_threshold=0.25, input_shape=tf.constant([input_size, input_size]))
else:
saved_model_loaded = tf.saved_model.load(FLAGS.weights, tags=[tag_constants.SERVING])
infer = saved_model_loaded.signatures[‘serving_default’]
batch_data = tf.constant(images_data)
pred_bbox = infer(batch_data)
for key, value in pred_bbox.items():
boxes = value[:, :, 0:4]
pred_conf = value[:, :, 4:]
boxes, scores, classes, valid_detections = tf.image.combined_non_max_suppression(
boxes=tf.reshape(boxes, (tf.shape(boxes)[0], -1, 1, 4)),
scores=tf.reshape(
pred_conf, (tf.shape(pred_conf)[0], -1, tf.shape(pred_conf)[-1])),
max_output_size_per_class=50,
max_total_size=50,
iou_threshold=FLAGS.iou,
score_threshold=FLAGS.score
)
pred_bbox = [boxes.numpy(), scores.numpy(), classes.numpy(), valid_detections.numpy()]
image = utils.draw_bbox(original_image, pred_bbox)
image = Image.fromarray(image.astype(np.uint8))
image.show()
image = cv2.cvtColor(np.array(image), cv2.COLOR_BGR2RGB)
cv2.imwrite(FLAGS.output, image)
if __name__ == ‘__main__’:
try:
app.run(main)
except SystemExit:
pass
PLZ HELP ME CROP THE BOUNDING BOX IN ORDER TO PERFORM A TESSERACT TO READ WHAT IS INSIDE THE BOUNDING BOX (DIGITS) .This is my work but it doesn’t crop

import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices(‘GPU’)
if len(physical_devices) > 0:
tf.config.experimental.set_memory_growth(physical_devices[0], True)
from absl import app, flags, logging
from absl.flags import FLAGS
import core.utils as utils
from core.yolov4 import filter_boxes
from tensorflow.python.saved_model import tag_constants
from PIL import Image
import cv2
import numpy as np
import os
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
flags.DEFINE_string(‘framework’, ‘tf’, ‘(tf, tflite, trt’)
flags.DEFINE_string(‘weights’, ‘./checkpoints/yolov4-416’,
‘path to weights file’)
flags.DEFINE_integer(‘size’, 416, ‘resize images to’)
flags.DEFINE_boolean(‘tiny’, False, ‘yolo or yolo-tiny’)
flags.DEFINE_string(‘model’, ‘yolov4’, ‘yolov3 or yolov4’)
flags.DEFINE_string(‘image’, ‘./data/kite.jpg’, ‘path to input image’)
flags.DEFINE_string(‘output’, ‘result.png’, ‘path to output image’)
flags.DEFINE_float(‘iou’, 0.45, ‘iou threshold’)
flags.DEFINE_float(‘score’, 0.25, ‘score threshold’)
flags.DEFINE_boolean(‘crop’, False, ‘crop detections from images’)
def crop_objects (img, data, path){
boxes, scores = data
class_name = “Compteur”
# get box coords
xmin, ymin, xmax, ymax = boxes[i]
# crop detection from image (take an additional 5 pixels around all edges)
cropped_img = img[int(ymin)-5:int(ymax)+5, int(xmin)-5:int(xmax)+5]
# construct image name and join it to path for saving crop properly
img_name = class_name +’.png’
img_path = os.path.join(path, img_name )
# save image
cv2.imwrite(img_path, cropped_img)
}
# helper function to convert bounding boxes from normalized ymin, xmin, ymax, xmax —> xmin, ymin, xmax, ymax
def format_boxes(bboxes, image_height, image_width):
for box in bboxes:
ymin = int(box[0] * image_height)
xmin = int(box[1] * image_width)
ymax = int(box[2] * image_height)
xmax = int(box[3] * image_width)
box[0], box[1], box[2], box[3] = xmin, ymin, xmax, ymax
return bboxes
def draw_bbox(image, bboxes, info = False, counted_classes = None, show_label=True, allowed_classes=list(read_class_names(cfg.YOLO.CLASSES).values()), read_plate = False):
classes = read_class_names(cfg.YOLO.CLASSES)
num_classes = len(classes)
image_h, image_w, _ = image.shape
hsv_tuples = [(1.0 * x / num_classes, 1., 1.) for x in range(num_classes)]
colors = list(map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples))
colors = list(map(lambda x: (int(x[0] * 255), int(x[1] * 255), int(x[2] * 255)), colors))
random.seed(0)
random.shuffle(colors)
random.seed(None)
out_boxes, out_scores, out_classes, num_boxes = bboxes
for i in range(num_boxes):
if int(out_classes[i]) < 0 or int(out_classes[i]) > num_classes: continue
coor = out_boxes[i]
fontScale = 0.5
score = out_scores[i]
class_ind = int(out_classes[i])
class_name = classes[class_ind]
if class_name not in allowed_classes:
continue
else:
if read_plate:
height_ratio = int(image_h / 25)
plate_number = recognize_plate(image, coor)
if plate_number != None:
cv2.putText(image, plate_number, (int(coor[0]), int(coor[1]-height_ratio)),
cv2.FONT_HERSHEY_SIMPLEX, 1.25, (255,255,0), 2)
bbox_color = colors[class_ind]
bbox_thick = int(0.6 * (image_h + image_w) / 600)
c1, c2 = (coor[0], coor[1]), (coor[2], coor[3])
cv2.rectangle(image, c1, c2, bbox_color, bbox_thick)
if info:
print(“Object found: {}, Confidence: {:.2f}, BBox Coords (xmin, ymin, xmax, ymax): {}, {}, {}, {} “.format(class_name, score, coor[0], coor[1], coor[2], coor[3]))
if show_label:
bbox_mess = ‘%s: %.2f’ % (class_name, score)
t_size = cv2.getTextSize(bbox_mess, 0, fontScale, thickness=bbox_thick // 2)[0]
c3 = (c1[0] + t_size[0], c1[1] – t_size[1] – 3)
cv2.rectangle(image, c1, (np.float32(c3[0]), np.float32(c3[1])), bbox_color, -1) #filled
cv2.putText(image, bbox_mess, (c1[0], np.float32(c1[1] – 2)), cv2.FONT_HERSHEY_SIMPLEX,
fontScale, (0, 0, 0), bbox_thick // 2, lineType=cv2.LINE_AA)
if counted_classes != None:
height_ratio = int(image_h / 25)
offset = 15
for key, value in counted_classes.items():
cv2.putText(image, “{}s detected: {}”.format(key, value), (5, offset),
cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0, 255, 0), 2)
offset += height_ratio
return image

def main(_argv):
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
STRIDES, ANCHORS, NUM_CLASS, XYSCALE = utils.load_config(FLAGS)
input_size = FLAGS.size
image_path = FLAGS.image
original_image = cv2.imread(image_path)
original_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB)
# image_data = utils.image_preprocess(np.copy(original_image), [input_size, input_size])
image_data = cv2.resize(original_image, (input_size, input_size))
image_data = image_data / 255.
# image_data = image_data[np.newaxis, …].astype(np.float32)
images_data = []
for i in range(1):
images_data.append(image_data)
images_data = np.asarray(images_data).astype(np.float32)
if FLAGS.framework == ‘tflite’:
interpreter = tf.lite.Interpreter(model_path=FLAGS.weights)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print(input_details)
print(output_details)
interpreter.set_tensor(input_details[0][‘index’], images_data)
interpreter.invoke()
pred = [interpreter.get_tensor(output_details[i][‘index’]) for i in range(len(output_details))]
if FLAGS.model == ‘yolov3’ and FLAGS.tiny == True:
boxes, pred_conf = filter_boxes(pred[1], pred[0], score_threshold=0.25, input_shape=tf.constant([input_size, input_size]))
else:
boxes, pred_conf = filter_boxes(pred[0], pred[1], score_threshold=0.25, input_shape=tf.constant([input_size, input_size]))
else:
saved_model_loaded = tf.saved_model.load(FLAGS.weights, tags=[tag_constants.SERVING])
infer = saved_model_loaded.signatures[‘serving_default’]
batch_data = tf.constant(images_data)
pred_bbox = infer(batch_data)
for key, value in pred_bbox.items():
boxes = value[:, :, 0:4]
pred_conf = value[:, :, 4:]
boxes, scores, classes, valid_detections = tf.image.combined_non_max_suppression(
boxes=tf.reshape(boxes, (tf.shape(boxes)[0], -1, 1, 4)),
scores=tf.reshape(
pred_conf, (tf.shape(pred_conf)[0], -1, tf.shape(pred_conf)[-1])),
max_output_size_per_class=50,
max_total_size=50,
iou_threshold=FLAGS.iou,
score_threshold=FLAGS.score
)
# format bounding boxes from normalized ymin, xmin, ymax, xmax —> xmin, ymin, xmax, ymax
original_h, original_w, _ = original_image.shape
bboxes = format_boxes(boxes.numpy()[0], original_h, original_w)

# hold all detection data in one variable
pred_bbox = [bboxes, scores.numpy()[0], classes.numpy()[0], valid_detections.numpy()[0]]
image = utils.draw_bbox(original_image, pred_bbox)
# image = utils.draw_bbox(image_data*255, pred_bbox)
image = Image.fromarray(image.astype(np.uint8))
image.show()
image = cv2.cvtColor(np.array(image), cv2.COLOR_BGR2RGB)
cv2.imwrite(FLAGS.output, image)
# if crop flag is enabled, crop each detection and save it as new image
if FLAGS.crop:
crop_path = os.path.join(os.getcwd(), ‘detections’, ‘crop’, image_name)
try:
os.mkdir(crop_path)
except FileExistsError:
pass
crop_objects(cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB), pred_bbox, crop_path)
if __name__ == ‘__main__’:
try:
app.run(main)
except SystemExit:
pass

submitted by /u/artificialYolov4
[visit reddit] [comments]

Categories
Misc

Why could I be getting high validation loss after loading a model?

I’m basically following the fine tuning instructions for efficientnet here. I use the model without top weights, freezing the rest, and train the model some, training and validation loss both ended around 1.7, and both accuracies were around 0.34. Then I saved the model, and loaded it again, and unfreezed the top 20 layers except for batch normalization ones. When I start training again, training loss is going down, between 1 and 2, and training accuracy is going up. But validation loss is in the hundreds, going up and down, and validation accuracy is oscillating between values of 0.01-0.3. Sometimes the validation accuracy goes down but the loss goes down also. Any ideas why the validation loss would be so high? I’m using ImageDataGenerator with a validation split of 0.2, and train_datagen.flow_from_dataframe with subsets for training and validation for each, for each training run. I saved the model with h5 format.

submitted by /u/Sea_Ad5023
[visit reddit] [comments]

Categories
Misc

cuSPARSELt v0.1.0 Now Available: Arm and Windows Support

NVIDIA announced the availability of cuSPARSELt version 0.1.0. This software can be downloaded now free for members of the NVIDIA Developer Program.

Today, NVIDIA is announcing the availability of cuSPARSELt version 0.1.0. This software can be downloaded now free for members of the NVIDIA Developer Program.

Download Now

What’s New

  • Support for Window 10 (x86_64)
  • Support for Linux ARM
  • Introduced SM 8.6 Compatibility
  • Support for TF32 compute type
  • Better performance for SM 8.0 kernels (up to 90% SOL)
  • Position independent sparseA / sparseB
  • New APIs for compression and pruning
    • Decoupled from cusparseLtMatmulPlan_t

See the cuSPARSELt Release Notes for more information

About cuSPARSELt

NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a sparse matrix:

D=alpha op(A) cdot op(B) + beta op(C)

In this formula, op(A) and op(B) refer to in-place operations such as transpose/non-transpose.

The cuSPARSELt APIs allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.

Key features:

  • NVIDIA Sparse MMA tensor core support
  • Mixed-precision computation support:
    • FP16 input/output, FP32 Tensor Core accumulate
    • BFLOAT16 input/output, FP32 Tensor Core accumulate
    • INT8 input/output, INT32 Tensor Core compute
    • FP32 input/output, TF32 Tensor Core compute
    • TF32 input/output, TF32 Tensor Core compute
  • Matrix pruning and compression functionalities
  • Auto-tuning functionality (see cusparseLtMatmulSearch())

Learn more:

Recent Developer Blog posts: