Categories
Misc

Looking for people to test my new GPU/Ubuntu virtual machine "cloud’ service!

Hi everyone! I’ve spent the last couple months building and configuring a virtual GPU “cloud/instance” service. I’m looking for anyone with ML/DL/Tensorflow/Ubuntu/GPU experience to put my VMs to the test and let me know what you like & dislike about it. It’s still in the early beta stages so I’d like to know how training times and latency compare to what you’re currently used to. Absolutely free of charge. SSH and VNC connections are available through the web. For security purposes, you must connect to my VPN to gain access. Let me know if you’re willing to try this out. All constructive criticism is greatly appreciated!

submitted by /u/GPUaccelerated
[visit reddit] [comments]

Categories
Offsites

Reproducibility in Deep Learning and Smooth Activations

Ever queried a recommender system and found that the same search only a few moments later or on a different device yields very different results? This is not uncommon and can be frustrating if a person is looking for something specific. As a designer of such a system, it is also not uncommon for the metrics measured to change from design and testing to deployment, bringing into question the utility of the experimental testing phase. Some level of such irreproducibility can be expected as the world changes and new models are deployed. However, this also happens regularly as requests hit duplicates of the same model or models are being refreshed.

Lack of replicability, where researchers are unable to reproduce published results with a given model, has been identified as a challenge in the field of machine learning (ML). Irreproducibility is a related but more elusive problem, where multiple instances of a given model are trained on the same data under identical training conditions, but yield different results. Only recently has irreproducibility been identified as a difficult problem, but due to its complexity, theoretical studies to understand this problem are extremely rare.

In practice, deep network models are trained in highly parallelized and distributed environments. Nondeterminism in training from random initialization, parallelism, distributed training, data shuffling, quantization errors, hardware types, and more, combined with objectives with multiple local optima contribute to the problem of irreproducibility. Some of these factors, such as initialization, can be controlled, but it is impractical to control others. Optimization trajectories can diverge early in training by following training examples in the order seen, leading to very different models. Several recently published solutions [1, 2, 3] based on advanced combinations of ensembling, self-ensembling, and distillation can mitigate the problem, but usually at the cost of accuracy and increased complexity, maintenance and improvement costs.

In “Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations”, we consider a different practical solution to this problem that does not incur the costs of other solutions, while still improving reproducibility and yielding higher model accuracy. We discover that the Rectified Linear Unit (ReLU), which is very popular as the nonlinearity function (i.e., activation function) used to transform values in neural networks, exacerbates the irreproducibility problem. On the other hand, we demonstrate that smooth activation functions, which have derivatives that are continuous for the whole domain, unlike those of ReLU, are able to substantially reduce irreproducibility levels. We then propose the Smooth reLU (SmeLU) activation function, which gives comparable reproducibility and accuracy benefits to other smooth activations but is much simpler.

The ReLU function (left) as function of the input signal, and its gradient (right) as function of the input.

Smooth Activations
An ML model attempts to learn the best model parameters that fit the training data by minimizing a loss, which can be imagined as a landscape with peaks and valleys, where the lowest point attains an optimal solution. For deep models, the landscape may consist of many such peaks and valleys. The activation function used by the model governs the shape of this landscape and how the model navigates it.

ReLU, which is not a smooth function, imposes an objective whose landscape is partitioned into many regions with multiple local minima, each providing different model predictions. With this landscape, the order in which updates are applied is a dominant factor in determining the optimization trajectory, providing a recipe for irreproducibility. Because of its non-continuous gradient, functions expressed by a ReLU network will contain sudden jumps in the gradient, which can occur internally in different layers of the deep network, affecting updates of different internal units, and are likely strong contributors to irreproducibility.

Suppose a sequence of model updates attempts to push the activation of some unit down from a positive value. The gradient of the ReLU function is 1 for positive unit values, so with every update it pushes the unit to become smaller and smaller (to the left in the panel above). At the point the activation of this unit crosses the threshold from a positive value to a negative one, the gradient suddenly changes from magnitude 1 to magnitude 0. Training attempts to keep moving the unit leftwards, but due to the 0 gradient, the unit cannot move further in that direction. Therefore, the model must resort to updating other units that can move.

We find that networks with smooth activations (e.g., GELU, Swish and Softplus) can be substantially more reproducible. They may exhibit a similar objective landscape, but with fewer regions, giving a model fewer opportunities to diverge. Unlike the sudden jumps with ReLU, for a unit with decreasing activations, the gradient gradually reduces to 0, which gives other units opportunities to adjust to the changing behavior. With equal initialization, moderate shuffling of training examples, and normalization of hidden layer outputs, smooth activations are able to increase the chances of converging to the same minimum. Very aggressive data shuffling, however, loses this advantage.

The rate that a smooth activation function transitions between output levels, i.e., its “smoothness”, can be adjusted. Sufficient smoothness leads to improved accuracy and reproducibility. Too much smoothness, though, approaches linear models with a corresponding degradation of model accuracy, thus losing the advantages of using a deep network.

Smooth activations (top) and their gradients (bottom) for different smoothness parameter values β as a function of the input values. β determines the width of the transition region between 0 and 1 gradients. For Swish and Softplus, a greater β gives a narrower region, for SmeLU, a greater β gives a wider region.

Smooth reLU (SmeLU)
Activations like GELU and Swish require complex hardware implementations to support exponential and logarithmic functions. Further, GELU must be computed numerically or approximated. These properties can make deployment error-prone, expensive, or slow. GELU and Swish are not monotonic (they start by slightly decreasing and then switch to increasing), which may interfere with interpretability (or identifiability), nor do they have a full stop or a clean slope 1 region, properties that simplify implementation and may aid in reproducibility. 

The Smooth reLU (SmeLU) activation function is designed as a simple function that addresses the concerns with other smooth activations. It connects a 0 slope on the left with a slope 1 line on the right through a quadratic middle region, constraining continuous gradients at the connection points (as an asymmetric version of a Huber loss function).

SmeLU can be viewed as a convolution of ReLU with a box. It provides a cheap and simple smooth solution that is comparable in reproducibility-accuracy tradeoffs to more computationally expensive and complex smooth activations. The figure below illustrates the transition of the loss (objective) surface as we gradually transition from a non-smooth ReLU to a smoother SmeLU. A transition of width 0 is the basic ReLU function for which the loss objective has many local minima. As the transition region widens (SmeLU), the loss surface becomes smoother. If the transition is too wide, i.e., too smooth, the benefit of using a deep network wanes and we approach the linear model solution — the objective surface flattens, potentially losing the ability of the network to express much information.

Loss surfaces (as functions of a 2D input) for two sample loss functions (middle and right) as the activation function’s transition region widens, going from from ReLU to an increasingly smoother SmeLU (left). The loss surface becomes smoother with increasing the smoothness of the SmeLU function.

<!–

Loss surfaces (as functions of a 2D input) for two sample loss functions (middle and right) as the activation function’s transition region widens, going from from ReLU to an increasingly smoother SmeLU (left). The loss surface becomes smoother with increasing the smoothness of the SmeLU function.

–>

Performance
SmeLU has benefited multiple systems, specifically recommendation systems, increasing their reproducibility by reducing, for example, recommendation swap rates. While the use of SmeLU results in accuracy improvements over ReLU, it also replaces other costly methods to address irreproducibility, such as ensembles, which mitigate irreproducibility at the cost of accuracy. Moreover, replacing ensembles in sparse recommendation systems reduces the need for multiple lookups of model parameters that are needed to generate an inference for each of the ensemble components. This substantially improves training and inference efficiency.

To illustrate the benefits of smooth activations, we plot the relative prediction difference (PD) as a function of change in some loss for the different activations. We define relative PD as the ratio between the absolute difference in predictions of two models and their expected prediction, averaged over all evaluation examples. We have observed that in large scale systems, it is sufficient, and inexpensive, to consider only two models for very consistent results.

The figure below shows curves on the PD-accuracy loss plane. For reproducibility, being lower on the curve is better, and for accuracy, being on the left is better. Smooth activations can yield a ballpark 50% reduction in PD relative to ReLU, while still potentially resulting in improved accuracy. SmeLU yields accuracy comparable to other smooth activations, but is more reproducible (lower PD) while still outperforming ReLU in accuracy.

Relative PD as a function of percentage change in the evaluation ranking loss, which measures how accurately items are ranked in a recommendation system (higher values indicate worse accuracy), for different activations.

<!–

Relative PD as a function of percentage change in the evaluation ranking loss, which measures how accurately items are ranked in a recommendation system (higher values indicate worse accuracy), for different activations.

–>

Conclusion and Future Work
We demonstrated the problem of irreproducibility in real world practical systems, and how it affects users as well as system and model designers. While this particular issue has been given very little attention when trying to address the lack of replicability of research results, irreproducibility can be a critical problem. We demonstrated that a simple solution of using smooth activations can substantially reduce the problem without degrading other critical metrics like model accuracy. We demonstrate a new smooth activation function, SmeLU, which has the added benefits of mathematical simplicity and ease of implementation, and can be cheap and less error prone.

Understanding reproducibility, especially in deep networks, where objectives are not convex, is an open problem. An initial theoretical framework for the simpler convex case has recently been proposed, but more research must be done to gain a better understanding of this problem which will apply to practical systems that rely on deep networks.

Acknowledgements
We would like to thank Sergey Ioffe for early discussions about SmeLU; Lorenzo Coviello and Angel Yu for help in early adoptions of SmeLU; Shiv Venkataraman for sponsorship of the work; Claire Cui for discussion and support from the very beginning; Jeremiah Willcock, Tom Jablin, and Cliff Young for substantial implementation support; Yuyan Wang, Mahesh Sathiamoorthy, Myles Sussman, Li Wei, Kevin Regan, Steven Okamoto, Qiqi Yan, Todd Phillips, Ed Chi, Sunita Verna, and many many others for many discussions, and for integrations in many different systems; Matt Streeter and Yonghui Wu for feedback on the paper and this post; Tom Small for help with the illustrations in this post.

Categories
Misc

Green Teams Achieve the Dream: NVIDIA Announces NPN Americas Partners of the Year

A dozen companies today received NVIDIA’s highest award for partners, recognizing their impact on AI education and adoption across such industries as education, federal, healthcare and technology. The winners of the 2021 NPN Americas Partner of the Year Awards have created a profound impact on AI by helping customers meet the demands of recommender systems, Read article >

The post Green Teams Achieve the Dream: NVIDIA Announces NPN Americas Partners of the Year appeared first on NVIDIA Blog.

Categories
Misc

Unreal Engine and NVIDIA: From One Generation to the Next

Square/Enix presents the fictional city of Midgar in Final Fantasy VII Remake at a filmic level of detail. Epic’s Fortnite bathes its environments in ray-traced sunlight, simulating how light bounces in the real world. And artists at Lucasfilm revolutionized virtual production techniques in The Mandalorian, using synchronized NVIDIA RTX GPUs to drive pixels on LED Read article >

The post Unreal Engine and NVIDIA: From One Generation to the Next appeared first on NVIDIA Blog.

Categories
Misc

Shaping the Future of Graphics with NVIDIA Technologies in Unreal Engine 5

With the launch of Unreal Engine 5, NVIDIA announces support with key RTX technologies for developers to propel their games and experiences to the next level.

Unreal Engine is an open and advanced real-time 3D creation platform. Evolving from its state-of-the-art use in game engines into a multitude of industries, creators can deliver cutting-edge content, interactive experiences, and immersive virtual worlds. NVIDIA strives to simplify adoption of our technologies for developers to get hands on with leading-edge RTX technologies.

NVIDIA is supporting the launch of Unreal Engine 5 with key RTX technologies.Tens of thousands of developers leverage NVIDIA technologies and Unreal Engine to propel their games and experiences to the next level. You can get started building applications today.

We’ve created a series of introductory videos for NVIDIA technologies in UE5 including:

The video below provides an overview for implementing ray tracing in Unreal Engine 5.

Figure 1. Learn how to set up Hardware Ray Tracing in Unreal Engine 5.

Deep Learning Super Sampling 

NVIDIA Deep Learning Super Sampling (DLSS) is a plug-in that uses deep learning algorithms to upscale or “super sample” an image, and helps during GPU heavy workloads like ray tracing. NVIDIA DLSS takes a lower resolution image and increases its resolution. 

Core benefits include uncompromised quality with higher performance. DLSS uses advanced AI rendering to produce image quality comparable to native resolution, and in some cases even better quality, while only conventionally rendering a fraction of the pixels. A new temporal feedback technique gives incredibly sharp image details and improved stability from frame to frame.

With DLSS, developers can choose among several image quality modes from Quality to Ultra Performance. Balancing quality and performance is done by controlling the game’s internal rendering resolution.

Additional benefits include:

  •  NVIDIA Image Sharpening with a spatial upscaler and sharpening algorithm for non-RTX GPUs for cross-platform support.
  • Deep Learning Anti-Aliasing mode or DLAA. This AI based anti-aliasing mode is for users with spare GPU headroom and looking for higher levels of image quality.

Get started with DLSS

To get started head to the DLSS download page, scroll down to the “Download UE Plugin” section, accept the terms of agreement, and launch the UE5 download link. Then you’re all set!  

For a more visual walkthrough on how to install and implement DLSS view the video below.

Figure 2. Learn how to use DLSS, DLAA, and NVIDIA Image Scalar for Unreal Engine 5.

RTX Global Illumination

NVIDIA RTX Global Illumination (RTXGI) is a fast, high quality, and scalable real-time global illumination solution. It uses ray tracing to provide infinite bounce in indirect lighting, without the need to bake lighting or create multiple light setups for scenes. 

You can customize RTXGI to your needs on any DXR-enabled GPU, including the GeForce RTX 30 series, RTX 20 series, GTX 1660 series, and GTX 10 series. RTXGI’s scalable design gives you the control to decide when and where you want to crank up performance or max out image quality.

RTXGI is a plugin for UE5, however if you want more advanced features like indirect lighting in reflections and translucency support you can download the NVRTX branch.

Get started with RTXGI

To get started head to the download page and scroll down to the “Download UE Plugin” section, click the terms of agreement to start the UE5 download link, and you’re set.  

View how to install and implement RTXGI with the video below.

Figure 3. RTXGI for Unreal Engine 5 offers indirect lighting, infinite colored bounces, and soft shadows while being fast and scalable.

NVIDIA Reflex

The NVIDIA Reflex plug-in reduces system latency, which is key for any title where a responsive experience is required. With native support in Unreal Engine 5, simply navigate to the plug-ins folder in UE5, search for NVIDIA Reflex, and enable. 

Key features include:

  • Low Latency Mode for reducing latency.
  • Reflex Stats and Latency Markers. With Reflex Stats, gamers can get per-frame PC Latency without any special hardware. This metric is ideal for tweaking game, OS, and GPU settings. 
  • Automatic Configuration and Flash Indicator for automatic configuration of the Reflex Analyzer. The Reflex Latency Analyzer detects clicks coming from your mouse and measures the time it takes for the resulting pixels (such as a gun muzzle flash) to change on screen. The results are displayed through GeForce Experience. 

Review the video walkthrough below for information on installing and implementing NVIDIA Reflex.

Figure 4. Learn how to reduce system latency by optimizing the mouse to screen chain of events using the NVIDIA Reflex.

NVIDIA Omniverse Connector for Unreal Engine 5 preview

NVIDIA Omniverse is a 3D design collaboration and virtual world simulation platform for creators to connect and enhance 3D workflows. Developers can also easily build advanced 3D tools and expand their ecosystem reach.

At GTC 2022, NVIDIA introduced an updated Omniverse Connector for Unreal Engine including the ability to export the source geometry of Nanite meshes from Unreal Engine 5.

Review the Omniverse Connect documentation for information on installing and using Unreal Engine Omniverse Connector 104.1. For more information on Unreal Engine and NVIDIA technologies, visit our Unreal Engine developer page.

*Disclaimer:  All versions of these plug-ins have been tested on UE 5.0 Preview 2, and may not be compatible in the full release of Unreal Engine 5.0.

Categories
Offsites

Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

In recent years, large neural networks trained for language understanding and generation have achieved impressive results across a wide range of tasks. GPT-3 first showed that large language models (LLMs) can be used for few-shot learning and can achieve impressive results without large-scale task-specific data collection or model parameter updating. More recent LLMs, such as GLaM, LaMDA, Gopher, and Megatron-Turing NLG, achieved state-of-the-art few-shot results on many tasks by scaling model size, using sparsely activated modules, and training on larger datasets from more diverse sources. Yet much work remains in understanding the capabilities that emerge with few-shot learning as we push the limits of model scale.

Last year Google Research announced our vision for Pathways, a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators. In “PaLM: Scaling Language Modeling with Pathways”, we introduce the Pathways Language Model (PaLM), a 540-billion parameter, dense decoder-only Transformer model trained with the Pathways system, which enabled us to efficiently train a single model across multiple TPU v4 Pods. We evaluated PaLM on hundreds of language understanding and generation tasks, and found that it achieves state-of-the-art few-shot performance across most tasks, by significant margins in many cases.

As the scale of the model increases, the performance improves across tasks while also unlocking new capabilities.

Training a 540-Billion Parameter Language Model with Pathways
PaLM demonstrates the first large-scale use of the Pathways system to scale training to 6144 chips, the largest TPU-based system configuration used for training to date. The training is scaled using data parallelism at the Pod level across two Cloud TPU v4 Pods, while using standard data and model parallelism within each Pod. This is a significant increase in scale compared to most previous LLMs, which were either trained on a single TPU v3 Pod (e.g., GLaM, LaMDA), used pipeline parallelism to scale to 2240 A100 GPUs across GPU clusters (Megatron-Turing NLG) or used multiple TPU v3 Pods (Gopher) with a maximum scale of 4096 TPU v3 chips.

PaLM achieves a training efficiency of 57.8% hardware FLOPs utilization, the highest yet achieved for LLMs at this scale. This is due to a combination of the parallelism strategy and a reformulation of the Transformer block that allows for attention and feedforward layers to be computed in parallel, enabling speedups from TPU compiler optimizations.

PaLM was trained using a combination of English and multilingual datasets that include high-quality web documents, books, Wikipedia, conversations, and GitHub code. We also created a “lossless” vocabulary that preserves all whitespace (especially important for code), splits out-of-vocabulary Unicode characters into bytes, and splits numbers into individual tokens, one for each digit.

Breakthrough Capabilities on Language, Reasoning, and Code Tasks
PaLM shows breakthrough capabilities on numerous very difficult tasks. We highlight a few examples for language understanding and generation, reasoning, and code-related tasks below.

Language Understanding and Generation
We evaluated PaLM on 29 widely-used English natural language processing (NLP) tasks. PaLM 540B surpassed few-shot performance of prior large models, such as GLaM, GPT-3, Megatron-Turing NLG, Gopher, Chinchilla, and LaMDA, on 28 of 29 of tasks that span question-answering tasks (open-domain closed-book variant), cloze and sentence-completion tasks, Winograd-style tasks, in-context reading comprehension tasks, common-sense reasoning tasks, SuperGLUE tasks, and natural language inference tasks.

PaLM 540B performance improvement over prior state-of-the-art (SOTA) results on 29 English-based NLP tasks.

In addition to English NLP tasks, PaLM also shows strong performance on multilingual NLP benchmarks, including translation, even though only 22% of the training corpus is non-English.

We also probe emerging and future capabilities of PaLM on the Beyond the Imitation Game Benchmark (BIG-bench), a recently released suite of more than 150 new language modeling tasks, and find that PaLM achieves breakthrough performance. We compare the performance of PaLM to Gopher and Chinchilla, averaged across a common subset of 58 of these tasks. Interestingly, we note that PaLM’s performance as a function of scale follows a log-linear behavior similar to prior models, suggesting that performance improvements from scale have not yet plateaued. PaLM 540B 5-shot also does better than the average performance of people asked to solve the same tasks.

Scaling behavior of PaLM on a subset of 58 BIG-bench tasks. 

PaLM demonstrates impressive natural language understanding and generation capabilities on several BIG-bench tasks. For example, the model can distinguish cause and effect, understand conceptual combinations in appropriate contexts, and even guess the movie from an emoji.

Examples that showcase PaLM 540B 1-shot performance on BIG-bench tasks: labeling cause and effect, conceptual understanding, guessing movies from emoji, and finding synonyms and counterfactuals.

Reasoning
By combining model scale with chain-of-thought prompting, PaLM shows breakthrough capabilities on reasoning tasks that require multi-step arithmetic or common-sense reasoning. Prior LLMs, like Gopher, saw less benefit from model scale in improving performance.

Standard prompting versus chain-of-thought prompting for an example grade-school math problem. Chain-of-thought prompting decomposes the prompt for a multi-step reasoning problem into intermediate steps (highlighted in yellow), similar to how a person would approach it.

We observed strong performance from PaLM 540B combined with chain-of-thought prompting on three arithmetic datasets and two commonsense reasoning datasets. For example, with 8-shot prompting, PaLM solves 58% of the problems in GSM8K, a benchmark of thousands of challenging grade school level math questions, outperforming the prior top score of 55% achieved by fine-tuning the GPT-3 175B model with a training set of 7500 problems and combining it with an external calculator and verifier.

This new score is especially interesting, as it approaches the 60% average of problems solved by 9-12 year olds, who are the target audience for the question set. We suspect that separate encoding of digits in the PaLM vocabulary helps enable these performance improvements.

Remarkably, PaLM can even generate explicit explanations for scenarios that require a complex combination of multi-step logical inference, world knowledge, and deep language understanding. For example, it can provide high quality explanations for novel jokes not found on the web.

PaLM explains an original joke with two-shot prompts.

Code Generation
LLMs have also been shown [1, 2, 3, 4] to generalize well to coding tasks, such as writing code given a natural language description (text-to-code), translating code from one language to another, and fixing compilation errors (code-to-code).

PaLM 540B shows strong performance across coding tasks and natural language tasks in a single model, even though it has only 5% code in the pre-training dataset. Its few-shot performance is especially remarkable because it is on par with the fine-tuned Codex 12B while using 50 times less Python code for training. This result reinforces earlier findings that larger models can be more sample efficient than smaller models because they better transfer learning from other programming languages and natural language data.

Examples of a fine-tuned PaLM 540B model on text-to-code tasks, such as GSM8K-Python and HumanEval, and code-to-code tasks, such as Transcoder.

We also see a further increase in performance by fine-tuning PaLM on a Python-only code dataset, which we refer to as PaLM-Coder. For an example code repair task called DeepFix, where the objective is to modify initially broken C programs until they compile successfully, PaLM-Coder 540B demonstrates impressive performance, achieving a compile rate of 82.1%, which outperforms the prior 71.7% state of the art. This opens up opportunities for fixing more complex errors that arise during software development.

An example from the DeepFix Code Repair task. The fine-tuned PaLM-Coder 540B fixes compilation errors (left, in red) to a version of code that compiles (right).

Ethical Considerations
Recent research has highlighted various potential risks associated with LLMs trained on web text. It is crucial to analyze and document such potential undesirable risks through transparent artifacts such as model cards and datasheets, which also include information on intended use and testing. To this end, our paper provides a datasheet, model card and Responsible AI benchmark results, and it reports thorough analyses of the dataset and model outputs for biases and risks. While the analysis helps outline some potential risks of the model, domain- and task-specific analysis is essential to truly calibrate, contextualize, and mitigate possible harms. Further understanding of risks and benefits of these models is a topic of ongoing research, together with developing scalable solutions that can put guardrails against malicious uses of language models.

Conclusion and Future Work
PaLM demonstrates the scaling capability of the Pathways system to thousands of accelerator chips across two TPU v4 Pods by training a 540-billion parameter model efficiently with a well-studied, well-established recipe of a dense decoder-only Transformer model. Pushing the limits of model scale enables breakthrough few-shot performance of PaLM across a variety of natural language processing, reasoning, and code tasks.

PaLM paves the way for even more capable models by combining the scaling capabilities with novel architectural choices and training schemes, and brings us closer to the Pathways vision:

“Enable a single AI system to generalize across thousands or millions of tasks, to understand different types of data, and to do so with remarkable efficiency.”

Acknowledgements
PaLM is the result of a large, collaborative effort by many teams within Google Research and across Alphabet. We’d like to thank the entire PaLM team for their contributions: Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Mishra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, and Jason Wei. PaLM builds on top of work by many, many teams at Google and we would especially like to recognize the T5X team, the Pathways infrastructure team, the JAX team, the Flaxformer team, the XLA team, the Plaque team, the Borg team, and the Datacenter networking infrastructure team. We’d like to thank our co-authors on this blog post, Alexander Spiridonov and Maysam Moussalem, as well as Josh Newlan and Tom Small for the images and animations in this blog post. Finally, we would like to thank our advisors for the project: Noah Fiedel, Slav Petrov, Jeff Dean, Douglas Eck, and Kathy Meier-Hellstern.

Categories
Misc

Meet the Omnivore: Videographer Makes Digital Walls, Virtual Homes Pop With NVIDIA Omniverse

Pekka Varis’s artistry has come a long way from his early days as a self-styled “punk activist” who spray painted during the “old school days of hip hop in Finland.”

The post Meet the Omnivore: Videographer Makes Digital Walls, Virtual Homes Pop With NVIDIA Omniverse appeared first on NVIDIA Blog.

Categories
Misc

options to incorporate implicit feedback in tf recommender (retrieval) system + how to apply k-means clustering to trained user embeddings

hey,

I have a large database of hundreds of thousands of users interacting with thousands of products for given amount of time, with more time indicating more interest. My company wants to understand if there are particular subgroups of similar consumers. In order to discover if that’s the case, I’ve built a 2-stage ML approach:

– using a tfrs model based on the basic retrieval tutorial (https://www.tensorflow.org/recommenders/examples/basic_retrieval), I’ve trained embeddings to represent my users and products.

– using k-means clustering on the user embeddings, I classify a particular user as a member of a particular cluster.

With this approach, I run into 2 challenges:

– the basic retrieval does not take into account the implicit feedback of the amount of time. This seems like a recurring theme in this space – to weigh the user-item interactions by some measure of implicit feedback. I can’t seem to find any TF implementations though – any tips?

– my trained user embedding layer does not seem suitable for k-means clustering in a sense, since its measure of inter vs intra cluster distance does not meaningfully reduce over training iterations, and (more importantly) decreases linearly(!) with a higher value for k, making it impossible to use the elbow method to determine an objectively good trade-off between k and explained variance.

what would you advice to tackle both of these issues? Thanks for thinking along!
I know some of these questions are more ‘applied machine learning’ than ‘tensorflow’ per se, but I didn’t know where else to take this question, so apologies if this in the wrong category.

submitted by /u/the_Wallie
[visit reddit] [comments]

Categories
Misc

Please help…..ValueError: A target array with shape (1288, 1) was passed for an output of shape (None, 256, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output.

Hi everyone,

I am pretty new to ML and tensorflow and am getting stuck. I am trying to do text classification.

My dataset is in the form where each row has 2 columns: text and polarity

text = string/tweet

polarity = can be 0 or 1

I am generating BERT embeddings following this code https://github.com/strongio/keras-bert/blob/master/keras-bert.ipynb

I want to add a Bi-LSTM between Bert Layer and the Dense layer. I have done it like this:

bert_output = BertLayer(n_fine_tune_layers=3, pooling="mean")(bert_inputs) bert_output = tf.keras.layers.Reshape((max_seq_length, embedding_size))(bert_output) bilstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, dropout=0.2,recurrent_dropout=0.2,return_sequences=True))(bert_output) output = tf.keras.layers.Dense(1, activation="softmax")(bilstm) model = tf.keras.models.Model(inputs=bert_inputs, outputs=output) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary() 

It gives an error:

ValueError: A target array with shape (1288, 1) was passed for an output of shape (None, 256, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output. 

This is the link of the notebook in colab:

https://colab.research.google.com/drive/13g1ccE_cbSwEyUlKBxFRnvxtJ4gt38-g?usp=sharing

What can I do to resolve this? Does it have something to do with what activation or loss is being used ? How can the shape be matched?

Any help will be appreciated.

submitted by /u/No-Life-8250
[visit reddit] [comments]

Categories
Misc

Is this kind of transfer learning is possible in Tensorflow

Hello I’m doing a CNN project for classification of Corn disease images(4 classes) It uses VGG16 as its base model. I have created and saved the model. Now is it possible to use that model as a base for another transfer learning task to classify cotton leaf disease images( 4 classes) with retaining the knowledge gained from corn disease images along with cotton leaf disease images? If so how should I modify the corn disease model. Should I need to make the output layer neurons as 8( 4 for cotton, 4 for corn disease) ?

submitted by /u/kudoshinichi-8211
[visit reddit] [comments]