Categories
Offsites

MediaPipe Holistic — Simultaneous Face, Hand and Pose Prediction, on Device

Real-time, simultaneous perception of human pose, face landmarks and hand tracking on mobile devices can enable a variety of impactful applications, such as fitness and sport analysis, gesture control and sign language recognition, augmented reality effects and more. MediaPipe, an open-source framework designed specifically for complex perception pipelines leveraging accelerated inference (e.g., GPU or CPU), already offers fast and accurate, yet separate, solutions for these tasks. Combining them all in real-time into a semantically consistent end-to-end solution is a uniquely difficult problem requiring simultaneous inference of multiple, dependent neural networks.

Today, we are excited to announce MediaPipe Holistic, a solution to this challenge that provides a novel state-of-the-art human pose topology that unlocks novel use cases. MediaPipe Holistic consists of a new pipeline with optimized pose, face and hand components that each run in real-time, with minimum memory transfer between their inference backends, and added support for interchangeability of the three components, depending on the quality/speed tradeoffs. When including all three components, MediaPipe Holistic provides a unified topology for a groundbreaking 540+ keypoints (33 pose, 21 per-hand and 468 facial landmarks) and achieves near real-time performance on mobile devices. MediaPipe Holistic is being released as part of MediaPipe and is available on-device for mobile (Android, iOS) and desktop. We are also introducing MediaPipe’s new ready-to-use APIs for research (Python) and web (JavaScript) to ease access to the technology.

Top: MediaPipe Holistic results on sport and dance use-cases. Bottom: “Silence” and “Hello” gestures. Note, that our solution consistently identifies a hand as either right (blue color) or left (orange color).

Pipeline and Quality
The MediaPipe Holistic pipeline integrates separate models for pose, face and hand components, each of which are optimized for their particular domain. However, because of their different specializations, the input to one component is not well-suited for the others. The pose estimation model, for example, takes a lower, fixed resolution video frame (256×256) as input. But if one were to crop the hand and face regions from that image to pass to their respective models, the image resolution would be too low for accurate articulation. Therefore, we designed MediaPipe Holistic as a multi-stage pipeline, which treats the different regions using a region appropriate image resolution.

First, MediaPipe Holistic estimates the human pose with BlazePose’s pose detector and subsequent keypoint model. Then, using the inferred pose key points, it derives three regions of interest (ROI) crops for each hand (2x) and the face, and employs a re-crop model to improve the ROI (details below). The pipeline then crops the full-resolution input frame to these ROIs and applies task-specific face and hand models to estimate their corresponding keypoints. Finally, all key points are merged with those of the pose model to yield the full 540+ keypoints.

MediaPipe Holistic pipeline overview.

To streamline the identification of ROIs, a tracking approach similar to the one used for the standalone face and hand pipelines is utilized. This approach assumes that the object doesn’t move significantly between frames, using an estimation from the previous frame as a guide to the object region in the current one. However, during fast movements, the tracker can lose the target, which requires the detector to re-localize it in the image. MediaPipe Holistic uses pose prediction (on every frame) as an additional ROI prior to reduce the response time of the pipeline when reacting to fast movements. This also enables the model to retain semantic consistency across the body and its parts by preventing a mixup between left and right hands or body parts of one person in the frame with another.

In addition, the resolution of the input frame to the pose model is low enough that the resulting ROIs for face and hands are still too inaccurate to guide the re-cropping of those regions, which require a precise input crop to remain lightweight. To close this accuracy gap we use lightweight face and hand re-crop models that play the role of spatial transformers and cost only ~10% of the corresponding model’s inference time.

 MEH   FLE 
 Tracking pipeline (baseline)   9.8%   3.1% 
 Pipeline without re-crops   11.8%   3.5% 
 Pipeline with re-crops   9.7%   3.1% 
Hand prediction quality.The mean error per hand (MEH) is normalized by the hand size. The face landmarks error (FLE) is normalized by the inter-pupillary distance.

Performance
MediaPipe Holistic requires coordination between up to 8 models per frame — 1 pose detector, 1 pose landmark model, 3 re-crop models and 3 keypoint models for hands and face. While building this solution, we optimized not only machine learning models, but also pre- and post-processing algorithms (e.g., affine transformations), which take significant time on most devices due to pipeline complexity. In this case, moving all the pre-processing computations to GPU resulted in ~1.5 times overall pipeline speedup depending on the device. As a result, MediaPipe Holistic runs in near real-time performance even on mid-tier devices and in the browser.

 Phone   FPS 
 Google Pixel 2 XL   18 
 Samsung S9+   20 
 15-inch MacBook Pro 2017   15 
Performance on various mid-tier devices, measured in frames per second (FPS) using TFLite GPU.

The multi-stage nature of the pipeline provides two more performance benefits. As models are mostly independent, they can be replaced with lighter or heavier versions (or turned off completely) depending on the performance and accuracy requirements. Also, once pose is inferred, one knows precisely whether hands and face are within the frame bounds, allowing the pipeline to skip inference on those body parts.

Applications
MediaPipe Holistic, with its 540+ key points, aims to enable a holistic, simultaneous perception of body language, gesture and facial expressions. Its blended approach enables remote gesture interfaces, as well as full-body AR, sports analytics, and sign language recognition. To demonstrate the quality and performance of the MediaPipe Holistic, we built a simple remote control interface that runs locally in the browser and enables a compelling user interaction, no mouse or keyboard required. The user can manipulate objects on the screen, type on a virtual keyboard while sitting on the sofa, and point to or touch specific face regions (e.g., mute or turn off the camera). Underneath it relies on accurate hand detection with subsequent gesture recognition mapped to a “trackpad” space anchored to the user’s shoulder, enabling remote control from up to 4 meters.

This technique for gesture control can unlock various novel use-cases when other human-computer interaction modalities are not convenient. Try it out in our web demo and prototype your own ideas with it.

In-browser touchless control demos. Left: Palm picker, touch interface, keyboard. Right: Distant touchless keyboard. Try it out!

MediaPipe for Research and Web
To accelerate ML research as well as its adoption in the web developer community, MediaPipe now offers ready-to-use, yet customizable ML solutions in Python and in JavaScript. We are starting with those in our previous publications: Face Mesh, Hands and Pose, including MediaPipe Holistic, with many more to come. Try them directly in the web browser: for Python using the notebooks in MediaPipe on Google Colab, and for JavaScript with your own webcam input in MediaPipe on CodePen!

Conclusion
We hope the release of MediaPipe Holistic will inspire the research and development community members to build new unique applications. We anticipate that these pipelines will open up avenues for future research into challenging domains, such as sign-language recognition, touchless control interfaces, or other complex use cases. We are looking forward to seeing what you can build with it!

Complex and dynamic hand gestures. Videos by Dr. Bill Vicars, used with permission.

Acknowledgments
Special thanks to all our team members who worked on the tech with us: Fan Zhang, Gregory Karpiak, Kanstantsin Sokal, Juhyun Lee, Hadon Nash, Chuo-Ling Chang, Jiuqiang Tang, Nikolay Chirkov, Camillo Lugaresi, George Sung, Michael Hays, Tyler Mullen, Chris McClanahan, Ekaterina Ignasheva, Marat Dukhan, Artsiom Ablavatski, Yury Kartynnik, Karthik Raveendran, Andrei Vakunov, Andrei Tkachenka, Suril Shah, Buck Bourdon, Ming Guang Yong, Esha Uboweja, Siarhei Kazakou, Andrei Kulik, Matsvei Zhdanovich, and Matthias Grundmann.

Categories
Offsites

Portrait Light: Enhancing Portrait Lighting with Machine Learning

Professional portrait photographers are able to create compelling photographs by using specialized equipment, such as off-camera flashes and reflectors, and expert knowledge to capture just the right illumination of their subjects. In order to allow users to better emulate professional-looking portraits, we recently released Portrait Light, a new post-capture feature for the Pixel Camera and Google Photos apps that adds a simulated directional light source to portraits, with the directionality and intensity set to complement the lighting from the original photograph.

Example image with and without Portrait Light applied. Note how Portrait Light contours the face, adding dimensionality, volume, and visual interest.

In the Pixel Camera on Pixel 4, Pixel 4a, Pixel 4a (5G), and Pixel 5, Portrait Light is automatically applied post-capture to images in the default mode and to Night Sight photos that include people — just one person or even a small group. In Portrait Mode photographs, Portrait Light provides more dramatic lighting to accompany the shallow depth-of-field effect already applied, resulting in a studio-quality look. But because lighting can be a personal choice, Pixel users who shoot in Portrait Mode can manually re-position and adjust the brightness of the applied lighting within Google Photos to match their preference. For those running Google Photos on Pixel 2 or newer, this relighting capability is also available for many pre-existing portrait photographs.

Pixel users can adjust a portrait’s lighting as they like in Google Photos, after capture.

Today we present the technology behind Portrait Light. Inspired by the off-camera lights used by portrait photographers, Portrait Light models a repositionable light source that can be added into the scene, with the initial lighting direction and intensity automatically selected to complement the existing lighting in the photo. We accomplish this by leveraging novel machine learning models, each trained using a diverse dataset of photographs captured in the Light Stage computational illumination system. These models enabled two new algorithmic capabilities:

  1. Automatic directional light placement: For a given portrait, the algorithm places a synthetic directional light in the scene consistent with how a photographer would have placed an off-camera light source in the real world.
  2. Synthetic post-capture relighting: For a given lighting direction and portrait, synthetic light is added in a way that looks realistic and natural.

These innovations enable Portrait Light to help create attractive lighting at any moment for every portrait — all on your mobile device.

Automatic Light Placement
Photographers usually rely on perceptual cues when deciding how to augment environmental illumination with off-camera light sources. They assess the intensity and directionality of the light falling on the face, and also adjust their subject’s head pose to complement it. To inform Portrait Light’s automatic light placement, we developed computational equivalents to these two perceptual signals.

First, we trained a novel machine learning model to estimate a high dynamic range, omnidirectional illumination profile for a scene based on an input portrait. This new lighting estimation model infers the direction, relative intensity, and color of all light sources in the scene coming from all directions, considering the face as a light probe. We also estimate the head pose of the portrait’s subject using MediaPipe Face Mesh.

Estimating the high dynamic range, omnidirectional illumination profile from an input portrait. The three spheres at the right of each image, diffuse (top), matte silver (middle), and mirror (bottom), are rendered using the estimated illumination, each reflecting the color, intensity, and directionality of the environmental lighting.

Using these clues, we determine the direction from which the synthetic lighting should originate. In studio portrait photography, the main off-camera light source, or key light, is placed about 30° above the eyeline and between 30° and 60° off the camera axis, when looking overhead at the scene. We follow this guideline for a classic portrait look, enhancing any pre-existing lighting directionality in the scene while targeting a balanced, subtle key-to-fill lighting ratio of about 2:1.

Data-Driven Portrait Relighting
Given a desired lighting direction and portrait, we next trained a new machine learning model to add the illumination from a directional light source to the original photograph. Training the model required millions of pairs of portraits both with and without extra light. Photographing such a dataset in normal settings would have been impossible because it requires near-perfect registration of portraits captured across different lighting conditions.

Instead, we generated training data by photographing seventy different people using the Light Stage computational illumination system. This spherical lighting rig includes 64 cameras with different viewpoints and 331 individually-programmable LED light sources. We photographed each individual illuminated one-light-at-a-time (OLAT) by each light, which generates their reflectance field — or their appearance as illuminated by the discrete sections of the spherical environment. The reflectance field encodes the unique color and light-reflecting properties of the subject’s skin, hair, and clothing — how shiny or dull each material appears. Due to the superposition principle for light, these OLAT images can then be linearly added together to render realistic images of the subject as they would appear in any image-based lighting environment, with complex light transport phenomena like subsurface scattering correctly represented.

Using the Light Stage, we photographed many individuals with different face shapes, genders, skin tones, hairstyles, and clothing/accessories. For each person, we generated synthetic portraits in many different lighting environments, both with and without the added directional light, rendering millions of pairs of images. This dataset encouraged model performance across diverse lighting environments and individuals.

Photographing an individual as illuminated one-light-at-a-time in the Google Light Stage, a 360° computational illumination rig.
Left: Example images from an individual’s photographed reflectance field, their appearance in the Light Stage as illuminated one-light-at-a-time. Right: The images can be added together to form the appearance of the subject in any novel lighting environment.

Learning Detail-Preserving Relighting Using the Quotient Image
Rather than trying to directly predict the output relit image, we trained the relighting model to output a low-resolution quotient image, i.e., a per-pixel multiplier that when upsampled can be applied to the original input image to produce the desired output image with the contribution of the extra light source added. This technique is computationally efficient and encourages only low-frequency lighting changes, without impacting high-frequency image details, which are directly transferred from the input to maintain image quality.

Supervising Relighting with Geometry Estimation
When photographers add an extra light source into a scene, its orientation relative to the subject’s facial geometry determines how much brighter each part of the face appears. To model the optical behavior of light sources reflecting off relatively matte surfaces, we first trained a machine learning model to estimate surface normals given the input photograph, and then applied Lambert’s law to compute a “light visibility map” for the desired lighting direction. We provided this light visibility map as input to the quotient image predictor, ensuring that the model is trained using physics-based insights.

The pipeline of our relighting network. Given an input portrait, we estimate per-pixel surface normals, which we then use to compute a light visibility map. The model is trained to produce a low-resolution quotient image that, when upsampled and applied as a multiplier to the original image, produces the original portrait with an extra light source added synthetically into the scene.

We optimized the full pipeline to run at interactive frame-rates on mobile devices, with total model size under 10 MB. Here are a few examples of Portrait Light in action.

Portrait Light in action.

Getting the Most Out of Portrait Light
You can try Portrait Light in the Pixel Camera and change the light position and brightness to your liking in Google Photos. For those who use Dual Exposure Controls, Portrait Light can be applied post-capture for additional creative flexibility to find just the right balance between light and shadow. On existing images from your Google Photos library, try it where faces are slightly underexposed, where Portrait Light can illuminate and highlight your subject. It will especially benefit images with a single individual posed directly at the camera.

We see Portrait Light as the first step on the journey towards creative post-capture lighting controls for mobile cameras, powered by machine learning.

Acknowledgements
Portrait Light is the result of a collaboration between Google Research, Google Daydream, Pixel, and Google Photos teams. Key contributors include: Yun-Ta Tsai, Rohit Pandey, Sean Fanello, Chloe LeGendre, Michael Milne, Ryan Geiss, Sam Hasinoff, Dillon Sharlet, Christoph Rhemann, Peter Denny, Kaiwen Guo, Philip Davidson, Jonathan Taylor, Mingsong Dou, Pavel Pidlypenskyi, Peter Lincoln, Jay Busch, Matt Whalen, Jason Dourgarian, Geoff Harvey, Cynthia Herrera, Sergio Orts Escolano, Paul Debevec, Jonathan Barron, Sofien Bouaziz, Clement Ng, Rachit Gupta, Jesse Evans, Ryan Campbell, Sonya Mollinger, Emily To, Yichang Shih, Jana Ehmann, Wan-Chun Alex Ma, Christina Tong, Tim Smith, Tim Ruddick, Bill Strathearn, Jose Lima, Chia-Kai Liang, David Salesin, Shahram Izadi, Navin Sarma, Nisha Masharani, Zachary Senzer.


1  Work conducted while at Google. 

Categories
Misc

The Metaverse Begins: NVIDIA Omniverse Open Beta Now Available

Explore virtual collaboration and photorealistic simulation with NVIDIA Omniverse open beta, available now. NVIDIA Omniverse is an open, cloud-native platform that makes it easy to accelerate design workflows and collaborate in real time. Omniverse allows creators, engineers and researchers to collaborate in virtual worlds that are all connected — the beginnings of the term Neal Read article >

The post The Metaverse Begins: NVIDIA Omniverse Open Beta Now Available appeared first on The Official NVIDIA Blog.

Categories
Misc

NVIDIA Rolls Out New Drivers for Vulkan Ray Tracing, Upgrades Quake II RTX

Continuing its industry-leading leading support for Vulkan Ray Tracing, NVIDIA is today rolling out production Vulkan drivers bringing Vulkan Ray Tracing support to GeForce and Quadro for both Windows (version 460.89) and Linux (version 460.27.04).

Vulkan is the industry’s first open, cross-vendor standard ray tracing API, enabling portable ray tracing acceleration across diverse platforms.

In November 2020, The Khronos Group released the final versions of the Vulkan Ray Tracing extension specifications that seamlessly integrate ray tracing into the existing Vulkan framework so that developers can reach more platforms and customers with less development and porting costs. Today, Khronos released an upgraded version of the Vulkan SDK with full Vulkan Ray Tracing support, enabling Vulkan developers to easily integrate ray tracing functionality into their applications for the first time.

Continuing its industry-leading support for Vulkan Ray Tracing, NVIDIA is today rolling out production Vulkan drivers bringing Vulkan Ray Tracing support to GeForce and Quadro for both Windows (version 460.89) and Linux (version 460.27.04).  All RTX GPUs are supported, together with GeForce GTX 1660 with 6GB+ of memory and GeForce GTX 1060+ with 6GB+ of memory. Together with the support for Vulkan Ray tracing in the NVIDIA Nsight Systems 2020.5 and  Nsight Graphics 2020.6 developer tools, all developers are now enabled to integrate portable ray tracing into software ranging from real-time games to professional applications.

NVIDIA Contributions to the Development of Vulkan Ray Tracing

Bringing ray tracing functionality into the Vulkan standard has been a multi-year effort by many companies and NVIDIA has taken an active leadership position in each stage of its evolution. We were elected to chair the Vulkan Ray Tracing subgroup at Khronos, we contributed the design of our VKRay vendor extension to Khronos to help the Vulkan working group make rapid progress, and we shipped beta drivers for the provisional version of the Vulkan Ray Tracing extensions to enable developer feedback. 

NVIDIA has also implemented Vulkan Ray Tracing support in Microsoft’s open source DXC HLSL compiler. As outlined earlier this year, production-ready use of HLSL in Vulkan has been achieved through integrating a SPIR-V backend into DXC. Now, NVIDIA has extended that SPIR-V support to include Vukan Ray Tracing functionality, enabling developers to use HLSL shaders in Vulkan Ray Tracing applications instead of GLSL if they prefer. This also makes porting DirectX 12 ray tracing (DXR) functionality to Vulkan far easier to enable applications on a far wider diversity of platforms.

Vulkan is used extensively as a backend for layered implementations of APIs such as DirectX 12 to enable Windows games on platforms such as Linux. Vulkan Ray Tracing has been carefully designed to support the efficient layering of DirectX 12 ray tracing to enable tools such as Valve’s vkd3d-Proton to support the execution of applications that use DXR on Linux. NVIDIA is actively contributing to the development of translation tools such as Wine, whose upcoming 6.0 release supports Vulkan specification version 1.2.162 which includes Vulkan Ray Tracing.

Vulkan Ray Tracing extensions being used in Quake II RTX

In 2019, NVIDIA worked to bring ray tracing to Quake II. The Quake II RTX demo significantly enhances the visual quality of this well-loved classic running on Vulkan with ray-traced lighting, shadows, and reflections. NVIDIA released the full source code on GitHub serving as a great example for developers who want to dive into the details of how this remastering was achieved. Today, with the release of Quake II RTX 1.4.0, NVIDIA has added support for the final Vulkan Ray Tracing extensions, enabling dynamic selection between the pre-existing NVIDIA VKRay and the new Khronos extension backends. This means the game can now run on GPUs from any vendors that support the `VK_KHR_ray_tracing_pipeline` extension, making Quake II RTX the world’s first cross-vendor ray tracing Vulkan game!

To learn more about Vulkan Ray Tracing and how you can use it in your own applications check out Khronos-hosted resources including: how to use the Vulkan Ray Tracing extensions, a deeper dive into the technical details of the final Vulkan Ray Tracing specifications and best practices for blending Vulkan rasterization and ray tracing techniques. The Khronos Group is actively monitoring developer feedback on the Vulkan Ray Tracing extension through the Vulkan issues tracker on GitHub.

A glTF model with NVIDIA’s sample open source ray tracing viewer

NVIDIA has also created a deep dive Vulkan Ray Tracing Tutorial, and a new tutorial that steps through how to create a complete mini-path tracer using the Vulkan Ray Tracing API, and a Vulkan-based glTF ray tracing viewer with open source on GitHub. Keep up to date with NVIDIA’s ongoing support for Vulkan on the NVIDIA Vulkan Developer Page.

Vulkan Ray Tracing is a critical step to making ray tracing pervasive across the computer graphics ecosystem, and it is now easily accessible to developers everywhere.

We can’t wait to see how you use it!

Categories
Misc

NVIDIA Chief Scientist Highlights New AI Research in GTC Keynote

in a keynote released this week for a virtual GTC event, NVIDIA’s chief scientist Bill Dally described how his team is driving an annual doubling of AI performance.

in a keynote released this week for a virtual GTC China event, NVIDIA’s chief scientist Bill Dally described how his team is driving an annual doubling of AI performance.

Dally delves into NVIDIA’s domain-specific platforms for a variety of industries such as healthcare, self-driving cars and robotics.

The keynote is just one of more than 220 sessions at GTC China. All the sessions are free and most are conducted in Mandarin.

Read the full recap on the NVIDIA Blog and watch the keynote, available in multiple languages, in the GTC Keynote page.

Categories
Misc

Battlefleet Gothic: Armada 2 – Prologue (Remastered 8K 60FPS) Resolution increased using neural networks to 8K 60FPS


Battlefleet Gothic: Armada 2 - Prologue (Remastered 8K 60FPS) Resolution increased using neural networks to 8K 60FPS
submitted by /u/stepanmetior

[visit reddit]

[comments]
Categories
Misc

sparklyr 1.5: better dplyr interface, more sdf_* functions, and RDS-based serialization routines

We are thrilled to announce sparklyr 1.5 is now available on

CRAN
!

To install sparklyr 1.5 from CRAN, run

install.packages("sparklyr")

In this blog post, we will highlight the following aspects of
sparklyr 1.5:

Better dplyr interface

A large fraction of pull requests that went into the sparklyr
1.5 release were focused on making Spark dataframes work with
various dplyr verbs in the same way that R dataframes do. The full
list of dplyr-related bugs and feature requests that were resolved
in sparklyr 1.5 can be found in
here
.

In this section, we will showcase three new dplyr
functionalities that were shipped with sparklyr 1.5.

Stratified sampling

Stratified sampling on an R dataframe can be accomplished with a
combination of dplyr::group_by() followed by dplyr::sample_n() or
dplyr::sample_frac(), where the grouping variables specified in the
dplyr::group_by() step are the ones that define each stratum. For
instance, the following query will group mtcars by number of
cylinders and return a weighted random sample of size two from each
group, without replacement, and weighted by the mpg column:

mtcars %>% dplyr::group_by(cyl) %>% dplyr::sample_n(size = 2, weight = mpg, replace = FALSE) %>% print()
## # A tibble: 6 x 11 ## # Groups: cyl [3] ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 ## 2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 ## 3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 5 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 ## 6 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2

Starting from sparklyr 1.5, the same can also be done for Spark
dataframes with Spark 3.0 or above, e.g.,:

library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") mtcars_sdf <- copy_to(sc, mtcars, replace = TRUE, repartition = 3) mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::sample_n(size = 2, weight = mpg, replace = FALSE) %>% print()
# Source: spark<?> [?? x 11] # Groups: cyl mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 3 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 5 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 6 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2

or

mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::sample_frac(size = 0.2, weight = mpg, replace = FALSE) %>% print()
## # Source: spark<?> [?? x 11] ## # Groups: cyl ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 ## 4 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 ## 5 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 ## 6 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 ## 7 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 ## 8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3

Row sums

The rowSums() functionality offered by dplyr is handy when one
needs to sum up a large number of columns within an R dataframe
that are impractical to be enumerated individually. For example,
here we have a six-column dataframe of random real numbers, where
the partial_sum column in the result contains the sum of columns b
through d within each row:

ncols <- 6 nums <- seq(ncols) %>% lapply(function(x) runif(5)) names(nums) <- letters[1:ncols] tbl <- tibble::as_tibble(nums) tbl %>% dplyr::mutate(partial_sum = rowSums(.[2:5])) %>% print()
## # A tibble: 5 x 7 ## a b c d e f partial_sum ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.781 0.801 0.157 0.0293 0.169 0.0978 1.16 ## 2 0.696 0.412 0.221 0.941 0.697 0.675 2.27 ## 3 0.802 0.410 0.516 0.923 0.190 0.904 2.04 ## 4 0.200 0.590 0.755 0.494 0.273 0.807 2.11 ## 5 0.00149 0.711 0.286 0.297 0.107 0.425 1.40

Beginning with sparklyr 1.5, the same operation can be performed
with Spark dataframes:

library(sparklyr) sc <- spark_connect(master = "local") sdf <- copy_to(sc, tbl, overwrite = TRUE) sdf %>% dplyr::mutate(partial_sum = rowSums(.[2:5])) %>% print()
## # Source: spark<?> [?? x 7] ## a b c d e f partial_sum ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.781 0.801 0.157 0.0293 0.169 0.0978 1.16 ## 2 0.696 0.412 0.221 0.941 0.697 0.675 2.27 ## 3 0.802 0.410 0.516 0.923 0.190 0.904 2.04 ## 4 0.200 0.590 0.755 0.494 0.273 0.807 2.11 ## 5 0.00149 0.711 0.286 0.297 0.107 0.425 1.40

As a bonus from implementing the rowSums feature for Spark
dataframes, sparklyr 1.5 now also offers limited support for the
column-subsetting operator on Spark dataframes. For example, all
code snippets below will return some subset of columns from the
dataframe named sdf:

# select columns `b` through `e` sdf[2:5]
# select columns `b` and `c` sdf[c("b", "c")]
# drop the first and third columns and return the rest sdf[c(-1, -3)]

Weighted-mean summarizer

Similar to the two dplyr functions mentioned above, the
weighted.mean() summarizer is another useful function that has
become part of the dplyr interface for Spark dataframes in sparklyr
1.5. One can see it in action by, for example, comparing the output
from the following

library(sparklyr) sc <- spark_connect(master = "local") mtcars_sdf <- copy_to(sc, mtcars, replace = TRUE) mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::summarize(mpg_wm = weighted.mean(mpg, wt)) %>% print()

with output from the equivalent operation on mtcars in R:

mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarize(mpg_wm = weighted.mean(mpg, wt)) %>% print()

both of them should evaluate to the following:

## cyl mpg_wm ## <dbl> <dbl> ## 1 4 25.9 ## 2 6 19.6 ## 3 8 14.8

New additions to the sdf_* family of functions

sparklyr provides a large number of convenience functions for
working with Spark dataframes, and all of them have names starting
with the sdf_ prefix.

In this section we will briefly mention four new additions and
show some example scenarios in which those functions are
useful.

sdf_expand_grid()

As the name suggests, sdf_expand_grid() is simply the Spark
equivalent of expand.grid(). Rather than running expand.grid() in R
and importing the resulting R dataframe to Spark, one can now run
sdf_expand_grid(), which accepts both R vectors and Spark
dataframes and supports hints for broadcast hash joins. The example
below shows sdf_expand_grid() creating a 100-by-100-by-10-by-10
grid in Spark over 1000 Spark partitions, with broadcast hash join
hints on variables with small cardinalities:

library(sparklyr) sc <- spark_connect(master = "local") grid_sdf <- sdf_expand_grid( sc, var1 = seq(100), var2 = seq(100), var3 = seq(10), var4 = seq(10), broadcast_vars = c(var3, var4), repartition = 1000 ) grid_sdf %>% sdf_nrow() %>% print()
## [1] 1e+06

sdf_partition_sizes()

As sparklyr user @sbottelli suggested here, one
thing that would be great to have in sparklyr is an efficient way
to query partition sizes of a Spark dataframe. In sparklyr 1.5,
sdf_partition_sizes() does exactly that:

library(sparklyr) sc <- spark_connect(master = "local") sdf_len(sc, 1000, repartition = 5) %>% sdf_partition_sizes() %>% print(row.names = FALSE)
## partition_index partition_size ## 0 200 ## 1 200 ## 2 200 ## 3 200 ## 4 200

sdf_unnest_longer() and sdf_unnest_wider()

sdf_unnest_longer() and sdf_unnest_wider() are the equivalents
of tidyr::unnest_longer() and tidyr::unnest_wider() for Spark
dataframes. sdf_unnest_longer() expands all elements in a struct
column into multiple rows, and sdf_unnest_wider() expands them into
multiple columns. As illustrated with an example dataframe
below,

library(sparklyr) sc <- spark_connect(master = "local") sdf <- copy_to( sc, tibble::tibble( id = seq(3), attribute = list( list(name = "Alice", grade = "A"), list(name = "Bob", grade = "B"), list(name = "Carol", grade = "C") ) ) )
sdf %>% sdf_unnest_longer(col = record, indices_to = "key", values_to = "value") %>% print()

evaluates to

## # Source: spark<?> [?? x 3] ## id value key ## <int> <chr> <chr> ## 1 1 A grade ## 2 1 Alice name ## 3 2 B grade ## 4 2 Bob name ## 5 3 C grade ## 6 3 Carol name

whereas

sdf %>% sdf_unnest_wider(col = record) %>% print()

evaluates to

## # Source: spark<?> [?? x 3] ## id grade name ## <int> <chr> <chr> ## 1 1 A Alice ## 2 2 B Bob ## 3 3 C Carol

RDS-based serialization routines

Some readers must be wondering why a brand new serialization
format would need to be implemented in sparklyr at all. Long story
short, the reason is that RDS serialization is a strictly better
replacement for its CSV predecessor. It possesses all desirable
attributes the CSV format has, while avoiding a number of
disadvantages that are common among text-based data formats.

In this section, we will briefly outline why sparklyr should
support at least one serialization format other than arrow,
deep-dive into issues with CSV-based serialization, and then show
how the new RDS-based serialization is free from those issues.

Why arrow is not for everyone?

To transfer data between Spark and R correctly and efficiently,
sparklyr must rely on some data serialization format that is
well-supported by both Spark and R. Unfortunately, not many
serialization formats satisfy this requirement, and among the ones
that do are text-based formats such as CSV and JSON, and binary
formats such as Apache Arrow, Protobuf, and as of recent, a small
subset of RDS version 2. Further complicating the matter is the
additional consideration that sparklyr should support at least one
serialization format whose implementation can be fully
self-contained within the sparklyr code base, i.e., such
serialization should not depend on any external R package or system
library, so that it can accommodate users who want to use sparklyr
but who do not necessarily have the required C++ compiler tool
chain and other system dependencies for setting up R packages such
as arrow
or protolite.
Prior to sparklyr 1.5, CSV-based serialization was the default
alternative to fallback to when users do not have the arrow package
installed or when the type of data being transported from R to
Spark is unsupported by the version of arrow available.

Why is the CSV format not ideal?

There are at least three reasons to believe CSV format is not
the best choice when it comes to exporting data from R to
Spark.

One reason is efficiency. For example, a double-precision
floating point number such as .Machine$double.eps needs to be
expressed as “2.22044604925031e-16” in CSV format in order to not
incur any loss of precision, thus taking up 20 bytes rather than 8
bytes.

But more important than efficiency are correctness concerns. In
a R dataframe, one can store both NA_real_ and NaN in a column of
floating point numbers. NA_real_ should ideally translate to null
within a Spark dataframe, whereas NaN should continue to be NaN
when transported from R to Spark. Unfortunately, NA_real_ in R
becomes indistinguishable from NaN once serialized in CSV format,
as evident from a quick demo shown below:

original_df <- data.frame(x = c(NA_real_, NaN)) original_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()
## x is_nan ## 1 NA FALSE ## 2 NaN TRUE
csv_file <- "/tmp/data.csv" write.csv(original_df, file = csv_file, row.names = FALSE) deserialized_df <- read.csv(csv_file) deserialized_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()
## x is_nan ## 1 NA FALSE ## 2 NA FALSE

Another correctness issue very much similar to the one above was
the fact that “NA” and NA within a string column of an R dataframe
become indistinguishable once serialized in CSV format, as
correctly pointed out in this Github
issue
by @caewok and
others.

RDS to the rescue!

RDS format is one of the most widely used binary formats for
serializing R objects. It is described in some detail in chapter 1,
section 8 of this
document
. Among advantages of the RDS format are efficiency and
accuracy: it has a reasonably efficient implementation in base R,
and supports all R data types.

Also worth noticing is the fact that when an R dataframe
containing only data types with sensible equivalents in Apache
Spark (e.g., RAWSXP, LGLSXP, CHARSXP, REALSXP, etc) is saved using
RDS version 2, (e.g., serialize(mtcars, connection = NULL, version
= 2L, xdr = TRUE)), only a tiny subset of the RDS format will be
involved in the serialization process, and implementing
deserialization routines in Scala capable of decoding such a
restricted subset of RDS constructs is in fact a reasonably simple
and straightforward task (as shown in
here
).

Last but not least, because RDS is a binary format, it allows
NA_character_, “NA”, NA_real_, and NaN to all be encoded in an
unambiguous manner, hence allowing sparklyr 1.5 to avoid all
correctness issues detailed above in non-arrow serialization use
cases.

Other benefits of RDS serialization

In addition to correctness guarantees, RDS format also offers
quite a few other advantages.

One advantage is of course performance: for example, importing a
non-trivially-sized dataset such as nycflights13::flights from R to
Spark using the RDS format in sparklyr 1.5 is roughly 40%-50%
faster compared to CSV-based serialization in sparklyr 1.4. The
current RDS-based implementation is still nowhere as fast as
arrow-based serialization though (arrow is about 3-4x faster), so
for performance-sensitive tasks involving heavy serialization,
arrow should still be the top choice.

Another advantage is that with RDS serialization, sparklyr can
import R dataframes containing raw columns directly into binary
columns in Spark. Thus, use cases such as the one below will work
in sparklyr 1.5

library(sparklyr) sc <- spark_connect(master = "local") tbl <- tibble::tibble( x = list(serialize("sparklyr", NULL), serialize(c(123456, 789), NULL)) ) sdf <- copy_to(sc, tbl)

While most sparklyr users probably won’t find this capability
of importing binary columns to Spark immediately useful in their
typical sparklyr::copy_to() or sparklyr::collect() usages, it does
play a crucial role in reducing serialization overheads in the
Spark-based foreach
parallel backend that was first introduced in sparklyr 1.2. This is
because Spark workers can directly fetch the serialized R closures
to be computed from a binary Spark column instead of extracting
those serialized bytes from intermediate representations such as
base64-encoded strings. Similarly, the R results from executing
worker closures will be directly available in RDS format which can
be efficiently deserialized in R, rather than being delivered in
other less efficient formats.

Acknowledgement

In chronological order, we would like to thank the following
contributors for making their pull requests part of sparklyr
1.5:

We would also like to express our gratitude towards numerous bug
reports and feature requests for sparklyr from a fantastic
open-source community.

Finally, the author of this blog post is indebted to @javierluraschi, @batpigandme, and @skeydan for their valuable
editorial inputs.

If you wish to learn more about sparklyr, check out sparklyr.ai, spark.rstudio.com, and some of the
previous release posts such as
sparklyr 1.4
and sparklyr
1.3
.

Thanks for reading!

Categories
Offsites

sparklyr 1.5: better dplyr interface, more sdf_* functions, and RDS-based serialization routines

We are thrilled to announce sparklyr 1.5 is now available on

CRAN
!

To install sparklyr 1.5 from CRAN, run

install.packages("sparklyr")

In this blog post, we will highlight the following aspects of
sparklyr 1.5:

Better dplyr interface

A large fraction of pull requests that went into the sparklyr
1.5 release were focused on making Spark dataframes work with
various dplyr verbs in the same way that R dataframes do. The full
list of dplyr-related bugs and feature requests that were resolved
in sparklyr 1.5 can be found in
here
.

In this section, we will showcase three new dplyr
functionalities that were shipped with sparklyr 1.5.

Stratified sampling

Stratified sampling on an R dataframe can be accomplished with a
combination of dplyr::group_by() followed by dplyr::sample_n() or
dplyr::sample_frac(), where the grouping variables specified in the
dplyr::group_by() step are the ones that define each stratum. For
instance, the following query will group mtcars by number of
cylinders and return a weighted random sample of size two from each
group, without replacement, and weighted by the mpg column:

mtcars %>% dplyr::group_by(cyl) %>% dplyr::sample_n(size = 2, weight = mpg, replace = FALSE) %>% print()
## # A tibble: 6 x 11 ## # Groups: cyl [3] ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 ## 2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 ## 3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 5 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 ## 6 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2

Starting from sparklyr 1.5, the same can also be done for Spark
dataframes with Spark 3.0 or above, e.g.,:

library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") mtcars_sdf <- copy_to(sc, mtcars, replace = TRUE, repartition = 3) mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::sample_n(size = 2, weight = mpg, replace = FALSE) %>% print()
# Source: spark<?> [?? x 11] # Groups: cyl mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 3 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 5 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 6 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2

or

mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::sample_frac(size = 0.2, weight = mpg, replace = FALSE) %>% print()
## # Source: spark<?> [?? x 11] ## # Groups: cyl ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 ## 4 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 ## 5 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 ## 6 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 ## 7 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 ## 8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3

Row sums

The rowSums() functionality offered by dplyr is handy when one
needs to sum up a large number of columns within an R dataframe
that are impractical to be enumerated individually. For example,
here we have a six-column dataframe of random real numbers, where
the partial_sum column in the result contains the sum of columns b
through d within each row:

ncols <- 6 nums <- seq(ncols) %>% lapply(function(x) runif(5)) names(nums) <- letters[1:ncols] tbl <- tibble::as_tibble(nums) tbl %>% dplyr::mutate(partial_sum = rowSums(.[2:5])) %>% print()
## # A tibble: 5 x 7 ## a b c d e f partial_sum ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.781 0.801 0.157 0.0293 0.169 0.0978 1.16 ## 2 0.696 0.412 0.221 0.941 0.697 0.675 2.27 ## 3 0.802 0.410 0.516 0.923 0.190 0.904 2.04 ## 4 0.200 0.590 0.755 0.494 0.273 0.807 2.11 ## 5 0.00149 0.711 0.286 0.297 0.107 0.425 1.40

Beginning with sparklyr 1.5, the same operation can be performed
with Spark dataframes:

library(sparklyr) sc <- spark_connect(master = "local") sdf <- copy_to(sc, tbl, overwrite = TRUE) sdf %>% dplyr::mutate(partial_sum = rowSums(.[2:5])) %>% print()
## # Source: spark<?> [?? x 7] ## a b c d e f partial_sum ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.781 0.801 0.157 0.0293 0.169 0.0978 1.16 ## 2 0.696 0.412 0.221 0.941 0.697 0.675 2.27 ## 3 0.802 0.410 0.516 0.923 0.190 0.904 2.04 ## 4 0.200 0.590 0.755 0.494 0.273 0.807 2.11 ## 5 0.00149 0.711 0.286 0.297 0.107 0.425 1.40

As a bonus from implementing the rowSums feature for Spark
dataframes, sparklyr 1.5 now also offers limited support for the
column-subsetting operator on Spark dataframes. For example, all
code snippets below will return some subset of columns from the
dataframe named sdf:

# select columns `b` through `e` sdf[2:5]
# select columns `b` and `c` sdf[c("b", "c")]
# drop the first and third columns and return the rest sdf[c(-1, -3)]

Weighted-mean summarizer

Similar to the two dplyr functions mentioned above, the
weighted.mean() summarizer is another useful function that has
become part of the dplyr interface for Spark dataframes in sparklyr
1.5. One can see it in action by, for example, comparing the output
from the following

library(sparklyr) sc <- spark_connect(master = "local") mtcars_sdf <- copy_to(sc, mtcars, replace = TRUE) mtcars_sdf %>% dplyr::group_by(cyl) %>% dplyr::summarize(mpg_wm = weighted.mean(mpg, wt)) %>% print()

with output from the equivalent operation on mtcars in R:

mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarize(mpg_wm = weighted.mean(mpg, wt)) %>% print()

both of them should evaluate to the following:

## cyl mpg_wm ## <dbl> <dbl> ## 1 4 25.9 ## 2 6 19.6 ## 3 8 14.8

New additions to the sdf_* family of functions

sparklyr provides a large number of convenience functions for
working with Spark dataframes, and all of them have names starting
with the sdf_ prefix.

In this section we will briefly mention four new additions and
show some example scenarios in which those functions are
useful.

sdf_expand_grid()

As the name suggests, sdf_expand_grid() is simply the Spark
equivalent of expand.grid(). Rather than running expand.grid() in R
and importing the resulting R dataframe to Spark, one can now run
sdf_expand_grid(), which accepts both R vectors and Spark
dataframes and supports hints for broadcast hash joins. The example
below shows sdf_expand_grid() creating a 100-by-100-by-10-by-10
grid in Spark over 1000 Spark partitions, with broadcast hash join
hints on variables with small cardinalities:

library(sparklyr) sc <- spark_connect(master = "local") grid_sdf <- sdf_expand_grid( sc, var1 = seq(100), var2 = seq(100), var3 = seq(10), var4 = seq(10), broadcast_vars = c(var3, var4), repartition = 1000 ) grid_sdf %>% sdf_nrow() %>% print()
## [1] 1e+06

sdf_partition_sizes()

As sparklyr user @sbottelli suggested here, one
thing that would be great to have in sparklyr is an efficient way
to query partition sizes of a Spark dataframe. In sparklyr 1.5,
sdf_partition_sizes() does exactly that:

library(sparklyr) sc <- spark_connect(master = "local") sdf_len(sc, 1000, repartition = 5) %>% sdf_partition_sizes() %>% print(row.names = FALSE)
## partition_index partition_size ## 0 200 ## 1 200 ## 2 200 ## 3 200 ## 4 200

sdf_unnest_longer() and sdf_unnest_wider()

sdf_unnest_longer() and sdf_unnest_wider() are the equivalents
of tidyr::unnest_longer() and tidyr::unnest_wider() for Spark
dataframes. sdf_unnest_longer() expands all elements in a struct
column into multiple rows, and sdf_unnest_wider() expands them into
multiple columns. As illustrated with an example dataframe
below,

library(sparklyr) sc <- spark_connect(master = "local") sdf <- copy_to( sc, tibble::tibble( id = seq(3), attribute = list( list(name = "Alice", grade = "A"), list(name = "Bob", grade = "B"), list(name = "Carol", grade = "C") ) ) )
sdf %>% sdf_unnest_longer(col = record, indices_to = "key", values_to = "value") %>% print()

evaluates to

## # Source: spark<?> [?? x 3] ## id value key ## <int> <chr> <chr> ## 1 1 A grade ## 2 1 Alice name ## 3 2 B grade ## 4 2 Bob name ## 5 3 C grade ## 6 3 Carol name

whereas

sdf %>% sdf_unnest_wider(col = record) %>% print()

evaluates to

## # Source: spark<?> [?? x 3] ## id grade name ## <int> <chr> <chr> ## 1 1 A Alice ## 2 2 B Bob ## 3 3 C Carol

RDS-based serialization routines

Some readers must be wondering why a brand new serialization
format would need to be implemented in sparklyr at all. Long story
short, the reason is that RDS serialization is a strictly better
replacement for its CSV predecessor. It possesses all desirable
attributes the CSV format has, while avoiding a number of
disadvantages that are common among text-based data formats.

In this section, we will briefly outline why sparklyr should
support at least one serialization format other than arrow,
deep-dive into issues with CSV-based serialization, and then show
how the new RDS-based serialization is free from those issues.

Why arrow is not for everyone?

To transfer data between Spark and R correctly and efficiently,
sparklyr must rely on some data serialization format that is
well-supported by both Spark and R. Unfortunately, not many
serialization formats satisfy this requirement, and among the ones
that do are text-based formats such as CSV and JSON, and binary
formats such as Apache Arrow, Protobuf, and as of recent, a small
subset of RDS version 2. Further complicating the matter is the
additional consideration that sparklyr should support at least one
serialization format whose implementation can be fully
self-contained within the sparklyr code base, i.e., such
serialization should not depend on any external R package or system
library, so that it can accommodate users who want to use sparklyr
but who do not necessarily have the required C++ compiler tool
chain and other system dependencies for setting up R packages such
as arrow
or protolite.
Prior to sparklyr 1.5, CSV-based serialization was the default
alternative to fallback to when users do not have the arrow package
installed or when the type of data being transported from R to
Spark is unsupported by the version of arrow available.

Why is the CSV format not ideal?

There are at least three reasons to believe CSV format is not
the best choice when it comes to exporting data from R to
Spark.

One reason is efficiency. For example, a double-precision
floating point number such as .Machine$double.eps needs to be
expressed as “2.22044604925031e-16” in CSV format in order to not
incur any loss of precision, thus taking up 20 bytes rather than 8
bytes.

But more important than efficiency are correctness concerns. In
a R dataframe, one can store both NA_real_ and NaN in a column of
floating point numbers. NA_real_ should ideally translate to null
within a Spark dataframe, whereas NaN should continue to be NaN
when transported from R to Spark. Unfortunately, NA_real_ in R
becomes indistinguishable from NaN once serialized in CSV format,
as evident from a quick demo shown below:

original_df <- data.frame(x = c(NA_real_, NaN)) original_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()
## x is_nan ## 1 NA FALSE ## 2 NaN TRUE
csv_file <- "/tmp/data.csv" write.csv(original_df, file = csv_file, row.names = FALSE) deserialized_df <- read.csv(csv_file) deserialized_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()
## x is_nan ## 1 NA FALSE ## 2 NA FALSE

Another correctness issue very much similar to the one above was
the fact that “NA” and NA within a string column of an R dataframe
become indistinguishable once serialized in CSV format, as
correctly pointed out in this Github
issue
by @caewok and
others.

RDS to the rescue!

RDS format is one of the most widely used binary formats for
serializing R objects. It is described in some detail in chapter 1,
section 8 of this
document
. Among advantages of the RDS format are efficiency and
accuracy: it has a reasonably efficient implementation in base R,
and supports all R data types.

Also worth noticing is the fact that when an R dataframe
containing only data types with sensible equivalents in Apache
Spark (e.g., RAWSXP, LGLSXP, CHARSXP, REALSXP, etc) is saved using
RDS version 2, (e.g., serialize(mtcars, connection = NULL, version
= 2L, xdr = TRUE)), only a tiny subset of the RDS format will be
involved in the serialization process, and implementing
deserialization routines in Scala capable of decoding such a
restricted subset of RDS constructs is in fact a reasonably simple
and straightforward task (as shown in
here
).

Last but not least, because RDS is a binary format, it allows
NA_character_, “NA”, NA_real_, and NaN to all be encoded in an
unambiguous manner, hence allowing sparklyr 1.5 to avoid all
correctness issues detailed above in non-arrow serialization use
cases.

Other benefits of RDS serialization

In addition to correctness guarantees, RDS format also offers
quite a few other advantages.

One advantage is of course performance: for example, importing a
non-trivially-sized dataset such as nycflights13::flights from R to
Spark using the RDS format in sparklyr 1.5 is roughly 40%-50%
faster compared to CSV-based serialization in sparklyr 1.4. The
current RDS-based implementation is still nowhere as fast as
arrow-based serialization though (arrow is about 3-4x faster), so
for performance-sensitive tasks involving heavy serialization,
arrow should still be the top choice.

Another advantage is that with RDS serialization, sparklyr can
import R dataframes containing raw columns directly into binary
columns in Spark. Thus, use cases such as the one below will work
in sparklyr 1.5

library(sparklyr) sc <- spark_connect(master = "local") tbl <- tibble::tibble( x = list(serialize("sparklyr", NULL), serialize(c(123456, 789), NULL)) ) sdf <- copy_to(sc, tbl)

While most sparklyr users probably won’t find this capability
of importing binary columns to Spark immediately useful in their
typical sparklyr::copy_to() or sparklyr::collect() usages, it does
play a crucial role in reducing serialization overheads in the
Spark-based foreach
parallel backend that was first introduced in sparklyr 1.2. This is
because Spark workers can directly fetch the serialized R closures
to be computed from a binary Spark column instead of extracting
those serialized bytes from intermediate representations such as
base64-encoded strings. Similarly, the R results from executing
worker closures will be directly available in RDS format which can
be efficiently deserialized in R, rather than being delivered in
other less efficient formats.

Acknowledgement

In chronological order, we would like to thank the following
contributors for making their pull requests part of sparklyr
1.5:

We would also like to express our gratitude towards numerous bug
reports and feature requests for sparklyr from a fantastic
open-source community.

Finally, the author of this blog post is indebted to @javierluraschi, @batpigandme, and @skeydan for their valuable
editorial inputs.

If you wish to learn more about sparklyr, check out sparklyr.ai, spark.rstudio.com, and some of the
previous release posts such as
sparklyr 1.4
and sparklyr
1.3
.

Thanks for reading!

Categories
Misc

Problem trying to run object detection training

Hello everyone, I am trying to train the
ssd_mobilenet_v2_coco_2018_03_29 network using learning transfer
and the code to train (train.py) is the one in the models /
research / object_detection / legacy folder of the tensorflow api,
the following is the command I use for it

python3 train.py –logtostderr –train_dir = $ TRAIN_DIR
–pipeline_config_path = $PIPELINE_CONFIG_PATH

however I get the following problem

https://pastebin.pl/view/e1cf5d7e

can someone give me an idea what is going on, thanks

Notes: tensorflow version 1.14.

submitted by /u/legendarypegasus

[visit reddit]

[comments]

Categories
Misc

Does anyone have some tf/Keras best practices lists or examples?

I’m mostly looking for things like how to properly define your
models (for example how to define your model classes or functions
in order to have a cleaner project, how to manage hyperparameters
as inputs for the definition of your models,…), how to do the
train (should I implement a function/class for the training part?).
Thanks.

submitted by /u/convnetto

[visit reddit]

[comments]