Categories
Offsites

Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

In recent years, large neural networks trained for language understanding and generation have achieved impressive results across a wide range of tasks. GPT-3 first showed that large language models (LLMs) can be used for few-shot learning and can achieve impressive results without large-scale task-specific data collection or model parameter updating. More recent LLMs, such as GLaM, LaMDA, Gopher, and Megatron-Turing NLG, achieved state-of-the-art few-shot results on many tasks by scaling model size, using sparsely activated modules, and training on larger datasets from more diverse sources. Yet much work remains in understanding the capabilities that emerge with few-shot learning as we push the limits of model scale.

Last year Google Research announced our vision for Pathways, a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators. In “PaLM: Scaling Language Modeling with Pathways”, we introduce the Pathways Language Model (PaLM), a 540-billion parameter, dense decoder-only Transformer model trained with the Pathways system, which enabled us to efficiently train a single model across multiple TPU v4 Pods. We evaluated PaLM on hundreds of language understanding and generation tasks, and found that it achieves state-of-the-art few-shot performance across most tasks, by significant margins in many cases.

As the scale of the model increases, the performance improves across tasks while also unlocking new capabilities.

Training a 540-Billion Parameter Language Model with Pathways
PaLM demonstrates the first large-scale use of the Pathways system to scale training to 6144 chips, the largest TPU-based system configuration used for training to date. The training is scaled using data parallelism at the Pod level across two Cloud TPU v4 Pods, while using standard data and model parallelism within each Pod. This is a significant increase in scale compared to most previous LLMs, which were either trained on a single TPU v3 Pod (e.g., GLaM, LaMDA), used pipeline parallelism to scale to 2240 A100 GPUs across GPU clusters (Megatron-Turing NLG) or used multiple TPU v3 Pods (Gopher) with a maximum scale of 4096 TPU v3 chips.

PaLM achieves a training efficiency of 57.8% hardware FLOPs utilization, the highest yet achieved for LLMs at this scale. This is due to a combination of the parallelism strategy and a reformulation of the Transformer block that allows for attention and feedforward layers to be computed in parallel, enabling speedups from TPU compiler optimizations.

PaLM was trained using a combination of English and multilingual datasets that include high-quality web documents, books, Wikipedia, conversations, and GitHub code. We also created a “lossless” vocabulary that preserves all whitespace (especially important for code), splits out-of-vocabulary Unicode characters into bytes, and splits numbers into individual tokens, one for each digit.

Breakthrough Capabilities on Language, Reasoning, and Code Tasks
PaLM shows breakthrough capabilities on numerous very difficult tasks. We highlight a few examples for language understanding and generation, reasoning, and code-related tasks below.

Language Understanding and Generation
We evaluated PaLM on 29 widely-used English natural language processing (NLP) tasks. PaLM 540B surpassed few-shot performance of prior large models, such as GLaM, GPT-3, Megatron-Turing NLG, Gopher, Chinchilla, and LaMDA, on 28 of 29 of tasks that span question-answering tasks (open-domain closed-book variant), cloze and sentence-completion tasks, Winograd-style tasks, in-context reading comprehension tasks, common-sense reasoning tasks, SuperGLUE tasks, and natural language inference tasks.

PaLM 540B performance improvement over prior state-of-the-art (SOTA) results on 29 English-based NLP tasks.

In addition to English NLP tasks, PaLM also shows strong performance on multilingual NLP benchmarks, including translation, even though only 22% of the training corpus is non-English.

We also probe emerging and future capabilities of PaLM on the Beyond the Imitation Game Benchmark (BIG-bench), a recently released suite of more than 150 new language modeling tasks, and find that PaLM achieves breakthrough performance. We compare the performance of PaLM to Gopher and Chinchilla, averaged across a common subset of 58 of these tasks. Interestingly, we note that PaLM’s performance as a function of scale follows a log-linear behavior similar to prior models, suggesting that performance improvements from scale have not yet plateaued. PaLM 540B 5-shot also does better than the average performance of people asked to solve the same tasks.

Scaling behavior of PaLM on a subset of 58 BIG-bench tasks. 

PaLM demonstrates impressive natural language understanding and generation capabilities on several BIG-bench tasks. For example, the model can distinguish cause and effect, understand conceptual combinations in appropriate contexts, and even guess the movie from an emoji.

Examples that showcase PaLM 540B 1-shot performance on BIG-bench tasks: labeling cause and effect, conceptual understanding, guessing movies from emoji, and finding synonyms and counterfactuals.

Reasoning
By combining model scale with chain-of-thought prompting, PaLM shows breakthrough capabilities on reasoning tasks that require multi-step arithmetic or common-sense reasoning. Prior LLMs, like Gopher, saw less benefit from model scale in improving performance.

Standard prompting versus chain-of-thought prompting for an example grade-school math problem. Chain-of-thought prompting decomposes the prompt for a multi-step reasoning problem into intermediate steps (highlighted in yellow), similar to how a person would approach it.

We observed strong performance from PaLM 540B combined with chain-of-thought prompting on three arithmetic datasets and two commonsense reasoning datasets. For example, with 8-shot prompting, PaLM solves 58% of the problems in GSM8K, a benchmark of thousands of challenging grade school level math questions, outperforming the prior top score of 55% achieved by fine-tuning the GPT-3 175B model with a training set of 7500 problems and combining it with an external calculator and verifier.

This new score is especially interesting, as it approaches the 60% average of problems solved by 9-12 year olds, who are the target audience for the question set. We suspect that separate encoding of digits in the PaLM vocabulary helps enable these performance improvements.

Remarkably, PaLM can even generate explicit explanations for scenarios that require a complex combination of multi-step logical inference, world knowledge, and deep language understanding. For example, it can provide high quality explanations for novel jokes not found on the web.

PaLM explains an original joke with two-shot prompts.

Code Generation
LLMs have also been shown [1, 2, 3, 4] to generalize well to coding tasks, such as writing code given a natural language description (text-to-code), translating code from one language to another, and fixing compilation errors (code-to-code).

PaLM 540B shows strong performance across coding tasks and natural language tasks in a single model, even though it has only 5% code in the pre-training dataset. Its few-shot performance is especially remarkable because it is on par with the fine-tuned Codex 12B while using 50 times less Python code for training. This result reinforces earlier findings that larger models can be more sample efficient than smaller models because they better transfer learning from other programming languages and natural language data.

Examples of a fine-tuned PaLM 540B model on text-to-code tasks, such as GSM8K-Python and HumanEval, and code-to-code tasks, such as Transcoder.

We also see a further increase in performance by fine-tuning PaLM on a Python-only code dataset, which we refer to as PaLM-Coder. For an example code repair task called DeepFix, where the objective is to modify initially broken C programs until they compile successfully, PaLM-Coder 540B demonstrates impressive performance, achieving a compile rate of 82.1%, which outperforms the prior 71.7% state of the art. This opens up opportunities for fixing more complex errors that arise during software development.

An example from the DeepFix Code Repair task. The fine-tuned PaLM-Coder 540B fixes compilation errors (left, in red) to a version of code that compiles (right).

Ethical Considerations
Recent research has highlighted various potential risks associated with LLMs trained on web text. It is crucial to analyze and document such potential undesirable risks through transparent artifacts such as model cards and datasheets, which also include information on intended use and testing. To this end, our paper provides a datasheet, model card and Responsible AI benchmark results, and it reports thorough analyses of the dataset and model outputs for biases and risks. While the analysis helps outline some potential risks of the model, domain- and task-specific analysis is essential to truly calibrate, contextualize, and mitigate possible harms. Further understanding of risks and benefits of these models is a topic of ongoing research, together with developing scalable solutions that can put guardrails against malicious uses of language models.

Conclusion and Future Work
PaLM demonstrates the scaling capability of the Pathways system to thousands of accelerator chips across two TPU v4 Pods by training a 540-billion parameter model efficiently with a well-studied, well-established recipe of a dense decoder-only Transformer model. Pushing the limits of model scale enables breakthrough few-shot performance of PaLM across a variety of natural language processing, reasoning, and code tasks.

PaLM paves the way for even more capable models by combining the scaling capabilities with novel architectural choices and training schemes, and brings us closer to the Pathways vision:

“Enable a single AI system to generalize across thousands or millions of tasks, to understand different types of data, and to do so with remarkable efficiency.”

Acknowledgements
PaLM is the result of a large, collaborative effort by many teams within Google Research and across Alphabet. We’d like to thank the entire PaLM team for their contributions: Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Mishra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, and Jason Wei. PaLM builds on top of work by many, many teams at Google and we would especially like to recognize the T5X team, the Pathways infrastructure team, the JAX team, the Flaxformer team, the XLA team, the Plaque team, the Borg team, and the Datacenter networking infrastructure team. We’d like to thank our co-authors on this blog post, Alexander Spiridonov and Maysam Moussalem, as well as Josh Newlan and Tom Small for the images and animations in this blog post. Finally, we would like to thank our advisors for the project: Noah Fiedel, Slav Petrov, Jeff Dean, Douglas Eck, and Kathy Meier-Hellstern.

Categories
Misc

Meet the Omnivore: Videographer Makes Digital Walls, Virtual Homes Pop With NVIDIA Omniverse

Pekka Varis’s artistry has come a long way from his early days as a self-styled “punk activist” who spray painted during the “old school days of hip hop in Finland.”

The post Meet the Omnivore: Videographer Makes Digital Walls, Virtual Homes Pop With NVIDIA Omniverse appeared first on NVIDIA Blog.

Categories
Misc

options to incorporate implicit feedback in tf recommender (retrieval) system + how to apply k-means clustering to trained user embeddings

hey,

I have a large database of hundreds of thousands of users interacting with thousands of products for given amount of time, with more time indicating more interest. My company wants to understand if there are particular subgroups of similar consumers. In order to discover if that’s the case, I’ve built a 2-stage ML approach:

– using a tfrs model based on the basic retrieval tutorial (https://www.tensorflow.org/recommenders/examples/basic_retrieval), I’ve trained embeddings to represent my users and products.

– using k-means clustering on the user embeddings, I classify a particular user as a member of a particular cluster.

With this approach, I run into 2 challenges:

– the basic retrieval does not take into account the implicit feedback of the amount of time. This seems like a recurring theme in this space – to weigh the user-item interactions by some measure of implicit feedback. I can’t seem to find any TF implementations though – any tips?

– my trained user embedding layer does not seem suitable for k-means clustering in a sense, since its measure of inter vs intra cluster distance does not meaningfully reduce over training iterations, and (more importantly) decreases linearly(!) with a higher value for k, making it impossible to use the elbow method to determine an objectively good trade-off between k and explained variance.

what would you advice to tackle both of these issues? Thanks for thinking along!
I know some of these questions are more ‘applied machine learning’ than ‘tensorflow’ per se, but I didn’t know where else to take this question, so apologies if this in the wrong category.

submitted by /u/the_Wallie
[visit reddit] [comments]

Categories
Misc

Please help…..ValueError: A target array with shape (1288, 1) was passed for an output of shape (None, 256, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output.

Hi everyone,

I am pretty new to ML and tensorflow and am getting stuck. I am trying to do text classification.

My dataset is in the form where each row has 2 columns: text and polarity

text = string/tweet

polarity = can be 0 or 1

I am generating BERT embeddings following this code https://github.com/strongio/keras-bert/blob/master/keras-bert.ipynb

I want to add a Bi-LSTM between Bert Layer and the Dense layer. I have done it like this:

bert_output = BertLayer(n_fine_tune_layers=3, pooling="mean")(bert_inputs) bert_output = tf.keras.layers.Reshape((max_seq_length, embedding_size))(bert_output) bilstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, dropout=0.2,recurrent_dropout=0.2,return_sequences=True))(bert_output) output = tf.keras.layers.Dense(1, activation="softmax")(bilstm) model = tf.keras.models.Model(inputs=bert_inputs, outputs=output) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary() 

It gives an error:

ValueError: A target array with shape (1288, 1) was passed for an output of shape (None, 256, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output. 

This is the link of the notebook in colab:

https://colab.research.google.com/drive/13g1ccE_cbSwEyUlKBxFRnvxtJ4gt38-g?usp=sharing

What can I do to resolve this? Does it have something to do with what activation or loss is being used ? How can the shape be matched?

Any help will be appreciated.

submitted by /u/No-Life-8250
[visit reddit] [comments]

Categories
Misc

Is this kind of transfer learning is possible in Tensorflow

Hello I’m doing a CNN project for classification of Corn disease images(4 classes) It uses VGG16 as its base model. I have created and saved the model. Now is it possible to use that model as a base for another transfer learning task to classify cotton leaf disease images( 4 classes) with retaining the knowledge gained from corn disease images along with cotton leaf disease images? If so how should I modify the corn disease model. Should I need to make the output layer neurons as 8( 4 for cotton, 4 for corn disease) ?

submitted by /u/kudoshinichi-8211
[visit reddit] [comments]

Categories
Misc

error while quantiaztion tensorflow2 keras model

File “/opt/vitis_ai/conda/envs/vitis-ai-tensorflow2/lib/python3.7/site-packages/keras/engine/functional.py”, line 1163, in _deserialize_keras_tensor

layer = layer_map[layer_name]

KeyError: ‘activation_4’

Does anyone know how to solve this problem and what it is about .
This model is working fine without quantization

submitted by /u/Beginner_Journey
[visit reddit] [comments]

Categories
Misc

Tflite model gives different results in android application

I’m having a trained model of image classification which works fine on python but when i convert it to tflite and deploy on Android it gives different results Can anyone help me here

submitted by /u/shubham1300
[visit reddit] [comments]

Categories
Misc

Is it possible to use transfer learning to create a CNN model to classify alzheimer’s MRI images(4 classes) by using a pre-trainned model used for pneumonia xray(2 classes).

Hello I have been trying to create two tensorflow models to experiment with transfer learning. I have a trainned a cnn model for lung xray images for pneumonia(2 classes) by using the kaggle chest x-ray dataset .

Here is my code

import tensorflow as tf

import numpy as np

from tensorflow import keras

import os

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from tensorflow.keras.preprocessing import image

import matplotlib.pyplot as plt

gen = ImageDataGenerator(rescale=1./255)

train_data = gen.flow_from_directory(“/Users/saibalaji/Downloads/chest_xray/train”,target_size=(500,500),batch_size=32,class_mode=’binary’)

test_data = gen.flow_from_directory(“/Users/saibalaji/Downloads/chest_xray/test”,target_size=(500,500),batch_size=32,class_mode=’binary’)

model = keras.Sequential()

# Convolutional layer and maxpool layer 1

model.add(keras.layers.Conv2D(32,(3,3),activation=’relu’,input_shape=(500,500,3)))

model.add(keras.layers.MaxPool2D(2,2))

# Convolutional layer and maxpool layer 2

model.add(keras.layers.Conv2D(64,(3,3),activation=’relu’))

model.add(keras.layers.MaxPool2D(2,2))

# Convolutional layer and maxpool layer 3

model.add(keras.layers.Conv2D(128,(3,3),activation=’relu’))

model.add(keras.layers.MaxPool2D(2,2))

# Convolutional layer and maxpool layer 4

model.add(keras.layers.Conv2D(128,(3,3),activation=’relu’))

model.add(keras.layers.MaxPool2D(2,2))

# This layer flattens the resulting image array to 1D array

model.add(keras.layers.Flatten())

# Hidden layer with 512 neurons and Rectified Linear Unit activation function

model.add(keras.layers.Dense(512,activation=’relu’))

# Output layer with single neuron which gives 0 for Cat or 1 for Dog

#Here we use sigmoid activation function which makes our model output to lie between 0 and 1

model.add(keras.layers.Dense(1,activation=’sigmoid’))

hist = model.compile(optimizer=’adam’,loss=’binary_crossentropy’,metrics=[‘accuracy’])

model.fit_generator(train_data,

steps_per_epoch = 163,

epochs = 4,

validation_data = test_data

)

I have saved the model in .h5 format.

Then I created a new notebook loaded data from kaggle for alzheimer’s disease and loaded my saved pneumonia model. Copied its layer to a new model except last layer then Freezed all the layers in the new model as non trainable. Then added a output dense layer with 4 neurons for 4 classes. Then trainned only the last layer for 5 epochs. But the problem is val accuaracy remains at 35% constant. How can I improve it.

Here is my code for alzeihmers model

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from tensorflow.keras.preprocessing import image

import matplotlib.pyplot as plt

import numpy as np

gen = ImageDataGenerator(rescale=1./255)

traindata = datagen.flow_from_directory(‘/Users/saibalaji/Documents/TensorFlowProjects/ad/train’,target_size=(500,500),batch_size=32)

testdata = datagen.flow_from_directory(‘/Users/saibalaji/Documents/TensorFlowProjects/ad/test’,target_size=(500,500),batch_size=32)

model = keras.models.load_model(‘pn.h5’)

nmodel = keras.models.Sequential()

#add all layers except last one

for layer in model.layers[0:-1]:

nmodel.add(layer)

for layer in nmodel.layers:

layer.trainable = False

nmodel.add(keras.layers.Dense(units=4,name=’dense_last’))

hist = nmodel.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.002),loss = ‘categorical_crossentropy’, metrics = [‘accuracy’])

nmodel.fit(x=traindata,validation_data=testdata,epochs=5,steps_per_epoch=160)

submitted by /u/kudoshinichi-8211
[visit reddit] [comments]

Categories
Offsites

Introducing CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Automatic translation of speech from one language to speech in another language, called speech-to-speech translation (S2ST), is important for breaking down the communication barriers between people speaking different languages. Conventionally, automatic S2ST systems are built with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, so that the system overall is text-centric. Recently, work on S2ST that doesn’t rely on intermediate text representation is emerging, such as end-to-end direct S2ST (e.g., Translatotron) and cascade S2ST based on learned discrete representations of speech (e.g., Tjandra et al.). While early versions of such direct S2ST systems obtained lower translation quality compared to cascade S2ST models, they are gaining traction as they have the potential both to reduce translation latency and compounding errors, and to better preserve paralinguistic and non-linguistic information from the original speech, such as voice, emotion, tone, etc. However, such models usually have to be trained on datasets with paired S2ST data, but the public availability of such corpora is extremely limited.

To foster research on such a new generation of S2ST, we introduce a Common Voice-based Speech-to-Speech translation corpus, or CVSS, which includes sentence-level speech-to-speech translation pairs from 21 languages into English. Unlike existing public corpora, CVSS can be directly used for training such direct S2ST models without any extra processing. In “CVSS Corpus and Massively Multilingual Speech-to-Speech Translation”, we describe the dataset design and development, and demonstrate the effectiveness of the corpus through training of baseline direct and cascade S2ST models and showing performance of a direct S2ST model that approaches that of a cascade S2ST model.

Building CVSS
CVSS is directly derived from the CoVoST 2 speech-to-text (ST) translation corpus, which is further derived from the Common Voice speech corpus. Common Voice is a massively multilingual transcribed speech corpus designed for ASR in which the speech is collected by contributors reading text content from Wikipedia and other text corpora. CoVoST 2 further provides professional text translation for the original transcript from 21 languages into English and from English into 15 languages. CVSS builds on these efforts by providing sentence-level parallel speech-to-speech translation pairs from 21 languages into English (shown in the table below).

To facilitate research with different focuses, two versions of translation speech in English are provided in CVSS, both are synthesized using state-of-the-art TTS systems, with each version providing unique value that doesn’t exist in other public S2ST corpora:

  • CVSS-C: All the translation speech is in a single canonical speaker’s voice. Despite being synthetic, the speech is highly natural, clean, and consistent in speaking style. These properties ease the modeling of the target speech and enable trained models to produce high quality translation speech suitable for general user-facing applications where speech quality is of higher importance than accurately reproducing the speakers’ voices.
  • CVSS-T: The translation speech captures the voice from the corresponding source speech. Each S2ST pair has a similar voice on the two sides, despite being in different languages. Because of this, the dataset is suitable for building models where accurate voice preservation is desired, such as for movie dubbing.

Together with the source speech, the two S2ST datasets contain 1,872 and 1,937 hours of speech, respectively.

Source
Language    
Code     Source
  speech (X)  
CVSS-C
  target speech (En)  
CVSS-T
  target speech (En)  
French fr 309.3 200.3 222.3
German de 226.5 137.0 151.2
Catalan ca 174.8 112.1 120.9
Spanish es 157.6 94.3 100.2
Italian it 73.9 46.5 49.2
Persian fa 58.8 29.9 34.5
Russian ru 38.7 26.9 27.4
Chinese zh 26.5 20.5 22.1
Portuguese     pt 20.0 10.4 11.8
Dutch nl 11.2 7.3 7.7
Estonian et 9.0 7.3 7.1
Mongolian mn 8.4 5.1 5.7
Turkish tr 7.9 5.4 5.7
Arabic ar 5.8 2.7 3.1
Latvian lv 4.9 2.6 3.1
Swedish sv 4.3 2.3 2.8
Welsh cy 3.6 1.9 2.0
Tamil ta 3.1 1.7 2.0
Indonesian id 3.0 1.6 1.7
Japanese ja 3.0 1.7 1.8
Slovenian sl 2.9 1.6 1.9
Total 1,153.2 719.1 784.2
Amount of source and target speech of each X-En pair in CVSS (hours).

In addition to translation speech, CVSS also provides normalized translation text matching the pronunciation in the translation speech (on numbers, currencies, acronyms, etc., see data samples below, e.g., where “100%” is normalized as “one hundred percent” or “King George II” is normalized as “king george the second”), which can benefit both model training as well as standardizing the evaluation.

CVSS is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license and it can be freely downloaded online.

Data Samples

Example 1:
Source audio (French)   
Source transcript (French)    Le genre musical de la chanson est entièrement le disco.
CVSS-C translation audio (English)   
CVSS-T translation audio (English)   
Translation text (English)    The musical genre of the song is 100% Disco.
Normalized translation text (English)        the musical genre of the song is one hundred percent disco
     
     
Example 2:
Source audio (Chinese)       
Source transcript (Chinese)        弗雷德里克王子,英国王室成员,为乔治二世之孙,乔治三世之幼弟。
CVSS-C translation audio (English)       
CVSS-T translation audio (English)       
Translation text (English)        Prince Frederick, member of British Royal Family, Grandson of King George II, brother of King George III.
Normalized translation text (English)        prince frederick member of british royal family grandson of king george the second brother of king george the third

Baseline Models
On each version of CVSS, we trained a baseline cascade S2ST model as well as two baseline direct S2ST models and compared their performance. These baselines can be used for comparison in future research.

Cascade S2ST: To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous states of the art by +5.8 average BLEU on all 21 language pairs (detailed in the paper) when trained on the corpus without using extra data. This ST model is connected to the same TTS models used for constructing CVSS to compose very strong cascade S2ST baselines (ST → TTS).

Direct S2ST: We built two baseline direct S2ST models using Translatotron and Translatotron 2. When trained from scratch with CVSS, the translation quality from Translatotron 2 (8.7 BLEU) approaches that of the strong cascade S2ST baseline (10.6 BLEU). Moreover, when both use pre-training the gap decreases to only 0.7 BLEU on ASR transcribed translation. These results verify the effectiveness of using CVSS to train direct S2ST models.

Translation quality of baseline direct and cascade S2ST models built on CVSS-C, measured by BLEU on ASR transcription from speech translation. The pre-training was done on CoVoST 2 without other extra data sets.

Conclusion
We have released two versions of multilingual-to-English S2ST datasets, CVSS-C and CVSS-T, each with about 1.9K hours of sentence-level parallel S2ST pairs, covering 21 source languages. The translation speech in CVSS-C is in a single canonical speaker’s voice, while the same in CVSS-T is in voices transferred from the source speech. Each of these datasets provides unique value not existing in other public S2ST corpora.

We built baseline multilingual direct S2ST models and cascade S2ST models on both datasets, which can be used for comparison in future works. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous states of the art by +5.8 average BLEU when trained on the corpus without extra data. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.7 BLEU difference on ASR transcribed translation when utilized pre-training. We hope this work helps accelerate the research on direct S2ST.

Acknowledgments
We acknowledge the volunteer contributors and the organizers of the Common Voice and LibriVox projects for their contribution and collection of recordings, the creators of Common Voice, CoVoST, CoVoST 2, Librispeech and LibriTTS corpora for their previous work. The direct contributors to the CVSS corpus and the paper include Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen. We also thank Ankur Bapna, Yiling Huang, Jason Pelecanos, Colin Cherry, Alexis Conneau, Yonghui Wu, Hadar Shemtov and Françoise Beaufays for helpful discussions and support.

Categories
Misc

Merge Sort Explained: A Data Scientist’s Algorithm Guide

The article includes a step by step explanation of the merge sort algorithm and code snippets illustrating the implementation of the algorithm itself.

Data Scientists deal with algorithms daily. However, the data science discipline as a whole has developed into a role that does not involve implementation of sophisticated algorithms. Nonetheless, practitioners can still benefit from building an understanding and repertoire of algorithms.

In this article, the sorting algorithm merge sort is introduced, explained, evaluated, and implemented. The aim of this post is to provide you with robust background information on the merge sort algorithm, which acts as foundational knowledge for more complicated algorithms.

Although merge sort is not considered to be complex, understanding this algorithm will help you recognize what factors to consider when choosing the most efficient algorithm to perform data-related tasks. Created in 1945, John Von Neumann developed the merge sort algorithm using the divide-and-conquer approach.

Divide and conquer

To understand the merge sort algorithm, you must be familiar with the divide and conquer paradigm, alongside the programming concept of recursion. Recursion within the computer science domain is when a method defined to solve a problem involves an invocation of itself within its implementation body.

In other words, the function calls itself repeatedly.

Visual illustration of recursion.
Figure 1. Visual illustration of recursion – Image by author.

Divide and conquer algorithms (which merge sort is a type of) employ recursion within its approach to solve specific problems. Divide and conquer algorithms decompose complex problems into smaller sub-parts, where a defined solution is applied recursively to each sub-part. Each sub-part is then solved separately, and the solutions are recombined to solve the original problem.

The divide-and-conquer approach to algorithm design combines three primary elements:

  • Decomposition of the larger problem into smaller subproblems. (Divide)
  • Recursive utilization of functions to solve each of the smaller subproblems. (Conquer)
  • The final solution is a composition of the solution to the smaller subproblems of the larger problem. (Combine)

Other algorithms use the divide-and-conquer paradigm, such as Quicksort, Binary Search, and Strassen’s algorithm.

Merge sort

In the context of sorting elements in a list and in ascending order, the merge sort method divides the list into halves, then iterates through the new halves, continually dividing them down further to their smaller parts.

Subsequently, a comparison of smaller halves is conducted, and the results are combined together to form the final sorted list.

Steps and implementation

Implementation of the merge sort algorithm is a three-step procedure. Divide, conquer, and combine.

The divide component of the divide-and-conquer approach is the first step. This initial step separates the overall list into two smaller halves. Then, the lists are broken down further until they can no longer be divided, leaving only one element item in each halved list.

The recursive loop in merge sort’s second phase is concerned with the list’s elements being sorted in a particular order. For this scenario, the initial array is sorted in ascending order.

In the following illustration, you can see the division, comparison, and combination steps involved in the merge sort algorithm.

Image showing the divide component of the merge sort algorithm.
Figure 2. Divide component illustration of the Merge sort algorithm—Image by Author.
Image showing the Conquer and combine component of the merge sort algorithm.
Figure 3. Conquer and combine components—Image by author.

To implement this yourself:

  • Create a function called merge_sort that accepts a list of integers as its argument. All following instructions presented are within this function.
  • Start by dividing the list into halves. Record the initial length of the list.
  • Check that the recorded length is equal to 1. If the condition evaluates to true, return the list as this means that there is just one element within the list. Therefore, there is no requirement to divide the list.
  • Obtain the midpoint for a list with a number of elements greater than 1. When using the Python language, the // performs division with no remainder. It rounds the division result to the nearest whole number. This is also known as floor division.
  • Using the midpoint as a reference point, split the list into two halves. This is the divide aspect of the divide-and-conquer algorithm paradigm.
  • Recursion is leveraged at this step to facilitate the division of lists into halved components. The variables ‘left_half’ and ‘right_half’ are assigned to the invocation of the ‘merge_sort’ function, accepting the two halves of the initial list as parameters.
  • The ‘merge_sort’ function returns the invocation of a function that merges two lists to return one combined, sorted list.
def merge_sort(list: [int]):
    list_length = len(list)
    
    if list_length == 1:
        return list
    
    mid_point = list_length // 2
    
    left_half = merge_sort(list[:mid_point])
    right_half = merge_sort(list[mid_point:])
    
    return merge(left_half, right_half)
  • Create a ‘merge’ function that accepts two lists of integers as its arguments. This function contains the conquer and combine aspects of the divide-and-conquer algorithm paradigm. All following steps are executed within the body of this function.
  • Assign an empty list to the variable ‘output’ that holds the sorted integers.
  • The pointers ‘i’ and ‘j’ are used to index the left and right lists, respectively.
  • Within the while loop, there is a comparison between the elements of both the left and right lists. After each comparison, the output list is populated within the two compared elements. The pointer of the list of the appended element is incremented.
  • The remaining elements to be added to the sorted list are elements obtained from the current pointer value to the end of the respective list.
def merge(left, right):
    output = []
    i = j = 0
    
    while (i 

Performance and complexity

Big O notation is a standard for defining and organizing the performance of algorithms in terms of their space requirement and execution time.

Merge sort algorithm time complexity is the same for its best, worst, and average scenarios. For a list of size n, the expected number of steps, minimum number of steps, and maximum number of steps for the merge sort algorithm to complete, are all the same.

As noted earlier in this article, the merge sort algorithm is a three-step process: divide, conquer, and combine. The ‘divide’ step involves the computation of the midpoint of the list, which, regardless of the list size, takes a single operational step. Therefore the notation for this operation is denoted as O(1).

The ‘conquer’ step involves dividing and recursively solving subarrays–the notation log n denotes this. The ‘combine’ step consists of combining the results into a final list; this operation execution time is dependent on the list size and denoted as O(n).

The merge sort notation for its average, best, and worst time complexity is log n * n * O(1). In Big O notation, low-order terms and constants are negligible, meaning the final notation for the merge sort algorithm is O(n log n). For a detailed analysis of the merge sort algorithm, refer to this article.

Evaluation

Merge sort performs well when sorting large lists, but its operation time is slower than other sorting solutions when used on smaller lists. Another disadvantage of merge sort is that it will execute the operational steps even if the initial list is already sorted. In the use case of sorting linked lists, merge sort is one of the fastest sorting algorithms to use. Merge sort can be used in file sorting within external storage systems, such as hard drives.

Key takeaways

This article describes the merge sort technique by breaking it down in terms of its constituent operations and step-by-step processes.

Merge sort algorithm is commonly used and the intuition and implementation behind the algorithm is rather straightforward in comparison to other sorting algorithms. This article includes the implementation step of the merge sort algorithm in Python.

You should also know that the time complexity of the merge sort method’s execution time in different situations, remains the same for best, worst, and average scenarios. It is recommended that merge sort algorithm is applied in the following scenarios:

  • When dealing with larger sets of data, use the merge sort algorithm. Merge sort performs poorly on small arrays when compared to other sorting algorithms.
  • Elements within a linked list have a reference to the next element within the list. This means that within the merge sort algorithm operation, the pointers are modifiable, making the comparison and insertion of elements have a constant time and space complexity.
  • Have some form of certainty that the array is unsorted. Merge sort will execute its operations even on sorted arrays, a waste of computing resources.
  • Use merge sort when there is a consideration for the stability of data. Stable sorting involves maintaining the order of identical values within an array. When compared with the unsorted data input, the order of identical values throughout an array in a stable sort is kept in the same position in the sorted output.