Drug discovery startup Insilico Medicine—alongside researchers from Harvard Medical School, Johns Hopkins School of Medicine, the Mayo Clinic, and others—used AI to identify…
Drug discovery startup Insilico Medicine—alongside researchers from Harvard Medical School, Johns Hopkins School of Medicine, the Mayo Clinic, and others—used AI to identify more than two dozen gene targets related to amyotrophic lateral sclerosis (ALS). The research findings, which included 17 high-confidence and 11 novel therapeutic targets, were recently published in Frontiers in Aging Neuroscience.
Using Insilico’s AI-driven target discovery engine, called PandaOmics, the researchers analyzed massive datasets to discover genes that new drugs could target to improve outcomes for ALS, also known as Lou Gehrig’s disease. Today, patients typically face an average life expectancy of between two and five years after symptom onset.
The research team used NVIDIA GPUs to train the deep learning models for target identification. The PandaOmics AI engine uses a combination of omics AI scores, text-based AI scores, financial scores, and more to rank gene targets.
ALS is a debilitating disease. Patients rapidly lose voluntary muscle movement, affecting the ability to walk, talk, eat, and breathe. The five existing FDA-approved therapies for the disease are unable to halt or reverse this loss of function, which affects more than 700,000 people around the world.
“The results of this collaborative research effort show what is possible when we bring together human expertise with AI tools to discover new targets for diseases where there is a high unmet need,” said Alex Zhavoronkov, founder and CEO of Insilico Medicine, in a press release. “This is only the beginning.”
Insilico Medicine is a Premier member of NVIDIA Inception, a global program designed to support cutting-edge startups with co-marketing, expertise, and technology.
AI uncovers new paths to treat untreatable diseases
The research team used Quiver, a distributed graph learning library, to accelerate its AI models on multiple NVIDIA GPUs. They used natural language processing models including BioBERT, GPT, and OPT, as well as text recognition models including PaddleOCR and docTR.
To help identify the genes related to ALS, the researchers used public datasets as well as data from Answer ALS, a global project with clinical data consisting of 2.6 trillion data points from around 1,000 ALS patients. In a preclinical animal model, the team validated that 18 of the 28 identified gene targets were functionally correlated to ALS—and that in eight of them, suppression would strongly reduce neurodegeneration.
The researchers are now working to advance some of these targets toward clinical trials for ALS. The targets will be shared on ALS.AI to help accelerate drug discovery.
Earlier this year, Insilico began a Phase 1 clinical trial for an AI-discovered, AI-designed drug to treat pulmonary fibrosis, another fast-progressing, hard-to-treat disease.
Investigate the ultimate truth this GFN Thursday with Loopmancer, now streaming to all members on GeForce NOW. Stuck in a death loop, RTX 3080 and Priority members can search for the truth with RTX ON — including NVIDIA DLSS and ray-traced reflections. Plus, players can enjoy the latest Genshin Impact event with the “Summer Fantasia” Read article >
Over the last decades, organizations of all sizes across the world have flocked to implement video management systems (VMS) that tie together the components of a video network…
Over the last decades, organizations of all sizes across the world have flocked to implement video management systems (VMS) that tie together the components of a video network infrastructure. By allowing businesses to easily capture, record, store, retrieve, view, and analyze video collected from their cameras, VMS can improve their operations, increase visibility, and enhance safety.
VMS infrastructure is now so pervasive that enterprises can no longer monitor the firehose of video streaming day and night. The growing need for scalable and real-time analysis of video is possibly the greatest driver today of AI in the enterprise. With vast amounts of video data to be analyzed in real time, smart video analytics call for edge AI technology, where the heavy computation executes in the field near sensors like video cameras.
Organizations across all industries are eager to add AI to their existing VMS to maximize the return on their initial investments and take advantage of this valuable data but, unfortunately, it is a difficult task.
Organizations must partner with an independent software vendor who provides an intelligent video analytics (IVA) application. The vendor must then develop, deploy, manage, and support their own integration for every application that the organization wants to run. It is a painstaking process that requires significant time, energy, and expertise to execute.
An NVIDIA Metropolis partner themselves, Milestone Systems is a global leader in VMS helping to address this challenge and make it easier for hundreds of other Metropolis IVA partners to expand accessibility to incredibly valuable vision AI applications.
John Madsen, a senior research engineer at Milestone, explains, “When you have thousands of cameras that are recording 24/7, how do you find the relevant data? With AI, our end users can find recorded events in their logs that they want to find in minutes instead of combing through hours and hours of footage. We want to help our end users find the relevant video footage and run live analytics.”
Introducing AI Bridge
Milestone has embarked on a mission to help their customers get the most out of their existing VMS platforms. The result is Milestone AI Bridge.
AI Bridge is an API gateway that eases the integration of intelligent video analytics (IVA) applications with the Milestone XProtect VMS.
How AI Bridge works:
A camera sends video data to the VMS site.
The VMS site is connected to AI Bridge and sends video data back and forth.
AI Bridge connects the video from the VMS site to the GPU-accelerated IVA applications to run AI analytics and generate insights.
The insights are then fed back into the VMS so that actions can be taken based on whatever insight is provided from the AI application.
With AI Bridge, Milestone users can now instantly integrate third-party AI models into their own video systems. Milestone users are typically application providers or independent software vendors that help organizations create IVA applications.
To get access to AI Bridge from Milestone, create an account with the NGC catalog.
AI Bridge in action
Another NVIDIA Metropolis partner, DataFromSky is using AI Bridge to provide AI solutions for smart parking, traffic control, and retail.
One of their customers, the Køge Nord Train Station, located near Copenhagen, was experiencing large volumes of commuter congestion. For many commuters, a lack of parking spots and traffic congestion can lead to frustration, wasted time, accidents, and even missed trains or buses.
To solve this, DataFromSky built an intelligent parking application that monitors parking lots for occupancy, enables mobile payments, and navigates drivers to empty parking spots. With the addition of AI, each camera installed on the parking lot is able to monitor up to 400 parking spots in real-time. All this results in commuters having smoother and better travel experiences.
Thanks to AI Bridge, DataFromSky is able to integrate AI solutions into their customers’ existing camera infrastructure easily. This results in a significantly faster installation time, especially critical for larger deployments that may span hundreds of cameras.
Bringing AI Bridge to life
In building AI Bridge, Milestone knew that they needed to work with a partner that had deep roots in the AI community. That is why they chose NVIDIA.
“Our VMS works on a Windows platform which is very different from the AI community which uses modern software such as Linux, Kubernetes, and Docker,” says Madsen, “Working with NVIDIA allows us to modernize our stack and makes it extremely easy for us to work with the AI community.”
Milestone leveraged a wide array of NVIDIA AI products to make AI Bridge possible.
NVIDIA-Certified Systems provide enterprises with optimized hardware to enable quick and efficient video processing and inference that can be scaled across many cameras.
The NVIDIA Metropolis platform is an application framework that simplifies the development and scale of IVA applications for connecting to the AI ecosystem.
NVIDIA Fleet Command is a managed platform for container orchestration that streamlines the provisioning and deployment of systems and AI applications at the edge.
Milestone leverages Fleet Command to deploy the AI Bridge API remotely onto dozens or even thousands of edge systems within minutes.
“A big challenge is not just the integration, but deploying the analytics on-premises and how you manage it,” added Madsen. “This is why we turned to NVIDIA Fleet Command.”
Fleet Command also provides a single control plane for IT administrators to securely manage all AI applications through one dashboard. This makes it the ideal way to accelerate deployments, POCs, and edge infrastructure management.
The use cases of IVA
IVA promises to bring new, intelligent use cases across every industry. Some of the transformational use cases include the following:
You’re invited to connect with NVIDIA experts through a new exclusive series of Ask Me Anything (AMA) sessions. During these live Q&As, members of the NVIDIA Developer Program…
You’re invited to connect with NVIDIA experts through a new exclusive series of Ask Me Anything (AMA) sessions. During these live Q&As, members of the NVIDIA Developer Program can submit questions to our experts, brainstorm about common challenges that developers are facing, and engage in online discussions about NVIDIA technologies. The series will also provide guidance on integrating NVIDIA SDKs.
The AMA series kicks off on July 28 at 10:00 AM, Pacific time. Attendees can get tips on incorporating real-time rendering across their projects from the editors of Ray Tracing Gems II:
Adam Marrs is a principal engineer in the Game Engines and Core Technology group at NVIDIA. He holds a Ph.D. in computer science and has shipped graphics code in various AAA games and commercial game engines. He has written for GPU Zen 2, Ray Tracing Gems, and recently served as the editor-in-chief of Ray Tracing Gems II.
Peter Shirley is a distinguished engineer in the Research group at NVIDIA. He holds a Ph.D. in computer science and has worked in academics, startup companies, and industry. He is the author of several books, including the recent Ray Tracing in One Weekend series.
Ingo Wald is a director of ray tracing at NVIDIA. He holds a Ph.D. in computer science, has a long history of research related to ray tracing in both academia and industry, and is known for authoring and co-authoring various papers and open-source software projects on rendering, visualization, and data structures.
Eric Haines currently works at NVIDIA on interactive ray tracing. He co-authored the books Real-Time Rendering, 4th Edition and An Introduction to Ray Tracing. He edited The Ray Tracing News, and co-founded the Journal of Graphics Tools and the Journal of Computer Graphics Techniques. Most recently, he co-edited Ray Tracing Gems.
Each of these exclusive Q&A sessions will offer the developer community a chance to get answers from experts in real time, along with a forum for collaboration after the event.
To participate, you must be a member of the NVIDIA Developer Program. Sign up if you’re not already a member. Post questions to the dedicated online forum before the event and during the 60-minute live session.
Mark your calendars for the second AMA in the series scheduled for October 26, 2022. We’ll dive into best practices for building, training, and deploying recommender systems.
Effective and robust VQA systems cannot exist without high-quality, semantically and stylistically diverse large-scale training data of image-question-answer triplets. But, creating such data is time consuming and onerous. Perhaps unsurprisingly, the VQA community has focused more on sophisticated model development rather than scalable data creation.
In “All You May Need for VQA are Image Captions,” published at NAACL 2022, we explore VQA data generation by proposing “Visual Question Generation with Question Answering Validation” (VQ2A), a pipeline that works by rewriting a declarative caption into multiple interrogative question-answer pairs. More specifically, we leverage two existing assets — (i) large-scale image-text data and (ii) large-capacity neural text-to-text models — to achieve automatic VQA data generation. As the field has progressed, the research community has been making these assets larger and stronger in isolation (for general purposes such as learning text-only or image-text representations); together, they can achieve more and we adapt them for VQA data creation purposes. We find our approach can generate question-answer pairs with high precision and that this data can successfully be used for training VQA models to improve performance.
The VQ2A technique enables VQA data generation at scale from image captions by rewriting each caption into multiple question-answer pairs.
VQ2A Overview The first step of the VQ2A approach is to apply heuristics based on named entity recognition, part-of-speech tagging and manually defined rules to generate answer candidates from the image caption. These generated candidates are small pieces of information that may be relevant subjects about which to ask questions. We also add to this list two default answers, “yes” and “no”, which allow us to generate Boolean questions.
Then, we use a T5 model that was fine-tuned to generate questions for the candidate, resulting in [question, candidate answer] pairs. We then filter for the highest quality pairs using another T5 model (fine-tuned to answer questions) by asking it to answer the question based on the caption. was . That is, we compare the candidate answer to the output of this model and if the two answers are similar enough, we define this question as high quality and keep it. Otherwise, we filter it out.
The idea of using both question answering and question generation models to check each other for their round-trip consistency has been previously explored in other contexts. For instance, Q2 uses this idea to evaluate factual consistency in knowledge-grounded dialogues. In the end, the VQ2Aapproach, as illustrated below, can generate a large number of [image, question, answer] triplets that are high-quality enough to be used as VQA training data.
VQ2A consists of three main steps: (i) candidate answer extraction, (ii) question generation, (iii) question answering and answer validation.
Results Two examples of our generated VQA data are shown below, one based on human-written COCO Captions (COCO) and the other on automatically-collected Conceptual Captions (CC3M), which we call VQ2A-COCO and VQ2A-CC3M, respectively. We highlight the variety of question types and styles, which are critical for VQA. Overall, the cleaner the captions (i.e., the more closely related they are to their paired image), the more accurate the generated triplets. Based on 800 samples each, 87.3% of VQ2A-COCO and 66.0% VQ2A-CC3M are found by human raters to be valid, suggesting that our approach can generate question-answer pairs with high precision.
Generated question-answer pairs based on COCO Captions (top) and Conceptual Captions (bottom). Grey highlighting denotes questions that do not appear in VQAv2, while green highlighting denotes those that do, indicating that our approach is capable of generating novel questions that an existing VQA dataset does not have.
Finally, we evaluate our generated data by using it to train VQA models (highlights shown below). We observe that our automatically-generated VQA data is competitive with manually-annotated target VQA data. First, our VQA models achieve high performance on target benchmarks “out-of-the-box”, when trained only on our generated data (light blue and light red vs. yellow). Once fine-tuned on target data, our VQA models outperform target-only training slightly on large-scale benchmarks like VQAv2 and GQA, but significantly on the small, knowledge-seeking OK-VQA (dark blue/red vs. light blue/red).
VQA accuracy on popular benchmark datasets.
Conclusion All we may need for VQA are image captions! This work demonstrates that it is possible to automatically generate high-quality VQA data at scale, serving as an essential building block for VQA and vision-and-language models in general (e.g., ALIGN, CoCa). We hope that our work inspires other work on data-centric VQA.
Acknowledgments We thank Roee Aharoni, Idan Szpektor, and Radu Soricut for their feedback on this blogpost. We also thank our co-authors: Xi Chen, Nan Ding, Idan Szpektor, and Radu Soricut. We acknowledge contributions from Or Honovich, Hagai Taitelbaum, Roee Aharoni, Sebastian Goodman, Piyush Sharma, Nassim Oufattole, Gal Elidan, Sasha Goldshtein, and Avinatan Hassidim. Finally, we thank the authors of Q2, whose pipeline strongly influences this work.
Join this digital conference from August 2-4 to learn how science is being advanced through the work done at Open Hackathons or accelerated using OpenACC.
Join this digital conference from August 2-4 to learn how science is being advanced through the work done at Open Hackathons or accelerated using OpenACC.
A new approach to data The convergence of AI and IoT has shifted the center of gravity for data away from the cloud and to the edge of the network. In retail stores, factories,…
A new approach to data
The convergence of AI and IoT has shifted the center of gravity for data away from the cloud and to the edge of the network. In retail stores, factories, fulfillment centers, and other distributed locations, thousands of sensors are collecting petabytes of data that power insights for innovative AI use cases. Because the most valuable insights are generated at the edge, organizations have quickly adopted new technologies and processes to better capitalize on this new center of gravity.
One of the major technologies adopted is edge computing, the process of bringing the computing power for an application to the same physical location where sensors are collecting information. When this computing method is used to power AI applications at the edge, it’s referred to as edge AI.
To ensure that these edge locations harvesting valuable insights do not exist in isolated silos, organizations are increasingly working to integrate their edge computing solutions into their existing workflows to develop, test, and optimize applications. By having a seamless path from the development process to the deployment process, teams are able to simultaneously have strong visibility into how applications are operating in production environments while also taking advantage of the data and insights collected by the applications at edge locations.
This process will only become more important as AI models are quickly and constantly retrained and iterated on based on new data collected at edge locations.
Machine learning operations and edge AI
Machine learning operations (MLOps) is a system of processes to streamline the development, deployment, monitoring, and ongoing management of machine learning models. It allows organizations to quickly scale the development process for applications and enables rapid iterations between data science and IT teams. MLOps platforms organize that philosophy into a set of tools that can be used cross-functionally in an organization to speed up the rate of innovation.
Integrating MLOps platforms and edge computing solutions allows for a seamless and rapid workflow for data scientists and IT teams to collaboratively develop and deploy applications in production environments. With a complete workflow, teams can significantly increase the rate of innovation as they constantly iterate, test, deploy, and retain based on insights and information collected at edge sites. And for organizations diligently working to capitalize on the new data paradigm, innovation is paramount.
Integrating Domino Data Lab and NVIDIA Fleet Command
The Domino Data Lab Enterprise MLOps Platform and NVIDIA Fleet Command are now integrated to provide data scientists and IT teams with a consistent, simplified flow from model development to deployment.
Domino Data Lab provides an enterprise MLOps platform that powers model-driven business to accelerate the development and deployment of data science work while increasing collaboration and governance. It allows data scientists to experiment, research, test, and validate AI models before deploying them into production.
NVIDIA Fleet Command is a managed platform for container orchestration that streamlines provisioning and deployment of systems and AI applications at the edge. It simplifies the management of distributed computing environments with the scale and resiliency of the cloud, turning every site into a secure, intelligent location.
From development to deployment
The integration with NVIDIA Fleet Command provides Domino Data Lab users an easy avenue to deploy models they are working on to edge locations. The integration bridges the gap between the data scientist team developing applications and IT teams deploying them, allowing both teams access to the entire application lifecycle.
“The integration with NVIDIA Fleet Command is the last piece in the puzzle to give data scientists access to the complete workflow for developing and deploying AI applications to the edge,” says Thomas Robinson, VP of Strategic Partnerships and Corporate Development at Domino Data Lab. “Full visibility into production deployments is critical for teams to take advantage of the data and insights generated at the edge, ultimately producing better applications faster.”
Data scientists can use the Domino Data MLOps Platform to quickly iterate on models they are working on. Through the same interface, users have the ability to load their new models onto Fleet Command, making them available to deploy to any connected location. Once deployed, administrators have remote access to the applications for monitoring and troubleshooting, providing critical feedback that can be used in the next iteration of the model.
A data scientist working on a quality inspection application for a beverage manufacturing plant is one example of this integration used in production environments. The application is used to visually catch dents and defects on cans to prevent them from reaching consumers. The challenge is that the packaging on the cans changes frequently as new designs are tested, seasonal products are released, and event-based packages go to market. The application needs to be able to learn new designs quickly and frequently while still maintaining precise levels of success. This requires a high rate of innovation in order to keep up with the frequent changes in packaging. To achieve this, the data scientist uses Domino Data Lab Enterprise MLOps Platform and NVIDIA Fleet Command to create a fast and seamless flow from the development and iteration efforts to the deployment and monitoring efforts. By doing so, they are able to increase the rate of innovation by easily deploying new models with limited disruption in service as products change. Additionally, model monitoring ensures that the data scientist catches any issues with the quality or predictive power of their models.
Watch an end-to-end demo of model development, deployment, and monitoring in the oil and gas space using Domino and NVIDIA Fleet Command.
Get started with Domino on NVIDIA Fleet Command
Deploying applications on NVIDIA Fleet Command is currently available to Domino users. The Domino Enterprise MLOps Platform is also accessible on NVIDIA LaunchPad, which provides free short-term access to a catalog of hands-on labs. Quickly test AI initiatives and get practical experience with scaling data science workloads.
Join this webinar and Metropolis meetup on July 20 and 21 to learn how NVIDIA Jetson Orin and NVIDIA Launchpad boost your go-to-market efforts for vision AI applications.
Join this webinar and Metropolis meetup on July 20 and 21 to learn how NVIDIA Jetson Orin and NVIDIA Launchpad boost your go-to-market efforts for vision AI applications.
With the increasing demand for access to pretrained large language model (LLM) weights, the climate around LLM sharing is changing. Recently, Meta released Open Pretrained…
With the increasing demand for access to pretrained large language model (LLM) weights, the climate around LLM sharing is changing. Recently, Meta released Open Pretrained Transformer, a language model with 175 billion parameters. BigScience is on schedule to release its multilingual language model with 176 billion parameters in a few months.
As more LLMs become available, industries need techniques for solving real-world natural language tasks. It has been shown that model prompting methods can elicit good zero– and few-shot performance from LLMs and help yield quality results on various downstream natural language processing (NLP) tasks. The whitepaper proposed prompting as a solution to make general, pretrained LLMs practically useful in the new pretrain, prompt, and predict paradigm that is becoming increasingly popular in the NLP field.
However, when you are applying prompting methods to industrial NLP applications, there are other challenges to consider. For any downstream NLP task, you must collect labeled data to instruct the language model on how to produce the expected results.
Although for many tasks there is plenty of labeled English data, there are few benchmark-worthy, non-English, downstream datasets. Scarcity of labeled data is the number one challenge for industry to perform NLP tasks in low-resource language environments.
Furthermore, companies usually must dynamically solve multiple downstream NLP tasks that can evolve over time. Continuous learning for new tasks without forgetting previously learned tasks is still a hot research topic. A nice and clean solution means lower model maintenance, lower deployment costs, and fast development.
In this post, we show you how to adapt p-tuning, a prompt learning method, to low-resource language settings. We use an improved version of p-tuning implemented in NVIDIA NeMo that enables the continuous multitask learning of virtual prompts. In particular, we focus on adapting our English p-tuning workflow to Swedish. Learn more about how a consortium in Sweden plans to make the language model available in Nordic regions.
Our proposed workflow is generic and can easily be modified for other languages.
Why large language models?
As shown in the language model scaling law study by OpenAI, language model performance improves as the language model size increases. This has led to a race to train larger and larger language models.
NVIDIA recently trained a Megatron Turing NLG 530B model, which has superior zero– and few-shot learning performance. To access LLMs, researchers can use paid model APIs such as the ones provided by OpenAI or deploy publicly released models locally.
When you have an LLM that understands language well, you can apply prompt learning methods to make the model solve a plethora of NLP downstream tasks.
A short overview of prompt learning and p-tuning
Instead of selecting discrete text prompts in a manual or automated fashion, prompt learning uses virtual prompt embeddings that can be optimized using gradient descent. These virtual embeddings get automatically inserted among the discrete token embeddings from a text prompt.
During prompt learning, the entire GPT model is frozen and only these virtual token embeddings are updated at each training step. The prompt learning process results in a small number of virtual token embeddings that can be combined with a text prompt to improve task performance at inference time.
In p-tuning specifically, a small long short-term memory (LSTM) model is used as a prompt encoder. The input to the prompt encoder is a task name and the outputs are task-specific virtual token embeddings that are passed into the LLM along with the text prompt embeddings.
A multitask continuous learning solution
Figure 2 shows that p-tuning uses a prompt encoder to generate virtual token embeddings. In the original p-tuning paper, the prompt encoder can only work for one task. We extended it in our NeMo implementation so that the prompt encoder can be conditioned on different tasks’ names.
When the prompt encoder is trained, it maps the task names to a set of virtual token embeddings. This enables you to build an embedding table that stores the mapping between task names and virtual token embeddings for each task. Using this embedding table enables you to continuously learn new tasks and avoid catastrophic forgetting. For example, you can start p-tuning with tasks A and B.
After training, you can save the virtual token embeddings for tasks A and B in the table and freeze them. You can proceed to train task C with another fresh prompt encoder. Similarly, after the training, you save the virtual token embeddings for task C in the prompt table. During the inference, the model can look up the prompt table and use the correct virtual token embeddings for different tasks.
Second, p-tuning requires only a few labeled data points to give reasonable results. For example, for an FIQA sentiment analysis task, it used 1,000 data examples to achieve 92% accuracy.
Third, p-tuning as described in the original paper, and even more so in our specific implementation, is extremely parameter-efficient. During p-tuning, an LSTM with parameters equal to a small fraction of the original GPT model’s parameters is tuned while the GPT model weights remain frozen. At the end of training, the LSTM network can be discarded and only the virtual prompts themselves need to be saved. This means parameters totaling less than ~0.01% of the GPT model’s size must be stored and used during inference to achieve dramatically improved task performance compared to zero– and few-shot inference.
Fourth, p-tuning is also more resource-efficient during training. Freezing the GPT model means that we didn’t have to store optimizer states for those model parameters and we didn’t have to spend time updating GPT model weights. This saved a considerable amount of GPU memory.
Lastly, the virtual prompt token parameters are decoupled from the GPT model. This yields the ability to distribute small virtual token parameter files that can be plugged into a shared access GPT model without the need for also sharing updated GPT model weights, as would be required if the GPT model were fine-tuned.
Creating Swedish downstream task datasets
To apply p-tuning to non-English downstream tasks, we labeled data in the target language. As there is an abundance of labeled English downstream task data, we used a machine translation model to translate this English labeled data into the target low-resource language. For this post, we translated our English data into Swedish. Thanks to p-tuning’s low labeled data requirements, we didn’t have to translate a lot of labeled data points.
To have complete control of the translation model, we chose to use an in-house translation model trained from scratch. This model is trained with English to Swedish/ Norwegian (one-to-many) direction using the NeMo NMT toolkit. The training data (parallel corpus) was obtained from Opus. The English to Swedish translation quality was manually evaluated by a native bilingual English and Swedish speaker.
We also used other translation models to help check the quality of our translation model. We translated a handful of random samples from the original English benchmark data and manually checked the quality of the other model translations compared with our own. We used deepL, the Google translation API, and DeepTranslator.
Apart from some clock-and-time systematic errors, the overall translation quality was good enough for us to proceed with converting the English-labeled data into Swedish. With the training and verification of our NeMo NMT English to Swedish translation model complete, we used the model to translate two English benchmark datasets:
For convenience, we use svFIQA and svAssistant to distinguish between the original English and the translated Swedish benchmark datasets.
Here are randomly selected examples of training records from FIQA and svFIQA, respectively:
English:
{"taskname": "sentiment-task", "sentence": "Barclays PLC & Lloyds Banking Group PLC Are The 2 Banks I'd Buy Today. Sentiment for Lloyds ", "label": "positive"}
Swedish:
{"taskname": "sentiment-task", "sentence": "Barclays PLC & Lloyds Banking Group PLC är de 2 banker jag skulle köpa idag.. Känslor för Lloyds", "label": "positiva"}
The translated dataset should preserve the correct grammar structure of the actual English source data. Because the sentiment refers to the two banks, it’s plural. The ground truth label translated to Swedish should also reflect the correct Swedish grammar, that is, ‘’positiva’’.
For completeness, we also randomly selected one example each from Assistant and svAssistant:
English:
{"taskname": "intent_and_slot", "utterance": "will you please get the coffee machine to make some coffee", "label": "nIntent: iot_coffeenSlots: device_type(coffee machine)"}
Swedish:
{"taskname": "intent_and_slot", "utterance": "kommer du snälla få kaffemaskinen för att göra lite kaffe", "label": "Intent: iot _ kaffe Slots: enhet _ typ (kaffemaskin)"}
GPT models
The Swedish GPT-SW3 checkpoints used in the following experiments were a result of a partnership between AI Sweden and NVIDIA. More specifically, AI Sweden’s GPT-SW3 checkpoint with 3.6 billion parameters is pretrained using Megatron-LM. This model was used to conduct the Swedish multitask p-tuning experiments described in this post.
Multitask p-tuning experiments
To simulate the typical enterprise customer use case, we imagined a scenario where a user first needs to solve a sentiment analysis NLP task with high accuracy. Later, as the business evolves, the user needs to continue to solve a virtual assistant task with the same model to reduce cost.
We ran p-tuning twice in a continuous learning setup for Swedish. We used the svFIQA dataset for the first NLP task. We then used the svAssistant dataset for the second NLP task.
We could have p-tuned both tasks simultaneously. However, we choose to do two rounds of p-tuning consecutively to showcase the continuous prompt learning capability in NeMo.
We first conducted a series of short hyperparameter tuning experiments for svFIQA and svAssistant using a slightly modified version of this p-tuning tutorial notebook. In these experiments, we identified the optimal number of virtual tokens and best virtual token placements for each task.
To manipulate the total number of virtual tokens and their positions within a text prompt, we modified the following sentiment task template within the p-tuning model’s training config file. For more information about the p-tuning configuration file, see the prompt learning config section of our NeMo documentation.
This prompt template is language-specific. Apart from the virtual tokens’ placement and the number of virtual tokens used, it is important to translate the words within each prompt template into the target language. Here, the term “sentiment” (added between the final virtual prompt tokens and the label) should be translated into Swedish.
In our experiments, we used 10-fold cross-validation to calculate performance metrics. During our hyperparameter search, we p-tuned the Swedish GPT-SW3 model on the first fold until the validation loss plateaued after 10-20 epochs.
After a few rounds of experimentation in this manner, we decided to use the following template for all 10 folds of the svFIQA dataset:
The term “sentiment” was removed from the prompt template and was instead directly included in the {sentence} part of the prompt. This allowed us to easily translate “sentiment” into Swedish along with the rest of the English sentence:
{"taskname": "sentiment-task", "sentence": "Barclays PLC & Lloyds Banking Group PLC är de 2 banker jag skulle köpa idag.. Känslor för Lloyds", "label": "positiva"}
After finding an optimal configuration for training, we p-tuned our Swedish GPT-SW3 model on each of the 10 svFIQA folds. We evaluated the p-tuned checkpoint for every fold on its corresponding test split. We added intent and slot prediction capability to our GPT-SW3 model by repeating the same steps with the svAssistant dataset, this time restoring our checkpoints trained on svFIQA and adding the intent and slot task.
Results
To establish a baseline, and because there are no existing benchmarks for Swedish in this context, we used the original AI Sweden GPT-SW3 model’s zero–, one–, and few-shot learning performance as the baseline (Figure 3).
As can be seen, except for zero-shot, the few-shot learning performance on svFIQA is 42-52%. Understandably, the performance of zero-shot is significantly worse due to the fact that the GPT model receives zero labeled examples. The model generates tokens that are most likely unrelated to the given task.
Given the binary nature of this sentiment analysis task, we mapped all Swedish grammatical variants of the words “positiv” and “negativ” to the same format before calculating task accuracy.
Table 1. First round of p-tuning performance on svFIQA 10 folds average accuracy
With this re-mapping mechanism, we achieved fairly good results: 82.65%. The p-tuning performance on the svFIQA test is averaged across all 10 folds.
Table 2 shows the results for the second round of p-tuning on the svAssistant dataset (intent and slot classification). Scores are averaged across all 10 folds as well.
Precision
Recall
F1 -Score
average
88.00%
65.00%
73.00%
Table 2. Second round of p-tuning performance on the svAssistant dataset
Next, we further explored the question, “How much can we reduce the total amount of training data without decreasing performance?”
For the svFIQA dataset, we discovered that we can get away with as little as one-tenth of the training data in each training run and still maintain acceptable performance. However, from 5% training data onwards (with only 47 data points for training), we started to see steep degradation of performance and the performance became unstable at around 1% (as little as nine data points for training, averaged across six training runs, each with nine randomly sampled data points).
Future work
We noticed that the results for intent and slot classification can be improved. They are heavily dependent on the translation model’s ability to translate non-natural text from English to Swedish. In the following example, the English intent and slot prompt formatting were difficult for the translation model to translate accurately, compromising the quality of the Swedish translations.
The label for English is “Intent: alarm_set Slots: date(sunday), time (eight am)”.
When it is translated to Swedish, it became “tid (åtta am)”.
The translation model skipped the words “Intent:” and “Slot:” completely. It also dropped the translation for alarm_set in the intent as well as date(sunday) in the slot.
In the future, we will formulate source language data as natural language before translating it into the target language. We are also experimenting with a pretrained mT5 model that can skip the translation steps completely. The early results are promising, so stay tuned for the full results.
Lastly, we also plan to compare prompt learning methods against full fine-tuning of the base GPT models. This will enable us to compare trade-offs between the two task adaptation approaches.
Conclusion
In this post, we demonstrated a parameter-efficient solution to solving multiple NLP tasks in a low-resource language setting. Focusing on the Swedish language, we translated English sentiment classification and intent/slot classification datasets into Swedish. We then p-tuned the Swedish GPT-SW3 model on these datasets and achieved good performance compared to our few-shot learning baselines.
We showed that our approach can help you train the prompt encoder with as little as one-tenth of the original training data tuning less than 0.1% of the model’s original parameters, while still maintaining performance.
Because the LLM is frozen during training, p-tuning requires fewer resources and the whole training process can be done efficiently and quickly, which democratizes LLM access for anyone. You can bring your own data and tune the model for your own use cases.
In our NeMo p-tuning implementation, we make lightweight, continuous learning easy as well. You can use our approach to continuously learn and deploy new tasks without degrading the performance of previously added tasks.