Categories
Offsites

STUDY: Socially aware temporally causal decoder recommender systems

Reading has many benefits for young students, such as better linguistic and life skills, and reading for pleasure has been shown to correlate with academic success. Furthermore students have reported improved emotional wellbeing from reading, as well as better general knowledge and better understanding of other cultures. With the vast amount of reading material both online and off, finding age-appropriate, relevant and engaging content can be a challenging task, but helping students do so is a necessary step to engage them in reading. Effective recommendations that present students with relevant reading material helps keep students reading, and this is where machine learning (ML) can help.

ML has been widely used in building recommender systems for various types of digital content, ranging from videos to books to e-commerce items. Recommender systems are used across a range of digital platforms to help surface relevant and engaging content to users. In these systems, ML models are trained to suggest items to each user individually based on user preferences, user engagement, and the items under recommendation. These data provide a strong learning signal for models to be able to recommend items that are likely to be of interest, thereby improving user experience.

In “STUDY: Socially Aware Temporally Causal Decoder Recommender Systems”, we present a content recommender system for audiobooks in an educational setting taking into account the social nature of reading. We developed the STUDY algorithm in partnership with Learning Ally, an educational nonprofit, aimed at promoting reading in dyslexic students, that provides audiobooks to students through a school-wide subscription program. Leveraging the wide range of audiobooks in the Learning Ally library, our goal is to help students find the right content to help boost their reading experience and engagement. Motivated by the fact that what a person’s peers are currently reading has significant effects on what they would find interesting to read, we jointly process the reading engagement history of students who are in the same classroom. This allows our model to benefit from live information about what is currently trending within the student’s localized social group, in this case, their classroom.

Data

Learning Ally has a large digital library of curated audiobooks targeted at students, making it well-suited for building a social recommendation model to help improve student learning outcomes. We received two years of anonymized audiobook consumption data. All students, schools and groupings in the data were anonymized, only identified by a randomly generated ID not traceable back to real entities by Google. Furthermore all potentially identifiable metadata was only shared in an aggregated form, to protect students and institutions from being re-identified. The data consisted of time-stamped records of student’s interactions with audiobooks. For each interaction we have an anonymized student ID (which includes the student’s grade level and anonymized school ID), an audiobook identifier and a date. While many schools distribute students in a single grade across several classrooms, we leverage this metadata to make the simplifying assumption that all students in the same school and in the same grade level are in the same classroom. While this provides the foundation needed to build a better social recommender model, it’s important to note that this does not enable us to re-identify individuals, class groups or schools.

The STUDY algorithm

We framed the recommendation problem as a click-through rate prediction problem, where we model the conditional probability of a user interacting with each specific item conditioned on both 1) user and item characteristics and 2) the item interaction history sequence for the user at hand. Previous work suggests Transformer-based models, a widely used model class developed by Google Research, are well suited for modeling this problem. When each user is processed individually this becomes an autoregressive sequence modeling problem. We use this conceptual framework to model our data and then extend this framework to create the STUDY approach.

While this approach for click-through rate prediction can model dependencies between past and future item preferences for an individual user and can learn patterns of similarity across users at train time, it cannot model dependencies across different users at inference time. To recognise the social nature of reading and remediate this shortcoming we developed the STUDY model, which concatenates multiple sequences of books read by each student into a single sequence that collects data from multiple students in a single classroom.

However, this data representation requires careful diligence if it is to be modeled by transformers. In transformers, the attention mask is the matrix that controls which inputs can be used to inform the predictions of which outputs. The pattern of using all prior tokens in a sequence to inform the prediction of an output leads to the upper triangular attention matrix traditionally found in causal decoders. However, since the sequence fed into the STUDY model is not temporally ordered, even though each of its constituent subsequences is, a standard causal decoder is no longer a good fit for this sequence. When trying to predict each token, the model is not allowed to attend to every token that precedes it in the sequence; some of these tokens might have timestamps that are later and contain information that would not be available at deployment time.

In this figure we show the attention mask typically used in causal decoders. Each column represents an output and each column represents an output. A value of 1 (shown as blue) for a matrix entry at a particular position denotes that the model can observe the input of that row when predicting the output of the corresponding column, whereas a value of 0 (shown as white) denotes the opposite.

The STUDY model builds on causal transformers by replacing the triangular matrix attention mask with a flexible attention mask with values based on timestamps to allow attention across different subsequences. Compared to a regular transformer, which would not allow attention across different subsequences and would have a triangular matrix mask within sequence, STUDY maintains a causal triangular attention matrix within a sequence and has flexible values across sequences with values that depend on timestamps. Hence, predictions at any output point in the sequence are informed by all input points that occurred in the past relative to the current time point, regardless of whether they appear before or after the current input in the sequence. This causal constraint is important because if it is not enforced at train time, the model could potentially learn to make predictions using information from the future, which would not be available for a real world deployment.

In (a) we show a sequential autoregressive transformer with causal attention that processes each user individually; in (b) we show an equivalent joint forward pass that results in the same computation as (a); and finally, in (c) we show that by introducing new nonzero values (shown in purple) to the attention mask we allow information to flow across users. We do this by allowing a prediction to condition on all interactions with an earlier timestamp, irrespective of whether the interaction came from the same user or not.

<!–

In (a) we show a sequential autoregressive transformer with causal attention that processes each user individually; in (b) we show an equivalent joint forward pass that results in the same computation as (a); and finally, in (c) we show that by introducing new nonzero values (shown in purple) to the attention mask we allow information to flow across users. We do this by allowing a prediction to condition on all interactions with an earlier timestamp, irrespective of whether the interaction came from the same user or not.

–><!–

In (a) we show a sequential autoregressive transformer with causal attention that processes each user individually; in (b) we show an equivalent joint forward pass that results in the same computation as (a); and finally, in (c) we show that by introducing new nonzero values (shown in purple) to the attention mask we allow information to flow across users. We do this by allowing a prediction to condition on all interactions with an earlier timestamp, irrespective of whether the interaction came from the same user or not.

–>

Experiments

We used the Learning Ally dataset to train the STUDY model along with multiple baselines for comparison. We implemented an autoregressive click-through rate transformer decoder, which we refer to as “Individual”, a k-nearest neighbor baseline (KNN), and a comparable social baseline, social attention memory network (SAMN). We used the data from the first school year for training and we used the data from the second school year for validation and testing.

We evaluated these models by measuring the percentage of the time the next item the user actually interacted with was in the model’s top n recommendations, i.e., hits@n, for different values of n. In addition to evaluating the models on the entire test set we also report the models’ scores on two subsets of the test set that are more challenging than the whole data set. We observed that students will typically interact with an audiobook over multiple sessions, so simply recommending the last book read by the user would be a strong trivial recommendation. Hence, the first test subset, which we refer to as “non-continuation”, is where we only look at each model’s performance on recommendations when the students interact with books that are different from the previous interaction. We also observe that students revisit books they have read in the past, so strong performance on the test set can be achieved by restricting the recommendations made for each student to only the books they have read in the past. Although there might be value in recommending old favorites to students, much value from recommender systems comes from surfacing content that is new and unknown to the user. To measure this we evaluate the models on the subset of the test set where the students interact with a title for the first time. We name this evaluation subset “novel”.

We find that STUDY outperforms all other tested models across almost every single slice we evaluated against.

In this figure we compare the performance of four models, Study, Individual, KNN and SAMN. We measure the performance with hits@5, i.e., how likely the model is to suggest the next title the user read within the model’s top 5 recommendations. We evaluate the model on the entire test set (all) as well as the novel and non-continuation splits. We see STUDY consistently outperforms the other three models presented across all splits.

Importance of appropriate grouping

At the heart of the STUDY algorithm is organizing users into groups and doing joint inference over multiple users who are in the same group in a single forward pass of the model. We conducted an ablation study where we looked at the importance of the actual groupings used on the performance of the model. In our presented model we group together all students who are in the same grade level and school. We then experiment with groups defined by all students in the same grade level and district and also place all students in a single group with a random subset used for each forward pass. We also compare these models against the Individual model for reference.

We found that using groups that were more localized was more effective, with the school and grade level grouping outperforming the district and grade level grouping. This supports the hypothesis that the STUDY model is successful because of the social nature of activities such as reading — people’s reading choices are likely to correlate with the reading choices of those around them. Both of these models outperformed the other two models (single group and Individual) where grade level is not used to group students. This suggests that data from users with similar reading levels and interests is beneficial for performance.

Future work

This work is limited to modeling recommendations for user populations where the social connections are assumed to be homogenous. In the future it would be beneficial to model a user population where relationships are not homogeneous, i.e., where categorically different types of relationships exist or where the relative strength or influence of different relationships is known.

Acknowledgements

This work involved collaborative efforts from a multidisciplinary team of researchers, software engineers and educational subject matter experts. We thank our co-authors: Diana Mincu, Lauren Harrell, and Katherine Heller from Google. We also thank our colleagues at Learning Ally, Jeff Ho, Akshat Shah, Erin Walker, and Tyler Bastian, and our collaborators at Google, Marc Repnyek, Aki Estrella, Fernando Diaz, Scott Sanner, Emily Salkey and Lev Proleev.

Categories
Misc

Create Custom Character Detection and Recognition Models with NVIDIA TAO, Part 1

Optical Character Detection (OCD) and Optical Character Recognition (OCR) are computer vision techniques used to extract text from images. Use cases vary across…

Optical Character Detection (OCD) and Optical Character Recognition (OCR) are computer vision techniques used to extract text from images. Use cases vary across industries and include extracting data from scanned documents or forms with handwritten texts, automatically recognizing license plates, sorting boxes or objects in a fulfillment center based on serial numbers, identifying components for inspection on assembly lines based on part numbers, and more. 

OCR is used in many industries, including financial services, healthcare, logistics, industrial inspection, and smart cities. OCR improves productivity and increases operational efficiency for businesses by automating manual tasks. 

To be effective, OCR must achieve or exceed human-level accuracy. It is inherently complicated due to the unique use cases it works across. For example, when OCR is analyzing text, the text can vary in font, size, color, shape, and orientation, and can be handwritten or have other noise like partial occlusion. Fine-tuning the model on the test environment becomes extremely important to maintain high accuracy and reduce error rate.  

NVIDIA TAO Toolkit is a low-code AI toolkit that can help developers customize and optimize models for many vision AI applications. NVIDIA introduced new models and features for automating character detection and recognition in TAO 5.0. These models and features will accelerate the creation of custom OCR solutions. For more details, see Access the Latest in Vision AI Model Development Workflows with NVIDIA TAO Toolkit 5.0.

This post is part of a series on using NVIDIA TAO and pretrained models to create and deploy custom AI models to accurately detect and recognize handwritten texts. This part explains the training and fine-tuning of character detection and recognition models using TAO. Part 2 walks you through the steps to deploy the model using NVIDIA Triton. The steps presented can be used with any other OCR tasks.

NVIDIA TAO OCD/OCR workflow

A workflow overview using OCDNet for generating bounding boxes around areas of text in an image, using the text rectifier to correct any text that is distorted or at extreme angles, then lastly using OCRNet to recognize those sequences of text.
Figure 1. Character recognition pipeline with OCDNet and OCRNet

A pretrained model has been trained on large datasets and can be further fine-tuned with additional data to accomplish a specific task. The Optical Character Detection Network (OCDNet) is a TAO pretrained model that detects text in images with complex backgrounds. It uses a process called differentiable binarization to help accurately locate text of various shapes, sizes, and fonts. The result is a bounding box with the detected text.

A text rectifier is middleware that serves as a bridge between character detection and character recognition during the inference phase. Its primary function is to improve the accuracy of recognizing characters on texts that are at extreme angles. To achieve this, the text rectifier takes the vertices of polygons that cover the text area and the original images as inputs. 

The Optical Character Recognition Network (OCRNet) is another TAO pretrained model that can be used to recognize the characters of text that reside in the detected bounding box regions. This model takes the image as network input and produces a sequence of characters as output.

Prerequisites

To follow along with the tutorial, you will need the following:

Download the dataset

This tutorial fine-tunes the OCD and OCR model to detect and recognize handwritten letters. It works with the IAM Handwriting Database, a large dataset containing various handwritten English text documents. These text samples will be used to train and test handwritten text recognizers for the OCD and OCR models.

The handwritten word ‘have’ from the IAM dataset.
Figure 2. ‌Handwritten word from the IAM dataset

To gain access to this dataset, register your email address on the IAM registration page.

Once registered, download the following datasets from the downloads page:

  1. data/ascii.tgz
  2. data/formsA-D.tgz
  3. data/formsE-H.tgz
  4. data/formsI-Z.tgz

The following section explores various aspects of the Jupyter notebook to delve deeper into the fine-tuning process of OCDNet and OCRNet for the purpose of detecting and recognizing handwritten characters.

Note that this dataset may be used for noncommercial research purposes only. For more details, review the terms of use on the IAM Handwriting Database

Run the notebook

The OCDR Jupyter notebook showcases how to fine-tune the OCD and OCR models to the IAM handwritten dataset. It also shows how to run inference on the trained models and perform deployment.

Set up environment variables

Set up the following environment variables in the Jupyter notebook to match your current directory, then execute:

%env LOCAL_PROJECT_DIR=home//ocdr_notebook
%env NOTEBOOK_DIR=home//ocdr_notebook

# Set this path if you don't run the notebook from the samples directory.
%env NOTEBOOK_ROOT=home//ocdr_notebook

The following folders will be generated:

  • HOST_DATA_DIR contains the train/test split data for model training.
  • HOST_SPECS_DIR houses the specification files that contain the hyperparameters used by TAO to perform training, inference, evaluation, and model deployment.
  • HOST_RESULTS_DIR contains the results of the fine-tuned OCD and OCR models.
  • PRE_DATA_DIR is where the downloaded handwritten dataset files will be located. This path will be called to preprocess the data for OCD/OCR model training.

TAO Launcher uses Docker containers when running tasks. For data and results to be visible to Docker, map the location of our local folders to the Docker container using the ~/.tao_mounts.json file. Run the cell in the Jupyter notebook to generate the ~/.tao_mounts.json file. 

The environment is now ready for use with the TAO Launcher. The next steps will prepare the handwritten dataset to be in the correct format for TAO OCD model training.

Prepare the dataset for OCD and OCR

Preprocess the IAM handwritten dataset to match the TAO image format following the steps below. Note that in the folder structure for OCD and OCR model training in TAO, /img houses the handwritten image data, and /gt contains ground truth labels of the characters found in each image. 

|── train
|   ├──img
|   ├──gt
|── test
|   ├──img
|   ├──gt

Begin by moving the four downloaded .tgz files to the location of your $PRE_DATA_DIR directory. If you are following the same steps as above, the .tgz files will be placed in /data/iamdata.

Extract the images and ground truth labels from these files. The subsequent cells will extract the image files and move them to the proper folder format when run.

!tar -xf $PRE_DATA_DIR/ascii.tgz --directory $PRE_DATA_DIR/ words.txt

# Create directories to hold the image data and ground truth files.
!mkdir -p $PRE_DATA_DIR/train/img
!mkdir -p $PRE_DATA_DIR/test/img
!mkdir -p $PRE_DATA_DIR/train/gt
!mkdir -p $PRE_DATA_DIR/test/gt
# Unpack the images, let's use the first two groups of images for training, and the last for validation.

!tar -xzf $PRE_DATA_DIR/formsA-D.tgz --directory $PRE_DATA_DIR/train/img
!tar -xzf $PRE_DATA_DIR/formsE-H.tgz --directory $PRE_DATA_DIR/train/img
!tar -xzf $PRE_DATA_DIR/formsI-Z.tgz --directory $PRE_DATA_DIR/test/img

The data is now organized correctly. However, the ground truth label used by IAM dataset is currently in the following format:

a01-000u-00-00 ok 154 1 408 768 27 51 AT A


#     a01-000u-00-00  -> word id for line 00 in form a01-000u
#     ok              -> result of word segmentation
#                            ok: word was correctly
#                            er: segmentation of word can be bad
#
#     154            -> graylevel to binarize the line containing this word
#     1               -> number of components for this word
#     408 768 27 51   -> bounding box around this word in x,y,w,h format
#     AT            -> the grammatical tag for this word, see the
#                        file tagset.txt for an explanation
#     A               -> the transcription for this word

The words.txt file looks like this:

  		0				1
0	a01-000u-00-00	ok 154 408 768 27 51 AT A
1	a01-000u-00-01	ok 154 507 766 213 48 NN MOVE
2	a01-000u-00-02	ok 154 796 764 70 50 TO to
...

Currently, words.txt uses a four-point coordinate system for drawing a bounding box around the word in an image. TAO requires the use of an eight-point coordinate system to draw a bounding box around detected text. 

To convert the data to the eight-point coordinate system, use the extract_columns and process_text_file functions provided in section 2.1 of the notebook. words.txt will be transformed into the following DataFrame and will be ready for fine-tuning on an OCDNet model.


filename	   x	 y	x2	y2	x3	y3	x4	y4	word
0	gt_a01-000u.txt	  408	768	435	768	435	819	408	819	A
1	gt_a01-000u.txt	  507	766	720	766	720	814	507	814	MOVE
2	gt_a01-000u.txt	  796	764	866	764	866	814	796	814	to
...

To prepare the dataset for OCRNet, the raw image data and labels must be converted to LMDB format, which converts the images and labels into a key-value memory database.

# Convert the raw train and test dataset to lmdb
print("Converting the training set to LMDB.")
!tao model ocrnet dataset_convert -e $SPECS_DIR/ocr/experiment.yaml 
                        	dataset_convert.input_img_dir=$DATA_DIR/train/processed 
                        	dataset_convert.gt_file=$DATA_DIR/train/gt.txt 
                        	dataset_convert.results_dir=$DATA_DIR/train/lmdb

# Convert the raw test dataset to lmdb
print("Converting the testing set to LMDB.")
!tao model ocrnet dataset_convert -e $SPECS_DIR/ocr/experiment.yaml 
                        	dataset_convert.input_img_dir=$DATA_DIR/test/processed 
                        	dataset_convert.gt_file=$DATA_DIR/test/gt.txt 
                        	dataset_convert.results_dir=$DATA_DIR/test/lmdb

The data is now processed and ready to be fine-tuned on the OCDNet and OCRNet pretrained models.

Create a custom character detection (OCD) model

The NGC CLI will be used to download the pretrained OCDNet model. For more information, visit NGC and click on Setup in the navigation bar.

Download the OCDNet pretrained model

!mkdir -p $HOST_RESULTS_DIR/pretrained_ocdnet/

# Pulls pretrained models from NGC
!ngc registry model download-version nvidia/tao/ocdnet:trainable_resnet18_v1.0 --dest $HOST_RESULTS_DIR/pretrained_ocdnet/

You can check that the model has been downloaded to /pretrained_ocdnet/ using the following call:

print("Check that model is downloaded into dir.")
!ls -l $HOST_RESULTS_DIR/pretrained_ocdnet/ocdnet_vtrainable_resnet18_v1.0

OCD training specification

In the specs folder, you can find different files related to how you want to train, evaluate, infer, and export data for both models. For training OCDNet, you will use the train.yaml file in the specs/ocd folder. You can experiment with changing different hyperparameters, such as number of epochs, in this spec file. 

Below is a code example of some of the configs that you can experiment with:

num_gpus: 1

model:
  load_pruned_graph: False
  pruned_graph_path: '/results/prune/pruned_0.1.pth'
  pretrained_model_path: '/data/ocdnet/ocdnet_deformable_resnet18.pth'
  backbone: deformable_resnet18

train:
  results_dir: /results/train
  num_epochs: 300
  checkpoint_interval: 1
  validation_interval: 1
...

Train the character detection model

Now that the specification files are configured, provide the paths to the spec file, the pretrained model, and the results:

#Train using TAO Launcher
#print("Run training with ngc pretrained model.")
!tao model ocdnet train 
      	-e $SPECS_DIR/train.yaml 
      	-r $RESULTS_DIR/train 
      	model.pretrained_model_path=$DATA_DIR/ocdnet_deformable_resnet18.pth

Training output will resemble the following. Note that this step could take some time, depending on the number of epochs specified in train.yaml.

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name  | Type  | Params
--------------------------------
0 | model | Model | 12.8 M
--------------------------------
12.8 M    Trainable params
0         Non-trainable params
12.8 M    Total params
51.106    Total estimated model params size (MB)
Training: 0it [00:00, ?it/s]Starting Training Loop.
Epoch 0: 100%|█████████| 751/751 [19:57

Evaluate the model

Next, evaluate the OCDNet model trained on the IAM dataset.

# Evaluate on model
!tao model ocdnet evaluate 
        	-e $SPECS_DIR/evaluate.yaml 
        	evaluate.checkpoint=$RESULTS_DIR/train/model_best.pth

Evaluation output will look like the following:

test model: 100%|██████████████████████████████| 488/488 [06:44

OCD inference

The inference tool produces annotated image outputs and .txt files that contain prediction information. Run the inference tool below to generate inferences on OCDNet models and visualize the results for detected text.

# Run inference using TAO
!tao model ocdnet inference 
       	-e $SPECS_DIR/ocd/inference.yaml 
       	inference.checkpoint=$RESULTS_DIR/ocd/train/model_best.pth 
       	inference.input_folder=$DATA_DIR/test/img 
       	inference.results_dir=$RESULTS_DIR/ocd/inference

Figure 3 shows the OCDNet inference on a test sample image.

Handwritten text output from OCDNet inference. Bounding boxes are applied to detected words such as ‘discuss’ and ‘best.’
Figure 3. Output from OCDNet inference

Export the OCD model for deployment

The last step is to export the OCD model to ONNX format for deployment.

!tao model ocdnet export 
       	-e $SPECS_DIR/export.yaml 
       	export.checkpoint=$RESULTS_DIR/train/model_best.pth 
       	export.onnx_file=$RESULTS_DIR/export/model_best.onnx

Create a custom character recognition (OCR) model

Now that you have the trained OCDNet model to detect and apply bounding boxes to areas of handwritten text, use TAO to fine-tune the OCRNet model to recognize and classify the detected letters.

Download the OCRNet pretrained model

Continuing in the Jupyter notebook, the OCRNet pretrained model will be pulled from NGC CLI.

!mkdir -p $HOST_RESULTS_DIR/pretrained_ocrnet/

# Pull pretrained model from NGC
!ngc registry model download-version nvidia/tao/ocrnet:trainable_v1.0 --dest $HOST_RESULTS_DIR/pretrained_ocrnet

OCR training specification

OCRNet will use the experiment.yaml spec file to perform training. You can change training hyperparameters such as batch size, number of epochs, and learning rate shown below:

dataset:
  train_dataset_dir: []
  val_dataset_dir: /data/test/lmdb
  character_list_file: /data/character_list
  max_label_length: 25
  batch_size: 32
  workers: 4

train:
  seed: 1111
  gpu_ids: [0]
  optim:
	name: "adadelta"
	lr: 0.1
  clip_grad_norm: 5.0
  num_epochs: 10
  checkpoint_interval: 2
  validation_interval: 1

Train the character recognition model

Train the OCRNet model on the dataset. You can also configure spec parameters like the number of epochs or learning rate within the train command, shown below.

!tao model ocrnet train -e $SPECS_DIR/ocr/experiment.yaml 
          	train.results_dir=$RESULTS_DIR/ocr/train 
          	train.pretrained_model_path=$RESULTS_DIR/pretrained_ocrnet/ocrnet_vtrainable_v1.0/ocrnet_resnet50.pth 
          	train.num_epochs=20 
          	train.optim.lr=1.0 
          	dataset.train_dataset_dir=[$DATA_DIR/train/lmdb] 
          	dataset.val_dataset_dir=$DATA_DIR/test/lmdb 
          	dataset.character_list_file=$DATA_DIR/train/character_list.txt

The output will resemble the following:

...
Epoch 19: 100%|█| 3605/3605 [08:04

Evaluate the model

You can evaluate the OCRNet model based on the accuracy of its character recognition. Recognition accuracy simply means a percentage of all the characters in a text area that were recognized correctly.

!tao model ocrnet evaluate -e $SPECS_DIR/ocr/experiment.yaml 
             	evaluate.results_dir=$RESULTS_DIR/ocr/evaluate 
             	evaluate.checkpoint=$RESULTS_DIR/ocr/train/best_accuracy.pth 
             	evaluate.test_dataset_dir=$DATA_DIR/test/lmdb 
             	dataset.character_list_file=$DATA_DIR/train/character_list.txt

Evaluation 

The output should appear similar to the following:

data directory:	/data/iamdata/test/lmdb	 num samples: 37109
Accuracy: 77.8%

OCR inference

Inference on OCR will produce a sequence output of recognized characters from the bounding boxes, shown below.

!tao model ocrnet inference -e $SPECS_DIR/ocr/experiment.yaml 
inference.results_dir=$RESULTS_DIR/ocr/inference 
inference.checkpoint=$RESULTS_DIR/ocr/train/best_accuracy.pth 
inference.inference_dataset_dir=$DATA_DIR/test/processed 
dataset.character_list_file=$DATA_DIR/train/character_list.txt
+--------------------------------------+--------------------+--------------------+
| image_path                           | predicted_labels   |   confidence score |
|--------------------------------------+--------------------+--------------------|
| /data/test/processed/l04-012_28.jpg  | lelly              |             0.3799 |
| /data/test/processed/k04-068_26.jpg  | not                |             0.9644 |
| /data/test/processed/l04-062_58.jpg  | set                |             0.9542 |
| /data/test/processed/l07-176_39.jpg  | boat               |             0.4693 |
| /data/test/processed/k04-039_39.jpg  | .                  |             0.9286 |
+--------------------------------------+--------------------+--------------------+

Export OCR model for deployment

Finally, export the OCD Model to ONNX format for deployment.

!tao model ocrnet export -e $SPECS_DIR/ocr/experiment.yaml 
             	export.results_dir=$RESULTS_DIR/ocr/export 
             	export.checkpoint=$RESULTS_DIR/ocr/train/best_accuracy.pth 
             	export.onnx_file=$RESULTS_DIR/ocr/export/ocrnet.onnx 
             	dataset.character_list_file=$DATA_DIR/train/character_list.txt

Results

Table 1 highlights the accuracy and performance of the two models featured in this post. The character detection model is fine-tuned on the ICDAR pretrained OCDNet model and character recognition model is fine-tuned on the Uber-text OCRNet pretrained model. ICDAR and Uber-text are publicly available datasets that we used to pretrain the OCDNet and OCRNet models, respectively. Both models are available on NGC.  

OCDNet OCRNet
Dataset IAM Handwritten Dataset
Backbone Deformable Conv ResNet18 ResNet50
Accuracy 90% 78%
Inference resolution 1024×1024 1x32x100
Inference performance (FPS) on NVIDIA L4 GPU 125 FPS (BS=1) 8030 (BS=128)
Table 1. Performance and accuracy data for OCDNet and OCRNet

Summary

This post explains the end-to-end workflow for creating custom character detection and recognition models in NVIDIA TAO. You can start with a pretrained model for character detection (OCDNet) and character recognition (OCRNet) from NGC. Then fine-tune it on your custom dataset using TAO and export the model for inference. 

Continue reading Part 2 for a step-by-step walkthrough on deploying this model into production using NVIDIA Triton.

Categories
Misc

Create Custom Character Detection and Recognition Models with NVIDIA TAO, Part 2

NVIDIA Triton Inference Server streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained ML or DL models from any framework…

NVIDIA Triton Inference Server streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained ML or DL models from any framework on any GPU- or CPU-based infrastructure. It helps developers deliver high-performance inference across cloud, on-premises, edge, and embedded devices. 

The nvOCDR library is integrated into Triton for inference. The nvOCDR library wraps the entire inference pipeline for optical character detection and recognition (OCD/OCR). This library consumes OCDNet and OCRNet models that are trained on TAO Toolkit. For more details, refer to the nvOCDR documentation.

This post is part of a series on using NVIDIA TAO and pretrained models to create and deploy custom AI models to accurately detect and recognize handwritten texts. Part 1 explains the training and fine-tuning of character detection and recognition models using TAO. This part walks you through the steps to deploy the model using NVIDIA Triton. The steps presented can be used with any other OCR tasks.

Build the Triton sample with OCD/OCR models

The following steps show the simple and recommended way to build and use OCD/OCR models in Triton Inference Server with Docker images.

Step 1: Prepare the ONNX models

Once you follow ocdnet.ipynb and ocrnet.ipynb to finish the model training and export, you could get two ONNX models, such as ocdnet.onnx and ocrnet.onnx. (In ocdnet.ipynb, the exported ONNX is named model_best.onnx. In ocrnet.ipynb, the exported ONNX is named best_accuracy.onnx.)

# bash commands
$ mkdir onnx_models
$ cd onnx_models
$ cp /export/model_best.onnx ./ocdnet.onnx
$ cp /export/best_accuracy.onnx ./ocrnet.onnx

The character list file, generated in ocrnet.ipynb, is also needed:

$ cp /character_list ./

Step 2: Get the nvOCDR repository

To get the nvOCDR repository, use the following script:

$ git clone https://github.com/NVIDIA-AI-IOT/NVIDIA-Optical-Character-Detection-and-Recognition-Solution.git

Step 3: Build the Triton server Docker image

The building process of Triton server and client Docker images can be launched automatically by running related scripts:

$ cd NVIDIA-Optical-Character-Detection-and-Recognition-Solution/triton

# bash setup_triton_server.sh [input image height] [input image width] [OCD input max batchsize] [DEVICE] [ocd onnx path] [ocr onnx path] [ocr character list path]
$ bash setup_triton_server.sh 1024 1024 4 0 ~/onnx_models/ocd.onnx ~/onnx_models/ocr.onnx ~/onnx_models/ocr_character_list

Step 4: Build the Triton client Docker image

Use the following script to build the Triton client Docker image:

$ cd NVIDIA-Optical-Character-Detection-and-Recognition-Solution/triton
$ bash setup_triton_client.sh

Step 5: Run nvOCDR Triton server

After building the Triton server and Triton client docker image, create a container and launch the Triton server:

$ docker run -it --net=host --gpus all --shm-size 8g nvcr.io/nvidian/tao/nvocdr_triton_server:v1.0 bash

Next, modify the config file of nvOCDR lib. nvOCDR lib can support high-resolution input images (4000 x 4000 or larger). If your input images are large, you can change the configure file to /opt/nvocdr/ocdr/triton/models/nvOCDR/spec.json in the Triton server container to support the high resolution images inference.

# to support high resolution images
$ vim /opt/nvocdr/ocdr/triton/models/nvOCDR/spec.json

   "is_high_resolution_input": true,
   "resize_keep_aspect_ratio": true,

The resize_keep_aspect_ratio will be set to True automatically if you set the is_high_resolution_input to True. If you are going to infer images that have smaller resolution (640 x 640 or 960 x 1280, for example) you can set the is_high_resolution_input to False.

In the container, run the following command to launch the Triton server:

$ CUDA_VISIBLE_DEVICES= tritonserver --model-repository /opt/nvocdr/ocdr/triton/models/

Step 6: Send an inference request

In a separate console, launch the nvOCDR example from the Triton client container:

$ docker run -it --rm -v :  --net=host nvcr.io/nvidian/tao/nvocdr_triton_client:v1.0 bash

Launch the inference:

$ python3 client.py -d  -bs 1
The predicted output from OCDNet and OCRNet on a sample handwritten image, with bounding boxes around words such as ‘stairs’ and ‘rushed.’
Figure 1. Predicted output from OCDNet and OCRNet on a sample handwritten image

Conclusion

NVIDIA TAO 5.0 introduced several features and models for Optical Character Detection (OCD) and Optical Character Recognition (OCR). This post walks through the steps to customize and fine-tune the pretrained model to accurately recognize handwritten texts on the IAM dataset. This model achieves 90% accuracy for character detection and about 80% for character recognition. All the steps mentioned in the post can be run from the provided Jupyter notebook, making it easy to create custom AI models with minimal coding. 

For more information, see:

Categories
Misc

Release: NVIDIA DeepStream SDK version 6.3

Deepstream abstract graphic.Explore the latest streaming analytics features and advancements with this new release.Deepstream abstract graphic.

Explore the latest streaming analytics features and advancements with this new release.

Categories
Misc

Better 3D Meshes, from Reconstruction to Generative AI

GIF of a marble rolling track.Next-generation AI pipelines have shown incredible success in generating high-fidelity 3D models, ranging from reconstructions that produce a scene matching…GIF of a marble rolling track.

Next-generation AI pipelines have shown incredible success in generating high-fidelity 3D models, ranging from reconstructions that produce a scene matching given images to generative AI pipelines that produce assets for interactive experiences.

These generated 3D models are often extracted as standard triangle meshes. Mesh representations offer many benefits, including support in existing software packages, advanced hardware acceleration, and supporting physics simulation. However, not all meshes are equal, and these benefits are only realized on a high-quality mesh.

Recent NVIDIA research discovered a new approach called FlexiCubes for generating high-quality meshes in 3D pipelines, improving quality across a range of applications.

FlexiCubes mesh generation

GIF shows a rotating, digital reconstruction of a hand statue made with Flexicubes.
Figure 1. Example mesh reconstructed by FlexiCubes

The common ingredient across AI pipelines from reconstruction to simulation is that meshes are generated from an optimization process. At each step of the process, the representation is updated to match the desired output better.

The new idea of FlexiCubes mesh generation is to introduce additional, flexible parameters that precisely adjust the generated mesh. By updating these parameters during optimization, mesh quality is greatly improved.

Those familiar with mesh-based pipelines might have used marching cubes in the past to extract meshes. FlexiCubes can be used as a drop-in replacement for marching cubes in optimization-based AI pipelines.

GIF of a scene passing through several digitally generated motorcycles that provide meshes through FlexiCubes.
Figure 2. FlexiCubes high-quality mesh

FlexiCubes generates high-quality meshes from neural workflows like photogrammetry and generative AI.

Better meshes, better AI

FlexiCubes mesh extraction improves the results of many recent 3D mesh generation pipelines, producing higher-quality meshes that do a better job at representing fine details in complex shapes.

The generated meshes are also well-suited for physics simulation, where mesh quality is especially important to make simulations efficient and robust. The tetrahedral meshes are ready to use in out-of-the-box physics simulations.

GIF of a 3D pretzel that bounces to simulate physics.
Figure 3. FlexiCubes tetrahedral mesh example

Explore FlexiCubes now

This research is being presented as part of NVIDIA advancements at SIGGRAPH 2023 in Los Angeles. For more information about the new approach, see Flexible Isosurface Extraction for Gradient-Based Mesh Optimization. Explore more results on the FlexiCubes project page.

Categories
Misc

Strength in Numbers: NVIDIA and Generative Red Team Challenge Unleash Thousands to Vet Security at DEF CON

Thousands of hackers will tweak, twist and probe the latest generative AI platforms this week in Las Vegas as part of an effort to build more trustworthy and inclusive AI. Read article >

Categories
Misc

Challenge Accepted: GeForce NOW Fires Up the Cloud With Ultimate Challenge and First Bethesda Games

Rise and shine, it’s time to quake up — the GeForce NOW Ultimate KovaaK’s challenge kicks off at the QuakeCon gaming festival today, giving gamers everywhere the chance to play to their ultimate potential with ultra-high 240 frames per second streaming. On top of bragging rights, top scorers can win some sweet prizes — including a 240Hz gaming monitor Read article >

Categories
Misc

Visual Effects Multiplier: Wylie Co. Goes All in on GPU Rendering for 24x Returns

Visual effects studios have long relied on render farms — vast numbers of servers — for computationally intensive, complex special effects, but that landscape is rapidly changing. Read article >

Categories
Misc

NVIDIA Jetson Project of the Month: This Autonomous Soccer Robot Can Aim, Shoot, and Score

Soccer is considered one of the most popular sports around the world. And with good reason: the action is often intense, and the game combines both physicality…

Soccer is considered one of the most popular sports around the world. And with good reason: the action is often intense, and the game combines both physicality and skill from the players that can be thrilling to watch. So it should come as no surprise that there are folks out there who are working to teach robots the finer points of the game, including how to gather the ball, line up a shot, pass, and score a goal. 

In fact, an entire competition is devoted to this very idea. The RoboCup Small Size League (SSL) Vision Blackout Technical Challenge encourages teams to “explore local sensing and processing rather than the typical approach of an off-board computer and a global set of cameras sensing the environment.” Student João Guilherme, his instructor Edna Barros, and other SSL teammates from the Federal University of Pernambuco in Recife, Brazil built an omnidirectional robot powered by the NVIDIA Jetson Nano Developer Kit to execute soccer tasks autonomously. 

The team built their omnidirectional robot with a monocular camera that can autonomously perform the following tasks:

  • Localization
  • Soccer ball detection and grabbing
  • Coordinate calculation
  • Passing the ball to other team robots
  • Scoring on an empty goal

The team built the robot with an AI software pipeline running at an average processing speed of 30 FPS, with the hardware consuming only around 10.8 W of power.

The robot has a kicking device on its front and is a four-wheeled omnidirectional robot. Figure 1 shows the geometry of the robot.

Chart showing the movement capability of the SSL omnidirectional robot.
Figure 1. The movement capabilities of the omnidirectional robot powered by the NVIDIA Jetson Nano Developer Kit to execute soccer tasks autonomously

“We evaluate our system on three soccer tasks: grabbing a ball, scoring a goal, and passing the ball, achieving 80%, 80%, and 46.7% success rates, respectively,” the team explains in Towards an Autonomous RoboCup Small Size League Robot.

During tournament play, teams will use off-field computers to execute most of the computation, receiving the position of the ball and gathering field geometry information and referee commands. The matches are played between teams of six (division B) and 11 (division A) robots, and the robots receive navigation commands through RF communication with minimal bandwidth. The diameter and height of the robots are limited to 180 millimeters (division B) and 150 millimeters (division A), hence the name Small Size League. 

The SSL RoboCup competitions include four stages:

  1. Grab a stationary ball somewhere on the field
  2. Score with the ball on an empty goal
  3. Move the robot to specific coordinates
  4. Score an indirect goal (two robots required) 

In addition, this challenge requires the robot to detect objects in the field, estimate their position, compute navigation paths, and keep records of past trajectories.

“SSL matches are highly dynamic environments with extremely resource-constrained robots, requiring solutions to consider size, power consumption, accuracy, and processing speed trade-offs. This work presents an architecture that enables these robots to execute basic soccer tasks autonomously, that is, without receiving any external information,” according to Guilherme and his teammates in Towards an Autonomous RoboCup Small Size League Robot.

Project hardware

The team used the following hardware in their project: 

  • A Jetson Nano Developer Kit, to perform embedded vision and decision making 
  • An omnidirectional robot
  • A Logitech C922 camera, to provide monocular vision 
  • Inertial sensors, to implement odometry estimation 
  • An STM32F767ZI microcontroller unit (MCU), to receive target relative positions and navigation flags from the Nano and execute low-level control and trajectory estimation using inertial odometry
Flow chart for soccer robot AI detection pipeline and movement planning.
Figure 2. The AI detection pipeline and movement planning of the soccer robot

For more information about the hardware used, see RobôCIn 2020 Team Description Paper.

Technical challenges 

During the competition’s Vision Blackout Challenge, the winning robot must be able to complete a variety of soccer-based skills, including grabbing a stationary ball, scoring on an empty goal, moving to specific coordinates, and scoring an indirect goal (passing to another robot). 

The robot must be able to perform these skills using only embedded sensing and processing. There are no height restrictions for this challenge, so the team added an onboard camera, the Jetson Nano, and a power supply board on top of their typical robot. 

Two versions of the soccer-playing robot are shown. The one on the left is modified for the Vision Blackout Challenge, with an onboard camera and power supply board. The original robot appears on the right.
Figure 3. The team’s soccer-playing robot modified for the Vision Blackout Challenge (left) and their original robot (right) 

In addition, this challenge requires the robot to detect objects in the field, estimate their position, compute navigation paths, and keep records of past trajectories. The SSL soccer matches make use of external cameras and offboard computers for perceiving the environment and sending commands to the robots. 

According to the researchers, the SSL Vision architecture “presents limitations such as the camera’s field-of-view, color segmentation, software latency, and communication dropouts, forcing teams to develop solutions for dealing with complex conditions. For example, one common problem during matches is ball occlusion, which occurs when a robot’s projection on the camera image overlaps the ball. Another issue is that the ball and robot position flicks, occasionally not detecting or falsely detecting them.”

In the SSL contests, the robots and balls achieve up to 3.7 m/s and 6.5 m/s velocities, respectively, resulting in a fast-moving game requiring high-throughput solutions. Additionally, the size limitations coupled with using a battery as a power source require solutions to have low-power consumption. Also, precise kicks and passes over long distances are performed during ‌matches, requiring accurate position estimations.

The team also noted the importance of accurate motor control, so the robot can move across the soccer field and keep its measured position accurate. The team needed a way to reduce the rate at which the robot’s internal understanding of its position diverges from its actual physical position. For more details, see Towards an Autonomous RoboCup Small Size League Robot.

Flow chart showing how the robot detects the soccer ball on the field.
Figure 4. The soccer robot’s camera aids object detection along with field of vision for decision making and path planning

Project software and AI

The team used OpenCV2 and calibration and pose computation techniques to extract the “intrinsic and extrinsic parameters” of the monocular camera (fixed to the robot). They used SSD MobileNet v2 to detect objects’ 2D bounding boxes on camera frames. They also used a program applying linear regression to the bounding box coordinates created by SSD MobileNet that was used to estimate precalibrated camera parameters. This would assign points on the field corresponding to the object’s bottom center (which has an object’s relative position to the camera), and therefore to the robot, too. 

Results 

The team is pleased with how their robot played in this year’s challenge. Highlights include: 

  • Grabbing a stationary ball: In 12 out of 15 attempts, the robot was able to stop with the ball touching its dribbler, an 80% success rate. 
  • Scoring a goal: A goal was scored in 12 of the 15 runs.
  • Passing: The robot passed the ball in 7 of the 15 tries, resulting in a 46.7% success rate. 

Visit RoboCup 2023 Results to see the full list of results. The team has participated in the RoboCup Small Size League since 2019, winning their first world title in 2022 (Division B). They are currently a three-time Latin American champion. RobôCIn Small Size League Extended Team Description Paper for RoboCup 2023 presents the improvements the team made to their project for the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France in late July, when they took first place.

Figure 5. The robot grabbing a stationary ball (left) and scoring a goal (right)

Future plans

Guilherme shared some insights about challenges their team encountered in competition, and opportunities for improvement for future events. He noted that most of the failures were due to false-positive detections from objects outside the field. “We are working on a solution for detecting the field boundaries and applying a mask to discard those objects,” he said. 

The team needs faster object detection solutions. “Even though we are able to execute basic skills so far, 30 FPS is still a low processing speed for the SSL environment. At the main competition, cameras usually operate at 70 FPS,” he said. 

The robot’s skills were implemented using only relative positions from detected objects–that is, without the knowledge of the robot’s self-localization on the field. “We believe this information might be useful for optimizing our performance in the soccer tasks, while also allowing us to avoid penalties,” Guilherme noted. For example, the robot should not enter the goalkeeper’s area. “We are working on a self-localization algorithm based on Monte Carlo Localization (MCL) and will share it in the coming months.”

The team plans to add more features to the robot’s system in the future (such as field line detection, localization algorithms, and path planning), and they will be working to optimize each part of the system for those needs. 

In addition, the team continues to work on solutions for detecting field boundaries and lines, and estimating the robot’s self-localization. They also plan to replace the Jetson Nano with a Jetson Orin Nano so they can achieve faster processing speeds with their robot. That upgrade should help the team compete more effectively in league play. 

To learn more about the team’s original project, visit the Developer Forum and GitHub. Explore Jetson Community Projects for more ideas and inspiration from your fellow robotics developers.

Categories
Misc

Pro Tips for Building Multilingual Recommender Systems

Picture this: You’re browsing through an online store, looking for the perfect pair of running shoes. But with thousands of options available, where do you even…

Picture this: You’re browsing through an online store, looking for the perfect pair of running shoes. But with thousands of options available, where do you even begin? Suddenly, a section catches your eye: “Recommended for You.” Intrigued, you click and, within seconds, a curated list of running shoes tailored to your unique preferences appears. It’s as if the website understands your tastes, needs, and style.

Welcome to the world of recommendation systems, where cutting-edge technology combines data analysis, artificial intelligence (AI), and a touch of magic to transform our digital experiences.

This post dives deep into the fascinating realm of recommendation systems and explores the modeling approach for building a two-stage candidate reranker. I provide pro tips on how to overcome data scarcity in underrepresented languages, along with a technical walkthrough of how to implement these best practices.

Overview of building a two-stage candidate reranker

For each user, a recommender system must predict a few items that this user will be interested in from possibly millions of items. This is a daunting task. A powerful modeling approach is called the two-stage candidate reranker

Figure 1 shows the two stages. In the first stage, the model identifies hundreds of candidate items that the user may be interested in. In the second stage, the model ranks this list from most likely to least likely. Finally, the model suggests the most likely items to the user.

Diagram shows the steps from candidate generation to ranking and the relative pool size.
Figure 1. Flow of a two-stage candidate reranker recommendation system

Stage 1: Candidate generation

There are many ways to generate candidates, including statistical methods and deep learning methods. One statistical technique to generate candidates is building a co-visitation matrix. You iterate through all user historical sessions and maintain a cumulative tally of how often each pair of items coexists within user sessions. As a result, you know the top 100 items that are frequently paired with each item.

Now, given a specific user, you can generate candidate items by iterating through their user history and combining all top 100 lists associated with each item in their history. Many items appear multiple times. The candidates are the most common items in this concatenated list of hundreds of items.

Stage 2: Ranking

Using candidates from stage 1, build a tabular dataframe (Figure 2), which you use to train a reranker. Imagine that stage 1 produces 100 candidates per user. Then your tabular dataframe has 100 rows for each trained data user. One column is user and another column is candidate item. Add a third column for the target. Each row with a candidate item that is a correct match for that row’s user has a 1 in the target column and 0 otherwise.

Tabular reranker dataframe diagram shows rows and columns to describe user sessions, candidate items, and features.
Figure 2. Reranker dataframe table

Next, add columns that describe the user sessions and items called feature columns. These feature columns are what the reranker uses to learn patterns and predict the target column. You train your reranker with either a binary classification objective or a pairwise or listwise ranking objective. Afterward, you use this trained model to predict items for unseen test user sessions.

Data scarcity in underrepresented languages

The two-stage candidate reranker approach (and any other approach) requires a large amount of training data to train the machine learning or deep learning model properly. Popular languages typically have lots of existing data, but this is not true for historically underrepresented languages.

Advocating for underserved languages is crucial for several reasons, such as promoting inclusivity, increasing global reach, and improving online user engagement and satisfaction.

To build recommender systems for underrepresented languages, I recommend using transfer learning. By leveraging datasets for common languages, models can recognize existing patterns and apply these learnings to support languages that are not widely spoken. This helps you overcome small dataset challenges and create a more inclusive digital world.

Pro tips for developing multilingual recommendation systems

To overcome data scarcity, use transfer learning to apply information from one language to another for stages 1 and 2. Many items have equivalents in multiple languages. Therefore, user-item interaction behavior in one language can be translated to another language.

Here are the top tips for speeding up the development process for multilingual recommendation engines.

Tips for candidate generation

  • First, create co-visitation matrices for underrepresented languages by using user histories that exist in both popular languages and underrepresented languages.
  • Be sure to represent items with pretrained multilingual large language model (LLM) embeddings. Then, use cosine similarity to find candidate items in underrepresented languages.
  • Initialize NN embeddings with pretrained multilingual LLM embeddings. Then, fine-tune and use cosine similarity between user and item embeddings to find candidate items in the underrepresented languages.

Tips for ranking

  • You can use item features from popular languages as item features for underrepresented languages in the tabular dataframe for the reranker.
  • Create user-item interaction features by transferring user-item patterns learned from popular languages to underrepresented languages.
  • Finally, train an underrepresented language’s reranker using user-item dataframe rows from popular languages.

Tutorial: Multilingual recommender system

To help you test these methods out, I walk you through an optimized process for building a multilingual recommender system.

Candidate generation implementation

The goal of candidate generation is to generate hundreds of item suggestions per user. Two popular techniques are using co-visitation matrices and using representation learning. Using transfer learning with co-visitation matrices is straightforward.

Earlier in this post, I discussed how co-visitation candidate generation is based on counting the coexisting pairs of product IDs within user histories. As many product IDs exist in multiple languages, you can use pairs from a German user’s history as counts in a Spanish co-visitation matrix. In Figure 3, the top German row is from the training data. You then “translate” it to Spanish, shown in the bottom row.

Diagram showing co-visitation matrices preprocessing for generating item suggestions using a transfer learning technique.
Figure 3. Transfer learning process

The procedure is as follows.

  • Given a pair of Spanish product IDs, you can iterate through users from the other five languages: English, German, Japanese, Italian, and French.
  • Whenever you observe the pair of Spanish product IDs in one of these user’s histories, add 1 to the count for this Spanish item pair. Or you can use a different weight, like adding 0.5 to the count.
  • After you accumulate counts for all Spanish item pairs, continue to generate candidates as before by applying the new co-visitation matrix to each Spanish user’s history to generate candidates for the Spanish user.

The fastest and most efficient way to create co-visitation matrices is to use RAPIDS cuDF. To follow along, see the Candidate ReRank Model using Handcrafted Rules Jupyter notebook with example code.

By merging a dataframe that contains all user histories (that is, a dataframe with columns user and history item) to itself on the key user, you create all historical pairs. Then group by item pairs and aggregate the counts.

import cudf 
df = cudf.DataFrame(ALL_USER_HISTORIES)
df = df.merge(df, on='user')
df['wgt'] = 1
df = df.groupby(['item_x','item_y']).wgt.sum()

Representation learning, LLMs, and deep learning embeddings are hot and current topics. Besides co-visitation matrices, an alternative to generating candidate items for each user is to create meaningful distance embeddings. If you have meaningful distance embeddings for each item, then you could use a model that predicts an embedding for each user. Next, find the 100 closest (through cosine similarity) embeddings to this predicted embedding and use these as your candidates (Figure 4).

Visual representation of a varied approach to generating item recommendations using distance embeddings.
Figure 4. Compute distance between embeddings

The process of training meaningful distance embeddings for items is called representation learning. Embeddings are N dimensional vectors in N dimensional space. During training, embeddings of similar items are modified to be closer together (through some distance metric) while embeddings of dissimilar items are modified to have at least a predefined gap distance (margin) between them.

One way to use transfer learning during representation learning is to pre-initialize the embeddings with multilingual sentence embeddings. Each item has a title, whether it’s in English, German, Japan, Spanish, Italian, or French. You can pre-initialize each item with its title embedding from Hugging Face’s model stsb-xlm-r-multilingual, for example. This model has been trained on many different languages and transfers learning from all of them. Afterward, you can fine-tune the embeddings using your training data with the model shown in Figure 5.

Diagram showing the convolution and embedding layers workflow in representation learning.
Figure 5. Representation learning

Fine-tune your model using all train data user histories. Every three consecutive history items are paired with one positive item target, which is the next consecutive item. Each triplet is paired with 4096 negative item targets, which are randomly chosen items. Backpropagation maximizes cosine similarity between the predicted embedding and positive target. And it minimizes cosine similarity between the predicted embedding and negative target. Afterward, you have meaningful distance embeddings for each item and a predicted embedding for each user.

A quick and easy way to create transformer-based, session-aware recommender systems that can use pretrained embeddings is to use the NVIDIA Merlin framework. For more information, see the Session-Based Next Item Prediction for Fashion E-Commerce and Training With Pretrained Embeddings Jupyter notebooks.

You can also feed your models with NVIDIA Merlin Dataloader.

Ranking implementation

The goal of stage 2 is to train a reranker that predicts the likelihood of each candidate item being correct among all possible candidate items for each user. To train a model successfully, you need feature columns in addition to a user, item, and target column. There are three types of feature columns:

  • Item features
  • User features
  • User-item interaction features
Diagram shows different type of features: item, user, and user-item interaction.
Figure 6. Reranker dataframe with features

Item features describe items. For example, you can add an item price feature. Then, every row in your reranker dataframe with item A has a corresponding price A in the item price column (Figure 6).

Using transfer learning on item features is easy. To transfer learning from German to Spanish, you can create item features from the German user history data and then merge it to Spanish items.

For example, for each item product ID, count how often it appears in all German user histories. Then every row in your reranker dataframe with Spanish item A has a corresponding German popularity A in the German item popularity column. The reason this works is because many item product IDs exist in both German and Spanish. If a certain Spanish product ID does not exist in German, then you insert NAN in the German item popularity column.

User feature columns and item feature columns are generally created with dataframe groupby commands. Create a property for each user or item and then merge it into your dataframe. The quickest and most efficient method is to use RAPIDS cuDF.

import cudf
item_features = data.groupby(‘item’)
.agg({‘item:count’,’user:nunique’,’price:first’})
df = df.merge(item_features, left_on=’item’, 
   right_index=True, how=’left’)
user_features = data.groupby(‘user’)
.agg({‘user:count’,’item:nunique’})
df = df.merge(user_features, left_on=’user’, 
   right_index=True, how=’left’)

User-item interaction features describe the relationship between a row’s candidate item and that row’s user. These features have a different value for each row. A common way to generate user-item interaction features is to describe the relationship between a user’s last history item and their candidate item.

One way to use transfer learning from popular languages to underrepresented languages is to create meaningful distance embeddings for all items using multilingual information. Then a user-item interaction feature can be the cosine similarity score between a user’s last history item and candidate item based on the embeddings.

Figure 7 shows extracting item embeddings from a multilingual LLM. You concatenate all the text for each item and input it into your LLM. Extract the last hidden layer activations as your embedding.

Architecture of LLM embeddings with an output layer, hidden layers, and input layer used to extract embeddings.
Figure 7. Large language model embeddings

A third way to use information from popular languages to improve underrepresented language recommendation is to train your underrepresented GBT reranker using dataframe rows from popular languages. First, you use the same column features for all language dataframes and then you merge all dataframes into one new dataframe. Afterward, your dataframe is large.

The best way to train GBT with millions of rows is to use RAPIDS Dask cuDF XGB, which uses multiple GPUs! For more information, see the KDD cup solution code.

The key lines of the code are as follows:

import xgboost as xgb
import dask, dask_cudf
from dask.distributed import Client
client = Client(cluster)
df = dask_cudf.read_parquet(FILES).persist()
dtrain = xgb.dask.DaskQuantileDMatrix(
client, df[FEATURES], df[TARGET])
xgb.dask.train(client, xgb_parms, dtrain)

Conclusion

When browsing online, recommendation systems may seem magical but, as you learned throughout this post, the inner workings of a multilingual recommendation engine are deterministic and understandable.

In this post, I shared techniques that the Kaggle Grandmasters of NVIDIA and NVIDIA Merlin teams used to win the recent KDD cup 2023 Multilingual Recommender System competition hosted by Amazon.

I also introduced the two-stage candidate reranker technique for recommendation systems. This is a powerful technique that helps solve many recommender system needs. Next, I gave you pro tips to help train recommendation systems for underrepresented languages. I shared how RAPIDS and NVIDIA Merlin frameworks can help you build recommender systems.

I hope that you can use some of these ideas in your next recommender system project. By improving online recommender systems for underrepresented languages, we can all make the Internet more inclusive, extend global reach, and improve user engagement and satisfaction.