Startup MosaicML is on a mission to help the AI community improve prediction accuracy, decrease costs and save time by providing tools for easy training and deployment of large AI models. In this episode of NVIDIA’s AI Podcast, host Noah Kravitz speaks with MosaicML CEO and co-founder Naveen Rao about how the company aims to Read article >
This post is part of a series on accelerated data analytics. Visualization brings data to life, unveiling hidden patterns and insights through accessible…
This post is part of a series on accelerated data analytics.
Visualization brings data to life, unveiling hidden patterns and insights through accessible visuals, and empowering you and your organization to perceive the invisible, make informed decisions, and fully leverage your data.
Especially when working with large datasets, interaction can be difficult as render and compute times become prohibitive. Switching to RAPIDS libraries, such as cuDF, enables GPU acceleration that unlocks access to your data insights through a familiar pandas-like API. This post explains:
- Why speed matters for visualization, especially for large datasets
- How to use pandas-like features in RAPIDS for visualization
- How to use hvPlot, datashader, cuxfilter, and Plotly Dash
Why speed matters for visualization
While data visuals are an effective tool for explaining data insights at the end of a project, they should ideally be used throughout the data exploration and enriching process. Visualization excels at enhancing data understanding by finding outliers, anomalies, and patterns not easily surfaced by purely analytical methods. This has been demonstrated by Anscombe’s quartet and the infamous Datasaurus Dozen.
An effective chart applies data visualization design principles that take advantage of pre-attentive visual processing. This style of visualization is essentially a hack for the brain to understand large amounts of information quickly. However, interactions such as filtering, selecting, or rerendering points that are slower than 7-10 seconds result in a disruption of a user’s short-term memory and train of thought. This disruption creates friction in the analysis process. To learn more, see Powers of 10: Time Scales in User Experience.
Combining sub-second speed with easy integration, the RAPIDS suite of open-source software libraries is ideal for supplementing exploratory data analysis (EDA) work with visualization–driving fluid, consistent insights that lead to better outcomes during analysis projects.
Large data analysis workflows require more compute power
Pandas has made data work simpler, helping to build a strong Python visualization ecosystem. For example, tools like Bokeh, Plotly, and Matplotlib have enabled more people to regularly use visuals for data analysis.
But when an EDA workflow is processing data larger than 2 GB, and requires compute intensive tasks, CPU-based solutions can start to constrain the iterative exploration process.
Accelerated data visualization with RAPIDS
Replacing CPU-based libraries with the pandas-like RAPIDS GPU-accelerated libraries (such as cuDF) means you can keep a swift pace for your EDA process as data sizes increase between 2 and 10 GB. Visualization compute and render times are brought down to interactive speeds, unblocking the discovery process. Moreover, as the RAPIDS libraries work seamlessly together, you can chart many types of data (time series, geospatial, graphs) with simple, familiar Python code to incorporate throughout your workflows.
RAPIDS Visualization Guide
The RAPIDS Visualization Guide on GitHub demonstrates the features and benefits of visualization libraries working together. Based on the publicly available Divvy bike share historical trip data, the notebook showcases how a visualization-focused approach to EDA can improve using the following GPU enabled libraries:
Use hvPlot for easy data interactivity
hvPlot is a pandas-like plot API, but has built-in interactivity as shown in Figure 1.
df.hvplot.hist(y='duration_min', bins=20, title="Trips Duration Histogram")
In this instance, the vast majority of bike trips appear under 20 minutes. Because of the ability to zoom in, you can also inspect the long tail of durations without creating another query. Augmenting the data using RAPIDS cuSpatial to quickly calculate distances also shows that most trips are relatively short.
Some hvPlot extras
Charts in hvPlot can be interactively displayed using Bokeh and Plotly extensions, or statically with the Matplotlib extension. Multiple charts can share axes by using the * operator, or in parallel for a basic layout with the + operator. More complicated dashboard layouts can be created with HoloViz Panel.
You can also automatically add simple widgets. For example, when using the built-in group by operation:
df.hvplot.heatmap(x='day_of_week', y='hour', C='count', groupby='month', widget_location='left_top')
Adding a widget for interactivity enables scrubbing through the months to search for patterns over a full year (Figure 2). In visualization, “a slider is worth a thousand queries,” or in this case, 12.
Easy geospatial plotting
Geospatial charts with multiple options for the underlying tile maps can be shown by simply specifying geo=True
:
df.hvplot.hexbin(x='start_lng', y='start_lat', geo=True, tiles="OSM")
Figure 3 shows the hexbin chart that aggregates trip start and ending locations to a manageable amount, verifying that the data is accurate to the bike share system map. Setting two charts side by side with the plus operator illustrates the radiating nature of the bike network.
Use Datashader for large data and high precision charts
The Datashader library directly supports cuDF and can rapidly render over millions of aggregated points. You can use it by itself to render a variety of precise and high-density chart types. It is also easy to use in conjunction with other libraries, like hvPlot, by specifying datashade=True
:
df.hvplot.points(x='start_lng', y='start_lat', geo=True, tiles="CartoDark", datashade=True, dynspread=True)
Datapoint rendering displaying high-resolution patterns is precisely what Datashader is designed for. In Figure 4, it clearly shows that while bikes tend to cluster, there is no guarantee that a bike will start or end a trip at a designated station.
Use cuxfilter for accelerated cross-filtered dashboards
Instead of creating several individual group by and query operations, a cuxfilter dashboard can simply cross-link numerous charts to quickly find patterns or anomalies (Figure 5).
A few lines of code is all it takes to get a dashboard up and running:
cux_df = cuxfilter.DataFrame.from_dataframe(df)
# Specify charts
charts = [
cuxfilter.charts.bar('dist_m', data_points=20 , title='Distance in M'),
cuxfilter.charts.bar('dur_min', data_points=20 , title='Duration in Min'),
cuxfilter.charts.bar('day_of_week', title='Day of Week'),
cuxfilter.charts.bar('hour', title='Trips per Hour'),
cuxfilter.charts.bar('day', title='Trips per Day'),
cuxfilter.charts.bar('month', title='Trips per Month')
]
# Specify side panel widgets
widgets = [
cuxfilter.charts.multi_select('year')
]
# Generate the dashboard and select a layout
d = cux_df.dashboard(charts, sidebar=widgets, layout=cuxfilter.layouts.two_by_three, theme=cuxfilter.themes.rapids, title='Bike Trips Dashboard')
# Update the yaxis ticker to an easily readable format
for i in charts:
if hasattr(i.chart, 'yaxis'):
i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")
# Show generates a full dashboard in another browser tab
d.show()
Using cuxfilter for quick, cross-filter-based exploration is another technique that can save time. This approach replaces dataframe queries with a GUI tool. As shown in Figure 5, a clear pattern emerges between weekday and weekend trips, as well as between daytime and evening.
Build powerful analytics applications with Plotly Dash
After data is properly formatted and augmented by an EDA process, making it more widely accessible and digestible for your organization can be a challenge. Plotly Dash enables data scientists to recast complex data and machine learning workflows as more accessible web applications.
For that reason, the findings from this notebook are encapsulated into a simple-to-use, accessible, and deployable app with Plotly Dash. The app uses the powerful analysis capabilities available with RAPIDS, but is controlled through an uncomplicated GUI.
This instance uses cuML K-means to cluster the bike start and stop points into nodes and show each node’s relative importance with cuGraph’s PageRank. The latter is computed in real time for each of the weekend-weekday, and day-night patterns discovered earlier. We have started with raw usage patterns and now provide interactive insights into specific user-types and their preferred areas of town.
Subsecond interaction of 300M+ Census data points with Plotly Dash
For a more comprehensive Plotly Dash example, we updated the popular Census visualization with 2020 and migration data. Figure 7 shows the interactive performance benefits of using cuDF over pandas for millions of data points. For a sample demo, see the Census 2020 Visualization using Plotly-Dash + RAPIDS on Google Colab. For the full 300+ million dataset interaction, watch Visualizing Census Data with RAPIDS cuDF and Plotly Dash.
The 2020 and 2010 Census data were sourced with permission from IPUMS NHGIS, University of Minnesota. To more accurately represent the entire US population visually, the block-level data were expanded into per-individual points randomly placed within their block region, and calculated to match a block-level distribution. Several views are tabulated for that data, including total population and net migration values. For more details about formatting the data, see Plotly-Dash + RAPIDS Census 2020 Visualization GitHub page..
Using a powerful visualization, you can forget about the tool and become immersed in exploring the data. Some intriguing patterns emerge in this case:
- The Census block boundaries were changed, resulting in large roadways with their own separate blocks. This might be a result of a new push to better reflect unhoused populations.
- The eastern states show much less overall migration than the midwest and western states, except for a few hot spots.
- New developments, especially large ones, are particularly easy to spot and can serve as a quick visual comparison between regional policies affecting growth, land use, and population densities.
Data visualization at the speed of thought
By replacing pandas with RAPIDS frameworks such as cuDF, and taking advantage of the simplicity to integrate accelerated visualization frameworks, data analytics workflows can become faster, more insightful, more productive, and (just maybe) more enjoyable.
To learn more about speeding up your data science workflows, check out these resources:
This post is part of a series on accelerated data analytics. If you are looking to take your machine learning (ML) projects to new levels of speed and…
This post is part of a series on accelerated data analytics.
If you are looking to take your machine learning (ML) projects to new levels of speed and scalability, GPU-accelerated data analytics can help you deliver insights quickly with breakthrough performance. From faster computation to efficient model training, GPUs bring many benefits to everyday ML tasks.
This post provides technical best practices for:
- Accelerating basic ML techniques, such as classification, clustering, and regression
- Preprocessing time series data and training ML models efficiently with RAPIDS, a suite of open-source libraries for executing data science and analytics pipelines entirely on GPUs
- Understanding algorithm performance and which evaluation metrics to use for each ML task
Accelerating data science pipelines with GPUs
GPU-accelerated data analytics is made possible with RAPIDS cuDF, a GPU DataFrame library, and RAPIDS cuML, a GPU-accelerated ML library.
cuDF is a Python GPU DataFrame library built on the Apache Arrow columnar memory format for loading, joining, aggregating, filtering, and manipulating data. It has an API similar to pandas, an open-source software library built on top of Python specifically for data manipulation and analysis. This makes it a useful tool for data analytics workflows, including data preprocessing and exploratory tasks to prepare dataframes for ML. For more information on how you can accelerate your data analytics pipeline with cuDF, refer to the series on accelerated data analytics.
Once your data is preprocessed, cuDF seamlessly integrates with cuML, which leverages GPU acceleration to provide a large set of ML algorithms that can help execute complex ML tasks at scale, much faster than CPU-based frameworks like scikit-learn.
cuML provides a straightforward API closely mirroring the scikit-learn API, making it easy to integrate into existing ML projects. With cuDF and cuML, data scientists and data analysts working on ML projects get the easy interactivity of the most popular open-source data science tools with the power of GPU acceleration across the data pipeline. This minimizes adoption time to pushing ML workflows forward.
Note: This resource serves as an introduction to ML with cuML and cuDF, demonstrating common algorithms for learning purposes. It’s not intended as a definitive guide for feature engineering or model building. Each ML scenario is unique and might require custom techniques. Always consider your problem specifics when building ML models.
Understanding the Meteonet dataset
Before diving into the analysis, it is important to understand the structure and content of the Meteonet dataset, which is well-suited for time series analysis. This dataset is a comprehensive collection of weather data that is immensely beneficial for researchers and data scientists in meteorology.
An overview of the Meteonet dataset and the meaning of each column is provided below:
number_sta
: A unique identifier for each weather station.lat
andlon
: Latitude and longitude of the weather station, representing its geographical location.height_sta
: Height of the weather station above sea level in meters.date
: Date and time of data recording, essential for time series analysis.dd
: Wind direction in degrees, indicating the direction from which the wind is coming.ff
: Wind speed, measured in meters per second.precip
: Amount of precipitation measured in millimeters.hu
: Humidity, represented as a percentage indicating the concentration of water vapor in the air.td
: Dew point temperature in degrees Celsius, indicating when the air becomes saturated with moisture.t
: Air temperature in degrees Celsius.psl
: Atmospheric pressure at sea level in hPa (hectopascals).
Machine learning with RAPIDS
This tutorial covers the acceleration of three fundamental ML algorithms with cuDF and cuML: regression, classification, and clustering.
Installation
Before analyzing the Meteonet dataset, install and set up RAPIDS cuDF and cuML. Refer to the RAPIDS Installation Guide for instructions based on your system requirements.
Classification
Classification is a type of ML algorithm used to predict a categorical value based on a set of features. In this case, the goal is to predict weather conditions (such as sunny, cloudy, or rainy) and wind direction using temperature, humidity, and other factors.
Random forest is a powerful and versatile ML method capable of performing both regression and classification tasks. This section uses the cuML Random Forest Classifier to classify the weather conditions and wind direction at a certain time and location. The accuracy of the model can be used to evaluate its performance.
For this tutorial, 3 years of northwest station data has been consolidated into a single dataframe named NW_data.csv. To see the complete steps for combining the data, visit the Introduction to Machine Learning Using cuML notebook on GitHub.
import cudf, cuml
from cuml.ensemble import RandomForestClassifier as cuRF
# Load data
df = cudf.read_csv('./NW_data.csv').dropna()
To prepare the data for classification, perform preprocessing tasks such as converting the date column to datetime format and extracting the hour.
# Convert date column to datetime and extract hour
df['date'] = cudf.to_datetime(df['date'])
df['hour'] = df['date'].dt.hour
# Drop the original 'date' column
df = df.drop(['date'], axis=1)
Create two new categorical columns: wind_direction
and weather_condition
.
For wind_direction
, discretize the dd
column (assumed to be wind direction in degrees) into four categories: north (0-90 degrees), east (90-180 degrees), south (180-270 degrees), and west (270-360 degrees).
# Discretize wind direction
df['wind_direction'] = cudf.cut(df['dd'], bins=[-0.1, 90, 180, 270, 360], labels=['N', 'E', 'S', 'W'])
For weather_condition
, discretize the precip column (which is the amount of precipitation) into three categories: sunny
(no rain), cloudy
(little rain), and rainy
(more rain).
# Discretize weather condition based on precipitation amount
df['weather_condition'] = cudf.cut(df['precip'], bins=[-0.1, 0.1, 1, float('inf')], labels=['sunny', 'cloudy', 'rainy'])
Then convert these categorical columns into numerical labels that the RandomForestClassifier
can work with using .cat.codes
.
# Convert 'wind_direction' and 'weather_condition' columns to category
df['wind_direction'] = df['wind_direction'].astype('category').cat.codes
df['weather_condition'] = df['weather_condition'].astype('category').cat.codes
Model training
Now that preprocessing is done, the next step is to define a function to predict wind direction and weather conditions:
def train_and_evaluate(target):
# Split into features and target
X = df.drop(target, axis=1)
y = df[target]
# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Define the model
model = cuRF()
# Train the model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy for predicting {target} is {accuracy}")
return model
Now that the function is ready, the next step is to train the model with the following call, mentioning the target variable:
# Train and evaluate models
weather_condition_model = train_and_evaluate('weather_condition')
wind_direction_model = train_and_evaluate('wind_direction')
This tutorial uses the cuML Random Forest Classifier to classify weather conditions and wind direction in the northwest dataset. Preprocessing steps include converting the date column, discretizing wind direction and weather conditions, and converting categorical columns to numerical labels. The models were trained and evaluated using accuracy as the evaluation metric.
Regression
Regression is an ML algorithm used to predict a continuous value based on a set of features. For example, you could use regression to predict the price of a house based on its features, such as the number of bedrooms, the square footage, and the location.
Linear regression is a popular algorithm for predicting a quantitative response. For this tutorial, use the cuML implementation of linear regression to predict temperature, humidity, and precipitation at different times and locations. The R^2 score can be used to evaluate the performance of your regression models.
Start by importing the required libraries for this section:
from cuml import make_regression, train_test_split
from cuml.linear_model import LinearRegression as cuLinearRegression
from cuml.metrics.regression import r2_score
from cuml.preprocessing.LabelEncoder import LabelEncoder
Next, load the NW dataset by reading the NW_data.csv file into a dataframe and dropping any rows with missing values:
# Load data
df = cudf.read_csv('/NW_data.csv').dropna()
For detailed steps on downloading NW_data.csv, see the Introduction to Machine Learning Using cuML notebook on GitHub.
For many ML algorithms, categorical input data must be converted to numeric forms. For this example, number_sta
, which signifies ‘station number,’ is converted using LabelEncoder, which assigns unique numeric values to each category.
Next, numeric features must be normalized to prevent the model from being biased by the variable scales.
Then transform the ‘date’ column into an ‘hour’ feature, as weather patterns often correlate with the time of day. Finally, drop the ‘date’ column, as the models used cannot process this directly.
# Convert categorical variables to numeric variables
le = LabelEncoder()
df['number_sta'] = le.fit_transform(df['number_sta'])
# Normalize numeric features
numeric_columns = ['lat', 'lon', 'height_sta', 'dd', 'ff', 'hu', 'td', 't', 'psl']
for col in numeric_columns:
if df[col].dtype != 'object':
df[col] = (df[col] - df[col].mean()) / df[col].std()
else:
print(f"Skipping normalization for non-numeric column: {col}")
# Convert date column to datetime and extract hour
df['date'] = cudf.to_datetime(df['date'])
df['hour'] = df['date'].dt.hour
# Drop the original 'date' column
df = df.drop(['date'], axis=1)
Model training and performance
With preprocessing done, the next step is to define a function that trains two models to predict temperature and humidity from weather stations.
To evaluate the performance of the regression model, use R^2, the coefficient of determination. A higher R^2 indicates a model that better predicts the data.
def train_and_evaluate(target):
# Split into features and target
X = df.drop(target, axis=1)
y = df[target]
# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Define the model
model = cuLinearRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
# Evaluate the model
r2 = r2_score(y_test, predictions)
print(f"R^2 score for predicting {target} is {r2}")
return model
Now that the function is written, the next step is to train the model with the following call, specifying the target variable:
# Train and evaluate models
temperature_model = train_and_evaluate('t')
humidity_model = train_and_evaluate('hu')
This examples demonstrates how to use the cuML linear regression to predict temperature, humidity, and precipitation using the northwest dataset. To evaluate the performance of the regression models, we used the R^2 score. It’s important to note that model performance can be further improved by exploring techniques such as feature selection, regularization, and advanced models.
Clustering
Clustering is an unsupervised machine learning (ML) technique used to group similar instances based on their characteristics. It helps identify patterns and structure within the data. This section explores the use of K-Means, a popular centroid-based clustering algorithm, to cluster weather conditions based on temperature and precipitation.
To begin, preprocess the dataset. Focus on two specific features: temperature (t
) and precipitation (pp
). Any rows with missing values will be removed for simplicity.
import cudf
from cuml import KMeans
# Load data
df = cudf.read_csv("/NW_data.csv").dropna()
# Select the features for clustering
features = ['t', 'pp']
df_kmeans = df[features]
Next, apply K-Means clustering to the data. The goal is to partition the data into a specified number of clusters, with each cluster represented by the mean of the data points within it.
# Initialize the KMeans model
kmeans = KMeans(n_clusters=5, random_state=42)
# Fit the model
kmeans.fit(df_kmeans)
After fitting the model, retrieve the cluster labels, indicating the cluster to which each data point belongs.
# Get the cluster labels
kmeans_labels = kmeans.labels_
# Add the cluster labels as new columns to the dataframe
df['KMeans_Labels_Temperature'] = cudf.Series(kmeans_labels)
df['KMeans_Labels_Precipitation'] = cudf.Series(kmeans_labels)
Model training and performance
To evaluate the quality of the clustering model, examine the inertia, which represents the sum of squared distances between each data point and its closest centroid. Lower inertia values indicate tighter and more distinct clusters.
# Print the inertia values
print("Temperature Inertia:")
print(kmeans.inertia_)
print("Precipitation Inertia:")
print(kmeans.inertia_)
Determining the optimal number of clusters in K-Means is important. The Elbow Method helps to find the ideal number by plotting inertia values against different cluster numbers. The “elbow” point indicates the optimal balance between minimizing inertia and avoiding excessive clusters. For a detailed exploration of the Elbow Method, see the Introduction to Machine Learning Using cuML notebook on GitHub.
UMAP, available in cuML, is a powerful dimensionality reduction algorithm used for visualizing high-dimensional data and uncovering underlying patterns. While UMAP itself is not a dedicated clustering algorithm, its ability to project data into a lower-dimensional space often reveals clustering structures. It is widely used for cluster exploration and analysis, providing valuable insights into the data. Its efficient implementation in cuML enables advanced data analysis and pattern identification for clustering tasks.
Deploying cuML models
Once you have trained your cuML model, you can deploy it to NVIDIA Triton. Triton is an open-source, scalable, and production-ready inference server that can be used to deploy cuML models to various platforms, including cloud, on-premises, and edge devices.
Deploying your trained cuML model effectively in a production environment is crucial to extract its full potential. For models trained with cuML, there are three primary methods:
- FIL backend for Triton
- Triton Python backend
- ONNX format
FIL backend for NVIDIA Triton
The FIL backend for Triton enables Triton users to take advantage of cuML’s Forest Inference Library (FIL) for accelerated inference of tree models, including decision forests and gradient-boosted forests. This Triton backend offers a highly-optimized method to deploy forest models, regardless of what framework was used to train them.
It offers native support for XGBoost and LightGBM models, as well as support for cuML and Scikit-Learn tree models using Treelite’s serialization format. While the FIL GPU mode offers state-of-the-art GPU-accelerated performance, it also provides an optimized CPU mode for prototype deployments or deployments where extreme small-batch latency is more important than overall throughput.
To get started, see the Fraud Detection with XGBoost and Triton-FIL introductory tutorial. For a comprehensive look at deploying tree models on Triton, see the FIL Backend FAQ notebook.
Triton Python backend
Another flexible approach for deploying models uses the Triton Python backend. This backend enables you to directly invoke RAPIDS Python libraries. It is highly flexible, so you can write custom Python scripts for handling preprocessing and postprocessing.
To deploy a cuML model using Triton Python backend, you need to:
- Write a Python script that the Triton Server can call for inference. This script should handle any necessary preprocessing and postprocessing.
- Configure the Triton Inference Server to use this Python script for serving your model.
In all cases, the Triton Inference Server provides a unified interface to all models, Triton Inference Server provides a unified interface to all models, regardless of their framework, making it easier to integrate into your existing services and infrastructure. It also enables dynamic batching of incoming requests, reducing compute resources and thereby lowering deployment costs.
Benchmarking RAPIDS
This post is a simplified walkthrough of the complete workflow from the Introduction to Machine Learning Using cuML notebook on GitHub. This workflow resulted in a speedup of up to 44x for combined workflow of data loading, preprocessing, and ML training. These results were performed on an NVIDIA RTX 8000 GPU with RAPIDS 23.04 and Intel Core i7-7800X CPU.
Conclusion
GPU-accelerated machine learning with cuDF and cuML can drastically speed up your data science pipelines. With faster data preprocessing using cuDF and the cuML scikit-learn-compatible API, it is easy to start leveraging the power of GPUs for machine learning.
For a hands-on deep dive into the concepts discussed in this post, check out the Introduction to Machine Learning Using cuML notebook on GitHub. Learn more about GPU-accelerated data science workflows.
Computer Architecture research has a long history of developing simulators and tools to evaluate and shape the design of computer systems. For example, the SimpleScalar simulator was introduced in the late 1990s and allowed researchers to explore various microarchitectural ideas. Computer architecture simulators and tools, such as gem5, DRAMSys, and many more have played a significant role in advancing computer architecture research. Since then, these shared resources and infrastructure have benefited industry and academia and have enabled researchers to systematically build on each other’s work, leading to significant advances in the field.
Nonetheless, computer architecture research is evolving, with industry and academia turning towards machine learning (ML) optimization to meet stringent domain-specific requirements, such as ML for computer architecture, ML for TinyML acceleration, DNN accelerator datapath, memory controllers, power consumption, security, and privacy. Although prior work has demonstrated the benefits of ML in design optimization, the lack of strong, reproducible baselines hinders fair and objective comparison across different methods and poses several challenges to their deployment. To ensure steady progress, it is imperative to understand and tackle these challenges collectively.
To alleviate these challenges, in “ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design”, accepted at ISCA 2023, we introduced ArchGym, which includes a variety of computer architecture simulators and ML algorithms. Enabled by ArchGym, our results indicate that with a sufficiently large number of samples, any of a diverse collection of ML algorithms are capable of finding the optimal set of architecture design parameters for each target problem; no one solution is necessarily better than another. These results further indicate that selecting the optimal hyperparameters for a given ML algorithm is essential for finding the optimal architecture design, but choosing them is non-trivial. We release the code and dataset across multiple computer architecture simulations and ML algorithms.
Challenges in ML-assisted architecture research
ML-assisted architecture research poses several challenges, including:
- For a specific ML-assisted computer architecture problem (e.g., finding an optimal solution for a DRAM controller) there is no systematic way to identify optimal ML algorithms or hyperparameters (e.g., learning rate, warm-up steps, etc.). There is a wider range of ML and heuristic methods, from random walk to reinforcement learning (RL), that can be employed for design space exploration (DSE). While these methods have shown noticeable performance improvement over their choice of baselines, it is not evident whether the improvements are because of the choice of optimization algorithms or hyperparameters.
Thus, to ensure reproducibility and facilitate widespread adoption of ML-aided architecture DSE, it is necessary to outline a systematic benchmarking methodology.
- While computer architecture simulators have been the backbone of architectural innovations, there is an emerging need to address the trade-offs between accuracy, speed, and cost in architecture exploration. The accuracy and speed of performance estimation widely varies from one simulator to another, depending on the underlying modeling details (e.g., cycle–accurate vs. ML–based proxy models). While analytical or ML-based proxy models are nimble by virtue of discarding low-level details, they generally suffer from high prediction error. Also, due to commercial licensing, there can be strict limits on the number of runs collected from a simulator. Overall, these constraints exhibit distinct performance vs. sample efficiency trade-offs, affecting the choice of optimization algorithm for architecture exploration.
It is challenging to delineate how to systematically compare the effectiveness of various ML algorithms under these constraints.
- Finally, the landscape of ML algorithms is rapidly evolving and some ML algorithms need data to be useful. Additionally, rendering the outcome of DSE into meaningful artifacts such as datasets is critical for drawing insights about the design space.
In this rapidly evolving ecosystem, it is consequential to ensure how to amortize the overhead of search algorithms for architecture exploration. It is not apparent, nor systematically studied how to leverage exploration data while being agnostic to the underlying search algorithm.
ArchGym design
ArchGym addresses these challenges by providing a unified framework for evaluating different ML-based search algorithms fairly. It comprises two main components: 1) the ArchGym environment and 2) the ArchGym agent. The environment is an encapsulation of the architecture cost model — which includes latency, throughput, area, energy, etc., to determine the computational cost of running the workload, given a set of architectural parameters — paired with the target workload(s). The ArchGym agent is an encapsulation of the ML algorithm used for the search and consists of hyperparameters and a guiding policy. The hyperparameters are intrinsic to the algorithm for which the model is to be optimized and can significantly influence performance. The policy, on the other hand, determines how the agent selects a parameter iteratively to optimize the target objective.
Notably, ArchGym also includes a standardized interface that connects these two components, while also saving the exploration data as the ArchGym Dataset. At its core, the interface entails three main signals: hardware state, hardware parameters, and metrics. These signals are the bare minimum to establish a meaningful communication channel between the environment and the agent. Using these signals, the agent observes the state of the hardware and suggests a set of hardware parameters to iteratively optimize a (user-defined) reward. The reward is a function of hardware performance metrics, such as performance, energy consumption, etc.
ML algorithms could be equally favorable to meet user-defined target specifications
Using ArchGym, we empirically demonstrate that across different optimization objectives and DSE problems, at least one set of hyperparameters exists that results in the same hardware performance as other ML algorithms. A poorly selected (random selection) hyperparameter for the ML algorithm or its baseline can lead to a misleading conclusion that a particular family of ML algorithms is better than another. We show that with sufficient hyperparameter tuning, different search algorithms, even random walk (RW), are able to identify the best possible normalized reward. However, note that finding the right set of hyperparameters may require exhaustive search or even luck to make it competitive.
With a sufficient number of samples, there exists at least one set of hyperparameters that results in the same performance across a range of search algorithms. Here the dashed line represents the maximum normalized reward. Cloud-1, cloud-2, stream, and random indicate four different memory traces for DRAMSys (DRAM subsystem design space exploration framework). |
Dataset construction and high-fidelity proxy model training
Creating a unified interface using ArchGym also enables the creation of datasets that can be used to design better data-driven ML-based proxy architecture cost models to improve the speed of architecture simulation. To evaluate the benefits of datasets in building an ML model to approximate architecture cost, we leverage ArchGym’s ability to log the data from each run from DRAMSys to create four dataset variants, each with a different number of data points. For each variant, we create two categories: (a) Diverse Dataset (DD), which represents the data collected from different agents (ACO, GA, RW, and BO), and (b) ACO only, which shows the data collected exclusively from the ACO agent, both of which are released along with ArchGym. We train a proxy model on each dataset using random forest regression with the objective to predict the latency of designs for a DRAM simulator. Our results show that:
- As we increase the dataset size, the average normalized root mean squared error (RMSE) slightly decreases.
- However, as we introduce diversity in the dataset (e.g., collecting data from different agents), we observe 9× to 42× lower RMSE across different dataset sizes.
Diverse dataset collection across different agents using ArchGym interface. |
The impact of a diverse dataset and dataset size on the normalized RMSE. |
The need for a community-driven ecosystem for ML-assisted architecture research
While, ArchGym is an initial effort towards creating an open-source ecosystem that (1) connects a broad range of search algorithms to computer architecture simulators in an unified and easy-to-extend manner, (2) facilitates research in ML-assisted computer architecture, and (3) forms the scaffold to develop reproducible baselines, there are a lot of open challenges that need community-wide support. Below we outline some of the open challenges in ML-assisted architecture design. Addressing these challenges requires a well coordinated effort and a community driven ecosystem.
Key challenges in ML-assisted architecture design. |
We call this ecosystem Architecture 2.0. We outline the key challenges and a vision for building an inclusive ecosystem of interdisciplinary researchers to tackle the long-standing open problems in applying ML for computer architecture research. If you are interested in helping shape this ecosystem, please fill out the interest survey.
Conclusion
ArchGym is an open source gymnasium for ML architecture DSE and enables an standardized interface that can be readily extended to suit different use cases. Additionally, ArchGym enables fair and reproducible comparison between different ML algorithms and helps to establish stronger baselines for computer architecture research problems.
We invite the computer architecture community as well as the ML community to actively participate in the development of ArchGym. We believe that the creation of a gymnasium-type environment for computer architecture research would be a significant step forward in the field and provide a platform for researchers to use ML to accelerate research and lead to new and innovative designs.
Acknowledgements
This blogpost is based on joint work with several co-authors at Google and Harvard University. We would like to acknowledge and highlight Srivatsan Krishnan (Harvard) who contributed several ideas to this project in collaboration with Shvetank Prakash (Harvard), Jason Jabbour (Harvard), Ikechukwu Uchendu (Harvard), Susobhan Ghosh (Harvard), Behzad Boroujerdian (Harvard), Daniel Richins (Harvard), Devashree Tripathy (Harvard), and Thierry Thambe (Harvard). In addition, we would also like to thank James Laudon, Douglas Eck, Cliff Young, and Aleksandra Faust for their support, feedback, and motivation for this work. We would also like to thank John Guilyard for the animated figure used in this post. Amir Yazdanbakhsh is now a Research Scientist at Google DeepMind and Vijay Janapa Reddi is an Associate Professor at Harvard.
Jacob Norris is a 3D artist and the president, co-founder and creative director of Sierra Division Studios — an outsource studio specializing in digital 3D content creation.
Webinar: NVIDIA DLSS 3 and Unreal Engine 5.2
On July 26, walkthrough DLSS 3 features within Unreal Engine 5.2 and learn how to best use the latest updates.
On July 26, walkthrough DLSS 3 features within Unreal Engine 5.2 and learn how to best use the latest updates.
Large language models (LLMs), such as GPT, have emerged as revolutionary tools in natural language processing (NLP) due to their ability to understand and…
Large language models (LLMs), such as GPT, have emerged as revolutionary tools in natural language processing (NLP) due to their ability to understand and generate human-like text. These models are trained on vast amounts of diverse data, enabling them to learn patterns, language structures, and contextual relationships. They serve as foundational models that can be customized to a wide range of downstream tasks, making them highly versatile.
Downstream tasks, such as classification, can include the analysis and categorization of text based on predefined criteria, aiding in tasks like sentiment analysis or spam detection. In closed question-answering (QA), they can provide precise answers based on the given context. In generation tasks, they can produce human-like text, such as story writing or poem composition. Even when it comes to brainstorming, LLMs can generate creative and coherent ideas by leveraging their vast knowledge base.
The adaptability and versatility of LLMs make them invaluable tools for a wide range of applications, empowering businesses, researchers, and individuals to accomplish various tasks with remarkable efficiency and accuracy.
This post shows you how LLMs can be adapted to downstream tasks using distributed datasets and federated learning to preserve privacy and enhance model performance.
Adaptation of LLMs to downstream tasks
Parameter-efficient fine-tuning of LLMs using task-specific modules has gained prominence. This approach involves keeping the pretrained LLM layers fixed while adapting a smaller set of additional parameters to the specific task at hand. Various techniques have been developed to facilitate this process, including prompt tuning, p-tuning, adapters, LoRA, and others.
For example, p-tuning involves freezing the LLM and learning to predict virtual token embeddings that are combined with the original input text, as shown in Figure 1. The task-specific virtual token embeddings are predicted by a prompt encoder network, which, along with the input word embeddings, are fed into the LLM to enhance performance on the downstream task at inference time. It is parameter efficient as only the prompt encoder parameters must be trained on the input text and labels, while the foundational LLM parameters can stay fixed.
Federated learning
Using private data for training AI models poses significant challenges due to regulatory constraints and complex bureaucratic processes. Privacy regulations and data protection laws often prohibit sharing sensitive information, limiting the feasibility of traditional data-sharing approaches. Moreover, data annotation, a crucial aspect of model training, incurs substantial costs and demands significant time and effort.
Recognizing data as a valuable asset, federated learning (FL) has emerged as a technology to address these concerns. FL bypasses the conventional model training process by sharing models instead of raw data. Participating clients train models using their respective private datasets locally, and the updated model parameters are aggregated. This preserves the privacy of the underlying data while collectively benefiting from the knowledge gained during the training process.
No direct data exchange is needed, which mitigates the compliance risks associated with data privacy regulations and distributes the burdensome data annotation cost among collaborators in the federation.
Figure 2 shows federated p-tuning with global model and three clients. The LLM parameters stay fixed while prompt encoder parameters are trained on the local data. After local training, the new parameters are aggregated on the server to update the global model for the next round of federated learning.
Federating the adaptation of LLMs to downstream tasks
FL enables this adaptation of LLMs to downstream tasks by leveraging decentralized data sources. By training LLMs collaboratively across multiple participants without sharing raw data, the accuracy, robustness, and generalizability of LLMs can be enhanced by leveraging collective knowledge and exposing models to a wider range of linguistic patterns (Figure 2). Additionally, FL offers various options for model adaptation and inference, including global models trained on aggregated data and personalized models tailored to individual clients.
Federated p-tuning for sentiment analysis
This section provides an example of federated adaptation of an LLM from NVIDIA NeMo framework for a downstream task with NVIDIA Flare using p-tuning. Both NeMo and NVIDIA Flare are open-source toolkits developed by NVIDIA. This fine-tuning process is efficient, as only a few dozen million parameters need to be exchanged, significantly reducing the communication burden.
In this sentiment analysis task, the NeMo Megatron-GPT model with 20 billion parameters can be efficiently fine-tuned using p-tuning. It uses the Financial PhraseBank dataset, which. contains the sentiments for financial news headlines from a retail investor’s perspective. For more details, see Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts.
The example inputs and model predictions are shown in Figure 3. In total, this data contains 1,800 pairs of headlines and corresponding sentiment labels. In p-tuning, only 50 million parameters of a trainable prompt encoder network are updated (0.25% of the full 20B parameters). For FL experiments, the data is split into three sets, which correspond to 600 headlines and sentiment pairs for each site. The clients use the same validation set to enable a direct comparison.
Figure 4a compares training the model in the centralized fashion compared to the federated model for 50 epochs (or FL rounds). In both settings, the adapted model performs comparably on the downstream task, achieving a similar low loss on the validation set. Figure 4b compares each client training on their local dataset only compared to the model p-tuned using FL. One can see a clear advantage for the global model using federated p-tuning by effectively making use of the larger training sets available in the collaboration and achieving a lower loss than clients training on their data alone.
Conclusion
Overall, this post highlights the potential of federated p-tuning in adapting LLMs to downstream tasks, emphasizing the benefits of FL in enabling collaborative learning for preserving privacy and enhancing model performance. Some key takeaways are:
- Large language models such as GPT have revolutionized NLP, offering versatility for various downstream tasks such as classification, question-answering, generation, and brainstorming.
- Federated learning addresses challenges related to private data by sharing model parameters instead of raw data, ensuring privacy and reducing compliance risks.
- Fine-tuning LLMs with task-specific modules, such as prompt-tuning or p-tuning, enables efficient adaptation to specific tasks.
- FL facilitates collaborative training and inference, leading to improved model performance.
For more information, see the NVIDIA Flare documentation and NVIDIA NeMo framework page. To replicate the experiments explained here and other LLM tasks, explore the Examples of NeMo-NVFlare Integration. The federated p-tuning approach presented here can be further combined with additional privacy-preserving solutions offered by NVIDIA Flare, such as homomorphic encryption and differential privacy. To learn more, see NVIDIA FLARE: Federated Learning from Simulation to Real-World.
If you are a DirectX 12 (DX12) game developer, you may have noticed that GPU times displayed in real time in your game HUD may change over time for a given…
If you are a DirectX 12 (DX12) game developer, you may have noticed that GPU times displayed in real time in your game HUD may change over time for a given pass. This may be the case even if nothing has changed on the application side.
One reason for GPU time variations may be GPU Boost dynamically changing the GPU core clock frequency. Still, even with GPU Boost disabled using the DX12 SetStablePowerState API, GPU timings measured in-game may still change unexpectedly from run to run, or from frame to frame. One factor to consider is whether background driver optimizations were engaged and when their resulting optimized shaders were deployed.
This post provides best practices for performing in-game GPU profiling while monitoring the state of the background driver optimizations, using the DX12 SetBackgroundProcessingMode API on NVIDIA GPUs.
Keep background driver optimizations always on
The DX12 driver automatically disables all of its background optimizations if it detects a risk that the CPU overhead may negatively impact the frame rate of the DX12 application. As a result, running with a Debug build of an application may result in less optimal GPU workloads, for instance. Even for a Release build, the driver background optimizations may be turned on and off dynamically from frame to frame.
To avoid getting inconsistent profiling results depending on the CPU load of your application, you can request that the driver background optimizations stay always on, even if it may degrade frame rate. Use the following call (once is enough–no need to redo for every frame):
if (FAILED(pDevice6->SetBackgroundProcessingMode(
D3D12_BACKGROUND_PROCESSING_MODE_ALLOW_INTRUSIVE_MEASUREMENTS,
D3D12_MEASUREMENTS_ACTION_KEEP_ALL,
nullptr, nullptr)) {
// handle error.
}
Wait for background driver optimization threads
Even with driver background optimizations always on, the optimizations typically require multiple frames to collect observations. The observations are then used to compile a shader asynchronously. In contrast, DX12 Create calls block for compiles. This asynchronous delivery of new binaries can result in GPU performance for one shader suddenly changing from one frame to the next without anything changing on the application side.
Understandably, this can cause a great deal of confusion in timing your shaders. You should still aim to measure these background-optimized shaders to avoid application optimization work that the driver is already providing.
To know when all background driver optimizations have completed so you can take GPU performance measurements in your in-game profiler, use the following code on Present. Continue to render frames until wantMoreFrames
is returned as false.
On Present:
BOOL wantMoreFrames;
if (FAILED(pDevice6->SetBackgroundProcessingMode(
D3D12_BACKGROUND_PROCESSING_MODE_ALLOW_INTRUSIVE_MEASUREMENTS,
D3D12_MEASUREMENTS_ACTION_KEEP_ALL,
nullptr,
&wantMoreFrames))) {
// handle error.
}
Notes:
- The
wantMoreFrames
return value combines two pieces of information from the driver: “are background compiles currently running” and “does the driver want more frames demonstrated to the optimizers.” - We recommend that you display this Boolean in real time in your game HUD next to your in-game GPU timings.
- It is possible that
wantMoreFrames
never becomes false if the driver continues generating new binaries. We recommend that you pause your game time and do not move the camera to avoid this possibility. - If the
wantMoreFrames
Boolean never turns false in your case, even after you have paused all simulations, you can fall back to looking at whether the GPU timings in your game HUD appear to have settled.
Reset the background processing mode to the default mode
Use the following call to return to the default mode of the DX12 driver. In this mode, the driver turns background optimizations on and off depending on internal heuristics.
if (FAILED(pDevice6->SetBackgroundProcessingMode(
D3D12_BACKGROUND_PROCESSING_MODE_ALLOWED,
D3D12_MEASUREMENTS_ACTION_KEEP_ALL,
nullptr, nullptr)) {
// handle error.
}
Conclusion
For more deterministic performance measurements on NVIDIA GPUs using your DX12 in-game GPU profiler, we recommend that you display the wantMoreFrames
Boolean in your game HUD next to your in-game GPU timings to know whether background driver optimizations are in flight.
By using the DX12 SetBackgroundProcessingMode API in your game engine in this way during development, your in-game GPU profiler will provide more reliable information. By using the ALLOW_INTRUSIVE_MEASUREMENTS
background processing mode, you should no longer get different GPU timings depending on the CPU load of your game. By waiting for wantMoreFrames
to be false, you can make sure that you always look at the GPU performance of the fully optimized shaders.
Google at ACL 2023
This week, the 61st annual meeting of the Association for Computational Linguistics (ACL), a premier conference covering a broad spectrum of research areas that are concerned with computational approaches to natural language, is taking place online.
As a leader in natural language processing and understanding, and a Diamond Level sponsor of ACL 2023, Google will showcase the latest research in the field with over 50 publications, and active involvement in a variety of workshops and tutorials.
If you’re registered for ACL 2023, we hope that you’ll visit the Google booth to learn more about the projects at Google that go into solving interesting problems for billions of people. You can also learn more about Google’s participation below (Google affiliations in bold).
Board and Organizing Committee
Area chairs include: Dan Garrette
Workshop chairs include: Annie Louis
Publication chairs include: Lei Shu
Program Committee includes: Vinodkumar Prabhakaran, Najoung Kim, Markus Freitag
Spotlight papers
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, Ayu Purwarianti
Optimizing Test-Time Query Representations for Dense Retrieval
Mujeen Sung, Jungsoo Park, Jaewoo Kang, Danqi Chen, Jinhyuk Lee
PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition
Sihao Chen*, Senaka Buthpitiya, Alex Fabrikant, Dan Roth, Tal Schuster
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Cheng-Yu Hsieh*, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
Large Language Models with Controllable Working Memory
Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar
OpineSum: Entailment-Based Self-Training for Abstractive Opinion Summarization
Annie Louis, Joshua Maynez
RISE: Leveraging Retrieval Techniques for Summarization Evaluation
David Uthus, Jianmo Ni
Follow the Leader(board) with Confidence: Estimating p-Values from a Single Test Set with Item and Response Variance
Shira Wein*, Christopher Homan, Lora Aroyo, Chris Welty
SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives
Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, Zhe Dong
Papers
Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability
Eleftheria Briakou, Colin Cherry, George Foster
Prompting PaLM for Translation: Assessing Strategies and Performance
David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, George Foster
Query Refinement Prompts for Closed-Book Long-Form QA
Reinald Kim Amplayo, Kellie Webster, Michael Collins, Dipanjan Das, Shashi Narayan
To Adapt or to Annotate: Challenges and Interventions for Domain Adaptation in Open-Domain Question Answering
Dheeru Dua*, Emma Strubell, Sameer Singh, Pat Verga
FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation (see blog post)
Parker Riley, Timothy Dozat, Jan A. Botha, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, Noah Constant
Conditional Generation with a Question-Answering Blueprint
Shashi Narayan, Joshua Maynez, Reinald Kim Amplayo, Kuzman Ganchev, Annie Louis, Fantine Huot, Anders Sandholm, Dipanjan Das, Mirella Lapata
Coreference Resolution Through a Seq2Seq Transition-Based System
Bernd Bohnet, Chris Alberti, Michael Collins
Cross-Lingual Transfer with Language-Specific Subnetworks for Low-Resource Dependency Parsing
Rochelle Choenni, Dan Garrette, Ekaterina Shutova
DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue
William Held*, Christopher Hidey, Fei Liu, Eric Zhu, Rahul Goel, Diyi Yang, Rushin Shah
RARR: Researching and Revising What Language Models Say, Using Language Models
Luyu Gao*, Zhuyun Dai, Panupong Pasupat, Anthony Chen*, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, Kelvin Guu
Benchmarking Large Language Model Capabilities for Conditional Generation
Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann
Crosslingual Generalization Through Multitask Fine-Tuning
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel
DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering
Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, Omri Abend
Resolving Indirect Referring Expressions for Entity Selection
Mohammad Javad Hosseini, Filip Radlinski, Silvia Pareti, Annie Louis
SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models
Akshita Jha*, Aida Mostafazadeh Davani, Chandan K Reddy, Shachi Dave, Vinodkumar Prabhakaran, Sunipa Dev
The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks
Nikil Selvam, Sunipa Dev, Daniel Khashabi, Tushar Khot, Kai-Wei Chang
Character-Aware Models Improve Visual Text Rendering
Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, Noah Constant
Cold-Start Data Selection for Better Few-Shot Language Model Fine-Tuning: A Prompt-Based Uncertainty Propagation Approach
Yue Yu, Rongzhi Zhang, Ran Xu, Jieyu Zhang, Jiaming Shen, Chao Zhang
Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment
Roni Rabin, Alexandre Djerbetian, Roee Engelberg, Lidan Hackmon, Gal Elidan, Reut Tsarfaty, Amir Globerson
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction
Chen-Yu Lee, Chun-Liang Li, Hao Zhang, Timothy Dozat, Vincent Perot, Guolong Su, Xiang Zhang, Kihyuk Sohn, Nikolay Glushinev, Renshen Wang, Joshua Ainslie, Shangbang Long, Siyang Qin, Yasuhisa Fujii, Nan Hua, Tomas Pfister
Dialect-Robust Evaluation of Generated Text
Jiao Sun*, Thibault Sellam, Elizabeth Clark, Tu Vu*, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, Sebastian Gehrmann
MISGENDERED: Limits of Large Language Models in Understanding Pronouns
Tamanna Hossain, Sunipa Dev, Sameer Singh
LAMBADA: Backward Chaining for Automated Reasoning in Natural Language
Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, Deepak Ramachandran
LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction
Jeremiah Milbauer*, Annie Louis, Mohammad Javad Hosseini, Alex Fabrikant, Donald Metzler, Tal Schuster
Modular Visual Question Answering via Code Generation (see blog post)
Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer and Huan Sun
Better Zero-Shot Reasoning with Self-Adaptive Prompting
Xingchen Wan*, Ruoxi Sun, Hanjun Dai, Sercan Ö. Arik, Tomas Pfister
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Léonard Hussenot, Orgad Keller, Nikola Momchev, Sabela Ramos, Piotr Stanczyk, Nino Vieillard, Olivier Bachem, Gal Elidan, Avinatan Hassidim, Olivier Pietquin, Idan Szpektor
Natural Language to Code Generation in Interactive Data Science Notebooks
Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Oleksandr Polozov, Charles Sutton
Teaching Small Language Models to Reason
Lucie Charlotte Magister*, Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn
Using Domain Knowledge to Guide Dialog Structure Induction via Neural Probabilistic Soft Logic
Connor Pryor*, Quan Yuan, Jeremiah Liu, Mehran Kazemi, Deepak Ramachandran, Tania Bedrax-Weiss, Lise Getoor
A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
Lining Zhang, Simon Mille, Yufang Hou, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Saad Mahamood, Sebastian Gehrmann, Miruna Clinciu, Khyathi Raghavi Chandu and João Sedoc
Industry Track papers
Federated Learning of Gboard Language Models with Differential Privacy
Zheng Xu, Yanxiang Zhang, Galen Andrew, Christopher Choquette, Peter Kairouz, Brendan McMahan, Jesse Rosenstock, Yuanbo Zhang
KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models
Zhiwei Jia*, Pradyumna Narayana, Arjun Akula, Garima Pruthi, Hao Su, Sugato Basu, Varun Jampani
ACL Findings papers
Multilingual Summarization with Factual Consistency Evaluation
Roee Aharoni, Shashi Narayan, Joshua Maynez, Jonathan Herzig, Elizabeth Clark, Mirella Lapata
Parameter-Efficient Fine-Tuning for Robust Continual Multilingual Learning
Kartikeya Badola, Shachi Dave, Partha Talukdar
FiDO: Fusion-in-Decoder Optimized for Stronger Performance and Faster Inference
Michiel de Jong*, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai, Fei Sha, William Cohen
A Simple, Yet Effective Approach to Finding Biases in Code Generation
Spyridon Mouselinos, Mateusz Malinowski, Henryk Michalewski
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, Jason Wei
QueryForm: A Simple Zero-Shot Form Entity Query Framework
Zifeng Wang*, Zizhao Zhang, Jacob Devlin, Chen-Yu Lee, Guolong Su, Hao Zhang, Jennifer Dy, Vincent Perot, Tomas Pfister
ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval
Yue Yu, Yuchen Zhuang, Rongzhi Zhang, Yu Meng, Jiaming Shen, Chao Zhang
Multilingual Sequence-to-Sequence Models for Hebrew NLP
Matan Eyal, Hila Noga, Roee Aharoni, Idan Szpektor, Reut Tsarfaty
Triggering Multi-Hop Reasoning for Question Answering in Language Models Using Soft Prompts and Random Walks
Kanishka Misra*, Cicero Nogueira dos Santos, Siamak Shakeri
Tutorials
Complex Reasoning in Natural Language
Wenting Zhao, Mor Geva, Bill Yuchen Lin, Michihiro Yasunaga, Aman Madaan, Tao Yu
Generating Text from Language Models
Afra Amini, Ryan Cotterell, John Hewitt, Clara Meister, Tiago Pimentel
Workshops
Simple and Efficient Natural Language Processing (SustaiNLP)
Organizers include: Tal Schuster
Workshop on Online Abuse and Harms (WOAH)
Organizers include: Aida Mostafazadeh Davani
Document-Grounded Dialogue and Conversational Question Answering (DialDoc)
Organizers include: Roee Aharoni
NLP for Conversational AI
Organizers include: Abhinav Rastogi
Computation and Written Language (CAWL)
Organizers include: Kyle Gorman, Brian Roark, Richard Sproat
Computational Morphology and Phonology (SIGMORPHON)
Speakers include: Kyle Gorman
Workshop on Narrative Understanding (WNU)
Organizers include: Elizabeth Clark
* Work done while at Google