Categories
Offsites

Progress and Challenges in Long-Form Open-Domain Question Answering

Open-domain long-form question answering (LFQA) is a fundamental challenge in natural language processing (NLP) that involves retrieving documents relevant to a given question and using them to generate an elaborate paragraph-length answer. While there has been remarkable recent progress in factoid open-domain question answering (QA), where a short phrase or entity is enough to answer a question, much less work has been done in the area of long-form question answering. LFQA is nevertheless an important task, especially because it provides a testbed to measure the factuality of generative text models. But, are current benchmarks and evaluation metrics really suitable for making progress on LFQA?

In “Hurdles to Progress in Long-form Question Answering” (to appear at NAACL 2021), we present a new system for open-domain long-form question answering that leverages two recent advances in NLP: 1) state-of-the-art sparse attention models, such as Routing Transformer (RT), which allow attention-based models to scale to long sequences, and 2) retrieval-based models, such as REALM, which facilitate retrievals of Wikipedia articles related to a given query. To encourage more factual grounding, our system combines information from several retrieved Wikipedia articles related to the given question before generating an answer. It achieves a new state of the art on ELI5, the only large-scale publicly available dataset for long-form question answering.

However, while our system tops the public leaderboard, we discover several troubling trends with the ELI5 dataset and its associated evaluation metrics. In particular, we find 1) little evidence that models actually use the retrievals on which they condition; 2) that trivial baselines (e.g., input copying) beat modern systems, like RAG / BART+DPR; and 3) that there is a significant train/validation overlap in the dataset. Our paper suggests mitigation strategies for each of these issues.

Text Generation
The main workhorse of NLP models is the Transformer architecture, in which each token in a sequence attends to every other token in a sequence, resulting in a model that scales quadratically with sequence length. The RT model introduces a dynamic, content-based sparse attention mechanism that reduces the complexity of attention in the Transformer model from n2to n1.5, where n is the sequence length, which enables it to scale to long sequences. This allows each word to attend to other relevant words anywhere in the entire piece of text, unlike methods such as Transformer-XL where a word can only attend to words in its immediate vicinity.

The key insight of the RT work is that each token attending to every other token is often redundant, and may be approximated by a combination of local and global attention. Local attention allows each token to build up a local representation over several layers of the model, where each token attends to a local neighborhood, facilitating local consistency and fluency. Complementing local attention, the RT model also uses mini-batch k-means clustering to enable each token to attend only to a set of most relevant tokens.

Attention maps for the content-based sparse attention mechanism used in Routing Transformer. The word sequence is represented by the diagonal dark colored squares. In the Transformer model (left), each token attends to every other token. The shaded squares represent the tokens in the sequence to which a given token (the dark square) is attending. The RT model uses both local attention (middle), where tokens attend only to other tokens in their local neighborhood, and routing attention (right), in which a token only attends to clusters of tokens most relevant to it in context. The dark red, green and blue tokens only attend to the corresponding color of lightly shaded tokens.

We pre-train an RT model on the Project Gutenberg (PG-19) data-set with a language modeling objective, i.e, the model learns to predict the next word given all the previous words, so as to be able to generate fluent paragraph long text.

Information Retrieval
To demonstrate the effectiveness of the RT model on the task of LFQA, we combine it with retrievals from REALM. The REALM model (Guu et al. 2020) is a retrieval-based model that uses the maximum inner product search to retrieve Wikipedia articles relevant to a particular query or question. The model was fine-tuned for factoid-based question answering on the Natural Questions dataset. REALM utilizes the BERT model to learn good representations for a question and uses SCANN to retrieve Wikipedia articles that have a high topical similarity with the question representation. This is then trained end-to-end to maximize the log-likelihood on the QA task.

We further improve the quality of REALM retrievals by using a contrastive loss. The idea behind this is to encourage the representation of a question to get close to its ground truth answer and diverge from the other answers in its mini-batch. This ensures that when the system retrieves relevant items using this question representation, it returns articles that are “similar” to ground truth answers. We call this retriever contrastive-REALM or c-REALM.

The combined RT + c-REALM system for LFQA.

Evaluation
We test the model on long-form question answering using the ELI5 dataset, which is a part of the KILT benchmark, and is the only publicly available large-scale LFQA dataset. The KILT benchmark measures text retrievals using Precision (R-Prec) and text generation using ROUGE-L. The two scores are combined to give a KILT R-L score, which determines a model’s ranking on the leaderboard. We fine-tune the pre-trained RT model together with retrievals from c-REALM on the ELI5 dataset from KILT.

Our submission tops the KILT leaderboard for long-form question answering on ELI5 with a combined KILT R-L score of 2.36. It improves on the previous leaderboard entry of BART + DPR (KILT R-L score of 1.9), while having a similar number of parameters as the other models on the leaderboard. In terms of text generation quality, we see an improvement of +4.11, +5.78 and +9.14 Rouge-L over T5, BART + DPR and RAG, respectively.

Our results on the KILT leaderboard for long form question answering on ELI5.

Example Generations from the RT Model

  • Q: Why are almost all boats white?

    Boats are generally painted white for obvious reasons, as they are inexpensive and easy to clean. Ships are also often painted to provide a visible visual contrast to the sea. Boats also often have high visibility ports to allow for easy international trade. In addition, many boats have an air-breathing function that makes them easier to spot and stay afloat, which is a good thing if you’re a pirate or trying to get into some country’s waters. Lastly, many ships are built in the last few decades, so they are not as exotic looking as they were thirty years ago. And lastly, many ships are made out of materials that are easy to work with, and so many of them are painted white as a practical matter. There are many, many more reasons for why ships are white, but those are the ones I know of. I’m sure there are many other reasons, but those are the ones I know of. Edit: I forgot to mention that the other answers about visibility and easy to see ports are also true. Ships are painted to appear to be larger, and more expensive. That makes them appear more impressive, and so they are more likely to be bought by people who want to show off their wealth and status.

  • Q: What causes the trail behind jets at high altitude?

    The same thing that causes the clouds you see at the start of a storm to form in the first place. The air pressure changes as the jet moves through the air, causing it to cool down and condense into water droplets. These droplets are in fact a lot bigger than the water droplets in the clouds, so when the plane passes through them, they can cause the air pressure to change, causing the water droplets to condense into clouds. This is what you see when you look at a cloud formation from the ground. In addition to this, when the plane passes through the clouds, it also moves through the air, causing it to change its temperature, causing the water droplets to condense and form clouds. These clouds are often quite large, and can block out the sun. This is what you see when you look at a contrail. Finally, the water vapor that forms at high altitudes is cooled by the air above, and as it cools, it condenses into clouds. This is what you see when you look at a cumulonimbus cloud. These clouds are even larger than the water droplets in the clouds, and they block out the sun.

  • Q: How do shampoo and conditioner work?

    The key ingredients in shampoo and conditioner are usually surfactants, they tend to attract oil from your skin. Shampoo will usually contain an anti-dandruff agent (dandruff is caused by fungi in your skin which cause dandruff). They also contain some moisturizing agents (moisturizers are things that stay in your skin longer and protect your scalp from cleaning), and a lot of anti-dandruff (dandruff is a side effect of high doses of both surfactants and moisturizers) Conditioner will contain moisturizing agents, like a skin moisturizer. They tend to cause oils to be separated from your hair (e.g. wash your hair with conditioner instead of shampoo). They both have a foaming agent that makes them foam up (usually Hydrogen Peroxide). They also have a foaming agent that stops dirt from attaching to your hair. Some conditioner will also suppress the action of the shampoo (to help prevent dirt from attaching to your hair).

  • Q: When glancing at a clock, why does the first second after glancing at it sometimes feel longer than the rest?

    This is known as chronostasis. Your brain gets used to seeing the second hand move after the first, so it ignores the first one. If you were to look at the clock a lot, you’d see the second hand move a lot more than the first, and you’d be able to tell the difference between the second and the first. The second hand will appear to move much slower than the first hand because your brain is used to seeing the second hand move a lot more than the first. If you were to go on a long trip, you’d get used to the second hand moving a lot more than the first, and you’d have no problem telling the difference between the second and the first hand. It’s kind of like how you don’t notice your nose when you’re driving, but if you look at it for a while, you start to notice it. Your brain is just getting used to seeing the nose move a lot more than the other, and it can’t tell the difference between the two. It’s kind of like that with your eyes.

  • Q: Why do Britain and other English empire countries still bow to monarchs? What real purpose does the queen serve?

    Depends what you define as “purpose”. There is no specific purpose for the British monarch – rather, they are figureheads that serve a role. The role is to have a high degree of influence over a country, though the rules of the monarchy can somewhat vary from country to country, and the ceremony of “kissing the royal corpse” does not always involve kissing the monarch. Whether that’s the only reason for the custom, or if it was the only reason, I can’t say, but that’s the reasoning. When the U.K. was in full power, the monarch was a puppet of Parliament, and the powers of the monarch were transferred to the Prime Minister and thus to the Prime Minister’s deputy, who then became the Prime Minister. Since then, Parliament has been able to vote on legislation that goes through the monarch, although they may still act as the monarch’s representative in negotiating treaties, which can have very very deep consequences. The Queen’s role, as a representative of Britain, doesn’t necessarily involve her formal approval of any of the laws or legislation that goes through Parliament, though.

  • Q: What exactly is fire, in detail? How can light and heat come from something we can’t really touch?

    Fire is the chemical reaction of fuel (oxygen) with (what we call) impurities (ash, soot, oil etc). These impurities are created due to a number of things. If you were to study the chemical composition of a log (or a candle) and analyse it you would discover the chemical formula (A, C, and G) which is a simplified version of what actually happens. There is always fuel, and there is always the ash/soot. Wood is made up of mostly carbon and hydrogen and these are the fuels that create the fire. Hydrogen and oxygen are reactive, and they form water (H2O) and CO2 (CO2). If we split a log into smaller pieces, we start to see the process of combustion as it occurs. The carbon is burned, and with the explosion of CO, H2O and H2CO2, we get water and carbon dioxide (mostly). If we smash a log together, we start to see what happens when we also hit another log with another log…imagine smashing two sand castles together. We see how the sand castles and brick castles form and collapse at different rates…then it starts to burn. The smoke, ash and flames are the actual combustion of fuel.

  • Q: What goes on in those tall tower buildings owned by major banks?

    The actual buildings are not there for profit but for show. There are many real world historical buildings that have little to no effect other than being a symbol of an economic entity or symbol of an era or location, and nothing more. For example look at Sears, Sears didn’t care what went on inside, it was all about the _appearance_ of its location, the prestige of the location, the facilities and so on. It didn’t care about how long it took it to operate, it was about how much people would pay to go see it. Sears was a landmark as a cultural movement and other big companies followed suit, so if you want to see a building you’ve never seen before, you have to go see Sears, just like you have to see a Toyota Camry for Toyota Camry. They used to be all about building new factories, some of them if I recall, but now that they’re bigger, that means that more factory jobs are coming to them. You’ve probably seen them in stores as stores where people buy and sell stuff, so there aren’t that many places for them to come from. Instead, it’s just for show, a symbol of rich people.

Hurdles Towards Progress in LFQA
However, while the RT system described here tops the public leaderboard, a detailed analysis of the model and the ELI5 dataset reveal some concerning trends.

  • Many held-out questions are paraphrased in the training set. Best answer to similar train questions gets 27.4 ROUGE-L.

  • Simply retrieving answers to random unrelated training questions yields relatively high ROUGE-L, while actual gold answers underperform generations.

  • Conditioning answer generation on random documents instead of relevant ones does not measurably impact its factual correctness. Longer outputs get higher ROUGE-L.

We find little to no evidence that the model is actually grounding its text generation in the retrieved documents — fine-tuning an RT model with random retrievals from Wikipedia (i.e., random retrieval + RT) performs nearly as well as the c-REALM + RT model (24.2 vs 24.4 ROUGE-L). We also find significant overlap in the training, validation and test sets of ELI5 (with several questions being paraphrases of each other), which may eliminate the need for retrievals. The KILT benchmark measures the quality of retrievals and generations separately, without making sure that the text generation actually use the retrievals.

Trivial baselines get higher Rouge-L scores than RAG and BART + DPR.

Moreover, we find issues with the Rouge-L metric used to evaluate the quality of text generation, with trivial nonsensical baselines, such as a Random Training Set answer and Input Copying, achieving relatively high Rouge-L scores (even beating BART + DPR and RAG).

Conclusion
We proposed a system for long form-question answering based on Routing Transformers and REALM, which tops the KILT leaderboard on ELI5. However, a detailed analysis reveals several issues with the benchmark that preclude using it to inform meaningful modelling advances. We hope that the community works together to solve these issues so that researchers can climb the right hills and make meaningful progress in this challenging but important task.

Acknowledgments
The Routing Transformer work has been a team effort involving Aurko Roy, Mohammad Saffar, Ashish Vaswani and David Grangier. The follow-up work on open-domain long-form question answering has been a collaboration involving Kalpesh Krishna, Aurko Roy and Mohit Iyyer. We wish to thank Vidhisha Balachandran, Niki Parmar and Ashish Vaswani for several helpful discussions, and the REALM team (Kenton Lee, Kelvin Guu, Ming-Wei Chang and Zora Tung) for help with their codebase and several useful discussions, which helped us improve our experiments. We are grateful to Tu Vu for help with the QQP classifier used to detect paraphrases in ELI5 train and test sets. We thank Jules Gagnon-Marchand and Sewon Min for suggesting useful experiments on checking ROUGE-L bounds. Finally we thank Shufan Wang, Andrew Drozdov, Nader Akoury and the rest of the UMass NLP group for helpful discussions and suggestions at various stages in the project.

Categories
Misc

Sweden’s AI Catalyst: 300-Petaflops Supercomputer Fuels Nordic Research

A Swedish physician who helped pioneer chemistry 200 years ago just got another opportunity to innovate. A supercomputer officially christened in honor of Jöns Jacob Berzelius aims to establish AI as a core technology of the next century. Berzelius (pronounced behr-zeh-LEE-us) invented chemistry’s shorthand (think H20) and discovered a handful of elements including silicon. A Read article >

The post Sweden’s AI Catalyst: 300-Petaflops Supercomputer Fuels Nordic Research appeared first on The Official NVIDIA Blog.

Categories
Misc

Flower Identifier

Hi folks, so I have followed this tutorial and Im pretty new to tensor flow but basically what I need to know is, is there any tutorials similar to this which teach you how to make the model run it in app but instead of live detecting in camera that it detects from images from the users gallery/camera roll. Any links/advice would be great thanks https://codelabs.developers.google.com/codelabs/recognize-flowers-with-tensorflow-on-android/#0

submitted by /u/Lostcause89
[visit reddit] [comments]

Categories
Misc

John Snow Labs Spark-NLP 3.0.0: Supporting Spark 3.x, Scala 2.12, more Databricks runtimes, more EMR versions, performance improvements & lots more

John Snow Labs Spark-NLP 3.0.0: Supporting Spark 3.x, Scala 2.12, more Databricks runtimes, more EMR versions, performance improvements & lots more submitted by /u/dark-night-rises
[visit reddit] [comments]
Categories
Misc

GTC 21: Top 5 Game Development Technical Sessions

This year at GTC, we have a new track for Game Developers, where you can attend sessions for free, covering the latest in ray tracing, optimizing game performance, and content creation in NVIDIA Omniverse.

This year at GTC we have a new track for Game Developers where you can attend free sessions, covering the latest in ray tracing, optimizing game performance, and content creation in NVIDIA Omniverse.

Check out our top sessions below for those working in the gaming industry:

  1. Ray Tracing in Cyberpunk 2077

    Learn how ray tracing was used to create the visuals in the game, and how the developers at CD Projekt RED used extensive ray tracing techniques to bring the bustling Night City to life.

    Evgeny Makarov, Developer Technology Engineer, NVIDIA
    Jakub Knapik, Art Director at CDPR

  1. Our Sniper Elite 4 Journey  – Lessons in Porting AAA Action Games to the Nintendo Switch

    The Asura engine, entirely developed in-house by Rebellion, has allowed the independent developer/publisher the maximum creative and technical freedom. Rebellion has overcome enormous technical challenges and built on years of Nintendo development experience to bring their flagship game, “Sniper Elite 4,” to the Switch platform. Learn how a crack team took a AAA game targeting PS4/XB1 and got it running on a Nintendo Switch. Through a journey of Switch releases, you’ll see how Rebellion optimized “Sniper Elite 4” beyond what anyone thought was possible to deliver a beautiful and smooth experience.

    Arden Aspinall, Studio Head, Rebellion North

  1. Ray Tracing in One Weekend

    This presentation will assume the audience knows nothing about ray tracing. It is a guide for the first day in country. But rather than a broad survey it will dig deep on one way to make great looking images (the one discussed in the free ebook Ray Tracing in One Weekend). There will be no API or language discussed: all pseudocode. There will be no integrals, density functions, derivatives, or other topic inappropriate for polite company discussed.

    Pete Shirley, Distinguished Research Engineer, NVIDIA

  1. LEGO Builder’s Journey: Rendering Realistic LEGO Bricks Using Ray Tracing in Unity

    Learn how we render realistic-looking LEGO dioramas in real time using Unity high-definition render pipeline and ray tracing. Starting from a stylized look, we upgraded the game to use realistic rendering on PC to enhance immersion in the game play and story. From lighting and materials to geometry processing and post effects, you’ll  get a deep insight into what we’ve done to get as close to realism as possible with a small team in a limited time — all while still using the same assets for other versions of the game.

    Mikkel Fredborg, Technical Lead, Light Brick Studio

  1. Introduction to Real Time Ray Tracing with Minecraft

    This talk is aimed at graphics engineers that have little or no experience with ray tracing. It serves as a gentle introduction to many topics, including “What is ray tracing?”, “How many rays do you need to make an image?”, “The importance of [importance] sampling. (And more importantly, what is importance sampling?)”, “Denoising”, “The problem with small bright things”. Along the way, you will learn about specific implementation details from Minecraft.

    Oli Wright, GeForce DevTech, NVIDIA

Visit the GTC website to view the entire Game Development track and to register for the free conference.

Categories
Misc

Researchers Take Steps Towards Autonomous AI-Powered Exoskeleton Legs

University of Waterloo researchers are using deep learning and computer vision to develop autonomous exoskeleton legs to help users walk, climb stairs, and avoid obstacles.

University of Waterloo researchers are using deep learning and computer vision to develop autonomous exoskeleton legs to help users walk, climb stairs, and avoid obstacles. 

The project, described in an early-access paper on IEEE Transactions on Medical Robotics and Bionics, fits users with wearable cameras. AI software processes the camera’s video stream, and is being trained to recognize surrounding features such as stairs and doorways, and then determine the best movements to take.

“Our control approach wouldn’t necessarily require human thought,” said Brokoslaw Laschowski, Ph.D. candidate in systems design engineering and lead author on the project. “Similar to autonomous cars that drive themselves, we’re designing autonomous exoskeletons that walk for themselves.”

People who rely on exoskeletons for mobility typically operate the devices using smartphone apps or joysticks. 

“That can be inconvenient and cognitively demanding,” said Laschowski, who works with engineering professor John McPhee, the Canada Research Chair in Biomechatronic System Dynamics. “Every time you want to perform a new locomotor activity, you have to stop, take out your smartphone and select the desired mode.”

The researchers are using NVIDIA TITAN GPUs for neural network training and real-time image classification of walking environments. They collected 923,000 images of human locomotion environments to create a database dubbed ExoNet — which was used to train the initial model, developed using the TensorFlow deep learning framework. 


Still in development, the exoskeleton system must learn to operate on uneven terrain and avoid obstacles before becoming fully functional. To boost battery life, the team plans to use human motion to help charge the devices.

The recent paper analyzed how the power a person uses to go from a sitting to standing position could create biomechanical energy usable to charge the robotic exoskeletons.  

Read the University of Waterloo news release for more >> 

The researchers’ latest paper is available here. The original paper, published in 2019 at the IEEE International Conference on Rehabilitation Robotics, was a finalist for a best paper award.

Categories
Misc

Creating an MLP in TF, and extracting a single runs’ seed.

Lurked Reddit for a while but need some help with something I’m programming. I’m trying to create a multilayer perceptron in Tensorflow – from what I can understand an MLP is almost like a basic form of neural network that can be built upon and become other networks (adding in convolution layers turning it into a CNN). In Tensorflow/Keras I am creating a sequential object and then adding layers to it – is this how an MLP is meant to be created by those libraries or is there a more direct way?

Also, I know that whenever my model is compiled it generates random weight distributions from a seed – is there a way I can extract the seed used from a trained model so I can keep the one that produces the smallest loss value?

submitted by /u/Greedy-Snow808
[visit reddit] [comments]

Categories
Misc

MIT intro to deep learning how to run exercises on 4GB or less GPU memory locally

Hello everybody,

that’s my fist post here, so pleas be nice 🙂 I’m totaly new to tensorflow, so this is a beginners guide and no deep dive.

Like you may now the new free MIT intro to Deep Learning Course is online. some of the there given Models are kinda Memory hungry so here the solution:

CAUTION: think while coping form online Tutorials!

First of all it is a bless to work with the tensorflow/tensorflow:latest-gpu Docker Container so Yea, just do it.

first some dependencys, the notebooks do need python3-opencv and the lab 1 needs abcmidi and timidity

apt install python3-opencv abcmidi timidity 

to edit the code in a personal directory and not in the container you need a non root user

adduser nonroot 

login to the user

su - nonroot 

install your editor, it’s jupyter lab for me

pip install jupyterlab 

start jupyter lab on 0.0.0.0 in the bound directory

jupyter lab --ip 0.0.0.0 

add those lines on the top before importing tensorflow

import os os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' 

and those after importing tensorflow as tf

physical_devices = tf.config.list_physical_devices('GPU') try: tf.config.experimental.set_memory_growth(physical_devices[0], True) except: # Invalid device or cannot modify virtual devices once initialized. pass 

Tip: add

%config Completer.use_jedi = False 

if you have problems with autocomplete.

I hope that helps somebody!

submitted by /u/deep-and-learning
[visit reddit] [comments]

Categories
Misc

Power Your Big Data Analytics with the Latest NVIDIA GPUs in the Cloud

To make it easier to leverage NVIDIA accelerated compute, we’ve added support for launching RAPIDS + Dask on the latest NVIDIA A100 GPUs in the cloud.

Dask is an accessible and powerful solution for natively scaling Python analytics. Using familiar interfaces, it allows data scientists familiar with PyData tools to scale big data workloads easily. Dask is such a powerful tool that we have adopted it throughout a variety of projects at NVIDIA. When paired with RAPIDS, data practitioners can distribute big data workloads across massive NVIDIA GPU clusters.

To make it easier to leverage NVIDIA accelerated compute, we’ve added support for launching RAPIDS + Dask on the latest NVIDIA A100 GPUs in the cloud, allowing users and enterprises to get the most out of their data.

Spin-Up NVIDIA GPU Clusters Quickly with Dask Cloud Provider

While Dask makes scaling analytics workloads easy, distributing workloads in Cloud environments can be tricky. Dask-CloudProvider is a package that provides native Cloud integration, making it simple to get started on Amazon Web Services, Google Cloud Platform, or Microsoft Azure. Using native Cloud tools, data scientists, machine learning engineers, and DevOps engineers can stand-up infrastructure and start running workloads in no time.

RAPIDS builds upon Dask-CloudProvider to make spinning-up the most powerful NVIDIA GPU instances easy with raw virtual machines. While AWS, GCP, and Azure have great managed services for data scientists, these implementations can take time to adopt new GPU architectures. With Dask-CloudProvider and RAPIDS, users and enterprises can leverage the latest NVIDIA A100 GPUs, providing 20x more performance than the previous generation. With 40GB of GPU memory each and 600GB/s NVLINK connection, NVIDIA A100 GPUs are a supercharged workhorse for enterprise-scale data science workloads. Dask-CloudProvider and RAPIDS provide an easy way to get started with A100s without having to configure raw VMs from scratch.

RAPIDS strives to make NVIDIA accelerated data science accessible to a broader data-driven audience. With Dask, RAPIDS allows data scientists to solve enterprise-scale problems in less time and with less pain. For a deeper understanding of the latest RAPIDS features and integrations, read more here.

Categories
Offsites

Leveraging Machine Learning for Game Development

Over the years, online multiplayer games have exploded in popularity, captivating millions of players across the world. This popularity has also exponentially increased demands on game designers, as players expect games to be well-crafted and balanced — after all, it’s no fun to play a game where a single strategy beats all the rest.

In order to create a positive gameplay experience, game designers typically tune the balance of a game iteratively:

  1. Stress-test through thousands of play-testing sessions from test users
  2. Incorporate feedback and re-design the game
  3. Repeat 1 & 2 until both the play-testers and game designers are satisfied

This process is not only time-consuming but also imperfect — the more complex the game, the easier it is for subtle flaws to slip through the cracks. When games often have many different roles that can be played, with dozens of interconnecting skills, it makes it all the more difficult to hit the right balance.

Today, we present an approach that leverages machine learning (ML) to adjust game balance by training models to serve as play-testers, and demonstrate this approach on the digital card game prototype Chimera, which we’ve previously shown as a testbed for ML-generated art. By running millions of simulations using trained agents to collect data, this ML-based game testing approach enables game designers to more efficiently make a game more fun, balanced, and aligned with their original vision.

Chimera
We developed Chimera as a game prototype that would heavily lean on machine learning during its development process. For the game itself, we purposefully designed the rules to expand the possibility space, making it difficult to build a traditional hand-crafted AI to play the game.

The gameplay of Chimera revolves around the titular chimeras, creature mash-ups that players aim to strengthen and evolve. The objective of the game is to defeat the opponent’s chimera. These are the key points in the game design:

  • Players may play:
    • creatures, which can attack (through their attack stat) or be attacked (against their health stat), or
    • spells, which produce special effects.
  • Creatures are summoned into limited-capacity biomes, which are placed physically on the board space. Each creature has a preferred biome and will take repeated damage if placed on an incorrect biome or a biome that is over capacity.
  • A player controls a single chimera, which starts off in a basic “egg” state and can be evolved and strengthened by absorbing creatures. To do this, the player must also acquire a certain amount of link energy, which is generated from various gameplay mechanics.
  • The game ends when a player has successfully brought the health of the opponent’s chimera to 0.

Learning to Play Chimera
As an imperfect information card game with a large state space, we expected Chimera to be a difficult game for an ML model to learn, especially as we were aiming for a relatively simple model. We used an approach inspired by those used by earlier game-playing agents like AlphaGo, in which a convolutional neural network (CNN) is trained to predict the probability of a win when given an arbitrary game state. After training an initial model on games where random moves were chosen, we set the agent to play against itself, iteratively collecting game data, that was then used to train a new agent. With each iteration, the quality of the training data improved, as did the agent’s ability to play the game.

The ML agent’s performance against our best hand-crafted AI as training progressed. The initial ML agent (version 0) picked moves randomly.

For the actual game state representation that the model would receive as input, we found that passing an “image” encoding to the CNN resulted in the best performance, beating all benchmark procedural agents and other types of networks (e.g. fully connected). The chosen model architecture is small enough to run on a CPU in reasonable time, which allowed us to download the model weights and run the agent live in a Chimera game client using Unity Barracuda.

An example game state representation used to train the neural network.
In addition to making decisions for the game AI, we also used the model to display the estimated win probability for a player over the course of the game.

Balancing Chimera
This approach enabled us to simulate millions more games than real players would be capable of playing in the same time span. After collecting data from the games played by the best-performing agents, we analyzed the results to find imbalances between the two of the player decks we had designed.

First, the Evasion Link Gen deck was composed of spells and creatures with abilities that generated extra link energy used to evolve a player’s chimera. It also contained spells that enabled creatures to evade attacks. In contrast, the Damage-Heal deck contained creatures of variable strength with spells that focused on healing and inflicting minor damage. Although we had designed these decks to be of equal strength, the Evasion Link Gen deck was winning 60% of the time when played against the Damage-Heal deck.

When we collected various stats related to biomes, creatures, spells, and chimera evolutions, two things immediately jumped out at us:

  1. There was a clear advantage in evolving a chimera — the agent won a majority of the games where it evolved its chimera more than the opponent did. Yet, the average number of evolves per game did not meet our expectations. To make it more of a core game mechanic, we wanted to increase the overall average number of evolves while keeping its usage strategic.
  2. The T-Rex creature was overpowered. Its appearances correlated strongly with wins, and the model would always play the T-Rex regardless of penalties for summoning into an incorrect or overcrowded biome.

From these insights, we made some adjustments to the game. To emphasize chimera evolution as a core mechanism in the game, we decreased the amount of link energy required to evolve a chimera from 3 to 1. We also added a “cool-off” period to the T-Rex creature, doubling the time it took to recover from any of its actions.

Repeating our ‘self-play’ training procedure with the updated rules, we observed that these changes pushed the game in the desired direction — the average number of evolves per game increased, and the T-Rex’s dominance faded.

One example comparison of the T-Rex’s influence before and after balancing. The charts present the number of games won (or lost) when a deck initiates a particular spell interaction (e.g., using the “Dodge” spell to benefit a T-Rex). Left: Before the changes, the T-Rex had a strong influence in every metric examined — highest survival rate, most likely to be summoned ignoring penalties, most absorbed creature during wins. Right: After the changes, the T-Rex was much less overpowered.

By weakening the T-Rex, we successfully reduced the Evasion Link Gen deck’s reliance on an overpowered creature. Even so, the win ratio between the decks remained at 60/40 rather than 50/50. A closer look at the individual game logs revealed that the gameplay was often less strategic than we would have liked. Searching through our gathered data again, we found several more areas to introduce changes in.

To start, we increased the starting health of both players as well as the amount of health that healing spells could replenish. This was to encourage longer games that would allow a more diverse set of strategies to flourish. In particular, this enabled the Damage-Heal deck to survive long enough to take advantage of its healing strategy. To encourage proper summoning and strategic biome placement, we increased the existing penalties on playing creatures into incorrect or overcrowded biomes. And finally, we decreased the gap between the strongest and weakest creatures through minor attribute adjustments.

New adjustments in place, we arrived at the final game balance stats for these two decks:

Deck Avg # evolves per game    
(before → after)    
Win % (1M games)
(before → after)
Evasion Link Gen     1.54 → 2.16     59.1% → 49.8%
Damage Heal 0.86 → 1.76     40.9% → 50.2%

Conclusion
Normally, identifying imbalances in a newly prototyped game can take months of playtesting. With this approach, we were able to not only discover potential imbalances but also introduce tweaks to mitigate them in a span of days. We found that a relatively simple neural network was sufficient to reach high level performance against humans and traditional game AI. These agents could be leveraged in further ways, such as for coaching new players or discovering unexpected strategies. We hope this work will inspire more exploration in the possibilities of machine learning for game development.

Acknowledgements
This project was conducted in collaboration with many people. Thanks to Ryan Poplin, Maxwell Hannaman, Taylor Steil, Adam Prins, Michal Todorovic, Xuefan Zhou, Aaron Cammarata, Andeep Toor, Trung Le, Erin Hoffman-John, and Colin Boswell. Thanks to everyone who contributed through playtesting, advising on game design, and giving valuable feedback.