Categories
Misc

My network was outputting the same value for every input, so I added BatchNormalization() as the final layer in the model and now it actually changes its output. I feel like I should not do this, but I don’t know why. Anyone know if this is ok?

Adding batch norm to the layer before doesn’t help.

“`python model = Sequential()

# inputs: A 3D tensor with shape [batch, timesteps, feature]. # https://keras.io/api/layers/recurrent_layers/lstm/ dropout = 0.1 recurrent_dropout = 0 # 0 for cudnn model.add(LSTM(25, input_shape=(x_shape[1:]), activation='tanh', return_sequences = True, dropout=dropout, recurrent_dropout=recurrent_dropout)) #model.add(BatchNormalization()) # (A) model.add(LSTM(15, activation='tanh', return_sequences = False, dropout=dropout, recurrent_dropout=recurrent_dropout)) #model.add(BatchNormalization()) # (B) model.add(Dense(10, activation='gelu')) model.add(BatchNormalization()) # (C) #model.add(Dropout(0.1)) model.add(Dense(1, activation='sigmoid')) #model.add(BatchNormalization()) # (D) loss = tf.keras.losses.MeanAbsoluteError() model.compile(loss = loss, metrics=["accuracy",tf.keras.metrics.MeanAbsoluteError()], optimizer=tf.keras.optimizers.Adam())#lr=LEARN_RATE_LSTM, decay=LEARN_RATE_LSTM_DECAY return model 

“`

submitted by /u/Abradolf–Lincler
[visit reddit] [comments]

Categories
Misc

(Cifar10) Loss continues but validation flat at 66%. Overfit?? More epochs?

(Cifar10) Loss continues but validation flat at 66%. Overfit?? More epochs? submitted by /u/BlakeYerian
[visit reddit] [comments]
Categories
Offsites

Hidden Interfaces for Ambient Computing

As consumer electronics and internet-connected appliances are becoming more common, homes are beginning to embrace various types of connected devices that offer functionality like music control, voice assistance, and home automation. A graceful integration of devices requires adaptation to existing aesthetics and user styles rather than simply adding screens, which can easily disrupt a visual space, especially when they become monolithic surfaces or black screens when powered down or not actively used. Thus there is an increasing desire to create connected ambient computing devices and appliances that can preserve the aesthetics of everyday materials, while providing on-demand access to interaction and digital displays.

Illustration of how hidden interfaces can appear and disappear in everyday surfaces, such as a mirror or the wood paneling of a home appliance.

In “Hidden Interfaces for Ambient Computing: Enabling Interaction in Everyday Materials through High-Brightness Visuals on Low-Cost Matrix Displays”, presented at ACM CHI 2022, we describe an interface technology that is designed to be embedded underneath materials and our vision of how such technology can co-exist with everyday materials and aesthetics. This technology makes it possible to have high-brightness, low-cost displays appear from underneath materials such as textile, wood veneer, acrylic or one-way mirrors, for on-demand touch-based interaction.

Hidden interface prototypes demonstrate bright and expressive rendering underneath everyday materials. From left to right: thermostat under textile, a scalable clock under wood veneer, and a caller ID display and a zooming countdown under mirrored surfaces.

Parallel Rendering: Boosting PMOLED Brightness for Ambient Computing
While many of today’s consumer devices employ active-matrix organic light-emitting diode (AMOLED) displays, their cost and manufacturing complexity is prohibitive for ambient computing. Yet other display technologies, such as E-ink and LCD, do not have sufficient brightness to penetrate materials.

To address this gap, we explore the potential of passive-matrix OLEDs (PMOLEDs), which are based on a simple design that significantly reduces cost and complexity. However, PMOLEDs typically use scanline rendering, where active display driver circuitry sequentially activates one row at a time, a process that limits display brightness and introduces flicker.

Instead, we propose a system that uses parallel rendering, where as many rows as possible are activated simultaneously in each operation by grouping rectilinear shapes of horizontal and vertical lines. For example, a square can be shown with just two operations, in contrast to traditional scanline rendering that needs as many operations as there are rows. With fewer operations, parallel rendering can output significantly more light in each instant to boost brightness and eliminate flicker. The technique is not strictly limited to lines and rectangles even if that is where we see the most dramatic performance increase. For example, one could add additional rendering steps for antialiasing (i.e., smoothing of) non-rectilinear content.

Illustration of scanline rendering (top) and parallel rendering (bottom) operations of an unfilled rectangle. Parallel rendering achieves bright, flicker-free graphics by simultaneously activating multiple rows.

Rendering User Interfaces and Text
We show that hidden interfaces can be used to create dynamic and expressive interactions. With a set of fundamental UI elements such as buttons, switches, sliders, and cursors, each interface can provide different basic controls, such as light switches, volume controls and thermostats. We created a scalable font (i.e., a set of numbers and letters) that is designed for efficient rendering in just a few operations. While we currently exclude letters “k, z, x” with their diagonal lines, they could be supported with additional operations. The per-frame-control of font properties coupled with the high frame rate of the display enables very fluid animations — this capability greatly expands the expressivity of the rectilinear graphics far beyond what is possible on fixed 7-segment LED displays.

In this work, we demonstrate various examples, such as a scalable clock, a caller ID display, a zooming countdown timer, and a music visualizer.

Realizing Hidden Interfaces with Interactive Hardware
To implement proof-of-concept hidden interfaces, we use a PMOLED display with 128×96 resolution that has all row and column drivers routed to a connector for direct access. We use a custom printed circuit board (PCB) with fourteen 16-channel digital-to-analog converters (DACs) to directly interface those 224 lines from a Raspberry Pi 3 A+. The touch interaction is enabled by a ring-shaped PCB surrounding the display with 12 electrodes arranged in arc segments.

Comparison to Existing Technologies
We compared the brightness of our parallel rendering to both the scanline on the same PMOLED and a small and large state-of-the-art AMOLED. We tested brightness through six common materials, such as wood and plastic. The material thickness ranged from 0.2 mm for the one-way mirror film to 1.6 mm for basswood. We measured brightness in lux (lx = light intensity as perceived by the human eye) using a light meter near the display. The environmental light was kept dim, slightly above the light meter’s minimum sensitivity. For simple rectangular shapes, we observed 5–40x brightness increase for the PMOLED in comparison to the AMOLED. The exception was the thick basswood, which didn’t let much light through for any rendering technology.

Example showing performance difference between parallel rendering on the PMOLED (this work) and a similarly sized modern 1.4″ AMOLED.

To validate the findings from our technical characterization with more realistic and complex content, we evaluate the number “2”, a grid of checkboxes, three progress bars, and the text “Good Life”. For this more complex content, we observed a 3.6–9.3x brightness improvement. These results suggest that our approach of parallel rendering on PMOLED enables display through several materials, and outperforms common state-of-the-art AMOLED displays, which seem to not be usable for the tested scenarios.

Brightness experiments with additional shapes that require different numbers of operations (ops). Measurements are shown in comparison to large state-of-the-art AMOLED displays.

What’s Next?
In this work, we enabled hidden interfaces that can be embedded in traditional materials and appear on demand. Our lab evaluation suggests unmet opportunities to introduce hidden displays with simple, yet expressive, dynamic and interactive UI elements and text in traditional materials, especially wood and mirror, to blend into people’s homes.

In the future, we hope to investigate more advanced parallel rendering techniques, using algorithms that could also support images and complex vector graphics. Furthermore, we plan to explore efficient hardware designs. For example, application-specific integrated circuits (ASICs) could enable an inexpensive and small display controller with parallel rendering instead of a large array of DACs. Finally, longitudinal deployment would enable us to go deeper into understanding user adoption and behavior with hidden interfaces.

Hidden interfaces demonstrate how control and feedback surfaces of smart devices and appliances could visually disappear when not in use and then appear when in the user’s proximity or touch. We hope this direction will encourage the community to consider other approaches and scenarios where technology can fade into the background for a more harmonious coexistence with traditional materials and human environments.

Acknowledgements
First and foremost, we would like to thank Ali Rahimi and Roman Lewkow for the collaboration, including providing the enabling technology. We also thank Olivier Bau, Aaron Soloway, Mayur Panchal and Sukhraj Hothi for their prototyping and fabrication contributions. We thank Michelle Chang and Mark Zarich for visual designs, illustrations and presentation support. We thank Google ATAP and the Google Interaction Lab for their support of the project. Finally, we thank Sarah Sterman and Mathieu Le Goc for helpful discussions and suggestions.

Categories
Misc

Tooth Tech: AI Takes Bite Out of Dental Slide Misses by Assisting Doctors

Your next trip to the dentist might offer a taste of AI. Pearl, a West Hollywood startup, provides AI for dental images to assist in diagnosis. It landed FDA clearance last month, the first to get such a go-ahead for dentistry AI. The approval paves the way for its use in clinics across the United Read article >

The post Tooth Tech: AI Takes Bite Out of Dental Slide Misses by Assisting Doctors appeared first on NVIDIA Blog.

Categories
Misc

GFN Thursday Is Fit for the Gods: ‘God of War’ Arrives on GeForce NOW

The gods must be smiling this GFN Thursday — God of War today joins the GeForce NOW library. Sony Interactive Entertainment and Santa Monica Studios’ masterpiece is available to stream from GeForce NOW servers, across nearly all devices and at up to 1440p and 120 frames per second for RTX 3080 members. Get ready to Read article >

The post GFN Thursday Is Fit for the Gods: ‘God of War’ Arrives on GeForce NOW appeared first on NVIDIA Blog.

Categories
Misc

11+ Best Data Science Books for beginners to advance 2022 (Updated) –

11+ Best Data Science Books for beginners to advance 2022 (Updated) - submitted by /u/maneesh123456
[visit reddit] [comments]
Categories
Misc

Question – Neural network: same prediction for different inputs

I am getting the same prediction for different inputs. I am trying to use a regressional neural network. Since data is huge, I am training one example at a time. Here is a simplified version of my code.

model = Sequential() model.add(Dense(10000, input_dim=212207, kernel_initializer='normal', activation='relu')) model.add(Dense(100, activation='relu')) model.add(Dense(1, kernel_initializer='normal')) model.compile(loss='mean_squared_error', optimizer='adam') for i in range(10000000): #X is input with 212207 values #Y is a output value if i<6000000: model.fit(X.transpose(), Y, epochs=30, batch_size=1, verbose=0) else: prediction=model.predict(X.transpose()) 

I made sure that I am training on different examples and trying predictions on different examples. I am still getting the same prediction value for all testing inputs. I think I made some mistake in defining the model for regression neural network. Can you please check if the code is correct?

submitted by /u/exoplanet_hunter
[visit reddit] [comments]

Categories
Misc

Why There is No Ideal Data Center Network Design

Is there an ideal network design that always works? Let’s find out. This blog covers the pros and cons of pure layer 3, layer 2 only, and VXLAN and EVPN.

Network administrators have a hard job. They are responsible for ensuring connectivity for all users, servers, and applications on their networks. They are often tasked with building a network design before getting application requirements, making a challenging project even more difficult. In these scenarios, it’s logical for networking admins to try to find an ideal network design they can use with any set of applications. 

There is no one-size-fits-all network solution that will work every time, and every design has benefits and drawbacks. In this post, we analyze three different network types that could be perceived as ideal. Then, we describe where each falls short, based on real-world factors.

The candidates are:

  • Pure layer 3
  • Layer 2 only
  • Overlay with VXLAN and EVPN

Ready? Let’s get started.

Pure layer 3 design

Many forward-thinking architects think pure layer 3 (L3) is the ideal design due to its simplicity and reliance on only one protocol stack. All traffic is routed and balanced at the L3 level using equal-cost multipath and endpoint redundancy is achieved through a natively functional anycast address solution. It’s simple and elegant. 

Many large web-scale IT companies choose it for its excellent operational efficiency. It also gives them robust control over their application environment to design applications that work within this design.

Applications that rely on network overlays or pure routing are optimized for the L3 architecture. Whether using a container-based solution that leverages routing as its mechanism to provide access to the environment, or a Container Network Interface to encapsulate the container-to-container communication, these solutions work great on this architecture. 

The advent of SmartNICs and DPUs makes L3 more user-friendly by providing host-based solutions to offload resource-intensive tasks such as storing routing tables, performing packet encapsulation, and doing NAT.

The biggest drawback with L3 is that it doesn’t allow any distribution of layer 2 (L2) adjacency. Over time, most enterprises must introduce an application that requires L2 adjacency, either within or between racks. Historically, developers have been unreliable in writing their applications to handle clustering using L3 capabilities. Instead of using DNS or other L3 discovery processes, many legacy applications use L2 broadcast domains to discover and detect nodes to join the cluster. A pure L3 solution struggles to service software that requires such an environment because each L2 domain is limited to one node or one server.

Layer 2 only design

The L2 only solution is the opposite of pure L3. L2 only primarily leverages VLANs for segregating its connectivity and relies on legacy features like MLAG and spanning tree protocol (STP) to provide a distributed solution. The L2-only solution still has a place in network environments, typically in simple, static environments that don’t require scale. 

People are comfortable with L2, as it uses tried-and-true technologies familiar to most people. It’s simple in the protocol stack, making all forwarding decisions based on only the first two layers of the OSI model. Also, most low-cost network devices on the market are capable of these feature sets.

However, L2 has gaps in scale and performance. Relying on STP across three tiers to prevent loops, leads to inefficient redundant paths. To circumvent this limitation in spanning-tree convergence, you can try deploying back-to-back MLAG. MLAG is not as efficient as a pure layer 3 solution at handling device failures and synchronizing control planes, however. L2 networks tend to limit broadcast and multicast traffic. These are just a few limitations that create a hidden cost of ownership around deploying an L2 only design.

Overlay design: VXLAN and EVPN

The most common design in the enterprise data center is VXLAN as the transport layer encapsulation technology, with EVPN as the control plane technology. This architecture provides the greatest flexibility, with all the benefits of a pure layer 3 solution, and provides the network administrator the adaptability to support applications requiring L2 to function. 

It provides the benefits of L2 adjacency without introducing inefficient protocols such as STP and MLAG. Leveraging EVPN as the L2 control plane and multihoming as the optimal alternative to MLAG, overlay solutions solve many inefficiencies with L2.

A one-size-fits-all solution like VXLAN and EVPN could be thought of as ideal, but even this has drawbacks. Its detractors point to the multiple layered protocols required to make it operate. The solution builds on a BGP-enabled underlay with EVPN configured between the tunnel endpoints. VXLAN tunnels are configured on top of the overlay with varying levels of complexity depending on tenancy requirements. This may include integrating with VRFs, introducing L3 VNIs for intersubnet communication, and the reliance on border leafs for intertenant communication through VRF route leaking. Combining all these technologies can create a level of complexity that makes troubleshooting and operations difficult.

Conclusion

Everything has tradeoffs, whether you’re sacrificing operational complexity for network simplicity or trading application control for flexibility. The upside of accepting that there is no perfect network design is that you are now free to pick and choose the architectures and workflow that best fit your network. Work with your application and infrastructure teams to identify the server requirements, optimize your workflows, and select the best solutions for your applications’ needs.

Learn about NVIDIA Networking Solutions.

Categories
Misc

Scaling AI for Financial Services with Full-Stack Solutions to Implementation Challenges

To scale AI for financial services, companies must face several resource hurdles. The full-stack solution from NVIDIA and VMware helps you leverage the competitive advantages of AI.

AI adoption has grown rapidly over the past few years due to its ability to automate repetitive tasks and increase revenue opportunities. Yet many companies still struggle with how to meaningfully scale AI in financial services. Increased data needs and a lack of internal talent are a few issues.

In this post, we provide a landscape overview of AI use cases and highlight some key scaling challenges, along with how leading banks and financial institutions are using end-to-end solutions to mobilize AI adoption.

What drives the use of AI in financial services?

The power of AI to solve complex business problems is widely recognized in the financial services industry, an early adopter of data science. Banking is among the top three industries in annual investment in big data and analytics. An NVIDIA survey found that 83% of C-suite leaders, IT architects, and developers in financial institutions believe that AI is vital to their future success.

It is easy to understand why: Business Insider estimates that the potential for AI-driven cost savings for banks will reach $447 billion by 2023. Across consumer finance and capital market sectors, firms are looking for ways to mine that potential.

Improved fraud detection and prevention rank highest among proven AI use cases, especially with online-payment fraud losses expected to hit $48 billion annually by 2023.

Other business drivers include maximizing client returns through portfolio optimization or creating personalized customer service through call center transcription. The list continues to grow.

A recent paper by McKinsey and Company underscores the coming sea change: to thrive, banks need to operate “AI first.” Therefore, realizing AI’s value does not come from single projects or teams but instead requires collaboration among data engineers, data scientists, and app developers across the financial institution, all supported by an AI-enabled IT infrastructure.

Finance industry use cases

The quickest returns often stem from improved fraud detection and prevention, the most cited use of AI in financial services. AI is also beneficial for creating efficiencies and reducing compliance costs related to anti-money laundering efforts and know-your-customer regulations.

When combining risk management with profit maximization, investment firms increasingly look to AI for building algorithmic trading systems, which are powered by advances in GPUs and cloud computing. Analysts involved in asset management and portfolio optimization can use natural language processing (NLP) to extract more relevant information from unstructured data faster than ever before, reducing the need for labor-intensive research.

Within consumer finance, NLP helps retail banks gain a better understanding of customer needs and identify new marketing opportunities through call center transcription. Along with chatbots and virtual assistants, transcription falls within the larger category of conversational AI, which helps personalize service by creating 360o views of customers.

Another use of AI within customer engagement is recommendation engines. This technology improves cross-selling by leveraging credit usage, scoring, and balances to suggest fitting financial products.

Key challenges of scaling AI

Banking, trading, and online-payment firms that do not adopt AI and machine learning (ML) risk being left behind, but even companies that embrace the technology face hurdles. In fact, almost half of AI projects never make it to production.

Elevated costs

One fundamental concern: AI demands exceed the existing IT infrastructure for most companies. The tooling that data scientists require for AI workloads is often difficult to deploy onto legacy systems. 28% of IT pros rate the lack of infrastructure as the primary obstacle to adoption. The proliferation of AI apps is also making IT management increasingly difficult.

Shadow IT

Financial institutions that invest in AI only within a research lab, innovation team, or line of businesses frequently have data scientists operating on customized bare-metal infrastructure. This setup produces silos in data centers and leads to shadow IT existing outside departmental IT control.

Given that structure, it is easy to see why 71% of banking executives struggle with how to scale AI, and 32% of IT pros rank data silos and data complexity as the top barrier to adoption.

These silos further hinder productivity. Rather than focusing on training ML models, data scientists spend more time as de facto IT managers for their AI fiefdoms.

Furthermore, if that infrastructure does not include accelerated computing and the parallel processing power of GPUs, deep learning models can take days or weeks to train, thereby wasting precious time for data scientists. It is like hiring world-class race car drivers and equipping them with Yugos.

Unused infrastructure

As expected, costs begin to spiral. Disparate IT increases operational overhead. Inefficient utilization of infrastructure from silos results in higher costs per workload when compared with the cloudlike efficiency achieved by allocating resources on demand through virtualization. Expensive data scientists are forced to manage IT, and ROI slows. AI projects languish.

AI-driven platforms offered by NVIDIA and VMware

So how can financial institutions fully adopt AI, scale what they have implemented, and maximize its value?

Built on NVIDIA GPUs and VMware’s vSphere with Tanzu, NVIDIA AI Enterprise offers an end-to-end platform with a suite of data science tools and frameworks to support diverse AI applications. Crucially, it also helps reduce time to ROI while overcoming problems posed by ad hoc implementations.

NVIDIA and VMware’s full-stack solution is interoperable from the hardware to the application layer, providing a single platform for both financial services apps and AI workloads. It eliminates separate AI infrastructure, as it is developed on common NVIDIA-Certified Systems with NVIDIA GPUs and VMware vSphere with Tanzu, a cloud computing platform that already exists in most data centers.

This enterprise suite offers the AI tooling of TensorFlow, PyTorch, and RAPIDS, while vSphere’s Tanzu Kubernetes Grid can run containerized apps orchestrated with Kubernetes alongside virtualized apps, creating consistent infrastructure across all apps and AI-enabled workloads.

That consistency pulls data science out of silos and brings deployment into the mainstream data center, joining the IT team with data scientists and developers. Expensive devices like GPUs and accelerators can be shared using NVIDIA virtual GPUs, driving down the total cost of ownership while providing the speed-up needed to train ML models quickly. Meanwhile, data scientists are freed from IT management to focus on their specialized work.

In other words, IT leads get simplicity, manageability, easy scaling, and deployment; data scientists are supported with infrastructure and tooling that facilitates their core work; and business executives see quicker and greater ROI.

Summary

These applications merely scratch the surface of what is possible with AI in financial services, especially as the technology matures. To scale AI for financial services, companies can increase resource capacity, hire domain-specific expertise, and find innovative ways to govern their data.

The full-stack solution from NVIDIA and VMware provides a platform to leverage the competitive advantages of AI across entire firms while overcoming common implementation challenges.

For more information about how NVIDIA AI Enterprise can activate those uses, see the NVIDIA AI-Ready Enterprise Platform for Financial Services solution brief and the AI and HPC solutions for financial services page.

See what the platform can do for you with immediate trial access to the NVIDIA AI Enterprise software suite LaunchPad, which runs on private accelerated computing infrastructure in VMware vSphere environments and includes a curated set of hands-on labs for AI practitioners and IT staff.

Categories
Offsites

FormNet: Beyond Sequential Modeling for Form-Based Document Understanding

Form-based document understanding is a growing research topic because of its practical potential for automatically converting unstructured text data into structured information to gain insight about a document’s contents. Recent sequence modeling, which is a self-attention mechanism that directly models relationships between all words in a selection of text, has demonstrated state-of-the-art performance on natural language tasks. A natural approach to handle form document understanding tasks is to first serialize the form documents (usually in a left-to-right, top-to-bottom fashion) and then apply state-of-the-art sequence models to them.

However, form documents often have more complex layouts that contain structured objects, such as tables, columns, and text blocks. Their variety of layout patterns makes serialization difficult, substantially limiting the performance of strict serialization approaches. These unique challenges in form document structural modeling have been largely underexplored in literature.

An illustration of the form document information extraction task using an example from the FUNSD dataset.

In “FormNet: Structural Encoding Beyond Sequential Modeling in Form Document Information Extraction”, presented at ACL 2022, we propose a structure-aware sequence model, called FormNet, to mitigate the sub-optimal serialization of forms for document information extraction. First, we design a Rich Attention (RichAtt) mechanism that leverages the 2D spatial relationship between word tokens for more accurate attention weight calculation. Then, we construct Super-Tokens (tokens that aggregate semantically meaningful information from neighboring tokens) for each word by embedding representations from their neighboring tokens through a graph convolutional network (GCN). Finally, we demonstrate that FormNet outperforms existing methods, while using less pre-training data, and achieves state-of-the-art performance on the CORD, FUNSD, and Payment benchmarks.

FormNet for Information Extraction
Given a form document, we first use the BERT-multilingual vocabulary and optical character recognition (OCR) engine to identify and tokenize words. We then feed the tokens and their corresponding 2D coordinates into a GCN for graph construction and message passing. Next, we use Extended Transformer Construction (ETC) layers with the proposed RichAtt mechanism to continue to process the GCN-encoded structure-aware tokens for schema learning (i.e., semantic entity extraction). Finally, we use the Viterbi algorithm, which finds a sequence that maximizes the posterior probability, to decode and obtain the final entities for output.

Extended Transformer Construction (ETC)
We adopt ETC as the FormNet model backbone. ETC scales to relatively long inputs by replacing standard attention, which has quadratic complexity, with a sparse global-local attention mechanism that distinguishes between global and long input tokens. The global tokens attend to and are attended by all tokens, but the long tokens attend only locally to other long tokens within a specified local radius, reducing the complexity so that it is more manageable for long sequences.

Rich Attention
Our novel architecture, RichAtt, avoids the deficiencies of absolute and relative embeddings by avoiding embeddings entirely. Instead, it computes the order of and log distance between pairs of tokens with respect to the x and y axes on the layout grid, and adjusts the pre-softmax attention scores of each pair as a direct function of these values.

In a traditional attention layer, each token representation is linearly transformed into a Query vector, a Key vector, and a Value vector. A token “looks” for other tokens from which it might want to absorb information (i.e., attend to) by finding the ones with Key vectors that create relatively high scores when matrix-multiplied (called Matmul) by its Query vector and then softmax-normalized. The token then sums together the Value vectors of all other tokens in the sentence, weighted by their score, and passes this up the network, where it will normally be added to the token’s original input vector.

However, other features beyond the Query and Key vectors are often relevant to the decision of how strongly a token should attend to another given token, such as the order they’re in, how many other tokens separate them, or how many pixels apart they are. In order to incorporate these features into the system, we use a trainable parametric function paired with an error network, which takes the observed feature and the output of the parametric function and returns a penalty that reduces the dot product attention score.

The network uses the Query and Key vectors to consider what value some low-level feature (e.g., distance) should take if the tokens are related, and penalizes the attention score based on the error.

At a high level, for each attention head at each layer, FormNet examines each pair of token representations, determines the ideal features the tokens should have if there is a meaningful relationship between them, and penalizes the attention score according to how different the actual features are from the ideal ones. This allows the model to learn constraints on attention using logical implication.

A visualization of how RichAtt might act on a sentence. There are three adjectives that the word “crow” might attend to. “Lazy” is to the right, so it probably does not modify “crow” and its attention edge is penalized. “Sly” is many tokens away, so its attention edge is also penalized. “Cunning” receives no significant penalties, so by process of elimination, it is the best candidate for attention.

Furthermore, if one assumes that the softmax-normalized attention scores represent a probability distribution, and the distributions for the observed features are known, then this algorithm — including the exact choice of parametric functions and error functions — falls out algebraically, meaning FormNet has a mathematical correctness to it that is lacking from many alternatives (including relative embeddings).

Super-Tokens by Graph Learning
The key to sparsifying attention mechanisms in ETC for long sequence modeling is to have every token only attend to tokens that are nearby in the serialized sequence. Although the RichAtt mechanism empowers the transformers by taking the spatial layout structures into account, poor serialization can still block significant attention weight calculation between related word tokens.

To further mitigate the issue, we construct a graph to connect nearby tokens in a form document. We design the edges of the graph based on strong inductive biases so that they have higher probabilities of belonging to the same entity type. For each token, we obtain its Super-Token embedding by applying graph convolutions along these edges to aggregate semantically relevant information from neighboring tokens. We then use these Super-Tokens as an input to the RichAtt ETC architecture. This means that even though an entity may get broken up into multiple segments due to poor serialization, the Super-Tokens learned by the GCN will have retained much of the context of the entity phrase.

An illustration of the word-level graph, with blue edges between tokens, of a FUNSD document.

Key Results
The Figure below shows model size vs. F1 score (the harmonic mean of the precision and recall) for recent approaches on the CORD benchmark. FormNet-A2 outperforms the most recent DocFormer while using a model that is 2.5x smaller. FormNet-A3 achieves state-of-the-art performance with a 97.28% F1 score. For more experimental results, please refer to the paper.

Model Size vs. Entity Extraction F1 Score on CORD benchmark. FormNet significantly outperforms other recent approaches in absolute F1 performance and parameter efficiency.

We study the importance of RichAtt and Super-Token by GCN on the large-scale masked language modeling (MLM) pre-training task across three FormNets. Both RichAtt and GCN components improve upon the ETC baseline on reconstructing the masked tokens by a large margin, showing the effectiveness of their structural encoding capability on form documents. The best performance is obtained when incorporating both RichAtt and GCN.

Performance of the Masked-Language Modeling (MLM) pre-training. Both the proposed RichAtt and Super-Token by GCN components improve upon ETC baseline by a large margin, showing the effectiveness of their structural encoding capability on large-scale form documents.

Using BertViz, we visualize the local-to-local attention scores for specific examples from the CORD dataset for the standard ETC and FormNet models. Qualitatively, we confirm that the tokens attend primarily to other tokens within the same visual block for FormNet. Moreover for that model, specific attention heads are attending to tokens aligned horizontally, which is a strong signal of meaning for form documents. No clear attention pattern emerges for the ETC model, suggesting the RichAtt and Super-Token by GCN enable the model to learn the structural cues and leverage layout information effectively.

The attention scores for ETC and FormNet (ETC+RichAtt+GCN) models. Unlike the ETC model, the FormNet model makes tokens attend to other tokens within the same visual blocks, along with tokens aligned horizontally, thus strongly leveraging structural cues.

Conclusion
We present FormNet, a novel model architecture for form-based document understanding. We determine that the novel RichAtt mechanism and Super-Token components help the ETC transformer excel at form understanding in spite of sub-optimal, noisy serialization. We demonstrate that FormNet recovers local syntactic information that may have been lost during text serialization and achieves state-of-the-art performance on three benchmarks.

Acknowledgements
This research was conducted by Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie, Renshen Wang, Yasuhisa Fujii, and Tomas Pfister. Thanks to Evan Huang, Shengyang Dai, and Salem Elie Haykal for their valuable feedback, and Tom Small for creating the animation in this post.