2023-12-08 ML勉強会

2023/12/6 21:292024/6/11 9:30

FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction サマリーモチベーション提案手法 Rich Attention Super-Tokens 実験データセットモデルアーキテクチャ実験設定実験結果 Ablation Study 感想

FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction

Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie, Renshen Wang, Yasuhisa Fujii, Tomas Pfister

Google Cloud AI Research, Google Research

ACL 2022

サマリー

文書理解タスクの領域においてもSequence modelは高い性能を示してきたが、表などを含む様々なレイアウトパターンを持つフォーム型の文書内のトークンを正しく一列に並べること（直列化すること）は難しい。

FormNet: フォーム型文書の最適とは言えない直列化の問題を軽減するための構造認識(structured-aware) Sequence modelを提案

Rich Attention: フォーム内のトークン間の空間的な関係性を活用した計算を行うAttentionの仕組み
Super-Tokens: graph convolutionによって隣接トークンの情報を考慮した単語ごとの埋め込み表現の獲得

単純な直列化によって失われていた空間的な関係性によって表現される局所的な構文情報を獲得できることを期待

実験の結果、より小さなモデルサイズおよび事前学習データセットで画像特徴を利用せずとも既存手法を上回る

CORD, FUNSD, Paymentのベンチマークを利用

アブストラクト

シーケンスモデリングは、自然言語や文書理解タスクにおいて最先端の性能を発揮してきた。しかし、フォーム型文書のトークンを正しく直列化することは、そのレイアウトパターンが多様であるため、実際には困難である。我々は、フォームの最適でない直列化を軽減するために、構造認識シーケンスモデルであるフォームネットを提案する。まず、より正確なアテンションスコア計算のために、フォーム内のトークン間の空間的関係を活用するリッチアテンションを設計する。第二に、各単語に対して、グラフ畳み込みによって隣接するトークンからの表現を埋め込むことで、スーパー・トークンを構築する。したがって、フォームネットは、直列化時に失われた可能性のある局所的な構文情報を明示的に復元することができる。を明示的に復元する。実験では、FormNetはよりコンパクトなモデルサイズとより少ない事前学習で、既存手法よりコンパクトなモデルサイズと少ない事前学習データで、FormNetは既存の手法を凌駕し、CORD, FUNSD CORD、FUNSD、Paymentベンチマークにおいてベンチマークにおいて新たな最新性能を確立した。

原文

Sequence modeling has demonstrated state-ofthe-art performance on natural language and document understanding tasks. However, it is challenging to correctly serialize tokens in form-like documents in practice due to their variety of layout patterns. We propose FormNet, a structure aware sequence model to mitigate the suboptimal serialization of forms. First, we design Rich Attention that leverages the spatial relationship between tokens in a form for more precise attention score calculation. Second, we construct Super Tokens for each word by embedding representations from their neighboring tokens through graph convolutions. FormNet therefore explicitly recovers local syntactic information that may have been lost during serialization. In experiments, FormNet outperforms existing methods with a more compact model size and less pretraining data, establishing new state-of-the-art performance on CORD, FUNSD and Payment benchmarks.