2023-06-22 ML勉強会

2023/6/19 17:052024/9/10 4:24

Unifying Vision, Text, and Layout for Universal Document Processing (CVPR2023)要約詳細感想

Unifying Vision, Text, and Layout for Universal Document Processing (CVPR2023)

microsoftの新作

コード：https://github.com/microsoft/i-Code/tree/main/i-Code-Doc

Encoder + Text decoderの重みは公開されている

https://huggingface.co/ZinengTang/Udop/tree/main
MITライセンス

Vision DecoderはAzure APIで公開予定らしい

要約

既存研究では何ができなかったのか

既存研究では、テキスト、画像、およびレイアウトの複数のモダリティを一貫した表現で統一的に処理することができなかった。

どのようなアプローチでそれを解決しようとしたか

UDOPは、テキストコンテンツとドキュメントイメージの空間的相関を利用して、画像、テキスト、およびレイアウトのモダリティを統一された表現でモデル化することで、この問題を解決しようとしました。UDOPは、新しいVision-Text-Layout Transformerを使用し、事前学習とマルチドメインのタスクをプロンプトベースのシーケンス生成スキームに統一します。

結果、何が達成できたのか

UDOPは、大規模な未ラベルの文書コーパスとさまざまなラベル付きデータを使用して事前学習されます。さらに、UDOPは、テキストとレイアウトのモダリティからドキュメントイメージを生成する方法を学習します。これにより、UDOPは、高品質なニューラルドキュメント編集とコンテンツカスタマイズを同時に実現する最初のモデルであると言えます。UDOPは、財務レポート、学術論文、ウェブサイトなど、さまざまなデータドメインを対象とした8つのドキュメントAIタスク（ドキュメント理解やQAなど）において、最先端の性能を発揮しました。また、UDOPは、Document Understanding Benchmarkのリーダーボードで1位を獲得しました。

詳細

先行研究

近年のモデルは全部同じようなvision-languageのフレームワークに則ってる

画像をvision modelでencode + text → multimodal encoder (2タワー、3タワー)

Docformer: End-to-end transformer for document understanding. ICCV, 2021.
Unidoc: Unified pretraining framework for document understanding. NeurIPS, 2021.
Selfdoc: Self-supervised document representation learning, CVPR, 2021.
Going full-tilt boogie on document understanding with textimage-layout transformer. ICDAR, 2021.
Structext: Structured text understanding with multimodal transformers.ACM, 2021
LayoutLM, LayoutXLM, LayoutLMV2

image + text → joint encoder

Layoutlmv3: Pre-training for document ai with unified text and image masking. arXiv, 2022.
ViLT: Visionand-Language Transformer Without Convolution or Region Supervision. ICML, 2021.
Perceiver-vl: Efficient vision-and-language modeling with iterative latent attention. arXiv, 2022.

text → encoder

Lambert: layout-aware language modeling for information extraction. ICDAR, 2021.
Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. CVPR, 2022.
Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. AAAI, 2022.
Structurallm: Structural pre-trainingfor form understanding. arXiv, 2022.
Lilt: A simple yet effective language-independent layout transformer for structured document understanding. ACL, 2022.

これらはモダリティ間の強い相関は十分に活用されていない
多くのモデルはタスク固有のヘッドを使用する必要があり非効率的

提案モデル

Unifying Vision, Text, and Layout for Universal Document Processing(UDOP)
画像とテキストを独立した入力として捉えない
Vision-Text-Layout (VTL) Transformer

Unified Encoder
Text-Layout Decoder
Vision Decoder

新しい自己教師あり学習

11Mのラベルなし公開データと1.8Mの11個の教師ありデータセットで学習

MAE (masked auto encoder)をさせることで、文書画像を生成できる

Vision-Text-Layout Transformer

ImageをPxPxCサイズに分割する

それぞれをD次元のベクトルにEncodeしてN個のベクトルを取得 (N = H/P x W/P)

Textは同様にD次元のベクトルにEncodeしてM個のベクトルを取得 (Mは単語数)
テキストのbboxの中心点がImage patchに含まれているかどうかを示すindicater (φ)を作る
φ = 1なら単純にTextとImageのベクトルを足し合わせる
BBOXのPositional Embeddingも使う

Vision-Text-Layout Decoder

Vision-Decoder

Masked Auto Encoder

Text-Layout Decoder

事前学習

教師ありデータと教師データなしデータの両方を用いる
Task Prompts

教師なしデータの場合

TextとBBoxを15%マスクしてTextとBBoxを生成
BBoxを75%マスクしてBBoxを生成

小さくすると簡単になりすぎたらしい

Textを50%マスクしてTextを生成
Masked Image Reconstruction with Text and Layout

文字ごとの学習可能なEmbeddingとUnified Encoderの結果をCross Attention

文字情報は視覚生成に役立つはず？
画像生成の質が大幅に向上するらしい

対象画像パッチがMaskされてるかどうかを示す学習可能な２種類のPlaceholder EmbeddingをDecoderに入力し、画像を生成する

教師ありデータの場合

Task Promptsの通り

モデル構造

T5-largeのEncoder-Decoderを踏襲
Vision DecoderはMAE-large
tokenizerはT5 tokenizer
vocabは<layout_0>とかが含まれるように拡張

データ

IIT-CDIP Test Collection 1.0

研究利用は可。商用利用については明示的には書かれていない（コピーの配布を商用利用のために行うのはNG）

カリキュラム学習

小さな解像度(224)から大きな解像度(1024)へ変化させていく
Adam
画像サイズ224の時点でDUE-BenchmarkでSoTAを達成していて、1024にすることでさらに精度向上

結果

2タワーモデルと統一したものとの比較

画像を編集できたりする

感想

日本語では学習されていないので、データセットを準備しなくちゃダメ

とりあえずT + Lのモデルだけでもやる価値はある

生成系はエラー分析が難しそう