実文書を自然言語処理技術と適切に繋ぐ技術の重要性

原, 忠義; トピチ, ゴラン; 宮尾, 祐介; 相澤, 彰子; Tadayoshi, Hara; Goran, Topic; Yusuke, Miyao; Akiko, Aizawa

WEKO3

インデックスツリー

RootNode

アイテム

実文書を自然言語処理技術と適切に繋ぐ技術の重要性

https://ipsj.ixsq.nii.ac.jp/records/101884

名前 / ファイル	ライセンス	アクション
IPSJ-NL14217003 (712.2 kB)	Copyright (c) 2014 by the Information Processing Society of Japan
オープンアクセス

Item type

SIG Technical Reports(1)

公開日

2014-06-26

タイトル

実文書を自然言語処理技術と適切に繋ぐ技術の重要性

タイトル

言語

タイトル

Significance of Bridging Real-world Documents and NLP Technologies

言語

jpn

キーワード

主題Scheme

Other

主題

構文解析・構造解析

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_18gh

資源タイプ

technical report

著者所属

国立情報学研究所

著者所属

国立情報学研究所

著者所属

国立情報学研究所

著者所属

国立情報学研究所

著者所属(英)

National Institute of Informatics

著者所属(英)

National Institute of Informatics

著者所属(英)

National Institute of Informatics

著者所属(英)

National Institute of Informatics

著者名

原, 忠義トピチ, ゴラン宮尾, 祐介相澤, 彰子

著者名(英)

Tadayoshi, Hara Goran, Topic Yusuke, Miyao Akiko, Aizawa

論文抄録

内容記述タイプ

Other

内容記述

自然言語処理（NLP）ツールの多くが入力として平文テキストを前提とする一方で，実文書中のテキストは多様なレイアウト，文構造，埋め込みのオブジェクトなどによって，より表現豊かに表示されている．このようなテキストを NLP ツールで解析する際には，ツールの利用者が対象テキストをツールに合った入力形式に変換しなければならない．また，利用者の不慣れな変換作業によって得られた入力を用いたところで，そのツールが本来持つとされる性能を発揮することは困難となるであろう．本研究の目的は，平文テキストでは表し切れないテキスト構成がタグを用いて表現されるような XML 文書の解析を題材として，この問題への意識喚起を促すことにある．我々は，XML でタグ付けされたテキストと，NLP ツールの入出力となる平文テキストとの間の一般的な変換枠組を提案し，本枠組を用いて獲得されるテキスト列が，単純にタグを除去して得られるテキストよりも構文解析器で高被覆かつ高効率に処理できることを示し，実文書を NLP 技術と適切に繋ぐ技術を開発することの重要性を浮き彫りにする．

論文抄録(英)

内容記述タイプ

Other

内容記述

Most conventional natural language processing (NLP) tools assume plain text as their input, whereas realworld documents display text more expressively, using a variety of layouts, sentence structures, and inline objects, among others. When NLP tools are applied to such text, users must first convert the text into the input/output formats of the tools. Moreover, this awkwardly obtained input typically does not allow the expected maximum performance of the NLP tools to be achieved. This work attempts to raise awareness of this issue using XML documents, where textual composition beyond plain text is given by tags. We propose a general framework for data conversion between XML-tagged text and plain text used as input/output for NLP tools and show that text sequences obtained by our framework can be much more thoroughly and efficiently processed by parsers than naively tag-removed text. These results highlight the significance of bridging real-world documents and NLP technologies.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN10115061

書誌情報

研究報告自然言語処理（NL）

巻 2014-NL-217, 号 3, p. 1-9, 発行日 2014-06-26

Notice

SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc.

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-21 10:59:35.970778

Show All versions

Cite as

相澤, 彰子, 2014: 情報処理学会, 1–9 p.

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

実文書を自然言語処理技術と適切に繋ぐ技術の重要性

× 原, 忠義トピチ, ゴラン宮尾, 祐介相澤, 彰子

× Tadayoshi, Hara Goran, Topic Yusuke, Miyao Akiko, Aizawa

Versions

Share

Cite as

エクスポート

インデックスリンク

インデックスツリー

アイテム

実文書を自然言語処理技術と適切に繋ぐ技術の重要性

× 原, 忠義 トピチ, ゴラン 宮尾, 祐介 相澤, 彰子

× Tadayoshi, Hara Goran, Topic Yusuke, Miyao Akiko, Aizawa

Versions

Share

Cite as

エクスポート

× 原, 忠義トピチ, ゴラン宮尾, 祐介相澤, 彰子