ブログページ集合からのポストおよびコメント自動分離抽出手法

吉田, 光男; 乾, 孝司; 山本, 幹雄; Mitsuo, Yoshida; Takashi, Inui; Mikio, Yamamoto

WEKO3

インデックスツリー

RootNode

アイテム

ブログページ集合からのポストおよびコメント自動分離抽出手法

https://ipsj.ixsq.nii.ac.jp/records/96768

名前 / ファイル	ライセンス	アクション
IPSJ-JNL5412011.pdf (2.7 MB)	Copyright (c) 2013 by the Information Processing Society of Japan
オープンアクセス

Item type

Journal(1)

公開日

2013-12-15

タイトル

ブログページ集合からのポストおよびコメント自動分離抽出手法

タイトル

言語

タイトル

Automatic Extraction of Blog Posts and Comments from Blog Pages

言語

jpn

キーワード

主題Scheme

Other

主題

[一般論文] コンテンツ抽出，ブログ，HTML，要素識別子

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者所属

筑波大学大学院システム情報工学研究科

著者所属

筑波大学大学院システム情報工学研究科

著者所属

筑波大学大学院システム情報工学研究科

著者所属(英)

Graduate School of Systems and Information Engineering, University of Tsukuba

著者所属(英)

Graduate School of Systems and Information Engineering, University of Tsukuba

著者所属(英)

Graduate School of Systems and Information Engineering, University of Tsukuba

著者名

吉田, 光男乾, 孝司山本, 幹雄

著者名(英)

Mitsuo, Yoshida Takashi, Inui Mikio, Yamamoto

論文抄録

内容記述タイプ

Other

内容記述

ブログページには，Web検索エンジンなど機械的にページを処理するシステムにおいてノイズになる部分が含まれる．そのため，ブログのコンテンツを利用するためには，コンテンツの抽出処理が必要になる．さらに，ブログのコンテンツは，ポストと呼ばれるブログの書き手によるコンテンツと，コメントと呼ばれるブログの読み手によるコンテンツに二分できる．ポストとコメントの存在はブログの特性の1つであり，ブログの特性を活用するシステムや研究では，ポストおよびコメントを別々に抽出できていることが望ましい．本論文では，ブログページ集合を用いることにより，ポストとコメントを自動的に分離抽出する手法を提案する．複数のブログ記事ページを含むあるブログサイトにおいて，ポストはすべての記事ページに出現するが，コメントはいずれかの記事ページにしか出現しないという点に着目し考案した．また，本手法のアルゴリズムを実装したソフトウェアを用いて実験を行い，日本語ブログサイトに対しての有効性を検証し，コンテンツをポストおよびコメントに分離できることを確認した．

論文抄録(英)

内容記述タイプ

Other

内容記述

Content extraction is necessary to use blogs as data for Web search engines, because blog pages are excessively added noisy parts such as menus, advertisements and copyright notices. Most of the blog contents are texts, and those can be divided in two parts, posts and comments. A post is a content written by the blog owner and a comment is piece of text written by readers in response to the owner's post. In this paper, we propose a simple method to extract the posts and comments separately from series of blog pages, whose posts are all written by the same owner. The proposed method is based on the assumption that although posts appear in all blog pages, comments do not. We describe experimental results to show good performance of the proposed method using real Web pages of the blog sites in Japanese.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN00116647

書誌情報

情報処理学会論文誌

巻 54, 号 12, p. 2502-2512, 発行日 2013-12-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7764

戻る

views

See details

	Views

Versions

Ver.1

2025-01-21 13:10:04.922704

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

ブログページ集合からのポストおよびコメント自動分離抽出手法

× 吉田, 光男乾, 孝司山本, 幹雄

× Mitsuo, Yoshida Takashi, Inui Mikio, Yamamoto

Versions

Share

Cite as

エクスポート

インデックスリンク

インデックスツリー

アイテム

ブログページ集合からのポストおよびコメント自動分離抽出手法

× 吉田, 光男 乾, 孝司 山本, 幹雄

× Mitsuo, Yoshida Takashi, Inui Mikio, Yamamoto

Versions

Share

Cite as

エクスポート

× 吉田, 光男乾, 孝司山本, 幹雄