ベイジアンフィルタにおける言語知識を用いないトークン抽出方式の提案と評価

藤田, 拓也; 松本, 章代; テュールストマーティンヤコブ; Takuya, Fujita; Akiyo, Matsumoto; Martin, J.Durst

WEKO3

インデックスツリー

RootNode

アイテム

ベイジアンフィルタにおける言語知識を用いないトークン抽出方式の提案と評価

https://ipsj.ixsq.nii.ac.jp/records/66471

名前 / ファイル	ライセンス	アクション
IPSJ-JNL5009022.pdf (596.5 kB)	Copyright (c) 2009 by the Information Processing Society of Japan
オープンアクセス

Item type

Journal(1)

公開日

2009-09-15

タイトル

ベイジアンフィルタにおける言語知識を用いないトークン抽出方式の提案と評価

タイトル

言語

タイトル

Proposal and Evaluation of Improvements for Language-independent Tokenization in Bayesian Spam E-mail Filters

言語

jpn

キーワード

主題Scheme

Other

主題

特集：社会を活性化するコンピュータセキュリティ技術

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者所属

青山学院大学大学院理工学研究科理工学専攻／現在，ソニー株式会社

著者所属

青山学院大学理工学部情報テクノロジー学科

著者所属

青山学院大学理工学部情報テクノロジー学科

著者所属(英)

Graduate School of Science and Engineering, Aoyama Gakuin University / Presently with Sony Corporation

著者所属(英)

College of Science and Engineering, Aoyama Gakuin University

著者所属(英)

College of Science and Engineering, Aoyama Gakuin University

著者名

藤田, 拓也松本, 章代テュールストマーティンヤコブ

著者名(英)

Takuya, Fujita Akiyo, Matsumoto Martin, J.Durst

論文抄録

内容記述タイプ

Other

内容記述

近年，社会問題ともなっているスパムメールに対抗するために，ベイズ理論を応用したスパムメールフィルタであるベイジアンフィルタが脚光を浴びている．しかし，社会環境のグローバル化により，多言語環境においても利用可能なスパムメールフィルタが求められている現在において，言語や文字コードの知識を用いないベイジアンフィルタは十分に検討されたとはいえない状況である．そこで本論文では，ベイジアンフィルタに最適な，言語知識を用いないトークン抽出方式の提案と評価を行う．具体的には，電子メールの構造に基づいたトークンへの属性付与や，適切なトークン長のバイト単位N-gramによって，実用的な判別精度を持ったスパムメールフィルタが実現できることを明らかにする．また，言語の異なる複数のメールコーパスを用いた実験によって，言語や文字コードの知識を用いる既存手法との比較を行い，提案手法の有効性を示す．

論文抄録(英)

内容記述タイプ

Other

内容記述

Recently, Bayesian filters have attracted attention as a means to combat spam E-mail, which has become a social problem. However, not enough attention has been given to Bayesian filters that do not use knowledge about language or character encoding. This is an important requirement in today's multilingual society. This paper proposes and evaluates methods of languageindependent token extraction optimized for Bayesian filters. We use byte-level N-gram tokens of appropriate length and assign attributes to these tokens based on E-mail structure. This leads to a spam filter with a discrimination accuracy high enough for use in practice. We also compare our proposed methods with existing methods that use knowledge about the language or character encoding using several E-mail corpora with different languages, and show the effectiveness of the newly proposed methods.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN00116647

書誌情報

情報処理学会論文誌

巻 50, 号 9, p. 2182-2192, 発行日 2009-09-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7764

戻る

views

See details

	Views

Versions

Ver.1

2025-01-22 01:02:22.166214

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

ベイジアンフィルタにおける言語知識を用いないトークン抽出方式の提案と評価

× 藤田, 拓也松本, 章代テュールストマーティンヤコブ

× Takuya, Fujita Akiyo, Matsumoto Martin, J.Durst

Versions

Share

Cite as

エクスポート

インデックスリンク

インデックスツリー

アイテム

ベイジアンフィルタにおける言語知識を用いないトークン抽出方式の提案と評価

× 藤田, 拓也 松本, 章代 テュールストマーティンヤコブ

× Takuya, Fujita Akiyo, Matsumoto Martin, J.Durst

Versions

Share

Cite as

エクスポート

× 藤田, 拓也松本, 章代テュールストマーティンヤコブ