分散システムの性能異常に対する機械学習の解釈性に基づく原因診断手法

鶴田, 博文; 坪内, 佑樹; Hirofumi, Tsuruta; Yuuki, Tsubouchi

WEKO3

インデックスツリー

RootNode

アイテム

分散システムの性能異常に対する機械学習の解釈性に基づく原因診断手法

https://ipsj.ixsq.nii.ac.jp/records/213873

名前 / ファイル	ライセンス	アクション
IPSJ-IOTS2021004.pdf (2.4 MB)	Copyright (c) 2021 by the Information Processing Society of Japan
オープンアクセス

Item type

Symposium(1)

公開日

2021-11-18

タイトル

分散システムの性能異常に対する機械学習の解釈性に基づく原因診断手法

タイトル

言語

タイトル

A Method for Diagnosing the Causes of Performance Issues in Distributed Systems Based on the Interpretability of Machine Learning

言語

jpn

キーワード

主題Scheme

Other

主題

一般セッション2

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_5794

資源タイプ

conference paper

著者所属

さくらインターネット株式会社さくらインターネット研究所

著者所属

さくらインターネット株式会社さくらインターネット研究所／京都大学情報学研究科

著者所属(英)

SAKURA internet Research Center, SAKURA internet Inc.

著者所属(英)

SAKURA internet Research Center, SAKURA internet Inc. / Graduate School of Infomatics, Kyoto University

著者名

鶴田, 博文
坪内, 佑樹

著者名(英)

Hirofumi, Tsuruta
Yuuki, Tsubouchi

論文抄録

内容記述タイプ

Other

内容記述

Web サービスを構成する分散システムは，利用者からの多様な要求に応えるために，システム構成が複雑化している．また，システムへの変更頻度が高くなっており，システム構成の変化が速くなっている．これらの要因により，システムに性能異常が起きた際に，システム管理者が原因の診断に要する時間が増大するため，迅速な原因診断手法が必要である．先行手法として，システムの性能を示す時系列データであるメトリックに機械学習モデルを適用する手法がある．しかし，モデルとして学習に長い時間を要する深層学習が用いられているため，迅速に診断を行うには事前にモデルを学習する必要がある．モデルへの入力となるメトリックの系列数は固定であるため，システム構成が変更されて系列数が増減する場合，新たなモデルを学習しなければならない．これにより，システム構成の変更に迅速に追従した原因診断が難しい．解決方法として，高速に学習できる軽量な機械学習モデルを用いて，異常検知後に学習を行う方法が挙げられる．しかし，軽量な機械学習モデルは一般に深層学習よりも表現力が低いため，それに伴い診断精度が低くなる可能性がある．一方，機械学習モデルの予測の解釈性に関する研究が現在盛んに行われており，これらが原因診断にも有用であることが示されている．本論文では，異常検知後に軽量な機械学習モデルを学習し，解釈手法として注目されているシャープレイ値を用いて原因診断を行う手法を提案する．提案手法は，異常検知後の学習により，システム構成が頻繁に変更される場合でも常に現状の構成を反映した診断ができる．また，シャープレイ値が診断精度を高められるか検討する．実験から，提案手法は原因のメトリックの系列を 44.8% の精度で上位 1 位，82.3% の精度で上位 3 位以内に特定することを示した．

論文抄録(英)

内容記述タイプ

Other

内容記述

To respond to various demands from users, the configuration of distributed systems becomes more complex. In addition, the system configuration is changing faster due to more frequent changes in the system. Since these factors increase the time required to identify the cause of performance issues, a rapid method for diagnosing the cause is necessary. A previous method is to apply machine learning to metrics, which are time-series data that indicate the system performance. However, since deep learning, which takes a long time to train, is used as the model, it is necessary to train the model before an anomaly occurs. Since the number of metric series used as input to the model is fixed, if the number of series changes due to a change in the system configuration, a new model must be trained. As a result, it is difficult to diagnose the cause of issues by quickly following changes in the system configuration. One solution is to use a lightweight machine learning model that can be trained quickly, and train after the anomaly is detected. However, lightweight models have less expressive power than deep learning model, which may result in lower diagnostic accuracy. On the other hand, in the field of the interpretability of machine learning, methods for calculating the contribution of features to the model prediction have been studied, and these have been shown to be useful in diagnosing causes. In this paper, we propose a method for diagnosing the cause of issues using a lightweight machine learning model and Shapley value, which has attracted attention as an interpretation method for machine learning. The proposed method can always reflect the current configuration even when the system configuration is frequently changed. We also investigate whether the diagnosis accuracy can be improved by using the Shapley value. The experimental results show that the proposed method identifies the causal metric series to the top 1 with 44.8% accuracy and to the top 3 with 82.3% accuracy.

書誌情報

インターネットと運用技術シンポジウム論文集

巻 2021, p. 24-31, 発行日 2021-11-18

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-19 16:59:14.272602

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

分散システムの性能異常に対する機械学習の解釈性に基づく原因診断手法

× 鶴田, 博文

× 坪内, 佑樹

× Hirofumi, Tsuruta

× Yuuki, Tsubouchi

Versions

Share

Cite as

エクスポート