Human-in-the-loop Outlier Detection

Chengliang Chai; Lei Cao; Guoliang Li; Jian Li; Yuyu Luo; Samuel Madden

doi:10.1145/3318464.3389772

Human-in-the-loop Outlier Detection

Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, Samuel Madden

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

48 Citations (Scopus)

Abstract

Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose a human-in-the-loop outlier detection approach HOD that effectively leverages human intelligence to discover the true outliers. There are two main challenges in HOD. The first is to design human-friendly questions such that humans can easily understand the questions even if humans know nothing about the outlier detection techniques. The second is to minimize the number of questions. To address the first challenge, we design a clustering-based method to effectively discover a small number of objects that are unlikely to be outliers (aka, inliers) and yet effectively represent the typical characteristics of the given dataset. HOD then leverages this set of inliers (called context inliers) to help humans understand the context in which the outliers occur. This ensures humans are able to easily identify the true outliers from the outlier candidates produced by the machine-based outlier detection techniques. To address the second challenge, we propose a bipartite graph-based question selection strategy that is theoretically proven to be able to minimize the number of questions needed to cover all outlier candidates. Our experimental results on real data sets show that HOD significantly outperforms the state-of-the-art methods on both human efforts and the quality of the discovered outliers.

Original language	English
Title of host publication	SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
Publisher	Association for Computing Machinery
Pages	19-33
Number of pages	15
ISBN (Electronic)	9781450367356
DOIs	https://doi.org/10.1145/3318464.3389772
Publication status	Published - 14 Jun 2020
Externally published	Yes
Event	2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020 - Portland, United States Duration: 14 Jun 2020 → 19 Jun 2020

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)	0730-8078

Conference

Conference	2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
Country/Territory	United States
City	Portland
Period	14/06/20 → 19/06/20

Keywords

human-in-the-loop
outlier detection
question selection

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1145/3318464.3389772

Cite this

Chai, C., Cao, L., Li, G., Li, J., Luo, Y., & Madden, S. (2020). Human-in-the-loop Outlier Detection. In SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 19-33). (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3318464.3389772

@inproceedings{bfbbf68ed5014f2683058e838c038b8a,

title = "Human-in-the-loop Outlier Detection",

abstract = "Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose a human-in-the-loop outlier detection approach HOD that effectively leverages human intelligence to discover the true outliers. There are two main challenges in HOD. The first is to design human-friendly questions such that humans can easily understand the questions even if humans know nothing about the outlier detection techniques. The second is to minimize the number of questions. To address the first challenge, we design a clustering-based method to effectively discover a small number of objects that are unlikely to be outliers (aka, inliers) and yet effectively represent the typical characteristics of the given dataset. HOD then leverages this set of inliers (called context inliers) to help humans understand the context in which the outliers occur. This ensures humans are able to easily identify the true outliers from the outlier candidates produced by the machine-based outlier detection techniques. To address the second challenge, we propose a bipartite graph-based question selection strategy that is theoretically proven to be able to minimize the number of questions needed to cover all outlier candidates. Our experimental results on real data sets show that HOD significantly outperforms the state-of-the-art methods on both human efforts and the quality of the discovered outliers.",

keywords = "human-in-the-loop, outlier detection, question selection",

author = "Chengliang Chai and Lei Cao and Guoliang Li and Jian Li and Yuyu Luo and Samuel Madden",

note = "Publisher Copyright: {\textcopyright} 2020 Association for Computing Machinery.; 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020 ; Conference date: 14-06-2020 Through 19-06-2020",

year = "2020",

month = jun,

day = "14",

doi = "10.1145/3318464.3389772",

language = "English",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

pages = "19--33",

booktitle = "SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data",

}

Chai, C, Cao, L, Li, G, Li, J, Luo, Y & Madden, S 2020, Human-in-the-loop Outlier Detection. in SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, pp. 19-33, 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020, Portland, United States, 14/06/20. https://doi.org/10.1145/3318464.3389772

Human-in-the-loop Outlier Detection. / Chai, Chengliang; Cao, Lei; Li, Guoliang et al.
SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, 2020. p. 19-33 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Human-in-the-loop Outlier Detection

AU - Chai, Chengliang

AU - Cao, Lei

AU - Li, Guoliang

AU - Li, Jian

AU - Luo, Yuyu

AU - Madden, Samuel

PY - 2020/6/14

Y1 - 2020/6/14

N2 - Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose a human-in-the-loop outlier detection approach HOD that effectively leverages human intelligence to discover the true outliers. There are two main challenges in HOD. The first is to design human-friendly questions such that humans can easily understand the questions even if humans know nothing about the outlier detection techniques. The second is to minimize the number of questions. To address the first challenge, we design a clustering-based method to effectively discover a small number of objects that are unlikely to be outliers (aka, inliers) and yet effectively represent the typical characteristics of the given dataset. HOD then leverages this set of inliers (called context inliers) to help humans understand the context in which the outliers occur. This ensures humans are able to easily identify the true outliers from the outlier candidates produced by the machine-based outlier detection techniques. To address the second challenge, we propose a bipartite graph-based question selection strategy that is theoretically proven to be able to minimize the number of questions needed to cover all outlier candidates. Our experimental results on real data sets show that HOD significantly outperforms the state-of-the-art methods on both human efforts and the quality of the discovered outliers.

AB - Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose a human-in-the-loop outlier detection approach HOD that effectively leverages human intelligence to discover the true outliers. There are two main challenges in HOD. The first is to design human-friendly questions such that humans can easily understand the questions even if humans know nothing about the outlier detection techniques. The second is to minimize the number of questions. To address the first challenge, we design a clustering-based method to effectively discover a small number of objects that are unlikely to be outliers (aka, inliers) and yet effectively represent the typical characteristics of the given dataset. HOD then leverages this set of inliers (called context inliers) to help humans understand the context in which the outliers occur. This ensures humans are able to easily identify the true outliers from the outlier candidates produced by the machine-based outlier detection techniques. To address the second challenge, we propose a bipartite graph-based question selection strategy that is theoretically proven to be able to minimize the number of questions needed to cover all outlier candidates. Our experimental results on real data sets show that HOD significantly outperforms the state-of-the-art methods on both human efforts and the quality of the discovered outliers.

KW - human-in-the-loop

KW - outlier detection

KW - question selection

UR - http://www.scopus.com/inward/record.url?scp=85086260731&partnerID=8YFLogxK

U2 - 10.1145/3318464.3389772

DO - 10.1145/3318464.3389772

M3 - Conference contribution

AN - SCOPUS:85086260731

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 19

EP - 33

BT - SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery

T2 - 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020

Y2 - 14 June 2020 through 19 June 2020

ER -

Human-in-the-loop Outlier Detection

Abstract

Publication series

Conference

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this