Marshall–Olkin power-law distributions in length-frequency of entities

Xiaoshi Zhong; Xiang Yu; Erik Cambria; Jagath C. Rajapakse

doi:10.1016/j.knosys.2023.110942

Marshall–Olkin power-law distributions in length-frequency of entities

Xiaoshi Zhong, Xiang Yu, Erik Cambria^*, Jagath C. Rajapakse

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

Entities involve important concepts with concrete meanings and play important roles in numerous linguistic tasks. Entities have different forms in different linguistic tasks and researchers treat those different forms as different concepts. In this paper, we are curious to know whether there are some common characteristics that connect those different forms of entities. Specifically, we investigate the underlying distributions of entities from different types and different languages, trying to figure out some common characteristics behind those diverse entities. After analyzing twelve datasets about different types of entities and eighteen datasets about entities in different languages, we find that while these entities are dramatically diverse from each other in many aspects, their length-frequencies can be well characterized by a family of Marshall–Olkin power-law (MOPL) distributions. We conduct experiments on those thirty datasets about entities in different types and different languages, and experimental results demonstrate that MOPL models characterize the length-frequencies of entities much better than two state-of-the-art power-law models and an alternative log-normal model. Experimental results also demonstrate that MOPL models are scalable to the length-frequency of entities in large-scale real-world datasets.

Original language	English
Article number	110942
Journal	Knowledge-Based Systems
Volume	279
DOIs	https://doi.org/10.1016/j.knosys.2023.110942
Publication status	Published - 4 Nov 2023

Keywords

Entities
Length-frequency of entities
Marshall–Olkin power-law (MOPL) model
Power-law distributions

Access to Document

10.1016/j.knosys.2023.110942

Cite this

@article{287fdafc6af64d25a628643683bd6f17,

title = "Marshall–Olkin power-law distributions in length-frequency of entities",

abstract = "Entities involve important concepts with concrete meanings and play important roles in numerous linguistic tasks. Entities have different forms in different linguistic tasks and researchers treat those different forms as different concepts. In this paper, we are curious to know whether there are some common characteristics that connect those different forms of entities. Specifically, we investigate the underlying distributions of entities from different types and different languages, trying to figure out some common characteristics behind those diverse entities. After analyzing twelve datasets about different types of entities and eighteen datasets about entities in different languages, we find that while these entities are dramatically diverse from each other in many aspects, their length-frequencies can be well characterized by a family of Marshall–Olkin power-law (MOPL) distributions. We conduct experiments on those thirty datasets about entities in different types and different languages, and experimental results demonstrate that MOPL models characterize the length-frequencies of entities much better than two state-of-the-art power-law models and an alternative log-normal model. Experimental results also demonstrate that MOPL models are scalable to the length-frequency of entities in large-scale real-world datasets.",

keywords = "Entities, Length-frequency of entities, Marshall–Olkin power-law (MOPL) model, Power-law distributions",

author = "Xiaoshi Zhong and Xiang Yu and Erik Cambria and Rajapakse, {Jagath C.}",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2023",

month = nov,

day = "4",

doi = "10.1016/j.knosys.2023.110942",

language = "English",

volume = "279",

journal = "Knowledge-Based Systems",

issn = "0950-7051",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Marshall–Olkin power-law distributions in length-frequency of entities

AU - Zhong, Xiaoshi

AU - Yu, Xiang

AU - Cambria, Erik

AU - Rajapakse, Jagath C.

PY - 2023/11/4

Y1 - 2023/11/4

N2 - Entities involve important concepts with concrete meanings and play important roles in numerous linguistic tasks. Entities have different forms in different linguistic tasks and researchers treat those different forms as different concepts. In this paper, we are curious to know whether there are some common characteristics that connect those different forms of entities. Specifically, we investigate the underlying distributions of entities from different types and different languages, trying to figure out some common characteristics behind those diverse entities. After analyzing twelve datasets about different types of entities and eighteen datasets about entities in different languages, we find that while these entities are dramatically diverse from each other in many aspects, their length-frequencies can be well characterized by a family of Marshall–Olkin power-law (MOPL) distributions. We conduct experiments on those thirty datasets about entities in different types and different languages, and experimental results demonstrate that MOPL models characterize the length-frequencies of entities much better than two state-of-the-art power-law models and an alternative log-normal model. Experimental results also demonstrate that MOPL models are scalable to the length-frequency of entities in large-scale real-world datasets.

AB - Entities involve important concepts with concrete meanings and play important roles in numerous linguistic tasks. Entities have different forms in different linguistic tasks and researchers treat those different forms as different concepts. In this paper, we are curious to know whether there are some common characteristics that connect those different forms of entities. Specifically, we investigate the underlying distributions of entities from different types and different languages, trying to figure out some common characteristics behind those diverse entities. After analyzing twelve datasets about different types of entities and eighteen datasets about entities in different languages, we find that while these entities are dramatically diverse from each other in many aspects, their length-frequencies can be well characterized by a family of Marshall–Olkin power-law (MOPL) distributions. We conduct experiments on those thirty datasets about entities in different types and different languages, and experimental results demonstrate that MOPL models characterize the length-frequencies of entities much better than two state-of-the-art power-law models and an alternative log-normal model. Experimental results also demonstrate that MOPL models are scalable to the length-frequency of entities in large-scale real-world datasets.

KW - Entities

KW - Length-frequency of entities

KW - Marshall–Olkin power-law (MOPL) model

KW - Power-law distributions

UR - http://www.scopus.com/inward/record.url?scp=85171334963&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2023.110942

DO - 10.1016/j.knosys.2023.110942

M3 - Article

AN - SCOPUS:85171334963

SN - 0950-7051

VL - 279

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

M1 - 110942

ER -

Marshall–Olkin power-law distributions in length-frequency of entities

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this