TY - JOUR
T1 - dwMLCS
T2 - An Efficient MLCS Algorithm based on Dynamic and Weighted Directed Acyclic Graph
AU - Yu, Changyong
AU - Gao, Dekuan
AU - Guo, Xu
AU - Ma, Haitao
AU - Zhao, Yuhai
AU - Wang, Guoren
N1 - Publisher Copyright:
IEEE
PY - 2024
Y1 - 2024
N2 - The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms. The source codes of dwMLCS can be downloaded from web site https://github.com/BioLab310/dwMLCS.
AB - The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms. The source codes of dwMLCS can be downloaded from web site https://github.com/BioLab310/dwMLCS.
KW - Approximation algorithms
KW - Branch-and-bound algorithm
KW - Computational modeling
KW - Costs
KW - Directed acyclic graph
KW - Dynamic programming
KW - Heuristic algorithms
KW - Longest common subsequence
KW - Lower bound
KW - Search problems
KW - Sorting
UR - http://www.scopus.com/inward/record.url?scp=85199541501&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2024.3431558
DO - 10.1109/TCBB.2024.3431558
M3 - Article
C2 - 39037883
AN - SCOPUS:85199541501
SN - 1545-5963
SP - 1
EP - 14
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
ER -