Abstract
In the era of big data, for the ever-expanding data volume, complex and diverse application scenarios, heterogeneous hardware architecture and different types of users, traditional database techniques cannot adapt to these new scenarios and changes. So machine learning, known for its learning ability, gradually shows potential and application prospects in database. Based on full investigation and analysis, we first summarize the requirements of machine learning for building an efficient, reliable, highly available and adaptive database system, including database operation and maintenance, data storage, optimizer and executor, query optimization, database workload management, database security and privacy, database self-management, database for machine learning. Then, we discuss the potential challenges in the process of combining machine learning algorithms with database techniques from four aspects, including lack of training data, long training time, limited generalization ability, and challenges in applying machine learning models with specific database problems. Next, we survey the researches of machine-learning-based techniques, including automatic parameter tuning, automatic cardinality estimation, automatic query plan selection, automatic index and view selection. Automatic tuning technology includes heuristic algorithm, traditional machine learning and deep reinforcement learning. Heuristic algorithms explore the optimal subspace through sampling from the discrete parameter space, which can effectively improve the efficiency of parameter tuning, but they are difficult to find the appropriate configuration within the resource limit; traditional machine learning algorithm learns the mapping relationship between the system state and the specified workload template in the reduced dimension parameter space, which improves the adaptability of the model; deep reinforcement learning iteratively learns the optimization strategy in the high-dimensional parameter space, and uses neural network to improve the processing ability of high-dimensional data. It can effectively reduce the demand of training data; automatic cardinality estimation includes query-oriented method and query-plan-oriented method. The former uses convolutional neural network (CNN) to learn the relationship among data, filter conditions and join conditions. However, it is poor in generalization for different datasets. The latter estimates cardinality of physical operators in cascades, which improves the adaptability to different queries. Query plan selection includes deep learning and reinforcement learning. The deep learning method integrates the estimated cost values and data characteristics, which improve the accuracy of each plan cost estimation, but the results depend heavily on the accuracy of the estimator; deep reinforcement learning method iteratively generates the query plan based on the final goal, and it reduces the dependence on query cost. Automatic index selection includes classifier, reinforcement learning and genetic algorithm: the classification algorithm analyzes the cost of building indexes and the efficiency of different indexes based on the table characteristics. By combining the genetic algorithm, it improves the recommendation efficiency of composite index; reinforcement learning realizes online index selection by incrementally recommending indexes. Automatic view selection includes heuristic algorithm, probability statistics and reinforcement learning. Heuristic algorithms improve selection efficiency by greedily exploring directed acyclic graph of candidate views, but its adaptability is poor. Statistics-based methods formalize view selection into a 0-1 selection problem, effectively reducing the exploration cost of graph. Reinforcement learning methods model the creation and deletion of view into a dynamic selection process, and further improve selection efficiency with a try-and-error training pattern. Finally, we provide the revolutionary breakthroughs that machine learning technologies will bring to databases from eight perspectives.
Translated title of the contribution | A Survey of Machine Learning Based Database Techniques |
---|---|
Original language | Chinese (Traditional) |
Pages (from-to) | 2019-2049 |
Number of pages | 31 |
Journal | Jisuanji Xuebao/Chinese Journal of Computers |
Volume | 43 |
Issue number | 11 |
DOIs | |
Publication status | Published - Nov 2020 |
Externally published | Yes |