Provable Boolean interaction recovery from tree ensemble obtained via random forests-Reference-Cited by-同舟云学术

Provable Boolean interaction recovery from tree ensemble obtained via random forests

Published:2022-05-24 Issue:22 Volume:119 Page:
ISSN:0027-8424
Container-title:Proceedings of the National Academy of Sciences
language:en
Short-container-title:Proc. Natl. Acad. Sci. U.S.A.

Author:

Behr Merle¹^ORCID,Wang Yu¹,Li Xiao¹,Yu Bin¹²³

Affiliation:

1. Department of Statistics, University of California , Berkeley, CA 94720

2. Department of Electrical Engineering and Computer Sciences, University of California , Berkeley, CA 94720

3. Center for Computational Biology, University of California , Berkeley, CA 94720

Abstract

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S ± . Intuitively speaking, DWP( S ± ) measures how frequently features in S ± appear together in an RF tree ensemble. We prove that, with high probability, DWP( S ± ) attains a universal upper bound that does not involve any model coefficients, if and only if S ± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

Funder

Deutsche Forschungsgemeinschaft

National Science Foundation

Center for Science of Information

Simons Foundation

Publisher

Proceedings of the National Academy of Sciences

Subject

Multidisciplinary

Link

https://pnas.org/doi/pdf/10.1073/pnas.2118636119

Reference45 articles.

1. Veridical data science

2. Random forests;Breiman L.;Mach. Learn.,2001

3. Greedy function approximation: A gradient boosting machine.

4. Unbiased split selection for classification trees based on the Gini Index

5. G. Louppe , L. Wehenkel , A. Sutera , P. Geurts , “Understanding variable importances in forests of randomized trees” in Advances in Neural Information Processing Systems , C. J. Burges , L. Bottou , M. Welling , Z. Ghahramani , K. Q. Weinberger , Eds. ( Curran Associates, Inc ., Red Hook, NY , 2013 ), vol. 26 , pp. 431 – 439 .

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Opportunities and Challenges for AI-Based Analysis of RWD in Pharmaceutical R&D: A Practical Perspective;KI - Künstliche Intelligenz;2023-10-09

2. Machine learning-based dynamic prediction of lateral lymph node metastasis in patients with papillary thyroid cancer;Frontiers in Endocrinology;2022-10-10