Dissertations

GET STARTED
1
Request Info
2
Visit
3
Apply

Novel Tree-Based Variable Importance Methods on Correlated Data

Author: Naiyuan Zhang

Date: 9/28/2022

Executive Summary:
The rapid development of data-driven science and a massive increase in the data volume have encouraged the development of variable importance techniques in all fields of science. However, due to a lack of communication between various disciplines, these variable importance methods, independently developed and evaluated in distinct ways, raise some questions for us. Do they have a unified formal definition of variable importance? What are the criteria for treating a variable as noise? How do they perform on different types of data? What are the challenges of current variable importance techniques? In my dissertation, we summarize and organize definitions of noise variables and variable importance from different disciplines into mathematical frameworks and provide an in-depth discussion about the relative merits and weak points of several popular variable importance methods.Decision trees are widely used in statistics, data mining, and machine learning. Its robustness to messy data sets, flexibility on variable types, and intuitive interpretability make decision trees popular in many variable importance methods, such as random forest variable importance proposed by Breiman and Cutler, drop variable importance and set-0 variable importance. In my dissertation, we propose two novel tree-based variable importance approaches, Weighted Tree variable importance and Weighted Decrease Impurity (WDI) boosted forest, that eliminates the bias for highly correlated variables and perform well on high-dimensional data. We perform multiple simulations to explore the performance of the two variable importance methods on correlated variables and high-dimensional data. We also compare their performance with Breiman Cutler's (BC) importance, gradient boosting importance, minimal depth, and Lasso techniques. We show that our proposed methods have good practical operational properties through simulations.

Top