Hemant Ishwaran,
Ph.D., M.Sc.
The NIGMS R35 grant will enable Dr. Ishwaran and his research team to continue their work in developing machine learning methods for random forest and decision tree models. The framework will utilize a different type of base learner called SGTs rather than the widely used CART base learner for big data analysis.
“It’s a completely different type of algorithm and surprisingly works because we try to over-train the algorithm on the data -- this is a key principle of super greediness,” said Dr. Ishwaran.
Addressing Big Data Challenges
The SGT framework is designed to address various challenges associated with big data, including unsupervised machine learning, highly imbalanced data, and time varying covariate survival analysis.
One common challenge is unsupervised learning which consists of a machine learning algorithm with only input data available. The algorithm will examine patterns to determine the underlying structure of the input data. An application of unsupervised learning is evolutionary biology, which involves data clustering for genetic and species grouping.
Another challenge to big data analysis is imbalanced data. The limitation to an unbalanced dataset is that the algorithm will develop a bias towards the majority class in the data. Therefore, the algorithm does not learn the patterns that characterize the other class labels and will be unable to distinguish between the classes.
This, in turn, results in an algorithm with overall poor performance, thereby failing to accurately predict outcomes. Examples of imbalanced data include anomaly detections, such as the identification of a rare disease and fraudulent financial transactions.
Additionally, survival analysis is another objective of the research grant. An important application are large scale acquisition systems such as electronic health records systems and information technology collecting personal health data (such as wearable sensing) tracking an individual’s data over time. These measurements can be used to build real-time warning systems for adverse outcomes and to construct individualized risk predictions that dynamically change over time.
“The complexity of these models increases drastically with big data and the only hope is to use machine learning, but there are many challenges in survival analysis that still need to be overcome,” added Dr. Ishwaran.
Development of an Open-Source Software Package
Funding from the research grant will not only support the development of a unified SGT framework, but also the development of an open-source software package. The software will be a scalable and extensible software package that will serve as a resource to the scientific and research community. Moreover, a new robust and user-friendly website will be developed to share resources with randomForestSRC users.
RandomForestSRC is a R-package that implements a unified random forests. The research team has plans to extend the software in many ways, including Python, Java, Spark and other computer programming languages.
Dr. Hemant Ishwaran will serve as principal investigator of the study.
Copyright: 2024 University of Miami. All Rights Reserved.
Emergency Information
Privacy Statement & Legal Notices
Individuals with disabilities who experience any technology-based barriers accessing University websites can submit details to our online form.