NSF Grant Awarded for Data Science Project
I’m excited to share that my collaborative proposal to the NSF to study “Interactive Ensemble clustering for mixed data with application to mood disorders” has been funded. The project officially began September 15, 2015 with funding for one year. This is a planning proposal, with the intent to foster preliminary work toward a larger funded effort in subsequent years.
Here is the official NSF abstract for our project:
The Big Data era has given rise to data of unprecedented size and complexity. However, fully leveraging Big Data resources for knowledge and discovery is an open challenge due to the fact that conventional methods of data processing and analysis often fail or are inappropriate. This project develops an innovative approach that utilizes Big Data to improve the classification of mood disorders for the purpose of improving diagnosis and outcomes for psychiatric patients. Big Data issues are inherently more severe for mental disorders because of their elusive nature. The psychiatric community has recognized the critical need for a more precise, evidence-based approach for the diagnosis and treatment of disease. In fact, recent studies funded by the National Institute of Mental Health (NIMH) have found that psychiatric interventions were effective in less than 25% of patients presenting with an acute episode. This low efficacy rate is especially problematic given the prevalence of mental disorders. Mood disorders alone (e.g., depression) will be experienced by 1 in 5 adults in the United States at some point in their lives. This project is motivated by the hypothesis that a more precise and personalized classification of mental health disease can be obtained through the development of novel clustering methods that identify clinically significant structures with these large population data sets. However, such an approach must overcome a large number of methodological challenges introduced by the complexity of the problem and the nature of large-scale real-world electronic health data. These challenges include, among others, complex and unknown structure, high dimensionality, heterogeneity, complex mixtures of variables, missing data, and sparsity.
This project is carried out by a team with interdisciplinary and complimentary skill sets to develop methods for big data that address challenges inherent in the integration of biomedical data of this type. Collective expertise of the team spans the areas of biomedical informatics, biostatistics, computer and information science, electrical and computer engineering, mathematics, and psychiatry. A novel methodology is developed in a flexible and fully integrated framework that can be extended to other biomedical data and diseases. Within this framework, clustering methods that capture different aspects of relatedness in the data are integrated in a rigorous way that not only accounts for model uncertainty, but also results in an interactive visualization that is accessible with strong model interpretability for the non-expert. Specifically, the methodology will rely on novel modifications to bootstrap estimators of generalization error for the purpose of assembling a consensus over an ensemble of clusters inferred from topology-based and machine learning approaches. The framework also supports iterative refinement of the consensus solution based on user input (via the visualization) to incorporate domain expertise. The rigorous identification of sub-groups of individuals within heterogeneous populations will facilitate accurate and targeted diagnosis for mood disorders, and provide opportunity for personalized evidence-based interventions. Applications focus on clustering individuals with mood disorders (bipolar disorder and major depression) from data collected in the Bipolar Disorder Research Network (BDRN). Despite this focus, the methodology is generalizable to other diseases that face similar challenges for diagnosis and treatment. In fact, this project supports the first steps of a long-term vision of generalizing the methods to more complex and less curated data, such as electronic health records, social media, and other sources.