Loading…
WiDS Puget Sound is independently organized by Diversity in Data Science.
Tuesday, May 14 • 10:35am - 11:00am
Novel semi-supervised clustering algorithm drastically improves consistency and interpretability in cancer drug development.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.

Single-cell RNA sequencing is an emerging, state-of-the-art technology revolutionizing genomic analysis in cancer treatment. The primary tool used in its downstream analysis is unsupervised clustering, which helps to detect and visualize groups with common features and is leveraged more universally in the biomedical field to group cells based on their genetic and proteomic profiles. However, many common clustering methods suffer from inconsistency and interpretability problems. For example, clustering outcomes are heavily dependent on algorithm choice and are sensitive to variations in input and outliers. Additionally, it can be challenging to determine an appropriate number of clusters and label for each cluster. These issues are especially problematic for biological data in which the input data by nature has significant batch-to-batch variation, and being able to interpret the clustering labels or cell types is crucial to understanding the biological processes. Although advances in clustering methodology have helped optimize areas such as high dimensionality analysis and outlier detection, inconsistency and interpretability remain key analytical challenges. Here, we want to share a novel semi-supervised clustering method which addresses both problems. Originally developed by the Satija group at MIT, the algorithm constructs a reference clustering map through supervised learning using biologically measured data, anchoring future clusters to the reference map. In our case, we applied the model to classify and detect different cell types in cancer based on their gene expression profiles. Because the model can effectively control for the variability caused by the batch-to-batch effect, we were able to compare and pool a variety of data sources originating from different groups and environments. The reference map generated by supervised learning also provided a reliable way to label each cluster with a cell type, which drastically improved the cluster interpretability for analysis and presentation. Collectively, it has offered us a better understanding of the underlying biological processes, improving future cancer treatment medication. While our application was limited to drug development, there is no reason the semi-supervised approach cannot be applied more holistically to a variety of domains.

Speakers
MW

Marie Wang, PhD

Pfizer
Marie Wang is a Bioinformatics Scientist at Pfizer, one of the world’s premier biopharmaceutical companies. She applies rigorous statistical testing and machine learning methods to clinical data to understand the mechanism of action for drug candidates. Prior to joining Pfizer... Read More →


Tuesday May 14, 2024 10:35am - 11:00am PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA