Loading…
WiDS Puget Sound is independently organized by Diversity in Data Science.
Tuesday, May 14
 

9:00am PDT

Opening Ceremony
The opening ceremony will welcome the audience to WiDS Puget Sound 2024. We will share useful tips to optimize your conference experience and thank the volunteers and sponsors that help make the event possible.

Speakers
avatar for Kelly Stroh

Kelly Stroh

Data Scientist, Trupanion
avatar for Yashaswini Agarwal

Yashaswini Agarwal

Data Analyst, Mount Sinai
avatar for Niwako Sugimura

Niwako Sugimura

People Analytics Lead, Deloitte


Tuesday May 14, 2024 9:00am - 9:15am PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

9:15am PDT

How to thrive in the world of AI
Woven through my own story in the world of AI for 3 decades, I’ll reveal in this presentation how AI is reshaping our world. We'll unpack how to stay agile and innovative with AI, boosting productivity and creativity in any job. Let's explore how to tackle challenges, protect our planet, and ensure humanity flourishes. No tech background needed—just a readiness to adapt and grow!

Speakers
avatar for Teresa Escrig, PhD

Teresa Escrig, PhD

WhatMattersAcademy.com
Dr. Teresa Escrig, PhD in AI, has led robotics research, authored 100+ papers, 3 books, and spearheaded AI projects globally. At Microsoft, she led Responsible AI and Machine Teaching. Founder of What Matters Academy, she merges tech expertise with a passion for natural, empowered... Read More →


Tuesday May 14, 2024 9:15am - 9:55am PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

10:05am PDT

Advancing Retail Fraud Prevention with Apache Kafka and Apache Flink: A Real-Time Event-Streaming Approach

The retail sector is increasingly vulnerable to a variety of sophisticated fraud schemes that can lead to financial loss and erode consumer trust. This presentation will examine the escalating problem of retail fraud, identify the key challenges retailers face, and offer a comprehensive overview of an innovative solution leveraging the distributed event-streaming capabilities of Apache Kafka in conjunction with the real-time stream processing power of Apache Flink.

We will outline challenges in the retail industry, such as the need to process and analyze data in real-time to detect fraudulent patterns swiftly and the difficulty in scaling systems to handle peak transaction volumes.

The core of the session will provide an overview of the system's workflow, which employs Apache Kafka to efficiently handle high-volume data streams and Apache Flink for its low-latency processing capabilities. We will illustrate how these components interact to create a real-time fraud detection engine that can identify and act upon suspicious activities as they occur.

Next, we will delve into specific use cases, illustrating how the system addresses common fraud scenarios such as in-store return fraud, policy abuse and anomalous markdowns. Through these examples, attendees will gain insight into the system's versatility and its ability to mitigate various types of fraud across the retail domain.

The talk will also cover the results achieved by this system, including improved fraud detection rates, reduced false positives, and the ability to preempt fraud before it impacts the bottom line. We will highlight how this approach not only protects revenue but also enhances the customer experience by minimizing the intrusion of fraud checks on legitimate transactions.

Finally, we will explore other opportunities presented by this technology, including the incorporation of machine learning algorithms for enhanced predictive capabilities. These could utilize MLOps methodologies for model lifecycle management to ensure continuous adaptation and peak performance against emerging fraud tactics.

Attendees of this session will leave with a clear understanding of how the combined strengths of Apache Kafka and Apache Flink are shaping the future of real-time fraud prevention in the retail space, offering scalable, efficient, and adaptable solutions to this persistent industry challenge.

Speakers
avatar for Zhamilya Kruger

Zhamilya Kruger

Nordstrom
Zhamilya Kruger is a data scientist who has cultivated a breadth of experience at Nordstrom over the past six years, contributing her analytical skills to several areas within the company. Her journey has taken her through Product Management, Search and Browse Optimization, Finance... Read More →


Tuesday May 14, 2024 10:05am - 10:30am PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

10:05am PDT

As easy as breathing - manage your workflows with Airflow!

Apache Airflow is an open source workflow management tool that's been called "cron on steroids". For a career data engineer, this tool has been central in my success at orchestrating and maintaining data pipelines. But Airflow's applications have grown far beyond the intent for which it was originally built. What was once a machine learning training engine is now a tool I've used extensively over the last 6 years. I've used it across 3 jobs, in several different roles; for side projects and critical infrastructure; for manually triggered jobs and automated workflows; for IT (Ookla/Speedtest.net), science (Allen Institute for Cell Science), the commons (Openverse), and liberation (Orca Collective). In this talk, I'll be sharing a brief overview of what Apache Airflow is and how it might be able to help manage *your* workflows too! As an Airflow user and contributor for the last 6 years, I've seen how this tool can quickly become the hammer for every nail you see. Part of what makes Airflow powerful is that you can define its workflows in pure Python; this means you can leverage all of the clever language features and libraries Python has to offer when setting up a job. No more pesky and repetitive YAML files (GitHub Actions) or domain-specific languages (Jenkins). Use the language and libraries you're familiar with while getting automatic retries, error handling, control flow, and so much more.

Speakers
avatar for Madison Swain-Bowden

Madison Swain-Bowden

Senior Data Engineer, Automattic
Madison is a Senior Data Engineer & former Team Lead out of Seattle and an avid Python user/organizer. She is currently sponsored by Automattic to work on the open source project Openverse, and has worked at Ookla (Speedtest.net), the Allen Institute for Cell Science, and the Broad... Read More →


Tuesday May 14, 2024 10:05am - 10:30am PDT
Room 130, Student Center

10:05am PDT

GANs for Causal Inference: Harnessing Conditional Independence

This interdisciplinary talk introduces the listeners to the power of Generative AI in the field of Causal Inference and its subsequent applications in Economics and Political Science. Our rigorous year-long research aims to develop a state-of-the-art Causal Inference technique: CausalGANs. Generative Adversarial Networks (GANs) is a popular deep learning method which dominates the field of image generation. We harness the essence of GANs to create, from scratch, a causal inference technique which modifies the architecture of GANs to solve the fundamental problem of missing counterfactuals in Causal Inference. In this thorough research, we set up a new framework, develop the notation, write mathematical proofs, and produce robust results by running over 200 parallelised experiments for each different set of parameters on High Power Computing. The GANs algorithm simultaneously trains two models: a generator and a discriminator. The generator's objective is to find a data-generating process that generates fake data emulating the distribution of real data and the discriminator's objective is to distinguish the real data from the fake data. This adversarial nature makes this framework a minimax game between its two components; the competition in this game drives both generator and discriminator to improve their methods until the simulated samples are indistinguishable from the observed samples. At the core of the GANs algorithm is the search for a neural network model that can generate fake data, whose distribution is independent of the labeling of real versus fake data. Independence restrictions of this kind are front and center in causal inference models, where the distribution of potential outcomes under treatment and control, conditional on contextual variables, are independent of the realized treatment. This makes the GANs apparatus a good method for causal inference, where instead of pitting real versus fake data, we now strive to get distributions of potential outcomes for treated and non-treated as close as possible. The ongoing research involves the development of the method, proof of its validity, and conducting empirical experiments. We confirm several intuitions as we test different aspects of the method, CausalGANs, with a robust evaluation strategy and compare it against traditional and other state-of-the-art methods in causal inference. We were able to empirically verify the mathematical theorems defined for the framework: 1) We can recover the parameters of the data-generating process through this adversarial framework, 2) The minimum of the loss function is attained close to the true data parameters, and 3) The minimizer provides the best estimator of the propensity score. Through this framework, we successfully obtain the treatment effects. Thus, the success of this method revolutionizes the field of economics through practical applications such as policy development, which often seeks to find the causal effect of interventions.

Speakers
avatar for Palak Bansal

Palak Bansal

New York University
Palak Bansal is an accomplished data science professional committed to promoting diversity and inclusion in technology. Currently pursuing her Master's degree in Data Science at New York University, Palak has over three years of experience in both software and data science projects... Read More →
avatar for Hoa Duong

Hoa Duong

New York University
Hoa is a data science professional whose interests lie in the intersection between data science, economics, and business.  Hoa earned her B.A. in Mathematics and Economics with honors and worked as an Analyst and Researcher at NERA Economic Consulting, where she led teams to implement... Read More →


Tuesday May 14, 2024 10:05am - 10:30am PDT
Room 210, Student Center

10:35am PDT

Novel semi-supervised clustering algorithm drastically improves consistency and interpretability in cancer drug development.

Single-cell RNA sequencing is an emerging, state-of-the-art technology revolutionizing genomic analysis in cancer treatment. The primary tool used in its downstream analysis is unsupervised clustering, which helps to detect and visualize groups with common features and is leveraged more universally in the biomedical field to group cells based on their genetic and proteomic profiles. However, many common clustering methods suffer from inconsistency and interpretability problems. For example, clustering outcomes are heavily dependent on algorithm choice and are sensitive to variations in input and outliers. Additionally, it can be challenging to determine an appropriate number of clusters and label for each cluster. These issues are especially problematic for biological data in which the input data by nature has significant batch-to-batch variation, and being able to interpret the clustering labels or cell types is crucial to understanding the biological processes. Although advances in clustering methodology have helped optimize areas such as high dimensionality analysis and outlier detection, inconsistency and interpretability remain key analytical challenges. Here, we want to share a novel semi-supervised clustering method which addresses both problems. Originally developed by the Satija group at MIT, the algorithm constructs a reference clustering map through supervised learning using biologically measured data, anchoring future clusters to the reference map. In our case, we applied the model to classify and detect different cell types in cancer based on their gene expression profiles. Because the model can effectively control for the variability caused by the batch-to-batch effect, we were able to compare and pool a variety of data sources originating from different groups and environments. The reference map generated by supervised learning also provided a reliable way to label each cluster with a cell type, which drastically improved the cluster interpretability for analysis and presentation. Collectively, it has offered us a better understanding of the underlying biological processes, improving future cancer treatment medication. While our application was limited to drug development, there is no reason the semi-supervised approach cannot be applied more holistically to a variety of domains.

Speakers
MW

Marie Wang, PhD

Pfizer
Marie Wang is a Bioinformatics Scientist at Pfizer, one of the world’s premier biopharmaceutical companies. She applies rigorous statistical testing and machine learning methods to clinical data to understand the mechanism of action for drug candidates. Prior to joining Pfizer... Read More →


Tuesday May 14, 2024 10:35am - 11:00am PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

10:35am PDT

Trustworthy Automation: A Case Study in Explainable Generative AI for Driver Decision Making

There is a need for real-world driving data to understand how people drive in complex traffic situations. This presents a significant challenge to improving driver and road safety while developing more trustworthy vehicle automation technology. Therefore, this study addresses this gap by creating models that generate realistic driving scenarios to deepen understanding human driver decision-making using limited datasets. The models used in this research are based on explainable generative artificial intelligence, a combination of Generative Adversarial Networks (GANs) and explainable AI (xAI). This approach improves transparency and trustworthiness in understanding how the models operate. The goal is to simulate typical, rare, and critical driving scenarios, capturing a wide range of driver actions under various traffic conditions.

Speakers
MB

Mayuree Binjolkar

Meili Technologies
Mayuree is a Research Scientist at Meili Tech working on AI for in-vehicle health monitoring. She has a Ph.D. in Transportation Engineering and Masters in CS and Intelligent Transportation from the University of Washington. Her expertise bridges AI, transportation, and HCI, focusing... Read More →


Tuesday May 14, 2024 10:35am - 11:00am PDT
Room 130, Student Center

10:35am PDT

Womansplaining the Journey: Empowering Women to Thrive in Data Science Careers

In this talk, we will explore the challenges women face in building successful data careers and navigating the persistent issue of mansplaining. As a woman data science manager, I will share personal experiences, insights, and practical strategies for empowering women in this male-dominated field.

The session will begin by examining the dynamics of mansplaining and its impact on women's confidence and professional growth. We will identify common instances of mansplaining in data science workplaces and discuss the underlying biases that perpetuate these behaviors. By shedding light on this issue, we aim to create awareness and drive change towards a more inclusive and equitable industry.

Furthermore, we will provide actionable advice on how women can thrive in their data science careers by building resilience and asserting themselves. We will explore effective communication techniques, strategies for establishing credibility, and methods for navigating challenging professional situations. By equipping women with these tools, we aim to empower them to navigate mansplaining and create a supportive environment that values their contributions.

Join us for an engaging discussion on how we can collectively address mansplaining and foster an inclusive culture that enables women to thrive in their data science careers. Together, we can work towards breaking down barriers, empowering women, and creating a more diverse and equitable data science community.

Speakers
avatar for Sara Riker

Sara Riker

Nordstrom
Sara Riker is a Manager of Data Science and Analytics at Nordstrom. Starting on the salesfloor, she worked in various manager roles in stores before earning her Masters in Analytics from American University and moving to Seattle as a Data Analyst in 2017. In addition to her experience... Read More →


Tuesday May 14, 2024 10:35am - 11:00am PDT
Room 210, Student Center

10:50am PDT

Causal Insights From Observational Data: A Hands-On Python Workshop

Causal analysis is a powerful tool for understanding the mechanisms and effects of
interventions in complex systems. While A/B experimentation is the gold standard for
extracting causal insights, there are situations where experimenting isn’t possible – such
as when the feature was already released or ethical restrictions prevent us from
experimenting on certain populations. In such cases, we must rely on observational data,
which pose many challenges for causal analysis, such as confounding, selection bias, and
unmeasured variables. In this workshop, we will introduce the basic concepts and
methods of causal discovery and causal inference as we guide the audience through a hands-on step-by-step causal analysis using common Python causal libraries, including
DoWhy and EconML. We will provide a toy dataset for illustration. An internet-connected device is required.

Speakers
SS

Sarah Shy

Microsoft
Sarah is a data scientist at Microsoft where she works on applications of causal inference and builds ML models to power intelligent Windows features. Prior to joining Microsoft, she conducted research in the area of astrostatistics. Sarah is also passionate about mentoring newcomers... Read More →
avatar for Ganga Meghanath

Ganga Meghanath

Microsoft
Working on Causal Discovery in the Experimentation for Windows crew.


Tuesday May 14, 2024 10:50am - 12:15pm PDT
Campion Ballroom 914 East Jefferson Street, Seattle, WA, USA

11:05am PDT

Leveraging OpenAI: A Business Application Use Case

Cracking the Code: How Chatbots Get Stuff Done

Hey there! So, you know those super-smart digital helpers called Large Language Models (LLMs)? They’re like the James Bonds of the tech world—slick, resourceful, and ready to tackle any mission.

The Old-School Way: SQL SPs and Fuzzy Lookups

Now, picture this: You’ve got a task that needs some serious automation mojo. Maybe it’s sorting through messy data or fixing wonky addresses. Traditionally, we’d rely on SQL Stored Procedures (fancy term, right?) and fuzzy lookups (sounds like a jazz band). These methods are like rule-following robots—they do their thing, but they’re not exactly flexible. If the source system throws a curveball (like putting a company name in the city field), they’re stumped. Cue the dramatic music!

Enter LLMs: The Cool Cats of Automation

But wait! Here come the LLMs, striding in with confidence. They’re not bound by rules; they thrive on context and nuance. So, when that address mix-up happens, they channel their inner Sherlock. No human intervention needed! They’ll sleuth out the right city name faster than you can say “data wizard.”

Beyond Boring Automation: The Symphony of Imagination

But here’s the real magic: LLMs aren’t just taskmasters. They’re artists. Imagine a symphony where creativity meets logic. LLMs compose solutions that dance around the edges of certainty.

Get ready for an idea-packed conversation! Bring your questions, because we’re about to explore how you can harness the power of language models in your processes.


Speakers
avatar for Jyoti Vasudev

Jyoti Vasudev

Sr Software Engineer, Microsoft
I hold an engineering position at Microsoft and am actively studying Data Science at Harvard University Extension School. I find joy in helping women and young girls recognize their abilities and reach their fullest potential. Interestingly, I gain even more from these interactions... Read More →



Tuesday May 14, 2024 11:05am - 11:30am PDT
Room 210, Student Center

11:05am PDT

Towards sustainability: leverage deep learning in electric vehicle (EV) charging demand prediction

With the accelerating global transition towards sustainable energy, the demand for Electric Vehicles (EVs) has surged, necessitating advancements in EV charging infrastructure. Please join me for an exciting tour of leveraging deep learning in electric vehicle (EV) charging demand prediction! I will present an application we developed, utilizing deep learning to predict EV charging demand and corresponding energy savings, a crucial aspect in optimizing energy distribution and promoting sustainable transportation. Our application explores various deep learning models, including multiple linear perceptrons (MLP), convolutional neural networks (CNNs), long short term memory (LSTM), and a transformer, to analyze historical EV charging data alongside external variables influencing charging behaviors. I’ll present the results from different deep learning models and how to turn them into practical solutions. I will also present some fun visualizations of our results! Our application is open to all and takes advantage of publicly available data. In summary, it can serve as a tool for policymakers and/or urban planners in anticipating peak usage periods, optimizing resource allocation, and minimizing strain on the power grid!
This paper is coauthored with Mayuree Binjolkar.

Speakers
avatar for Yuanjie Tu

Yuanjie Tu

University of Washington
Yuanjie (Tukey) is a PhD candiate in Transportation Engineering at University of Washington. She mainly works on research projects that aim to advance sustainability outcomes by employing statistical and deep learning models to investigate diverse aspects of transportation behavior... Read More →


Tuesday May 14, 2024 11:05am - 11:30am PDT
Room 130, Student Center

11:05am PDT

Training Efficient Open Source Large Language Models
What does it take to train a Large Language Model like ChatGPT? This talk will go over the training and design of DBRX, Databricks’ 132 billion parameter flagship laguage model. We will also chat about all the ways in which Data Science helps out with training Large Language Models.  

Speakers
avatar for Tessa Barton

Tessa Barton

Databricks
Tessa is a research scientist at Databricks working on retrieval augmented generation. Previoiusly she was at the New York Times using computer vision for sports journalism. She has a masters degree in Computer Science from Brown University and has worked in data science roles at... Read More →


Tuesday May 14, 2024 11:05am - 11:30am PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

11:35am PDT

Ethical Implementation of Generative AI in Business Use Cases: A Practical Guide for Innovating Responsibly

In the past year generative AI applications have emerged as powerful tools capable of creating new, previously unseen content. However, with great power comes great responsibility. This talk will delve into the ethical implications and responsibilities associated with the development and deployment of these applications.
We will begin by defining generative AI and its various applications in business scenarios our teams have seen today. We will then explore the concept of Responsible AI, discussing its importance in ensuring the ethical use of AI technologies. We will highlight the potential risks and challenges posed by generative AI, such as the creation of deepfakes, the potential for bias in generated content, and issues related to data privacy.
The talk will also cover practical advice for measuring and mitigating risks in generative AI applications based on enterprise use cases. We will discuss strategies and current best practices for incorporating ethical considerations into the AI development process, such as transparency in AI decision-making, robustness against manipulation, and respect for user privacy. Specifically, we will talk about how businesses experiment with their generative AI solutions (e.g., prompt orchestration); how they evaluate their solutions to go to production (e.g., LLM-based metrics, red-teaming, use of synthetic data for testing); and how they continue monitoring the health of their solutions (e.g., performance metrics, data science metrics). We will also explore the importance of cross-functional collaboration in addressing these challenges, emphasizing the need for input from ethicists, legal experts, social scientists, and user experience professionals.
Finally, we will present case studies of responsible generative AI applications, demonstrating how businesses approach turning responsible AI principles into actionable items as they create solutions for their use cases. We will conclude with a discussion on the future of Responsible AI in the context of generative applications, considering both the opportunities and challenges that lie ahead.
This talk aims to equip data scientists with the knowledge and tools to develop generative AI applications responsibly, ensuring that these powerful technologies are used in a manner that respects user privacy, promotes fairness, and benefits society as a whole. Join us as we navigate the ethical landscape of generative AI, fostering a future where AI serves as a force for good.

Speakers
MG

Meltem Gurcay-Morris, PhD

Microsoft
Meltem Gurcay-Morris, PhD is currently a user researcher in AI Platform at Microsoft, which builds products to aid data scientists and developers in creating AI applications for businesses. Meltem's work focuses on understanding and improving user experiences with respect to implementing... Read More →


Tuesday May 14, 2024 11:35am - 12:00pm PDT
Room 130, Student Center

11:35am PDT

Synthetic Data for Instrument Segmentation in Surgery (Syn-ISS)

Synthetic data is increasingly important and relevant in today's data-driven landscape. It addresses privacy concerns by providing a means to generate data that mimics real-world information without exposing sensitive personal details. This makes it particularly valuable in fields like healthcare, where data privacy is paramount. Additionally, synthetic data can be used to fill gaps in datasets where real-world data is scarce or biased, enabling more comprehensive and unbiased AI training. It also allows for the testing and validation of systems in a controlled environment, enhancing model robustness and accuracy. Furthermore, synthetic data is instrumental in scenarios where gathering real-world data is impractical or too expensive, thus accelerating research and development across various industries. Our simulators and expertise in surgical simulation enables us to generate synthetic data that significantly enhances AI applications in the medical field. This contribution is pivotal in advancing AI-driven innovations in surgery. By harnessing advanced algorithms and state-of-the-art simulation technologies, we can produce high-quality synthetic data that closely mimics real-world surgical scenarios.

The core of this presentation revolves around the Synthetic Data for Instrument Segmentation in Surgery (Syn-ISS) challenge, hosted at MICCAI 2023 in Vancouver, Canada. The Syn-ISS challenge highlights the innovative use of semantic image segmentation algorithms and synthetic data derived from our state-of-the-art surgical simulators. Specifically, the challenge focuses on segmenting surgical instruments within synthetic data images. We had 12 participating teams compete. The dataset consisted of 3600 synthetic images generated from our FlexVR simulator. The winners were chosen based on a composite score of rankings, based on two weighted metrics: Dice Similarity Coefficient (DSC) and the Hausdorff Distance (HD). This challenge and participants showcased that synthetic data can be used in medical AI, benefiting medical education for humans and machines, ultimately improving patient outcomes.

Speakers
KG

Kimberly Glock, MS

Surgical Science
Kimberly serves as a Data Scientist at Surgical Science, where she is an integral part of the Research and Development Data Science team based in Seattle. Her expertise is primarily channeled into constructing machine learning models to support multiple projects across the company... Read More →


Tuesday May 14, 2024 11:35am - 12:00pm PDT
Room 210, Student Center

11:35am PDT

Empowerment Journeys: Entering, Exceling, and Exceeding Expectations in the Data Science Workforce

Join us for an insightful panel discussion, where we’ll explore career development, facing adversity and overcoming challenges as women in the fields of data science and AI. Our diverse panel of accomplished professionals will guide the attendees from the initial steps of launching their careers through rising to leadership positions and excelling amidst competition. Panelists will talk about the triumphs and obstacles encountered along their journeys, and share what their own version of empowerment looks like. We will delve into common challenges facing women in these fields, including equitable compensation, recognition, promotions, maintaining work-life balance, and navigating double standards. Whether you're contemplating a career in data science or seeking advancement in your current role, we invite you to join us for an honest & inspiring conversation with our amazing lineup of data science experts. They will share tips, strategies, and real-world anecdotes from their own journeys in tech & academia, providing invaluable insights and guidance for every stage of your career.

Speakers
avatar for Bernease Herman

Bernease Herman

Data Scientist, University of Washington eScience Institute
Bernease Herman is a data scientist and researcher at the University of Washington eScience Institute. Her research focuses on interpretable machine learning with work in fairness, accountability, and transparency. In her work, she collaborates with academic researchers, startups... Read More →
avatar for Madison Swain-Bowden

Madison Swain-Bowden

Senior Data Engineer, Automattic
Madison is a Senior Data Engineer & former Team Lead out of Seattle and an avid Python user/organizer. She is currently sponsored by Automattic to work on the open source project Openverse, and has worked at Ookla (Speedtest.net), the Allen Institute for Cell Science, and the Broad... Read More →
DE

Diala Ezzeddine

Product Manager, DeepLearning.AI
avatar for Iswarya Murali

Iswarya Murali

Microsoft; Principal Data Scientist
Iswarya is a Principal Data Scientist at Microsoft, where she is the technical architect for integrating Generative AI and LLMs in Microsoft's Security suite of products. She was previously at Google, working in the Risk and Fraud detection space. Her name is pronounced Aysh-Var... Read More →


Tuesday May 14, 2024 11:35am - 12:30pm PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

12:05pm PDT

Data Science at an Electric Company: Building and Validating an Electric Vehicle Detection Model

Washington has recently passed two pieces of legislation that impact electric companies in the state: the Clean Energy Transformation Act (CETA) and the Zero Emissions Vehicles Law. They require that the state’s electricity supply be free of greenhouse gas emissions by 2045 and for all new vehicles sold in the state to be zero-emission vehicles by 2035, respectively. In addition to moving away from coal-fired power plants, CETA states that utilities must consider the equity impact of these clean energy investments on vulnerable populations and highly impacted communities. To address this, utilities are developing energy efficiency incentives and community-based distributed energy resources. The changing nature of our state’s electric usage patterns due to these investments in community programs and an increase in electric vehicle (EV) adoption will put new stresses and strains on our electrical grid that we need to understand. In this talk I will focus on the model we developed at a Washington utility to detect EVs in order to understand their impact on our electrical grid.

In the coming decades, most electric vehicles are expected to be charged at single-family residences in the evening hours, rather than public charging stations at all hours. In order to prepare for the increased load on our power grid during peak times, electric companies need to know when and where the charging is happening. Building an EV charging detection model is difficult because the expected population of EVs is around 1-5%; we have a highly imbalanced problem. Starting with a relatively small labeled dataset, we built an EV detection model using novel step-detection features in time-series data. Using a random forest classifier, we are able to achieve accuracy, precision, and recall metrics of over 80%. In order to validate our data with what we expect among our population of customers, we compared our results to aggregated data from the Department of Licensing as well as survey results from our customers.


Overall, our stakeholders are satisfied with our model and its ability to predict which customers are charging an EV. I will discuss how we work with our stakeholders to understand which metrics we need to optimize for in order to help them prioritize their maintenance work. I will also briefly discuss next steps for this model.

Speakers
avatar for Andrea Urban, PhD

Andrea Urban, PhD

Puget Sound Energy
Andrea Urban is an astronomer-turned-data scientist. When she isn't chasing the next total solar eclipse, she enjoys looking for patterns in data and building bespoke machine learning models. 


Tuesday May 14, 2024 12:05pm - 12:30pm PDT
Room 210, Student Center

12:05pm PDT

MLOps for the Lonely Data Scientist

As the lone data scientist on a very small team, it can be very difficult to know or implement best practices on code productionalization. Many aspects of MLOps are often controlled by larger data engineering teams that you are your company may not have access to, but there are still tools and practices that we can implement in our day to day work. In this talk, we will explore implementing practices such as version control, continuous integration and continuous deployment (CI/CD), and using development and production environments. This is an opportunity to borrow from the software engineering development cycle and make our data science work more resilient to change or failure.

Speakers
avatar for Kelley Hall

Kelley Hall

Data Scientist, Tableau
Kelley Hall is a Data Scientist at Salesforce working on the Tableau Global Sales Operations team where she uses ML to enable data driven decision making within the sales organization. Her projects range from sales forecasting to discount recommendation. She received her PhD from... Read More →


Tuesday May 14, 2024 12:05pm - 12:30pm PDT
Room 130, Student Center

12:30pm PDT

Lunch
Tuesday May 14, 2024 12:30pm - 1:30pm PDT
Campion Ballroom 914 East Jefferson Street, Seattle, WA, USA

1:40pm PDT

Empowering Women in Career Progression: A Comprehensive Evaluation of the Principal vs Manager Data Science Pathway

In this engaging talk we aim to provide women data science professionals with an in-depth understanding of two crucial career advancement paths: the Principal role and the Managerial role. We will start by explaining the distinct responsibilities, skills required, and potential influence of each role, with a specific focus on the experiences and challenges faced by women in these positions. We will draw from real-world case studies and industry trends to highlight the unique advantages and potential hurdles in each role. Next, we will delve into a detailed discussion of the pros and cons associated with both roles. This will encompass aspects such as job satisfaction, work-life balance, compensation, and growth opportunities, all from a woman's perspective. The objective is to equip participants with the necessary insights to make informed career decisions that align with their professional goals and personal aspirations. In the final segment, we will facilitate an interactive dialogue, providing attendees with the opportunity to share their perspectives, ask questions, and learn from the experiences of other women professionals. This talk is designed to be a strategic tool for women in charting their career path in data science, helping them understand whether a Principal or Manager role best aligns with their professional ambitions and personal needs.

Speakers
avatar for Katherine Ostbye, MPH

Katherine Ostbye, MPH

Pfizer
Kate Ostbye is the Director of Data Science and Machine Learning at Pfizer, leading AI/ML solution delivery for R&D and co-leading a coding CoP and a local women's resource group. Kate holds a BS in English and Anthropology from UW-Madison, and an MPH in Epidemiology and Biostatistics... Read More →


Tuesday May 14, 2024 1:40pm - 2:05pm PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

1:40pm PDT

Fireside Chat

In this informal chat, Catherine will answer questions on her journey from studying ancient volcanoes to writing data science books, via deploying production machine learning models! She’ll discuss her latest book, “Software Engineering for Data Scientists” and give you some recommendations for applying software engineering best practices to data science code. She’ll also answer your questions on her move into data science.

Speakers
avatar for Catherine Nelson

Catherine Nelson

Data Scientist, Freelance
Catherine Nelson is a freelance data scientist and writer. She is currently working on the forthcoming O’Reilly book "Software Engineering for Data Scientists”. Previously, she was a Principal Data Scientist at SAP Concur, where she delivered production machine learning applications... Read More →


Tuesday May 14, 2024 1:40pm - 2:05pm PDT
Room 130, Student Center

2:00pm PDT

Decoding Ethics: Perspectives on Responsible Data Science
This panel will explore the range of ethical issues that occupy the data science terrain. The diverse group of experts will bring insights from both academia and industry to confront issues of data bias, sustainability, and individual rights. The discussion will delve into how societal biases have infiltrated cutting-edge applications of data science, and strategies to address these biases. Panelists will reflect on concerns regarding the industry’s impacts on the environment, privacy, and intellectual property, as well as how data science can be utilized for social good.

Speakers

Tuesday May 14, 2024 2:00pm - 2:55pm PDT
Campion Ballroom 914 East Jefferson Street, Seattle, WA, USA

2:10pm PDT

At Home With Large Language Models: Identifying Social Determinants of Health in Clinical Data of Pregnant Women

Social Determinants of Health (SDoH) such as housing stability are known to be intricately linked to a patient’s health status, and pregnant women experiencing housing instability are known to have worse health outcomes. We compared the ability of Large Language Models (LLMs) including GPT-3.5 and GPT-4 in identifying instances of both current and past housing instability, as well as general housing status, from 25,217 notes from 795 pregnant women. Results were compared with manual annotation, a named entity recognition (NER) model, and regular expressions (RegEx). Compared with GPT-3.5 and the NER model, GPT-4 had the highest performance and had a much higher recall (0.924) than human annotators (0.702) in identifying patients experiencing current or past housing instability, although precision was lower (0.850) compared with human annotators (0.971). This work demonstrates that, while manual annotation is likely to yield slightly more accurate results overall, LLMs provide a scalable, cost-effective solution with the advantage of greater recall. More efficient methods for obtaining structured SDoH data can help accelerate inclusion of exposome variables in biomedical research, and support healthcare systems in identifying patients who could benefit from proactive outreach.

Speakers
avatar for Alexandra Ralevski, PhD

Alexandra Ralevski, PhD

Data Scientist II, Institute for Systems Biology
Alexandra Ralevski, PhD is a Data Scientist at the Institute for Systems Biology in Seattle, WA. She is currently leading a Generative AI team to explore the use of Large Language Models such as GPT-4 in extracting complex SDoH data from Electronic Health Records. She also led a... Read More →


Tuesday May 14, 2024 2:10pm - 2:35pm PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

2:10pm PDT

Finetune LLMs

Newer smaller LLMs like Llama2 and mistral can be finetuned easily and provide more power over prompting. They utilize a new paradigm of fine tuning called parameter efficient finetuning (PEFT). Prompting has shown to have many restrictions and can limit the capabilities of LLMs, with PEFT even smaller models are performing very well.
This talk will introduce the audience to what PEFT is and how one can finetune LLMs to build custom solutions. This talk is really beneficial for audience who want to learn NLP in the LLM era.

Speakers
avatar for Riya Joshi

Riya Joshi

Microsoft
Riya is a Data Scientist at Microsoft who specializes in NLP and machine learning. She holds a Master’s degree in CS from the University of Massachusetts, Amherst, which she completed in May 2022. Before joining Microsoft’s US team, she worked as a Data Engineer in India. She... Read More →


Tuesday May 14, 2024 2:10pm - 2:35pm PDT
Room 130, Student Center

2:40pm PDT

Overcoming challenges and pitfalls of AB testing

This session will go beyond an overview of what A/B testing is. It will cover how to work with cross-functional partners to set up a test and analyse it. Finally, I will talk about how to make a decision based on the test results. I will go into depth about the most common challenges and pitfalls that I have experienced throughout my career and how to avoid making the most common mistakes. After the talk, you will know what to do when someone asks you to analyse an experiment you haven't designed, how to deal with partners asking for 'directional data' and how to work successfully with engineering to ensure each test is set up correctly.

Speakers
avatar for Kasia Rachuta

Kasia Rachuta

Data Science Tech Lead, Square
Kasia is a Data Science Tech Lead at Square, where she collaborates with her team to drive data-informed decision-making. Her expertise spans various domains, including identity verification, sales analytics, ecommerce, and infrastructure. Prior to her current role, she gained valuable... Read More →


Tuesday May 14, 2024 2:40pm - 3:05pm PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

2:40pm PDT

Reinforcement Learning for Model Bias Analysis

With the broad-scale application and massive growth of artificial intelligence (AI) and machine learning (ML) in all aspects of society, a question persists as to the robustness of these systems. In most cases, methods of investigating trustworthiness and explainability in ML models have focused on reactive methods designed to detect when a model has erred. These avenues of investigation are also limited to standard interrogation methods, which may be inadequate for sufficiently novel model architectures or data modalities. We, on the other hand, are developing a proactive method to anticipate possible failure states by simulating a unique and optimal adversarial attack using reinforcement learning (RL). We explore RL as a technique for evaluating model biases and robustness and propose an RL Optimizing Bias Elimination and Robustness Tool (ROBERT). The expected outcome of ROBERT is to learn how biases in a model can be exploited under potential adversarial attack.

In developing ROBERT, we train an image classification model on the MNIST dataset and construct an RL environment that perturbs input images which are then passed into this classification model. The reward of our system is designed to correlate with the impact of the perturbations on the model’s ability to correctly classify the image, with model error translating to higher reward, therefore teaching ROBERT the classifier model’s weaknesses. We validate ROBERT by means of a test wherein we train multiple image classification models with differing architectures and analyze ROBERT’s chosen actions to identify probable model biases. Additionally, we observe how extendible these methods are to the black box adversarial case, which requires less information from the model to perform a successful attack. In conducting this experiment, we develop a novel RL-based methodology aimed to identify unseen points of weakness and bias in existing image classification models.

Speakers
avatar for Rachel Wofford

Rachel Wofford

Data Scientist, PNNL
Rachel Wofford is a Data Scientist at PNNL. Her research and interests involve reinforcement learning, adversarial machine learning, and development of big data analytics in the radio frequency and cybersecurity domains. Rachel holds an MS from Oregon State University and a BS from... Read More →
avatar for Anastasiya Usenko

Anastasiya Usenko

Data Scientist, PNNL
Anastasiya Usenko is an early career data scientist in the field of applied deep learning research, with bachelors degrees in computer science and linguistics. At PNNL, she has worked with reinforcement learning, graph neural networks, and causal inference modeling, among others... Read More →


Tuesday May 14, 2024 2:40pm - 3:05pm PDT
Room 130, Student Center

3:15pm PDT

Data expansion to improve accuracy and availability of digital biomarkers for human health and performance

Advances in deep learning and sparse sensing have emerged as powerful tools to enable and expand human motion tracking. Motion tracking and analysis is essential for monitoring disease progression, guiding rehabilitation treatment, evaluating sports performance, and informing assistive device design. Biomechanists traditionally characterize motion, such as gait, by measuring biomechanical variables like joint kinematics, kinetics, and spatio-temporal parameters. Certain biomechanical variables have been established as biomarkers that correlate with meaningful outcomes, such as knee adduction angle for ACL injury or step width variability for aging/fall risk. In the US, with 1 in 7 individuals having a mobility disability and 1 in 2 adults living with a musculoskeletal condition, monitoring human motion 'in the wild' is vital for observing individuals' natural functionality and lifestyle. For motion to be observed in natural or uncontrolled environments, sensing devices must be portable, unobtrusive, reliable, and accurate. However, for sensing data to be meaningful, measurements must be converted to and contextualized as personalized biomechanical outcomes, a challenge not yet overcome in natural environments. Here, we present a deep learning algorithm -- originally developed for full state-space reconstruction of complex dynamical systems -- for personalized human motion tracking. Using this algorithm, we learn a mapping that transforms a low-dimensional sensor input into the full state-space dataset. By using as few as one sensor, we demonstrate that it is possible to reconstruct a comprehensive set of measures that are important for tracking and informing mobility-related health outcomes. As a concrete example, most smartwatches and smartphones contain an IMU (inertial measurement unit) sensor that monitors movement and is currently used for simple measures like daily step count or gesture control. We have demonstrated that our deep learning algorithm can use this single sensor to reconstruct not just the body segment where the sensor is worn, but the motion and – in some cases – the physiological state of the body. The basic premise of our approach that makes this powerful transformation possible is the leveraging of sensor measurement time histories to inform the mapping from low to high dimensional data. By expanding our datasets to unmeasured or unavailable quantities, this work can impact clinical trials, robotic/device control, and human performance. Additionally, this methodology may enable more efficient and cost-effective remote monitoring of patients, reducing the need for frequent visits to clinical settings. Overall, our work represents a major advance in personalized human motion sensing and has the potential to transform the way we monitor and manage movement-related health outcomes.

Speakers
avatar for Megan Ebers

Megan Ebers

Postdoctoral scholar, University of Washington
I am a postdoctoral scholar in Applied Mathematics with the NSF AI Institute in Dynamic Systems at the University of Washington. My postdoctoral research focuses on data-driven and reduced-order methods for complex systems. In my PhD research, I developed and applied machine learning... Read More →


Tuesday May 14, 2024 3:15pm - 3:40pm PDT
Room 130, Student Center

3:15pm PDT

Revolutionizing Search: The Integration of Generative AI and the Technical Challenges Ahead

In the rapidly evolving landscape of search engine technology, the integration of Generative AI has marked a paradigm shift from traditional keyword-based algorithms to advanced, intent-driven models. This talk aims to dissect this transformation, elucidating how models like GPT-4 are not just enhancing search engine capabilities but are redefining them. We begin by exploring the genesis of this change – the shift from simple keyword recognition to the complex understanding of user intent. This is a journey from linear algorithms to AI models that comprehend context, semantics, and the nuanced intricacies of human language. The talk will illuminate how these AI-driven engines are now capable of predicting user intent, thereby delivering search results that are not only accurate but also contextually relevant, making information retrieval more intuitive and efficient. However, this innovation is not without its challenges. The core of this discussion will pivot to the myriad technical hurdles encountered in blending Generative AI into existing search architectures. We'll delve into the computational demands these models impose, addressing the need for substantial processing power and advanced data handling capabilities. This segment will also cover the obstacles in adapting to the rapid pace of AI technology evolution, ensuring that search engines remain not just relevant but cutting-edge. Another crucial aspect is data privacy and security – paramount in an era where user data is both vital and sensitive. We'll examine the strategies to safeguard user privacy while leveraging AI for personalized search experiences. Furthermore, we'll address the challenge of linguistic dynamism – how AI models cope with the ever-changing nature of human language and the implications this has for search accuracy and relevance. This talk aims not only to highlight the revolutionary impact of Generative AI on search engines but also to provide insights into the practical solutions and strategies being developed to surmount the associated technical challenges. It's designed for an audience deeply entrenched in data science and technology, offering a blend of high-level understanding and technical detail that will resonate with professionals in the field.

Speakers
avatar for Akriti Chadda

Akriti Chadda

Microsoft
Akriti is an accomplished applied scientist with a strong focus on search and relevance. She possesses a diverse skill set, having earned an undergraduate degree in biomedical engineering and a master's in computer science. Her expertise lies in developing advanced algorithms for... Read More →


Tuesday May 14, 2024 3:15pm - 3:40pm PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

3:15pm PDT

Landscape of Data Science Across Industries
Join us for a discussion of career insights and everyday experiences from talented women applying data science in various sectors. This panel discussion brings together a diverse group of seasoned data science professionals from big tech and beyond, including biomedical research, entertainment, and retail. Our panelists will shed light on the challenges and opportunities within their industries to innovate and solve problems with data science. Attendees will gain invaluable insights into commonly used tools and technical approaches, as well as glimpses into the daily roles of these data science experts.

Whether you're curious about the nuances of data science in different sectors, seeking direction for skill development, or pondering the next step in your career, this panel promises a broad exploration of the data science landscape, illustrated through the practical experiences of outstanding women in the field.

Speakers
avatar for Rebecca Hadi

Rebecca Hadi

Nordstrom
Rebecca holds a Master's degree in Applied Mathematics from Johns Hopkins University and a Bachelor's degree in Mathematics from the University of Washington. She loves automating and optimizing processes, building Shiny apps, and helping others learn to code.


Tuesday May 14, 2024 3:15pm - 4:10pm PDT
Campion Ballroom 914 East Jefferson Street, Seattle, WA, USA

3:50pm PDT

Is it worth it? Let me work(shop) it!

With the latest GenAI hype wave, many executives are asking data teams, "can't we just use AI to do this?". Execs and business leads often don't know enough about traditional ML or Generative AI to assess the utility of these tools, bogging down data scientists with unrealistic requests. This session provides data scientists with a foundation of questions to ask over-eager execs to evaluate and prioritize ML use cases through a series of workshops.

Depending on the maturity of the data science practice and the expertise of the business lead in question, we've found three different types of workshops valuable in helping to educate and inspire:
1. Use Case Workshop: to identify business pain points and brainstorm connections to ML/GenAI solutions.
2. Prioritization workshop: Optional follow-on to the use case workshop to identify the highest ROI use cases
3. Requirements workshop: deep dive into a specific problem to
identify the core users, the proposed solution, and the expected impact.

Attendees will receive a sample workshop agenda, templates, and tips for effective virtual and in-person facilitation.


Tuesday May 14, 2024 3:50pm - 4:15pm PDT
Room 160, Student Center 901 12th Ave, Seattle, WA 98122, USA

3:50pm PDT

Race and Gender Bias in Generative AI Models

In this study, we set out to measure race and gender bias prevalent in text-to-image (TTI) AI image generation, focusing on the popular model Stable Diffusion from Stability AI. Previous investigations into the biases of word embedding models—which serve as the basis for image generation models—have demonstrated that models tend to overstate the relationship between semantic values and gender, ethnicity, or race. These biases are not limited to straightforward stereotypes; more deeply rooted biases may manifest as microaggressions or imposed opinions on policies, such as paid paternity leave decisions. In this analysis, we use image captioning software OpenFlamingo and Stable Diffusion to identify and classify bias within text-to-image models. Utilizing data from the Bureau of Labor Statistics, we engineer fifty prompts for professions and fifty prompts for actions in the interest of coaxing out shallow to systemic biases in the model. Prompts include generating images for “CEO”, “nurse”, “secretary”, “playing basketball”, and “doing homework”. After generating twenty images for each prompt, we document the model’s results, which show biases do exist within the model across a variety of prompts. For example, 95% of the images generated for “playing basketball” were African American men. We then analyze our results through categorizing our prompts into a series of income and education levels corresponding to data from the Bureau of Labor Statistics. Ultimately, we find that racial and gender biases are present yet not drastic for all cases.

Speakers
TJ

Tanisha Jauhari

Tanisha is a student from the San Francisco Bay Area. She has done machine learning research, primarily focusing on bias in generative artificial intelligence systems. Tanisha is passionate about supporting girls and women in STEM, and she serves as a 2024 ambassador for Women in... Read More →


Tuesday May 14, 2024 3:50pm - 4:15pm PDT
Room 130, Student Center

4:30pm PDT

Afternoon Keynote: Towards Robust and Responsible AI
The rise of foundation models has substantially reduced the barrier to adopting AI and enabled use cases which seemed like science fiction just 10 years ago. Now, the key challenge for AI practitioners is the ability to operate these AI-powered applications with transparency and control that is necessary to deliver a lasting, positive customer and business impact. In this presentation, we will review key risks associated with operating AI applications and how to take advantage of the emerging AI tooling ecosystem to mitigate these risks. We will discuss the role of each AI practitioner in facilitating robust and responsible AI adoption.

Speakers
AV

Alessya Visnjic

WhyLabs
Alessya Visnjic is the CEO of WhyLabs, the AI Observability company. Prior to WhyLabs, Alessya was a CTO-in-residence at the Allen Institute for AI and earlier she spent 9 years at Amazon leading AI initiatives. Alessya is the founder of Rsqrd AI, a global community of AI practitioners... Read More →


Tuesday May 14, 2024 4:30pm - 5:10pm PDT
Campion Ballroom 914 East Jefferson Street, Seattle, WA, USA

5:10pm PDT

Closing Ceremony
The closing ceremony will feature quick remarks from our volunteer team before adjourning for happy hour networking.

Speakers
avatar for Kelly Stroh

Kelly Stroh

Data Scientist, Trupanion
avatar for Niwako Sugimura

Niwako Sugimura

People Analytics Lead, Deloitte
avatar for Yashaswini Agarwal

Yashaswini Agarwal

Data Analyst, Mount Sinai


Tuesday May 14, 2024 5:10pm - 5:30pm PDT
Campion Ballroom 914 East Jefferson Street, Seattle, WA, USA

5:30pm PDT

Happy Hour
Tuesday May 14, 2024 5:30pm - 6:00pm PDT
Campion Ballroom 914 East Jefferson Street, Seattle, WA, USA
 
Filter sessions
Apply filters to sessions.