Data Science Project Match, Aug 30 at 3pm, DL 220!

Aug 16, 2022 • Tyler S. on behalf of Dan S.

You’re invited to the annual fall Data Science Project Match, hosted by the Yale Institute for the Foundations of Data Science. Yale faculty (or their representatives) will present data science projects that students might be able to work on. This will be a great way for S&DS or AMTH majors to find senior projects, for S&DS Ph.D. students to find practical work projects, and for students in other majors to learn about opportunities to do research with Yale faculty.

We will meet on Tuesday, August 30th , from 3-4 pm, in room 220 of Dunham Labs at 10 Hillhouse Avenue. We will start promptly at 3:00pm so that we have time to hear from all the faculty.


Hyojung Seo, Assistant Professor of Psychiatry and Neuroscience, Yale School of Medicine | Understanding how the brain generates intelligent behavior via complex neural networks
Cognitive behavior is generated by coordinated activity across networks of neurons, but it remains poorly understood how complex spatiotemporal features of network activity mediate diverse elements of cognition. Exploiting recent advances in neurotechnology and computer science, we are interested in exploring new tools to analyze and model neural activity underlying cognition. First, several statistical methods have been proposed to decompose/analyze high-dimensional population neural activity recorded simultaneously from many neurons. The project aims to explore and assess the proof-of-concept methods in dynamic systems approach, by applying them to analyze neural data recorded from diverse brain areas and under different behavioral contexts. Second project aims to use artificial neural network and deep reinforcement learning to model how cognition such as theory of mind can emerge from neural network interacting with other agents in simple games. Finally, as we plan to collect neural data from large-scale networks, we would like to explore network analysis methods to model and understand how distinctive cognitive and motivational functions are generated by collective and interactive dynamics of constituent nodes of the networks, and how specific perturbations of the network dynamics could lead to the symptoms of psychiatric illnesses.

Forrest Crawford, Associate Professor, Biostatistics, Statistics & Data Science, Operations, EEB |
Dynamics of the January 6, 2021 insurrection at the US Capitol
On January 6, 2021, President Trump Donald J. Trump led the “Stop the Steal” rally at the Ellipse outside the White House in Washington DC. During and immediately after the President’s speech, the crowd moved toward the US Capitol, breached the building security perimeter, and a riot ensued. Members of the crowd broke into the Capitol and attempted to disrupt the counting of electoral votes from the 2020 election. Hundreds of participants and police were injured, and at least 5 deaths occurred as a result of the riot; 884 individuals have been charged with crimes for their role in the events of January 6. The purpose of this project is to study the network and aggregate movement dynamics of participants during the Capitol riot using mobile device location data. Specifically, the analytic goals include understanding the flow dynamics of participants from the rally toward the Capitol and into secure areas, crowd density estimates within the riot zone, locations and times where participants breached the Capitol building, and proximity network patterns among subgroups of riot participants.

Claire Bowern, Professor of Linguistics | or
Neural Network Classifier for Voynich Plant Illustrations
The Voynich Manuscript (MS 408) is a 15th century cipher manuscript in Yale’s Beinecke Library. Two of its five sections include illustrations of plants and astrological diagrams. While there have been attempts to link the botanical illustrations to known plant species, the search space is large. This project builds on a prototype that matches Voynich illustrations with images from other medieval manuscripts. The prototype scrapes selected manuscript archives (e.g. the British Library and Bodleian Library collections), processes illustrations, and trains a neural network to classify the images and create a database of possible Voynich matches. I am looking to work with one or two students to scale up the prototype to full development. Experience with python and neural network classification models is needed.

Amin Karbasi, Associate Professor of Electrical Engineering, Computer Science, and Statistics and Data Science |
Gaming the Learning
If you like games (e.g., chess) and statistical learning, then this might be of interest to you. Consider the task of learning an unknown concept from a given concept class; to what extent does interacting with a domain expert accelerate the learning process? It turns out the answer is hidden in better understanding the game between an adversary (that tries to deceive) and a learner.

Margaret S. Clark, John M. Musser Professor of Psychology | Head, Trumbull College | Dean of Academic Affairs |
Emotional dynamics in close relationships
People experience and express (or suppress) emotions primarily in the context of their close relationships with friends, family and romantic partners and they constantly monitor their partners’ emotions within these same relationships. We have two data sets relevant to these processes which remain to be explored. First, we have some longitudinal data from 108 couples (216 individuals) including some personality measures and self-reports of their tendencies to experience and express a variety of positive and negative emotions and their perceptions of their partner’s experiences and expressions of the same emotions as well as two 5 day daily diary studies) in which they report how they felt upon giving and receiving benefits from one another. We a second data set from just over 200 couples (400 individuals) in which the same personality measures and self-report measures were collected and, in addition, couples engaged in four taped discussions of a positive and a negative event that occurred for each of them. These tapes have been coded for verbal and non-verbal expressions of a variety of positive and negative emotions (both by objective observers, the expressor of the emotion and that person’s partner). Do people project their own emotions onto what they see in their partners? Do people’s self-reports of their general tendencies to experience and express emotions match what they themselves self-report feeling in the moment, what their partners’ report? What objective observers report? How do personality factors relate to emotion expression?
Is being romantically partnered linked to better mental health for people of all sexual identities?
Two well-established findings are: #1 That gay/lesbian and bisexual individuals, on average, suffer from greater depression and anxiety than do heterosexual individuals and #2 that adults who are partnered, on average, experience less depression and anxiety than do those who are not partnered. However, studies supporting the latter finding have been done with exclusively (or primarily) heterosexual samples. We have collected a data set including partnered (for at least one year) and not-partnered (for at least one year) heterosexuals, gays, lesbians and bi-sexual individuals. They all filled out measures of anxiety, depression, life-satisfaction, relationship satisfaction and discrimination (experienced within the last year and over the course of their life-times). The data set can be used to explore not just if partnering is associated with the same benefits for members of all sexual orientations but if so why and if not why not?

Luke Sanford, Assistant Professor of Environmental Policy and Governance, Yale School of the Environment |
Satellite imagery and machine learning for causal impact evaluation
This project develops machine learning methods to measure economic development or environmental damage from satellite imagery. We show that many outcome variables as measured with existing remote sensing/machine learning methods can generate bias when used in causal impact evaluation. When standard machine learning methods minimize loss they produce estimates which are on average unbiased across the training data. However, this unbiasedness is not likely to hold across important subsets of the data, including the range of the true values of the outcome variable, or across important independent variables. We propose two strategies. First we use adversarial debiasing algorithms–originally developed to ensure that machine learning methods do not encode racial or other demographic biases–to generate suitable measures. Second, we use an active learning labeling method to reduce bias in existing methods while reducing the total amount of labeling researchers have to conduct. We are looking for students who have background in both statistics and machine learning. Any experience in the areas of adversarial methods, active learning, computer vision, or spatial data analysis are a plus!

Meg Urry, Israel Munson Professor of Physics Director, Yale Center for Astronomy and Astrophysics

Presented by Aritra Ghosh, senior graduate student in Prof. Urry’s group | |
Assessing the Shapes of Galaxies & AGN using Machine Learning
Have you ever wondered how many galaxies are out there in this universe? While the real answer is infinite, in our broad neighborhood of the universe, we estimate that number to be 100 billion! New telescopes, including the James Webb Space Telescope (JWST) and the upcoming Rubin Survey, has allowed a significant expansion on the distance (and number of galaxies) we can image. The sheer volume of this imaging dataset makes it very difficult to analyze it using traditional astronomical tools. Our lab has developed two flagship algorithms – GaMorNet and GaMPEN to help determine the shapes and sizes of ~10 million galaxies, spanning multiple surveys and redshifts (distance from Earth). We have also adapted another generative network, PSFGAN, to apply the above algorithms to Active Galactic Nuclei [AGN; galaxies where very massive black holes at their center release lots of energy in the form of electromagnetic radiation]. Since galaxies and AGN with different shapes evolve differently over time, assessing the shapes of these objects allows us to infer how galaxies and black holes evolve and how their evolution is correlated.
Project options: a) developing a machine learning tool that can detect merging galaxies – an important subclass which represents 15-30% of all galaxies. Merging galaxies are astronomically interesting because this is one of the mechanisms via which galaxies evolve/change shape, and mergers have also been shown to affect the rate of formation of stars and AGN activity. Due to their distorted/unusual shapes, merging galaxies often confuse our shape-determining algorithms. Thus we would like to develop an ML framework that can flag mergers and subsequently determine the shapes of the merging galaxies; b) improving the uncertainty quantification of GAMPEN – we are interested in using deep ensembles / stochastic weight averaging / simulation-based inference to verify whether any of these can produce better-calibrated uncertainties than our current approach; and c) improving the auto-cropping feature of GAMPEN – GaMPEN includes a Spatial Transformer Network to automatically crop input galaxies to an optimal size.

David van Dijk, Assistant Professor of Medicine and Computer Science |
Graph-neural networks for brain dynamics and spatial genomics
Graph-neural networks (or geometric deep learning) are revolutionizing machine learning and data science. They combine ideas from graph theory, geometry, topology, and deep learning to learn powerful non-linear models on graphical data. In the Van Dijk Lab we are developing several new types of graph neural networks, based on ideas from integral equations and self-attention models, and apply these to diverse biomedical applications. In one application, we are using graph neural networks to model spatiotemporal brain activity data, such as whole-cortex calcium imaging and fMRI recordings. In a second application, we are using graph neural networks to model spatial transcriptomic data – a new technology for the measurement of high-dimensional gene expression at the single-cell level with spatial resolution. Using our algorithm, we infer cell-cell interactions in measurements from kidney cancer and brain tissue of multiple sclerosis patients. In these projects, there is the opportunity to focus more on the algorithmic side or on the application, and you will work closely with postdocs and grad students in the lab.

I recommend attending even if you don’t need a project to work on. It’s a wonderful way to hear about exciting research going on in our community.