During my dissertation I investigated the relative benefits of using multi-task neural network architectures over single-task architectures in performing automated speaker profiling tasks. I argued that multi-task network architectures are are both more performant and theoretically preferable to single-task network architectures in that they programmatically implement the sociolinguistic principle of intersectionality and thus better reflect the current sociolinguistic theory of how social identity is performed linguistically. The speaker profiling tasks in question focused on the prediction of five different social traits: sex, ethnicity, age, region, and education based on features extracted from conversational speech snippits.
The key findings from this work are that multitask models consistently outperform single-task models, that models are most accurate when provided with information from the acoustic, phonetic, and lexical linguistic levels, and that lexical features as a group contribute substantially more predictive power than either phonetic or acoustic features.
The multitask neural network models trained during the course of this work outperform all previous automated speaker profiling systems trained on conversational, spoken data in terms of predictive accuracy for sex, ethnicity, and education.
During the summer and fall of 2019 I was part of a team in the Innovation Lab Unit of the International Monetary Fund working to develop a system to provide early warning indicators for impending financial crises across the globe.
The system we created was based on sentiment indices constructed from historical news corpora. These indices tracked the frequency over time of specific semantic/emotional term clusters (e.g. "fear" language, "risk" language, "hedge" language, etc.) which we found tend to spike in regional news media of the affected country/region prior to upcoming financial crises.
More details regarding this early warning system and our findings can be obtained in our 2019 IMF Working Paper: News-based Sentiment Indicators.
Up-to-date knowledge of street terms for high-risk, rapidly evolving recreational drugs is essential for health-care professionals working with addicted and at-risk populations. Unfortunately however, the time lag between the advent of a new term and recognition of that term by public health researchers is often measured in months, if not years.
From 2016 to 2017 I worked with a research team at the Center for Advanced Study of Language (CASL) in collaboration with the Center for Substance Abuse Research (CESAR) to develop automated methods for detecting novel drug terminology in social media discourse. During that time I developed a system to crawl continuous social media streams and draw on vector space semantics to search for unknown terms which fit the contextual linguistic profile of known drug terminology.
Results from our experiments with this social media crawling system indicate that such an approach can identify previously unknown drug slang with a high degree of accuracy and cut the lag time between term introduction and detection to weeks, if not days. An article detailing this system and our initial results was published in 2018 in the Journal of Medical Internet Research.
From 2012 through 2018 I (along with many others) was part of the team that conducted the endangered language research presented in both the Catalogue of Endangered Languages (ELCat) and the Endangered Languages Project (ELP). An exciting and encouraging result of our work on the Catalogue is that our data do not support the oft-cited claim that a language dies roughly every two weeks. Rather, our data suggest that the current extinction rate is closer to one language every 3 months.
To hear more about this revised language extinction rate and other findings, please check out our 2018 publication from Taylor & Francis: Cataloguing the Endangered Languages of the world. Look also for the chapter that Anna Belew and I contributed to the 2018 Oxford Handbook of Endangered Languages, which draws heavily on the data we collected for the Catalogue.