The key bottleneck in deploying ML systems today is labeling training data. My research focuses on weak supervision: the idea of using higher-level, noisier input from domain experts to train complex state-of-the-art models. On the applications side, I’m interested in text relation extraction and image classification, particularly for biomedical applications, which I focus on as a Bio-X SIG Fellow. I’m very fortunate to work with Chris Ré and many other talented people in the Hazy, Info, StatsML, and DAWN labs.
|Snorkel is a new system for quickly and cheaply generating training data based on user-provided labeling functions, which encode weak supervision signals like heuristics, patterns, and distant supervision sources. Snorkel automatically synthesizes and models these signals using our data programming approach, and is currently focused on training models for structured data extraction from text, PDFs, images, and more. Check out more tutorials and blog posts at snorkel.stanford.edu!|
Selected highlights in bold.
A Kernel Theory of Modern Data Augmentation. Tri Dao, Albert Gu, Alex Ratner, Virginia Smith, Christopher De Sa, Christopher Ré.
Snorkel: Rapid Training Data Creation with Weak Supervision. Alex Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, Christopher Ré. VLDB 2018. [Blog] [Project] [Poster] [Coverage: O'Reilly, EETimes, InfoWorld]
Cross-Modal Data Programming for Medical Images. Nishith Khandwala, Alex Ratner, Jared Dunnmon, Roger Goldman, Matt Lungren, Daniel Rubin, Christopher Ré. NIPS ML4H Workshop 2017.
Learning to Compose Domain-Specific Transformations for Data Augmentation. Alex Ratner*, Henry Ehrenberg*, Zeshan Hussain, Jared Dunnmon, Christopher Ré. NIPS 2017. [Blog] [Project] [Video] [Poster]
Learning the Structure of Generative Models without Labeled Data. Stephen Bach, Bryan He, Alex Ratner, Christopher Ré. ICML 2017. [Blog] [Tutorial]
DeepDive: Declarative Knowledge Base Construction. Ce Zhang, Christopher Ré, Michael Cafarella, Christopher De Sa, Alex Ratner, Jaeho Shin, Feiran Wang, Sen Wu. Communications of the ACM 2017.
Snorkel: Fast Training Set Generation for Information Extraction. Alex Ratner, Stephen Bach, Henry Ehrenberg, Christopher Ré. SIGMOD Demo 2017. [Project]
Snorkel: A System for Lightweight Extraction. Alex Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, Christopher Ré. CIDR Abstract 2017.
Data Programming: Creating Large Training Sets, Quickly. Alex Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré. NIPS 2016. [Blog] [Video] [Poster]
A Machine-Compiled Database of Genome-Wide Association Studies. Volodymyr Kuleshov, Braden Hancock, Alex Ratner, Christopher Ré, Serafim Batzaglou, Michael Snyder. NIPS ML4H Workshop 2016. [Poster]
Data Programming with DDLite: Putting Humans in a Different Part of the Loop. Henry Ehrenberg, Jaeho Shin, Alex Ratner, Jason Fries, Christopher Ré. HILDA @ SIGMOD 2016.
Deepdive: Declarative Knowledge Base Construction. Christopher De Sa, Alex Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang. ACM SIGMOD Record 2016.
[12/9/2017] Excited to be starting a workshop on weak supervision at NIPS 2017: Learning from Limited Labeled Data: Weak Supervision and Beyond.
[9/26/2017] Speaking about Data Programming + Snorkel at Strata Data Conference in NYC.
[7/12/2017] New blog post on weak supervision - send us your feedback!
[7/10/2017] Version 0.6 of Snorkel has been released!
[6/8/2017] Talking about data programming + Snorkel on the O'Reilly Data Show Podcast.