Workshop on Insights from Negative Results in NLP

May 5, 2023
(co-located with EACL)


Call for papers

Accepted papers




Invited Speakers

Program Committee

Insights 2022

Insights 2021

Insights 2020


Thursday, May 26, 2022

8:45–9:00 Opening remarks

9:00–10:00 Invited talk: Barbara Plank (LMU Munich)

Off the Beaten Track: On Serendipity and Turning “Failures” into Signal VIDEO

In this talk, I’ll first reflect upon the research process in current NLP and discuss how the principle of serendipity can play an important role in the design of research projects. In the second part, I will provide a series of examples to illustrate how something perceived as “noise” can yield research opportunities. These include leveraging fortuitous data like meta-data for low-resource NLP, human disagreement in labelling, and I will also provide some puzzling results on an understudied BERT detail.

10:00–10:30 Thematic Session 1: Improving Evaluation Practices

  • Replicability under Near-Perfect Conditions – A Case-Study from Automatic Summarization
    Margot Mieskes [PDF], [Video]
  • On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets
    Hyounghun Kim, Aishwarya Padmakumar, Di Jin, Mohit Bansal and Dilek Hakkani-Tur [PDF], [Video]

10:30–11:30 Coffee Break

11:30–12:00 Thematic Session 2: Transformers

  • How Much Do Modifications to Transformer Language Models Affect Their Ability to Learn Linguistic Knowledge?
    Simeng Sun, Brian Dillon and Mohit Iyyer [PDF], [Video]
  • Pathologies of Pre-trained Language Models in Few-shot Fine-tuning
    Hanjie Chen, Guoqing Zheng, Ahmed Hassan Awadallah and Yangfeng Ji [PDF], [Video]
  • On Isotropy Calibration of Transformer Models
    Yue Ding, Karolis Martinkus, Damian Pascual, Simon Clematide and RogerWattenhofer [PDF], [Video]

12:00–12:30 Thematic Session 3: Towards Better Data

  • Do Data-based Curricula Work?
    Maxim K. Surkov, Vladislav D. Mosin and Ivan P. Yamshchikov [PDF]
  • Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory
    Pedro Rodriguez, Phu Mon Htut, John P. Lalor and Jo˜ao Sedoc [PDF]
  • On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing
    Itsuki Okimura, Machel Reid, Makoto Kawano and Yutaka Matsuo [PDF], [Video]

12:30–14:00 Lunch

14:00–15:00 Panel Discussion: How Bad are Annotation Disagreements, Really? VIDEO

Panelists: Margot Mieskes (University of Applied Sciences, Darmstadt), Barbara Plank (LMU Munich), Massimo Poesio (Queen Mary University of London), Bonnie Webber (University of Edinburgh)

Moderator: Anna Rogers (University of Copenhagen)

15:00–15:30 Coffee Break

15:30–16:00 Thematic Session 4: Linguistically Informed Analysis

  • Do Dependency Relations Help in the Task of Stance Detection?
    Alessandra Teresa Cignarella, Cristina Bosco and Paolo Rosso [PDF], [Video]
  • BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation
    Dipesh Kumar and Avijit Thawani [PDF], [Video]
  • Challenges in including extra-linguistic context in pre-trained language models
    Ionut Teodor Sorodoc, Laura Aina and Gemma Boleda [PDF], [Video]

16:00–17:00 Invited talk: Tal Linzen (NYU)

Sensitivity to Initial Weights in Out-of-distribution Generalization VIDEO

The results of experiments that involve training neural networks can be sensitive to the networks’ initial weights. In this talk I will review work from my group and others that shows that such sensitivity can be quite dramatic when the network is evaluated on its out-of-distribution generalization accuracy, as is typically the case with the challenge datasets popular in the “interpretability” community. In one experiment, when we fine-tuned BERT 100 times on the same dataset, in-distribution test set accuracy was reasonably stable, but out-of-distribution behavior differed qualitatively across runs. The recent MultiBERTs project, where BERT was retrained 25 times, demonstrates that this variability persists across pretrained models as well. This variability makes it harder to interpret the results on a single fine-tuning run of a challenge dataset, and highlights a potentially underappreciated consequence of neural networks’ weak inductive biases.

17:00–18:00 Poster Session

  • Evaluating the Practical Utility of Confidence-score based Techniques for Unsupervised Open-world Classification
    Sopan Khosla, Rashmi Gangadharaiah [PDF], [Video]
  • Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains
    Chenyang Lyu, Jennifer Foster, Yvette Graham [PDF], [Video]
  • What Do You Get When You Cross Beam Search with Nucleus Sampling?
    Uri Shaham, Omer Levy [PDF], [Video]
  • Cross-lingual Inflection as a Data Augmentation Method for Parsing
    Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, David Vilares [PDF], [Video]
  • Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification
    Dawei Zhu, Michael Hedderich, Fangzhou Zhai, David Adelani, Dietrich Klakow [PDF], [Video]
  • Ancestor-to-Creole Transfer is Not a Walk in the Park
    Heather Lent, Emanuele Bugliarello, Anders Søgaard [PDF]
  • What GPT Knows About Who is Who
    Xiaohan Yang, Eduardo Peynetti, Vasco Meerman, Chris Tanner [PDF], [Video]
  • Evaluating Biomedical Word Embeddings for Vocabulary Alignment at Scale in the UMLS Metathesaurus Using Siamese Networks
    Goonmeet Bajaj, Vinh Nguyen, Thilini Wijesiriwardene, Hong Yung Yip, Vishesh Javangula, Amit Sheth, Srinivasan Parthasarathy, Olivier Bodenreider [PDF]
  • Can Question Rewriting Help Conversational Question Answering?
    Etsuko Ishii, Yan Xu, Samuel Cahyawijaya, Bryan Wilie [PDF], [Video]
  • The Document Vectors Using Cosine Similarity Revisited
    Zhang Bingyu, Nikolay Arefyev [PDF], [Video]
  • Label Errors in BANKING77
    Cecilia Ying, Stephen Thomas [PDF], [Video]
  • An Empirical study to understand the Compositional Prowess of Neural Dialog Models
    Vinayshekhar Kumar, Vaibhav Kumar, Mukul Bhutani, Alexander Rudnicky [PDF], [Video]
  • Combining Extraction and Generation for Constructing Belief-Consequence Causal Links
    Maria Alexeeva, Allegra A. Beal Cohen, Mihai Surdeanu [PDF], [Video]
  • Pre-trained language models evaluating themselves - A comparative study
    Philipp Koch, Matthias Aßenmacher, Christian Heumann [PDF], [Video]

18:00–18:10 Closing Remarks