Workshop on Insights from Negative Results in NLP

Why negative results?

Publication of negative results is difficult in most fields, but in NLP the problem is exacerbated by the near-universal focus on improvements in benchmarks. This situation implicitly discourages hypothesis-driven research, and it turns creation and fine-tuning of NLP models into art rather than science. Furthermore, it increases the time, effort, and carbon emissions spent on developing and tuning models, as the researchers have no opportunity to learn what has already been tried and failed.

This workshop invites both practical and theoretical unexpected or negative results that have important implications for future research, highlight methodological issues with existing approaches, and/or point out pervasive misunderstandings or bad practices. In particular, the most successful NLP models currently rely on different kinds of pretrained meaning representations (from word embeddings to Transformer-based models like BERT and GPT-3). To complement all the success stories, it would be insightful to see where and possibly why they fail. Any NLP tasks are welcome: sequence labeling, question answering, inference, dialogue, machine translation - you name it.

A successful negative results paper would contribute one of the following:

broadly applicable recommendations for training/fine-tuning, especially if X that didn’t work is something that many practitioners would think reasonable to try, and if the demonstration of X’s failure is accompanied by some explanation/hypothesis;
ablation studies of components in previously proposed models, showing that their contributions are different from what was initially reported;
datasets or probing tasks showing that previous approaches do not generalize to other domains or language phenomena;
trivial baselines that work suspiciously well for a given task/dataset;
cross-lingual studies showing that a technique X is only successful for a certain language or language family;
experiments on (in)stability of the previously published results due to hardware, random initializations, preprocessing pipeline components, etc;
theoretical arguments and/or proofs for why X should not be expected to work;
demonstration of issues with data processing/collection/annotation pipelines, especially if they are widely used;
demonstration of issues with evaluation metrics (e.g. accuracy, F1 or BLEU), which prevent their usage for fair comparison of methods.

The Workshop for Insights from Negative Results invites short papers as well as non-archival abstract submissions for papers published elsewhere (e.g. in one of the main conferences or in non-NLP venues). Our goal is to provide not only a publication venue, but an opportunity to discuss the most urgent methodological issues, and to think about where the field is going.

Thank you to all those who are sponsoring Insights Workshop 2024.