Insightful paper examples

Here are some examples of impactful papers that showcased negative results, flaws in methodology, misinterpretation of previous results etc.

A promising approach that turns out to be not-so-promising.
- G. Lapesa and S. Evert, “Large-scale evaluation of dependency-based DSMs: Are they worth the effort?,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017, pp. 394–400. [PDF]
- M. Karpinska, B. Li, A. Rogers, and A. Drozd, “Subcharacter Information in Japanese Embeddings: When Is It Worth It?,” in Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, Melbourne, Australia, 2018, pp. 28–37. [PDF]
- Pham, T. H., Macháček, D., & Bojar, O. (2019). Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed. Computación y Sistemas, 23(3), Article 3. [PDF]
The superior performance of a previously proposed model/component could not be reproduced because it is attributable to random initialization, a pipeline component, training data, or another element that is not the actual model architecture.
- M. Crane, “Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 241–252, 2018. [PDF]
- A. Kabbach, C. Ribeyre, and A. Herbelot, “Butterfly Effects in Frame Semantic Parsing: impact of data processing on model ranking,” in Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 3158–3169. [PDF]
- Musgrave, K., Belongie, S., & Lim, S.-N. (2020). A Metric Learning Reality Check. ArXiv:2003.08505 [Cs]. [PDF]
A previously reported phenomenon or architectural success fails to generalize to other tasks/datasets.
- M. Yatskar, “A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC,” presented at the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 2318–2323. [PDF]
- A. Gladkova, A. Drozd, and S. Matsuoka, “Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t.,” in Proceedings of the NAACL-HLT SRW, San Diego, California, June 12-17, 2016, 2016, pp. 47–54. [PDF]
A prior success on an NLP task was attributed to the wrong kind of mechanism.
- T. McCoy, E. Pavlick, and T. Linzen, “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 3428–3448. [PDF]
- R. Jia and P. Liang, “Adversarial Examples for Evaluating Reading Comprehension Systems,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 2017, pp. 2021–2031. [PDF]
- N. S. Moosavi and M. Strube, “Lexical Features in Coreference Resolution: To be Used With Caution,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, 2017, pp. 14–19. [PDF]
- T. Niven and H.-Y. Kao, “Probing Neural Network Comprehension of Natural Language Arguments,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 4658–4664. [PDF]
- Linzen, T. (2016). Issues in evaluating semantic spaces using word analogies. Proceedings of the First Workshop on Evaluating Vector Space Representations for NLP [PDF]
- A. Rogers, A. Drozd, and B. Li, “The (Too Many) Problems of Analogical Reasoning with Word Vectors,” in Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), 2017, pp. 135–148. [PDF]
Trivial baselines bring into question the workings of state-of-the-art systems:
- Kondratyuk, D., Cardenas, R., & Bojar, O. (2019). Replacing Linguists with Dummies: A Serious Need for Trivial Baselines in Multi-Task Neural Machine Translation. The Prague Bulletin of Mathematical Linguistics, 113(1), 31–40. [PDF]
- Thomason, J., Gordon, D., & Bisk, Y. (2019). Shifting the Baseline: Single Modality Performance on Visual Navigation & QA. 1977–1983. [PDF]
- Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Durme, B. V. (2018). Hypothesis Only Baselines in Natural Language Inference. 180–191. [PDF]

If you know of more papers that should be in this list, please let us know!