Here are some examples of impactful papers that showcased negative results, flaws in methodology, misinterpretation of previous results etc.
A promising approach that turns out to be not-so-promising.
The superior performance of a previously proposed model/component could not be reproduced because it is attributable to random initialization, a pipeline component, training data, or another element that is not the actual model architecture.
A previously reported phenomenon or architectural success fails to generalize to other tasks/datasets.
A prior success on an NLP task was attributed to the wrong kind of mechanism.
If you know of more papers that should be in this list, please let us know!