Home Organization Call for Papers

EMNLP2020 Workshop

Evaluation and Comparison of NLP Systems

Email: evaluation.nlp.workshop2020@gmail.com — Twitter: @NLPEvaluation


Fair evaluations and comparisons are of fundamental importance to the NLP community to properly track progress, especially within the current deep learning revolution, with new state-of-the-art results reported in ever shorter intervals. This concerns the designing of adequate metrics for evaluating performance in high-level generation tasks such as question answering, dialogue, summarization, machine translation, image captioning, poetry generation, etc.; properly evaluating word and sentence embeddings; and rigorously determining whether and under which conditions one system is better than another; etc.

Particular topics of interest of the workshop include (but not limited to):

  • Novel individual evaluation metrics, particularly for text generation
    • with desirable properties, e.g., (i) high correlations with humans; (ii) can distinguish high-quality outputs from mediocre / low-quality outputs; (iii) robustness across lengths of input and output sequences; (iv) speed; etc.
    • reference-free evaluation metrics, defined in terms of the source text(s) and system predictions only; 
    • cross-domain metrics that can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages and different time periods; etc.;
    • supervised, unsupervised, and semi-supervised metrics
  • Designing adequate evaluation methodology, particularly for text generation
    • statistics for the trustworthiness of results, via appropriate significance tests;
    • comparing score distributions instead of single-point estimates;
    • reproducibility;
    • comprehensive and fair comparisons;
    • comprehensive and unbiased error analyses and case studies, avoiding cherry-picking and sampling bias;
    • guidelines for adequate evaluation methodology
  • Creating adequate and correct evaluation data, particularly for text generation (Ensuring & reporting:)
    • coverage of phenomena, representativeness/balance/distribution with respect to the task, etc.;
    • size of corpora, variability among data sources, eras, genres, etc.;
    • internal consistency of annotations;
    • agreement levels for annotation metrics;
    • system evaluation using appropriate human annotations;
    • cost-effective manual evaluations with good inter-annotator agreements (e.g., crowdsourcing);  
    • introspection and elimination of biases in the annotated data, e.g., via probing and adversarial attacks



Ido Dagan Bar-Ilan University, Israel
William Wang University of California, Santa Barbara, USA
Goran Glavas University of Mannheim, Germany


Steering Committee

Ido Dagan Bar-Ilan University, Israel
Ani Nenkova University of Pennsylvania, USA
Robert WestÉcole polytechnique fédérale de Lausanne (EPFL), Switzerland
Mohit Bansal University of North Carolina (UNC) Chapel Hill, USA

Organizing Committee

Eduard Hovy Carnegie Mellon University, USA
Steffen EgerTechnische Universität Darmstadt, Germany
Yang GaoRoyal Holloway, University of London, UK
Maxime PeyrardÉcole polytechnique fédérale de Lausanne (EPFL), Switzerland
Wei ZhaoTechnische Universität Darmstadt, Germany