Workshop in NLP Evaluation and Comparison (Eval4NLP)

EMNLP2020 Workshop

Evaluation and Comparison of NLP Systems (Eval4NLP)

Email: evaluation.nlp.workshop2020@gmail.com — Twitter: @NLPEvaluation

Overview

Fair and adequate evaluations and comparisons are of fundamental importance to the NLP community to properly track progress, especially within the current deep learning revolution, with new state-of-the-art results reported in ever shorter intervals. This concerns the designing of adequate metrics for evaluating performance in high-level text generation tasks such as question and dialogue generation, summarization, machine translation, image captioning, poetry generation, etc.; properly evaluating word and sentence embeddings; and rigorously determining whether and under which conditions one system is better than another; etc.

Particular topics of interest of the workshop include (but not limited to):

Novel individual evaluation metrics, particularly for text generation
- with desirable properties, e.g., (i) high correlations with humans; (ii) can distinguish high-quality outputs from mediocre / low-quality outputs; (iii) robustness across lengths of input and output sequences; (iv) speed; etc.
- reference-free evaluation metrics, defined in terms of the source text(s) and system predictions only;
- cross-domain metrics that can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages and different time periods; etc.;
- supervised, unsupervised, and semi-supervised metrics
Designing adequate evaluation methodology, particularly for text generation
- statistics for the trustworthiness of results, via appropriate significance tests;
- comparing score distributions instead of single-point estimates;
- reproducibility;
- comprehensive and fair comparisons;
- comprehensive and unbiased error analyses and case studies, avoiding cherry-picking and sampling bias;
- guidelines for adequate evaluation methodology
- appropriate forms of human annotations, e.g. Likert scale ratings, rankings, preferences, bandit feedback, etc.
- methodologies for human evaluation
- validation of metrics against human evaluations
Creating adequate and correct evaluation data, particularly for text generation
- coverage of phenomena, representativeness/balance/distribution with respect to the task, etc.;
- size of corpora, variability among data sources, eras, genres, etc.;
- internal consistency of annotations;
- agreement levels for annotation metrics;
- cost-effective manual evaluations with good inter-annotator agreements (e.g., crowdsourcing);
- introspection and elimination of biases in the annotated data, e.g., via probing and adversarial attacks

Program

Speakers

Ido Dagan	Bar-Ilan University, Israel
William Wang	University of California, Santa Barbara, USA
Goran Glavas	University of Mannheim, Germany
Asli Celikyilmaz	Microsoft Research

Organization

Organizing Committee

Steffen Eger	Technische Universität Darmstadt, Germany
Yang Gao	Royal Holloway, University of London, UK
Maxime Peyrard	École polytechnique fédérale de Lausanne (EPFL), Switzerland
Wei Zhao	Technische Universität Darmstadt, Germany
Eduard Hovy	Carnegie Mellon University, USA

Steering Committee

Ido Dagan	Bar-Ilan University, Israel
Ani Nenkova	University of Pennsylvania, USA
Robert West	École polytechnique fédérale de Lausanne (EPFL), Switzerland
Mohit Bansal	University of North Carolina (UNC) Chapel Hill, USA

Schedule

Friday Nov 20	(CET Time Zone)
10:00-10:15	Opening Remarks
10:15-11:15	Invited Speaker: Goran Glavas Title: Pitfalls in Evaluation of Multilingual Text Representations [slides] [video] Abstract: Multilingual representation spaces, spanned by multilingual word embeddings or massively multilingual transformers, conceptually enable modeling of meaning across a wide range of languages and language transfer of task-specific NLP models from resource-rich to resource-lean languages. It is not yet clear, however, to which extent this conceptual promise holds in practice. Recent models, both cross-lingual word embedding models and multilingual transformers, have been praised for being able to induce multilingual representation spaces without any explicit supervision (i.e., without any word-level alignments or parallel corpora). In this talk, I will point to some prominent shortcomings and pitfalls of existing evaluations of multilingual representation spaces, which mask the limitations of state-of-the-art multilingual representation models. Remedying for some of these evaluation shortcomings, portrays meaning representation and language transfer capabilities of current state-of-the-art multilingual representation models in a less favorable light.
11:15-11:30	Coffee Break
11:30-12:30	Invited Speaker: Ido Dagan Title: Think out of the box! Conceptions and misconceptions in NLP evaluation [slides] Abstract: Continuity is an important principle in empirical evaluation; otherwise -- how can we compare performance along time and measure progress? Yet, in some cases, we may reconsider common evaluation practices and examine whether they deserve an extension or a change, due to being either limiting or misleading. In this talk I’ll describe four such recent experiences regarding data collection and evaluation practices: a controlled crowdsourcing methodology for more challenging data collection tasks, revisiting an old “same length” assumption for reference and system summaries, extending summarization evaluation to the interactive summarization setting, and examining the practice of considering singleton annotations in cross-document coreference evaluation.
12:30-13:30	Lunch Break
13:30-14:30	Invited Speaker: Asli Celikyilmaz Title: Meta-Evalution of Text Generation Abstract: Automatic text generation enables computers to summarize text, describe pictures to visually impaired, write stories or articles about an event, have conversations in customer-service, chit-chat with individuals, and other settings, etc. Recent advancements in deep learning have yielded tremendous improvements in many text generation tasks. While these models have dramatically improved the state of text generation, even state-of-the-art neural text generation models still face many challenges: a lack of diversity in generated text, commonsense violations in depicted situations, difficulties in determining if the generated text is coherent, factual, or makes any sense at all. In this talk I will briefly discuss current evaluation metrics for text generation focusing on factuality evaluation. Then I will talk about our recent work on building standards for evaluating different types of factuality metrics and conclude with our findings and avenues for future research in this direction.
14:30-14:45	Coffee Break
14:45-15:30	Best paper awards incl. short presentations
15:30-16:30	Invited Speaker: William Wang Title: Rethinking NLG Evaluation: From Metric Learning, Sample Variance, to Fact-Checking [slides] [video] Abstract: Existing metrics for natural language generation evaluation focus on counting lexical overlaps between generated sentences and references. However, it is well-known that this evaluation paradigm has major issues in evaluating long stories, the stability of result reporting, and catching semantic correctness and faithfulness. In this talk, I will discuss some of our recent attempted solutions to mitigate these issues. Specifically, I will introduce (1) AREL, a new paradigm for joint metric learning and policy optimization in visual storytelling; (2) our in-depth study into the understanding of sample variance in visually-grounded language generation problems; (3) a new logicNLP task that requires logical reasoning to access the faithfulness of generation results.
16:30-16:35	Concluding Remarks
16:35-	2mins Madness

Best Papers

Rank 1: Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance, Xi Chen, Nan Ding, Tomer Levinboim and Radu Soricut [paper] [video]

Rank 2: Are Some Words Worth More than Others? Shiran Dudy and Steven Bedrick [paper] [video]

Rank 3: Fill in the BLANC: Human-free quality estimation of document summaries, Oleg Vasilyev, Vedant Dharnidharka and John Bohannon [paper] [video]