Fair and adequate evaluations and comparisons are of fundamental importance to the NLP community to properly track progress, especially within the current deep learning revolution, with new state-of-the-art results reported in ever shorter intervals. This concerns the designing of adequate metrics for evaluating performance in high-level text generation tasks such as question and dialogue generation, summarization, machine translation, image captioning, poetry generation, etc.; properly evaluating word and sentence embeddings; and rigorously determining whether and under which conditions one system is better than another; etc.
Particular topics of interest of the workshop include (but not limited to):
- Novel individual evaluation metrics, particularly for text generation
- with desirable properties, e.g., (i) high correlations with humans; (ii) can distinguish high-quality outputs from mediocre / low-quality outputs; (iii) robustness across lengths of input and output sequences; (iv) speed; etc.
- reference-free evaluation metrics, defined in terms of the source text(s) and system predictions only;
- cross-domain metrics that can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages and different time periods; etc.;
- supervised, unsupervised, and semi-supervised metrics
- Designing adequate evaluation methodology, particularly for text generation
- statistics for the trustworthiness of results, via appropriate significance tests;
- comparing score distributions instead of single-point estimates;
- comprehensive and fair comparisons;
- comprehensive and unbiased error analyses and case studies, avoiding cherry-picking and sampling bias;
- guidelines for adequate evaluation methodology
- appropriate forms of human annotations, e.g. Likert scale ratings, rankings, preferences, bandit feedback, etc.
- methodologies for human evaluation
- validation of metrics against human evaluations
- Creating adequate and correct evaluation data, particularly for text generation
- coverage of phenomena, representativeness/balance/distribution with respect to the task, etc.;
- size of corpora, variability among data sources, eras, genres, etc.;
- internal consistency of annotations;
- agreement levels for annotation metrics;
- cost-effective manual evaluations with good inter-annotator agreements (e.g., crowdsourcing);
- introspection and elimination of biases in the annotated data, e.g., via probing and adversarial attacks
| Ido Dagan || Bar-Ilan University, Israel
|Ani Nenkova|| University of Pennsylvania, USA
|Robert West||École polytechnique fédérale de Lausanne (EPFL), Switzerland
|Mohit Bansal|| University of North Carolina (UNC) Chapel Hill, USA
|Steffen Eger||Technische Universität Darmstadt, Germany
|Yang Gao||Royal Holloway, University of London, UK
|Maxime Peyrard||École polytechnique fédérale de Lausanne (EPFL), Switzerland
|Wei Zhao||Technische Universität Darmstadt, Germany
|Eduard Hovy|| Carnegie Mellon University, USA
|Friday Nov 20
||(CET Time Zone)
Invited Speaker: Goran Glavas
Title: Pitfalls in Evaluation of Multilingual Text Representations [slides] [video]
Abstract: Multilingual representation spaces, spanned by multilingual word embeddings or massively multilingual transformers, conceptually enable modeling of meaning across a wide range of languages and language transfer of task-specific NLP models from resource-rich to resource-lean languages. It is not yet clear, however, to which extent this conceptual promise holds in practice. Recent models, both cross-lingual word embedding models and multilingual transformers, have been praised for being able to induce multilingual representation spaces without any explicit supervision (i.e., without any word-level alignments or parallel corpora). In this talk, I will point to some prominent shortcomings and pitfalls of existing evaluations of multilingual representation spaces, which mask the limitations of state-of-the-art multilingual representation models. Remedying for some of these evaluation shortcomings, portrays meaning representation and language transfer capabilities of current state-of-the-art multilingual representation models in a less favorable light.
Invited Speaker: Ido Dagan
Title: Think out of the box! Conceptions and misconceptions in NLP evaluation [slides]
Abstract: Continuity is an important principle in empirical evaluation; otherwise -- how can we compare performance along time and measure progress? Yet, in some cases, we may reconsider common evaluation practices and examine whether they deserve an extension or a change, due to being either limiting or misleading. In this talk I’ll describe four such recent experiences regarding data collection and evaluation practices: a controlled crowdsourcing methodology for more challenging data collection tasks, revisiting an old “same length” assumption for reference and system summaries, extending summarization evaluation to the interactive summarization setting, and examining the practice of considering singleton annotations in cross-document coreference evaluation.
Invited Speaker: Asli Celikyilmaz
Title: Meta-Evalution of Text Generation
Abstract: Automatic text generation enables computers to summarize text, describe pictures to visually impaired, write stories or articles about an event, have conversations in customer-service, chit-chat with individuals, and other settings, etc. Recent advancements in deep learning have yielded tremendous improvements in many text generation tasks. While these models have dramatically improved the state of text generation, even state-of-the-art neural text generation models still face many challenges: a lack of diversity in generated text, commonsense violations in depicted situations, difficulties in determining if the generated text is coherent, factual, or makes any sense at all. In this talk I will briefly discuss current evaluation metrics for text generation focusing on factuality evaluation. Then I will talk about our recent work on building standards for evaluating different types of factuality metrics and conclude with our findings and avenues for future research in this direction.
||Best paper awards incl. short presentations
Invited Speaker: William Wang
Title: Rethinking NLG Evaluation: From Metric Learning, Sample Variance, to Fact-Checking [slides] [video]
Abstract: Existing metrics for natural language generation evaluation focus on counting lexical overlaps between generated sentences and references. However, it is well-known that this evaluation paradigm has major issues in evaluating long stories, the stability of result reporting, and catching semantic correctness and faithfulness. In this talk, I will discuss some of our recent attempted solutions to mitigate these issues. Specifically, I will introduce (1) AREL, a new paradigm for joint metric learning and policy optimization in visual storytelling; (2) our in-depth study into the understanding of sample variance in visually-grounded language generation problems; (3) a new logicNLP task that requires logical reasoning to access the faithfulness of generation results.
Rank 1: Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance, Xi Chen, Nan Ding, Tomer Levinboim and Radu Soricut [paper] [video]
Rank 2: Are Some Words Worth More than Others? Shiran Dudy and Steven Bedrick [paper] [video]
Rank 3: Fill in the BLANC: Human-free quality estimation of document summaries, Oleg Vasilyev, Vedant Dharnidharka and John Bohannon [paper] [video]