Workshop in NLP Evaluation and Comparison (Eval4NLP)

Workshop Papers:

A survey on Recognizing Textual Entailment as an NLP Evaluation, Adam Poliak [paper] [video]

Are Some Words Worth More than Others? Shiran Dudy and Steven Bedrick [paper] [video]

Artemis: A Novel Annotation Methodology for Indicative Single Document Summarization, Rahul Jha, Keping Bi, Yang Li, Mahdi Pakdaman, Asli Celikyilmaz, Ivan Zhiboedov and Kieran McDonald [paper] [video]

Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation, Neslihan Iskender, Tim Polzehl and Sebastian Möller [paper] [video]

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation, Kawin Ethayarajh and Dorsa Sadigh [paper] [video]

ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation, Hanna Wecker, Annemarie Friedrich and Heike Adel [paper] [video]

Evaluating Word Embeddings on Low-Resource Languages, Nathan Stringham and Mike Izbicki [paper] [video]

Fill in the BLANC: Human-free quality estimation of document summaries, Oleg Vasilyev, Vedant Dharnidharka and John Bohannon [paper] [video]

Grammaticality and Language Modelling, Jingcheng Niu and Gerald Penn [paper] [video]

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance, Xi Chen, Nan Ding, Tomer Levinboim and Radu Soricut [paper] [video]

Item Response Theory for Efficient Human Evaluation of Chatbots, João Sedoc and Lyle Ungar [paper] [video]

On Aligning OpenIE Extractions with Knowledge Bases: A Case Study, Kiril Gashteovski, Rainer Gemulla, Bhushan Kotnis, Sven Hertling and Christian Meilicke [paper] [video]

On the Evaluation of Machine Translation n-best Lists, Jacob Bremerman, Huda Khayrallah, Douglas Oard and Matt Post [paper] [video]

One of these words is not like the other: a reproduction of outlier identification using non-contextual word representations, Jesper Brink Andersen, Mikkel Bak Bertelsen, Mikkel Hørby Schou, Manuel R. Ciosici and Ira Assent [paper] [video]

Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models, Reda Yacouby and Dustin Axman [paper] [video]

Truth or Error? Towards systematic analysis of factual errors in abstractive summaries, Klaus-Michael Lux, Maya Sappelli and Martha Larson [paper] [video]

ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT, Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui and Kyomin Jung [paper] [video]

Findings Papers:

A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing, Rachel Bawden, Biao Zhang, Lisa Yankovskaya, Andre Tättar, Matt Post [paper] [video]

Automatically Identifying Gender Issues in Machine Translation using Perturbations, Hila Gonen, Kellie Webster [paper] [video]

An Evaluation Method for Diachronic Word Sense Induction, Ashjan Alsulaimani, Erwan Moreau, Carl Vogel [paper] [video]

KoBE: Knowledge-Based Machine Translation Evaluation, Zorik Gekhman, Roee Aharoni, Genady Beryozkin, Markus Freitag, Wolfgang Macherey [paper] [video]

CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems, Yiran Chen, Pengfei Liu, Ming Zhong, Zi-Yi Dou, DanqingWang, Xipeng Qiu, Xuanjing Huang [paper] [video]

#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction, Asiye Tuba Köksal, Özge Bozal, Emre Yürekli, Gizem Gezici [paper] [video]

Multichannel Generative Language Model: Learning All Possible Factorizations Within and Across Channels, Harris Chan [paper] [video]

GRUEN for Evaluating Linguistic Quality of Generated Text, Wanzheng Zhu, Suma Bhat [paper] [video]

Non-archival Papers:

An Open-Source Library for Using and Developing Summarization Evaluation Metrics, Daniel Deutsch and Dan Roth

Evaluating the Evaluation of Diversity in Natural Language Generation, Guy Tevet and Jonathan Berant