The 1st Workshop on Evaluation and Comparison for NLP systems (Eval4NLP), co-located at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP2020), invites papers of a theoretical or experimental nature describing recent advances in system evaluation and comparison in NLP, particularly for text generation.
All deadlines are are 11:59 PM GMT -12 (anywhere-on-earth time)
|Anonymity period begins:||July 15, 2020|
|Deadline for submission:||August 28, 2020|
|Retraction of workshop papers accepted for EMNLP (main conference):||September 15, 2020|
|Notification of acceptance:||September 29, 2020|
|Deadline for camera-ready version:||October 10, 2020|
|Delivery of Workshop proceedings to Publications Committee:||October 17, 2020|
|Workshop Date:||November 20, 2020|
Authors should submit a long paper of up to 8 pages, with up to 2 additional pages for references, or a short paper of up to 4 pages, with up to 2 additional pages for references, following the EMNLP 2020 formatting requirements. The reported research should be substantially original. Accepted papers will be presented as posters. Reviewing will be double-blind, and thus no author information should be included in the papers; self-reference that identifies the authors should be avoided or anonymized. Accepted papers will appear in the workshop proceedings. The submission site is here: https://www.softconf.com/emnlp2020/nlpevaluation2020/
Thanks to our generous sponsors. We will reward three prizes ($400, $200 and $100) to the best three paper submissions, as nominated by our program committee. Both long and short submissions will be eligible for prizes.
 Mathur, N., Baldwin, T., Cohn, T. (2020). Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In ACL.
 Sellam, T., Das, D., Parikh, A. (2020). BLEURT: Learning Robust Metrics for Text Generation. In ACL.
 Zhao, W., Glavaš, G., Peyrard, M., Gao, Y., West, R., Eger, S. (2020). On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation. In ACL.
 Gao, Y., Zhao, W., Eger, S. (2020). SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In ACL.
 Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C. M., & Eger, S. (2019). Moverscore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In EMNLP-IJCNLP.
 Peyrard, M. (2019). Studying Summarization Evaluation Metrics in the Appropriate Scoring Range. In ACL.
 Clark, E., Celikyilmaz, A., & Smith, N. A. (2019). Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In ACL.
 Sun, S., & Nenkova, A. (2019). The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization. In EMNLP-IJCNLP.
 Louis, A., & Nenkova, A. (2013). Automatically assessing machine summary content without a gold standard. In Computational Linguistics.
 Dodge, J., Gururangan, S., Card, D., Schwartz, R., & Smith, N. A. (2019). Show your work: Improved reporting of experimental results. In EMNLP-IJCNLP.
 Gorman, K., & Bedrick, S. (2019). We need to talk about standard splits. In ACL.
 Reimers, N., & Gurevych, I. (2017). Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In EMNLP.
 Eger, S., Rücklé, A., & Gurevych, I. (2019). Pitfalls in the Evaluation of Sentence Embeddings. In RepL4NLP.
 Van der Lee, C., Gatt, A., Van Miltenburg, E., Wubben, S., & Krahmer, E. (2019). Best Practices for the Human Evaluation of Automatically Generated Text. In INLG.
 Reiter, E. (2018). A Structured Review of the Validity of BLEU. In CL.
 Glavas, G., Litschko, R., Ruder, S., & Vulic, I. (2019). How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In ACL.