Workshop in NLP Evaluation and Comparison (Eval4NLP)

Call For Papers

The 1st Workshop on Evaluation and Comparison for NLP systems (Eval4NLP), co-located at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP2020), invites papers of a theoretical or experimental nature describing recent advances in system evaluation and comparison in NLP, particularly for text generation.

Important Dates (Tentative)

All deadlines are are 11:59 PM GMT -12 (anywhere-on-earth time)

Anonymity period begins:	July 15, 2020
Deadline for submission:	August 28, 2020
Retraction of workshop papers accepted for EMNLP (main conference):	September 15, 2020
Notification of acceptance:	~~September 29, 2020~~ October 5, 2020
Deadline for camera-ready version:	~~October 10, 2020~~ October 12, 2020
Delivery of Workshop proceedings to Publications Committee:	October 17, 2020
Workshop Date:	November 20, 2020

Submission Guidelines:

Authors should submit a long paper of up to 8 pages, with unlimited pages for references, or a short paper of up to 4 pages, with unlimited pages for references, following the EMNLP 2020 formatting requirements. The reported research should be substantially original. Accepted papers will be presented as posters. Reviewing will be double-blind, and thus no author information should be included in the papers; self-reference that identifies the authors should be avoided or anonymized. Accepted papers will appear in the workshop proceedings. The submission site is here: https://www.softconf.com/emnlp2020/nlpevaluation2020/

Camera-ready Papers:

Final versions of long papers will be given one additional page of content (up to 9 pages). Short papers will be given 5 content pages. Authors are encouraged to use this additional page to address reviewers’ comments.

Best Paper Awards:

Thanks to our generous sponsors, we will reward three prizes ($400, $200 and $100) to the best three paper submissions, as nominated by our program committee. Both long and short submissions will be eligible for prizes.

References:

[1] Mathur, N., Baldwin, T., Cohn, T. (2020). Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In ACL.

[2] Sellam, T., Das, D., Parikh, A. (2020). BLEURT: Learning Robust Metrics for Text Generation. In ACL.

[3] Zhao, W., Glavaš, G., Peyrard, M., Gao, Y., West, R., Eger, S. (2020). On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation. In ACL.

[4] Gao, Y., Zhao, W., Eger, S. (2020). SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In ACL.

[5] Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C. M., & Eger, S. (2019). Moverscore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In EMNLP-IJCNLP.

[6] Peyrard, M. (2019). Studying Summarization Evaluation Metrics in the Appropriate Scoring Range. In ACL.

[7] Clark, E., Celikyilmaz, A., & Smith, N. A. (2019). Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In ACL.

[8] Sun, S., & Nenkova, A. (2019). The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization. In EMNLP-IJCNLP.

[9] Louis, A., & Nenkova, A. (2013). Automatically assessing machine summary content without a gold standard. In Computational Linguistics.

[10] Dodge, J., Gururangan, S., Card, D., Schwartz, R., & Smith, N. A. (2019). Show your work: Improved reporting of experimental results. In EMNLP-IJCNLP.

[11] Gorman, K., & Bedrick, S. (2019). We need to talk about standard splits. In ACL.

[12] Reimers, N., & Gurevych, I. (2017). Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In EMNLP.

[13] Eger, S., Rücklé, A., & Gurevych, I. (2019). Pitfalls in the Evaluation of Sentence Embeddings. In RepL4NLP.

[14] Van der Lee, C., Gatt, A., Van Miltenburg, E., Wubben, S., & Krahmer, E. (2019). Best Practices for the Human Evaluation of Automatically Generated Text. In INLG.

[15] Reiter, E. (2018). A Structured Review of the Validity of BLEU. In CL.

[16] Glavas, G., Litschko, R., Ruder, S., & Vulic, I. (2019). How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In ACL.