The submission deadline is extended to August 28rd. Dual submissions in the EMNLP main conference and our workshop are considered (Please withdraw the workshop paper by 15th, Sep if it is accepted at the main conference). Also notice that non-archival submissions are considered, but authors should inform us by email (there is no special button on the submission page).
Fair and adequate evaluations and comparisons are of fundamental importance to the NLP community to properly track progress, especially within the current deep learning revolution, with new state-of-the-art results reported in ever shorter intervals. This concerns the designing of adequate metrics for evaluating performance in high-level text generation tasks such as question and dialogue generation, summarization, machine translation, image captioning, poetry generation, etc.; properly evaluating word and sentence embeddings; and rigorously determining whether and under which conditions one system is better than another; etc.
Particular topics of interest of the workshop include (but not limited to):
- Novel individual evaluation metrics, particularly for text generation
- with desirable properties, e.g., (i) high correlations with humans; (ii) can distinguish high-quality outputs from mediocre / low-quality outputs; (iii) robustness across lengths of input and output sequences; (iv) speed; etc.
- reference-free evaluation metrics, defined in terms of the source text(s) and system predictions only;
- cross-domain metrics that can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages and different time periods; etc.;
- supervised, unsupervised, and semi-supervised metrics
- Designing adequate evaluation methodology, particularly for text generation
- statistics for the trustworthiness of results, via appropriate significance tests;
- comparing score distributions instead of single-point estimates;
- comprehensive and fair comparisons;
- comprehensive and unbiased error analyses and case studies, avoiding cherry-picking and sampling bias;
- guidelines for adequate evaluation methodology
- appropriate forms of human annotations, e.g. Likert scale ratings, rankings, preferences, bandit feedback, etc.
- methodologies for human evaluation
- validation of metrics against human evaluations
- Creating adequate and correct evaluation data, particularly for text generation
- coverage of phenomena, representativeness/balance/distribution with respect to the task, etc.;
- size of corpora, variability among data sources, eras, genres, etc.;
- internal consistency of annotations;
- agreement levels for annotation metrics;
- cost-effective manual evaluations with good inter-annotator agreements (e.g., crowdsourcing);
- introspection and elimination of biases in the annotated data, e.g., via probing and adversarial attacks
| Ido Dagan || Bar-Ilan University, Israel
|Ani Nenkova|| University of Pennsylvania, USA
|Robert West||École polytechnique fédérale de Lausanne (EPFL), Switzerland
|Mohit Bansal|| University of North Carolina (UNC) Chapel Hill, USA
|Steffen Eger||Technische Universität Darmstadt, Germany
|Yang Gao||Royal Holloway, University of London, UK
|Maxime Peyrard||École polytechnique fédérale de Lausanne (EPFL), Switzerland
|Wei Zhao||Technische Universität Darmstadt, Germany
|Eduard Hovy|| Carnegie Mellon University, USA