posted by organizer: shabnamt || 373 views || tracked by 1 users: [display]

CoCo4MT 2023 : The Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT)

FacebookTwitterLinkedInGoogle

Link: https://sites.google.com/view/coco4mt
 
When Sep 4, 2023 - Sep 8, 2023
Where Macau SAR, China
Submission Deadline Jul 5, 2023
Notification Due Jul 20, 2023
Final Version Due Jul 31, 2023
 

Call For Papers

The Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) @MT-SUMMIT XIX
The 19th Machine Translation Summit
Sep 4-8, 2023, Macau SAR, China
https://sites.google.com/view/coco4mt

SCOPE

It is a well-known fact that machine translation systems, especially those that use deep learning, require massive amounts of data. Several resources for languages are not available in their human-created format. Some of the types of resources available are monolingual, multilingual, translation memories, and lexicons. Those types of resources are generally created for formal purposes such as parliamentary collections when parallel and more informal situations when monolingual. The quality and abundance of resources including corpora used for formal reasons is generally higher than those used for informal purposes. Additionally, corpora for low-resource languages, languages with less digital resources available, tends to be less abundant and of lower quality.

CoCo4MT is a workshop centered around research that focuses on manual and automatic corpus creation, cleansing, and augmentation techniques specifically for machine translation. We accept work that covers any language (including sign language) but we are specifically interested in those submissions that explicitly report on work with languages with limited existing resources (low-resource languages). Since techniques from high-resource languages are generally statistical in nature and could be used as generic solutions for any language, we welcome submissions on high-resource languages also.

CoCo4MT aims to encourage research on new and undiscovered techniques. We hope that the methods presented at this workshop will lead to the development of high-quality corpora that will in turn lead to high-performing MT systems and new dataset creation for multiple corpora. We hope that submissions will provide high-quality corpora that are available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future. The workshop’s success will be measured by the following key performance indicators:

- Promotes the ongoing increase in quality of machine translation systems when measured by standard measurements,
- Provides a meeting place for collaboration from several research areas to increase the availability of commonly used corpora and new corpora,
- Drives innovation to address the need for higher quality and abundance of low-resource language data.

Topics of interest include:

- Difficulties with using existing corpora (e.g., political considerations or domain limitations) and their effects on final MT systems,
- Strategies for collecting new MT datasets (e.g., via crowdsourcing),
- Data augmentation techniques,
- Data cleansing and denoising techniques,
- Quality control strategies for MT data,
- Exploration of datasets for pretraining or auxiliary tasks for training MT systems.


SHARED TASK

To encourage research on corpus construction for low-resource machine translation, we introduce a shared task focused on identifying high-quality instances that should be translated into a target low-resource language. Participants are provided access to multi-way corpora in the high-resource languages of English, Spanish, German, Korean, and Indonesian, and using these, are required to identify beneficial instances, that when translated into the low-resource languages of Cebuano, Gujarati, and Burmese, lead to high-performing MT systems. More details on data, evaluation and submission can be found on the website (https://sites.google.com/view/coco4mt) or by emailing coco4mt-shared-task@googlegroups.com.

SUBMISSION INFORMATION

CoCo4MT will accept research, review, or position papers. The length of each paper should be at least four (4) and not exceed ten (10) pages, plus unlimited pages for references. Submissions should be formatted according to the official MT Summit 2023 style templates (https://www.overleaf.com/latex/templates/mt-summit-2023-template/knrrcnxhkqxd). Accepted papers will be published in the MT Summit 2023 proceedings which are included in the ACL Anthology and will be presented at the conference either orally or as a poster.

Submissions must be anonymized and should be made to the workshop using the Softconf conference management system (https://softconf.com/mtsummit2023/CoCo4MT). Scientific papers that have been or will be submitted to other venues must be declared as such, and must be withdrawn from the other venues if accepted and published at CoCo4MT. The review will be double-blind.

We would like to encourage authors to cite papers written in ANY language that are related to the topics, as long as both original bibliographic items and their corresponding English translations are provided.

Registration will be handled by the main conference. (To be announced)

IMPORTANT DATES

May 18, 2023 - Call for papers released
May 19, 2023 - Shared task release of train, dev and test data
May 25, 2023 - Shared task release of baselines
June 5, 2023 - Second call for papers
June 20, 2023 - Third and final call for papers
July 05, 2023 - Paper submissions due
July 05, 2023 - Shared task deadline to submit results
July 20, 2023 - Notification of acceptance
July 20, 2023 - Shared task system description papers due
July 31, 2023 - Camera-ready due
September 4-5, 2023 - CoCo4MT workshop

CONTACT

CoCo4MT Workshop Organizers:
coco4mt-2023-organizers@googlegroups.com

CoCo4MT Shared Task Organizers:
coco4mt-shared-task@googlegroups.com

ORGANIZING COMMITTEE (listed alphabetically)

Ananya Ganesh University of Colorado Boulder
Constantine Lignos Brandeis University
John E. Ortega Northeastern University
Jonne Sälevä Brandeis University
Katharina Kann University of Colorado Boulder
Marine Carpuat University of Maryland
Rodolfo Zevallos Universitat Pompeu Fabra
Shabnam Tafreshi University of Maryland
William Chen Carnegie Mellon University

PROGRAM COMMITTEE (listed alphabetically tentative)

Abteen Ebrahimi University of Colorado Boulder
Adelani David Saarland University
Ananya Ganesh University of Colorado Boulder
Alberto Poncelas ADAPT Centre at Dublin City University
Anna Currey Amazon
Amirhossein Tebbifakhr University of Trento
Atul Kr. Ojha National University of Ireland Galway
Ayush Singh Northeastern University
Barrow Haddow University of Edinburgh
Bharathi Raja Chakravarthi National University of Ireland Galway
Beatrice Savoldi University of Trento
Bogdan Babych Heidelberg University
Briakou Eleftheria University of Maryland
Constantine Lignos Brandeis University
Dossou Bonaventure Mila Quebec AI Institute
Duygu Ataman New York University
Eleftheria Briakou University of Maryland
Eleni Metheniti Université Toulosse - Paul Sabatier
Jasper Kyle Catapang University of Birmingham
John E. Ortega Northeastern University
Jonne Sälevä Brandeis University
Kalika Bali Microsoft
Katharina Kann University of Colorado Boulder
Kochiro Watanabe The University of Tokyo
Koel Dutta Chowdhury Saarland University
Liangyou Li Huawei
Manuel Mager University of Stuttgart
Maria Art Antonette Clariño University of the Philippines Los Baños
Marine Carpuat University of Maryland
Mathias Müller University of Zurich
Nathaniel Oco De La Salle University
Niu Xing Amazon
Patrick Simianer Lilt
Rico Sennrich University of Zurich
Rodolfo Zevallos Universitat Pompeu Fabra
Sangjee Dondrub Qinghai Normal University
Santanu Pal Saarland University
Sardana Ivanova University of Helsinki
Shantipriya Parida Silo AI
Shiran Dudy Northeastern University
Surafel Melaku Lakew Amazon
Tommi A Pirinen University of Tromsø
Valentin Malykh Moscow Institute of Physics and Technology
Xing Niu Amazon
Xu Weijia University of Maryland

Related Resources

LxGr 2024   9th Symposium on Corpus Approaches to Lexicogrammar
NJ 2024   18th NooJ Call for Paper- Applied Linguistics; Computational Linguistics; Discourse Analysis; General Linguistics; Semantics; Text/Corpus Linguistics
: LEXESP – GPLSI 2024   VII International Conference on English and ESP Lexicology and Lexicography: Computational Linguistics, Corpus linguistics and Artificial Intelligence
LEXESP – GPLSI 2024   VII International Conference on English and ESP Lexicology and Lexicography: Computational Linguistics, Corpus linguistics and Artificial Intelligence
ICMLA 2024   23rd International Conference on Machine Learning and Applications
ICDM 2024   IEEE International Conference on Data Mining
MLNLP 2024   2024 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024)
DSIT 2024   2024 7th International Conference on Data Science and Information Technology (DSIT 2024)
CCBDIOT 2024   2024 3rd International Conference on Computing, Big Data and Internet of Things (CCBDIOT 2024)
EAIH 2024   Explainable AI for Health