posted by user: bucc || 905 views || tracked by 2 users: [display]

BUCC 2022 : 15th Workshop on Building and Using Comparable Corpora with Shared Task on Multilingual Terminology Extraction from Comparable Corpora


When Jun 25, 2022 - Jun 25, 2022
Where Marseille
Submission Deadline Apr 10, 2022
Notification Due May 3, 2022
Final Version Due May 23, 2022
Categories    computational linguisticcs   corpus linguistics   comparable corpora   multilinguality

Call For Papers



Co-located with LREC 2022 (Marseille)

Saturday, June 25, 2022

Paper submission deadline: April 10, 2022

Workshop website:

Shared task website:

LREC website:



In the language engineering and the linguistics communities, research in
comparable corpora has been motivated by two main reasons. In language
engineering, on the one hand, it is primarily motivated by the need to
use comparable corpora as training data for statistical NLP applications
such as statistical and neural machine translation or cross-lingual
information retrieval. In linguistics, on the other hand, comparable
corpora are of interest because they enable cross-language discoveries
and comparisons. It is generally accepted in both communities that
comparable corpora consist of documents that are comparable in content
and form in various degrees and dimensions across several languages,
dialects, or varieties. Parallel corpora are on the one end of this
spectrum, unrelated corpora on the other.


We solicit contributions on all topics related to comparable (and
parallel) corpora, including but not limited to the following:

Building Comparable Corpora:

* Automatic and semi-automatic methods
* Methods to mine parallel and non-parallel corpora from the web
* Tools and criteria to evaluate the comparability of corpora
* Parallel vs non-parallel corpora, monolingual corpora
* Rare and minority languages, across language families
* Multi-media/multi-modal comparable corpora

Applications of comparable corpora:

* Human translation
* Language learning
* Cross-language information retrieval & document categorization
* Bilingual and multilingual projections
* (Unsupervised) machine translation
* Writing assistance
* Machine learning techniques using comparable corpora

Mining from Comparable Corpora:

* Cross-language distributional semantics and pre-trained multilingual
transformer models
* Creation of bilingual and multilingual embeddings from comparable corpora
* Methods to derive parallel from non-parallel corpora (e.g. to provide
for low-resource languages in neural machine translation)
* Extraction of bilingual and multilingual translations of single words,
multi-word expressions, proper names, named entities, sentences, and
paraphrases from comparable corpora, etc.
* Induction of morphological, grammatical, and translation rules from
comparable corpora
* Induction of multilingual word classes from comparable corpora

Comparable Corpora in the Humanities:

* Comparing linguistic phenomena across languages in contrastive linguistics
* Analyzing properties of translated language in translation studies
* Studying language change over time in diachronic linguistics
* Assigning texts to authors via authors' corpora in forensic linguistics
* Comparing rhetorical features in discourse analysis
* Studying cultural differences in sociolinguistics
* Analyzing language universals in typological research


April 10, 2022: Paper submission deadline
May 3, 2022: Notification of acceptance
May 23, 2022: Camera ready final papers
June 25, 2022: Workshop date

For updates see the workshop website at


Registration for the workshop will be via the main conference website at


Please follow the style sheet and templates provided for the main
conference at
Papers should be submitted as a PDF file using the START conference
manager at
Submissions must describe original and unpublished work and range from 4
to 8 pages plus unlimited references.

It is the authors' choice whether or not to reveal their identities in
their manuscripts submitted for review. Accepted papers will be
published in the workshop proceedings.

Double submission policy: Parallel submission to other meetings or
publications is possible but must be immediately notified to the
workshop organizers by e-mail.

For further information and updates see the BUCC 2022 website:

In case of questions, please contact Reinhard Rapp: reinhardrapp (at)
gmx (dot) de

***** BUCC 2022 SHARED TASK: bilingual term alignment in comparable
specialized corpora

The BUCC 2022 shared task is on multilingual terminology alignment in
comparable corpora. Many research groups are working on this problem
using a wide variety of approaches. However, as there is no standard
way to measure the performance of the systems, the published results are
not comparable and the pros and cons of the various approaches are not
clear. The shared task aims at solving these problems by organizing a
fair comparison of systems. This is accomplished by providing corpora
and evaluation datasets for a number of language pairs and domains.

Moreover, the importance of dealing with multi-word expressions in
Natural Language Processing applications has been recognized for a long
time. In particular, multi-word expressions pose serious challenges for
machine translation systems because of their syntactic and semantic
properties. Furthermore, multi-word expressions tend to be more
frequent in domain-specific text, hence the need to handle them in tasks
with specialized-domain corpora.

Through the 2022 BUCC shared task, we seek to evaluate methods that
detect pairs of terms that are translations of each other in two
comparable corpora, with an emphasis on multi-word terms in specialized

Sample and training data release: 11 February 2022
Test data release: 16 March 2022

For further details see the shared task website at


* Reinhard Rapp (Athena R.C., Greece; Magdeburg-Stendal University of
Applied Sciences and University of Mainz, Germany)
* Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LISN, Orsay, France)
* Serge Sharoff (University of Leeds, United Kingdom)

Contact workshop: reinhardrapp (at) gmx (dot) de
Contact shared task: pz (at) lisn (dot) fr


* Ahmet Aker (University of Duisburg-Essen, Germany)
* Ebrahim Ansari (Institue for Advanced Studies in Basic Sciences, Iran)
* Thierry Etchegoyhen (Vicomtech, Spain)
* Hitoshi Isahara (Otemon Gakuin University, Japan)
* Kyo Kageura (University of Tokyo, Japan)
* Natalie Kübler (CLILLAC-ARP, Université de Paris, France)
* Philippe Langlais (Université de Montréal, Canada)
* Yve Lepage (Waseda University, Japan)
* Michael Mohler (Language Computer Corporation, USA)
* Emmanuel Morin (Université de Nantes, France)
* Dragos Stefan Munteanu (RWS, USA)
* Reinhard Rapp (Athena R.C., Greece; Magdeburg-Stendal University of
Applied Sciences and University of Mainz, Germany)
* Nasredine Semmar (CEA LIST, Paris, France)
* Serge Sharoff (University of Leeds, UK)
* Richard Sproat (OGI School of Science & Technology, USA)
* Ted Pedersen (University of Minnesota, Duluth, USA)
* Pierre Zweigenbaum (LISN, CNRS, Université Paris-Saclay, Orsay, France)

Related Resources

BUCC 2024   17th Workshop on Building and Using Comparable Corpora
MLSP 2024   Multilingual Lexical Simplification Pipeline (MLSP) Shared Task @ 19th Workshop on Innovative Use of NLP for Building Educational Applications
GEM shared task 2024   GEM 2024 multilingual data-to-text and summarization shared task
GermEval2024 GerMS-Detect 2024   GermEval2024 Shared Task GerMS-Detect - Sexism Detection in German Online News Fora @Konvens 2024
KONVENS-ST/T/WS 2024   Call for Shared Task, Workshop and Tutorial Proposals @ KONVENS 2024
IberLEF 2024   Call for Task Proposals - IberLEF 2024
SMM4H 2024   The 9th Social Media Mining for Health Research and Applications Workshop and Shared Tasks — Large Language Models (LLMs) and Generalizability for Social Media NLP
SI AID 2024   SPECIAL ISSUE on Adaptive Intrusion Detection System using Machine Learning in Wireless Sensor Networks
ICBSTS 2024   2024 5th International Conference on Building Science, Technology and Sustainability (ICBSTS 2024)
RWE-AI 2024   International Workshop on Advances in Generating Real-World Evidence from Real-World Data Using Artificial Intelligence