posted by user: bucc || 81 views || tracked by 2 users: [display]

BUCC 2026 : 19th Workshop on Building and Using Comparable Corpora

FacebookTwitterLinkedInGoogle

Link: https://comparable.lisn.upsaclay.fr/bucc2026/
 
When May 11, 2026 - May 11, 2026
Where Palma de Mallorca
Submission Deadline Feb 28, 2026
Notification Due Mar 22, 2026
Final Version Due Mar 29, 2026
Categories    NLP   computational linguistics   corpus linguistics   comparable corpora
 

Call For Papers

Call for Papers

**************************************************************

19th WORKSHOP ON BUILDING AND USING COMPARABLE CORPORA

Co-located with LREC 2026, Palma de Mallorca (in-person & online)

May 11, 2026

Paper submission deadline: February 28, 2026

Workshop website: https://comparable.lisn.upsaclay.fr/bucc2026/

Main conference website: https://lrec2026.info/

**************************************************************

MOTIVATION

In the language engineering and linguistics communities, research
in comparable corpora has been motivated by two main reasons. In
language engineering, on the one hand, it is chiefly motivated by
the need to use comparable corpora as training data for data-driven
NLP applications such as statistical and neural machine translation, or
cross-lingual retrieval. In linguistics, on the other hand, comparable
corpora are of interest because they enable cross-language discoveries
and comparisons. It is generally accepted in both communities that
comparable corpora consist of documents that are comparable in content
and form in various degrees and dimensions across several languages.
Parallel corpora are on the one end of this spectrum, and unrelated
corpora are on the other. Increasingly, these resources are not only
collected, but also augmented or even created synthetically, which
raises new questions about how to define and measure comparability.

In recent years, the use of comparable corpora for pre-training Large
Language Models (LLMs) has led to their impressive multilingual and
cross-lingual abilities, which are relevant to a range of applications,
including information retrieval, machine translation, cross-lingual text
classification, etc. The linguistic definitions and observations related
to comparable corpora are crucial to improve methods to mine such corpora,
to assess and document synthetic data, and to improve cross-lingual transfer
of LLMs. Therefore, it is of great interest to bring together builders and
users of such corpora.


PANEL DISCUSSION

The panel discusses the impact of synthetic data on comparable corpora
research. Fundamental questions about how LLMs transform our understanding
and use of multilingual data are addressed.


TOPICS

We solicit contributions on all topics related to comparable (and parallel)
corpora, including but not limited to the following:

Building Comparable Corpora
- Automatic and semi-automatic methods, including generating
comparable corpora using LLMs
- Methods to mine parallel and non-parallel corpora from the web
- Tools and criteria to evaluate the comparability of corpora
- Parallel vs non-parallel corpora, monolingual corpora
- Rare and minority languages, within and across language families
- Multi-media/multi-modal comparable corpora

Synthetic Data for Comparable Corpora
- LLM generation of comparable/parallel data
- Improving comparability of synthetic data
- Incidental bilingualism & pre-training use of comparable data
- Comparability & cross-lingual consistency
- Detection & attribution of synthetic vs. human text
- English-centric effects & fairness across languages/scripts
- Evaluation & reproducibility for downstream tasks

Applications of Comparable Corpora
- Human translation
- Language learning
- Cross-language information retrieval & document categorization
- Bilingual and multilingual projections
- (Unsupervised) machine translation
- Writing assistance
- Machine learning techniques using comparable corpora

Mining from Comparable Corpora
- Cross-language distributional semantics, word embeddings and
pre-trained multilingual transformer models
- Extraction of parallel segments or paraphrases from comparable corpora
- Methods to derive parallel from non-parallel corpora (e.g. to provide
for low-resource languages in neural machine translation)
- Extraction of bilingual and multilingual translations of single words,
multi-word expressions, proper names, named entities, sentences,
paraphrases etc. from comparable corpora.
- Induction of morphological, grammatical, and translation rules from
comparable corpora
- Induction of multilingual word classes from comparable corpora

Comparable Corpora in the Humanities
- Comparing linguistic phenomena across languages in contrastive linguistics
- Analyzing properties of translated language in translation studies
- Studying language change over time in diachronic linguistics
- Assigning texts to authors via authors' corpora in forensic linguistics
- Comparing rhetorical features in discourse analysis
- Studying cultural differences in sociolinguistics
- Analyzing language universals in typological research


IMPORTANT DATES

28 Feb 2026: Paper Submission deadline
22 Mar 2026: Notification of acceptance
29 Mar 2026: Camera-ready final papers
14 Apr 2026: Workshop Programme final version
11 May 2026: Workshop date

All deadlines are 11:59PM UTC-12:00 (“anywhere on earth”).
For updates of the schedule, please see the workshop website.


PRACTICAL INFORMATION

The workshop is a hybrid event, both in-person and online. Workshop registration is via the main conference registration site, see https://lrec2026.info/

The workshop proceedings will be published in the ACL Anthology (https://aclanthology.org/).


SUBMISSION GUIDELINES

Please follow the style sheet and templates (for LaTeX, Overleaf and MS-Word) provided for the main conference at https://lrec2026.info/authors-kit/

Papers should be submitted as a PDF file using the START conference manager at https://softconf.com/lrec2026/BUCC2026/

Submissions must describe original and unpublished work and range from 4 to 8 pages plus unlimited references. Reviewing will be double blind, so the papers should not reveal the authors' identity. Accepted papers will be published in the workshop proceedings.

Double submission policy: Parallel submission to other meetings or publications is possible but must be notified to the workshop organizers by e-mail immediately upon submission to another venue.

For further information and updates, please see the BUCC 2026 web page at https://comparable.lisn.upsaclay.fr/bucc2026/.


WORKSHOP ORGANIZERS

- Reinhard Rapp (University of Mainz, Germany)
- Ayla Rigouts Terryn (Université de Montréal, Mila, Canada)
- Serge Sharoff (University of Leeds, United Kingdom)
- Pierre Zweigenbaum (Université Paris-Saclay, CNRS, France)

Contact: reinhardrapp (at) gmx (dot) de


PROGRAMME COMMITTEE

- Ebrahim Ansari (Institute for Advanced Studies in Basic Sciences, Iran)
- Eleftherios Avramidis (DFKI, Germany)
- Gabriel Bernier-Colborne (National Research Council, Canada)
- Kenneth Church (VecML.com, USA)
- Patrick Drouin (Université de Montréal, Canada)
- Alex Fraser (Technical University of Munich, Germany)
- Natalia Grabar (CNRS, University of Lille, France)
- Amal Haddad Haddad (Universidad de Granada, Spain)
- Kyo Kageura (University of Tokyo, Japan)
- Natalie Kübler (Université Paris Cité, France)
- Philippe Langlais (Université de Montréal, Canada)
- Yves Lepage (Waseda University, Japan)
- Shervin Malmasi (Amazon, USA)
- Michael Mohler (Language Computer Corporation, USA)
- Emmanuel Morin (Nantes Université, France)
- Dragos Stefan Munteanu (RWS, USA)
- Preslav Nakov (Mohamed bin Zayed University of AI, United Arab Emirates)
- Ted Pedersen (University of Minnesota, Duluth, USA)
- Reinhard Rapp (University of Mainz, Germany)
- Ayla Rigouts Terryn (Université de Montréal & Mila, Canada)
- Nasredine Semmar (CEA LIST, Paris, France)
- Serge Sharoff (University of Leeds, UK)
- Richard Sproat (Sakana.ai, Tokyo, Japan)
- Marko Tadić (University of Zagreb, Croatia)
- François Yvon (CNRS & Sorbonne Université, France)
- Pierre Zweigenbaum (Université Paris-Saclay, CNRS, France)


INFORMATION ABOUT THE LRE 2026 MAP AND THE "SHARE YOUR LRs!" INITIATIVE

When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of the research.
Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones).


Related Resources

Ei/Scopus-ITCC 2026   2026 6th International Conference on Information Technology and Cloud Computing (ITCC 2026)
AbjadNLP 2026   The 2nd Workshop on NLP for Languages Using Arabic Script
DEPLING 2023   International Conference on Dependency Linguistics
CMC-Corpora 2025   12th International Conference on CMC and Social Media Corpora for the Humanities
NLP4KGC 2025   4th NLP4KGC: Natural Language Processing for Knowledge Graph Construction
BUCC 2025   18th Workshop on Building and Using Comparable Corpora workshop at COLING'25
DRIJ 2026   Dental Research: An International Journal
SOEA 2026   10th International Conference on Software Engineering and Applications
RTME 2026   11th International Conference on Recent Trends in Mechanical Engineering
IJBB 2026   International Journal on Bioinformatics & Biosciences