CMLC 2023 : 11th Workshop on the Challenges in the Management of Large Corpora


When Jul 2, 2023 - Jul 2, 2023
Where Lancaster, UK
Submission Deadline Apr 27, 2023
Notification Due May 11, 2023
Final Version Due Jun 4, 2023
Categories    NLP   computational linguistics   linguistics

Call For Papers

11th Workshop on the Challenges in the Management of Large Corpora (CMLC)
The next meeting of CMLC will be held as part of Corpus Linguistics 2023 in Lancaster, UK, on the 2nd of July, 2023.

See for up-to-date information.

Important dates
Deadline for abstract submission: the 27th of April 2023 (Thursday, 23:59 UTC)
Notification of acceptance: the 11th of May 2023 (Thursday)
Deadline for the submission of camera-ready papers: the 4th of June 2023 (Sunday)
Meeting: Sunday, the 2nd of July 2023 (room/hour TBA)
Abstract submission
We invite anonymised extended abstracts for oral presentations on the topics listed below (ideally using the ACL-2023 templates, or PDF, 1000-1500 words excluding references, font preferably 11 pt, line spacing 1.5).
CMLC has always reserved a track for national corpus project reports, and to this end, we invite poster proposals of 500-750 words. National project reports need not be anonymised.
Submissions are accepted through the EasyChair submission system, at .

Please note that each CMLC event produces a volume of proceedings (published in Open Access before the meeting), where both oral and poster contributions have equal status. All final submissions to the 2023 proceedings volume will be expected to be formatted according to the ACLPUB guidelines and to pass the aclpubcheck.

Workshop description
The upcoming CMLC meeting continues the successful series of “Challenges in the management of large corpora” events, previously hosted at LREC (since 2012) and CL (since 2015) conferences. As in the previous meetings, we wish to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing, and data science.

Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.

A number of key themes and questions emerge of interest to the contributing research communities: (a) what can be done to deal with IPR and data protection issues? (b) what sampling techniques can we apply? (c) what quality issues should we be aware of? (d) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (f) what kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) how to guarantee that corpus data remain available and usable in a sustainable way?

Motivation and topics of interest
This year’s event will cover the entire range of the standard CMLC themes, with some new additions:

New and hot topics
Language Models
What linguistic insights can we gain by post-hoc language model analysis in the age of ChatGPT?
How can we avoid the proliferation of stereotypes in terms of both linguistic surface form and content when using language models for linguistic analysis?
Societal and legal issues relevant for corpora and studies
political and sociological balance ○ social media bubbles, hate speech and fake news
proliferation of stereotypes via corpora and language models
corpora as archives of the past: evolution in mentalities or laws, personality rights
How to make corpora as accessible as possible despite big data issues, application heterogeneity, and IPR issues
What are the most interesting APIs and libraries to build, analyse and access very large corpora?
How can we get us researchers to use existing research tools, infrastructures, libraries and APIs in research and teaching?
Linguistic content challenges
Dealing with the variety of language resources: multilinguality, historical texts, noisy OCR texts, user-generated content, etc.
Integration of human computation (crowdsourcing) and automatic annotation
Quality management of annotations
Technical challenges
Storage and retrieval solutions for big textual data corpora: primary data, metadata, and annotation data
Scalable and efficient NLP tooling for annotating and analysing large datasets: distributed and GPGPU computing; using big data analysis frameworks for language processing
Dealing with streaming (e.g. Social Media) and rapidly changing underlying data
Exploitation challenges
Legal and privacy issues
Query languages, data models, and standardisation
Licensing models of open and closed data, coping with intellectual property restrictions
Innovative approaches for aggregation and visualisation of text analytics
In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster along with a short presentation.

Programme Committee: TBA
Names will be added as Programme Committee members confirm their participation.

Organising Committee
Institut für Deutsche Sprache, Mannheim
📩 Piotr Bański, Marc Kupietz, Harald Lüngen

Berlin-Brandenburg Academy of Sciences
📩 Adrien Barbaresi

Institute of Computational Linguistics, University of Zurich
Simon Clematide

CMLC series homepage is located at

