babyLM CHALLENGE 2023 : CfP babyLM - shared task hosted in CoNLL/CMCL 2023

posted by user: LESHEM || 2280 views || tracked by 3 users: [display]

babyLM CHALLENGE 2023 : CfP babyLM - shared task hosted in CoNLL/CMCL 2023

When	Jan 1, 2023 - Sep 1, 2023
Where	CMCL+CONLL
Submission Deadline	TBD

Categories machine learning natural language processin computational linguistics pretraining

Call For Papers

Announcing the BabyLM Challenge, the shared task at CoNLL/CMCL 2023!

The goal of this shared task is to encourage researchers with an interest in pretraining and/or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by formulating an exciting open problem and establishing a community around it.

A huge effort has been put towards optimizing LM pretraining at massive scales in the last several years. While increasingly larger models often get the most attention, datasets have also grown by orders of magnitude. For example, Chinchilla is exposed to 1.4 trillion words during training—well over 10000 words for every one word a 13-year-old human has encountered in their entire life.

Focusing on scaled-down pretraining has several potential benefits: First, small-scale pretraining can be a sandbox for developing novel techniques for improving data efficiency. These techniques have the potential to then scale up to larger scales commonly seen in applied NLP or used to enhance current approaches to modeling low-resource languages. Second, improving our ability to train LMs on the same kinds and quantities of data that humans learn from hopefully will give us greater access to plausible cognitive models of humans and help us understand what allows humans to acquire language so efficiently.

The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, and/or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome). We will release a shared evaluation pipeline that evaluates on a variety of benchmarks and tasks, including targeted syntactic evaluations and natural language understanding.

Important dates:

January 2023: Training data released (see website for download)

March 2023: Evaluation pipeline released

July 15, 2023: Results due

August 1, 2023: Paper submissions due

Date TBA: Presentation at CoNLL

For more information, visit the BabyLM website https://babylm.github.io/ or consult our extended call for papers.

Related Resources

PerAnsSumm Shared Task @ CL4Health NAACL 2025 Shared Task on Perspective-aware Healthcare Answer Summarization at CL4Health Workshop [NAACL 2025]

Ei/Scopus-CCNML 2025 2025 5th International Conference on Communications, Networking and Machine Learning (CCNML 2025)

TA1C 2025 TA1C at IberLEF 2025 - Shared task on Clickbait Detection and Spoiling in Spanish

Ei/Scopus-SGGEA 2025 2025 2nd Asia Conference on Smart Grid, Green Energy and Applications (SGGEA 2025)

Topical collection Springer 2025 CFP: Sense-Making and Collective Virtues among AI Innovators. Aligning Shared Concepts and Common Goals

IEEE-ACAI 2025 2025 IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)

Abu Dhabi, UAE 2025 The First Workshop and Shared Task on Multilingual Counterspeech Generation

Ei/Scopus-IPCML 2025 2025 International Conference on Image Processing, Communications and Machine Learning (IPCML 2025)

COLING 2025 [2nd CFP] The 1st Workshop and Shared Task on Multilingual Counterspeech Generation

BioCreative9@IJCAI-2025 BioCreative IX Challenge and Workshop@IJCAI-2025: Large Language Models for Clinical and Biomedical NLP