posted by user: LESHEM || 1298 views || tracked by 4 users: [display]

BabyLM 2024 : BabyLM Challenge 2024

FacebookTwitterLinkedInGoogle

Link: https://babylm.github.io/
 
When May 24, 2024 - Aug 20, 2024
Where Competition
Submission Deadline TBD
Categories    NLP   ML   CV
 

Call For Papers

Announcing the BabyLM Challenge 2024!



The goal of this shared task is to encourage researchers with an interest in pretraining and/or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by formulating an exciting open problem and establishing a community around it.



A huge effort has been put towards optimizing LM pretraining at massive scales in the last several years. While increasingly larger models often get the most attention, datasets have also grown by orders of magnitude. For example, Chinchilla is exposed to 1.4 trillion words during training—well over 10000 words for every one word a 13-year-old human has encountered in their entire life.



Focusing on scaled-down pretraining has several potential benefits: First, small-scale pretraining can be a sandbox for developing novel techniques for improving data efficiency. These techniques have the potential to then scale up to larger scales commonly seen in applied NLP or used to enhance current approaches to modeling low-resource languages. Second, improving our ability to train LMs on the same kinds and quantities of data that humans learn from hopefully will give us greater access to plausible cognitive models of humans and help us understand what allows humans to acquire language so efficiently.



The task has three fixed-data tracks, two of which restrict the training data to pre-released datasets of 10M and 100M. The third track is the multimodal track, where the training set consists of 50M words of paired text-image data, and 50M words text-only data. There are also two other tracks with no fixed datasets. One is the “bring-your-data” track. This track restricts the amount of text used to 100M, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome). The other is the paper-only track. This track encourages contributions that are related to the goals of the challenge but do not involve direct competition entries. We will release a shared evaluation pipeline that evaluates on a variety of benchmarks and tasks, including targeted syntactic evaluations, natural language understanding, and visual-language understanding (for multimodal track).



· March 30 2024: Training data released (see website for download)

· April 30 2024: Evaluation pipeline released

· September 13 2024: Results due

· September 20 2024: Paper submissions due

· October 8 2024: Peer review begins

· October 30 2024: Peer review ends, acceptances and leaderboard released

· Late 2024: Presentation of workshop at ML/NLP venue



Dates may change. For more information, visit the BabyLM website https://babylm.github.io/ or consult our extended call for papers.

Related Resources

Dialogo-CDHEW 2025   International Virtual Conference on Humanity 2.0: The Challenge of reDefining Humanity in an Evolving World
IEEE-Ei/Scopus-ITCC 2025   2025 5th International Conference on Information Technology and Cloud Computing (ITCC 2025)-EI Compendex
CVPR-Avatar 2025   CVPR 2025 Photorealistic Avatar Challenge
ACL 2025   The 63rd Annual Meeting of the Association for Computational Linguistics
SOFTPA 2025   4th International Conference on Emerging Practices in Software Process & Architecture
NLP4DH 2025   The 5th International Conference on Natural Language Processing for Digital Humanities
LDK 2025   Fifth Conference on Language, Data and Knowledge
CoNLL 2025   29th Conference on Computational Natural Language Learning
CMIT 2025   17th International Conference of Managing Information Technology
NLDB 2025   The 30th International Conference on Natural Language & Information Systems