The International Working Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge we call upon everyone interested to apply their tools to bring research and industry closer together by analyzing a common data set. The challenge is for researchers and practitioners to bravely use their mining tools and approaches on a dare.
This year, the challenge is on large-scale repository mining on the Boa datasets from SourceForge and GitHub. We provide the metadata for almost 700,000 SourceForge projects and almost 8,000,000 GitHub repositories, and the full development histories with parsed abstract syntax trees for Java projects/repositories.
The breadth of the dataset enables participants to study research questions on an ultra-large dataset. For example, you could study the influence of the dataset size on the accuracy of data-driven approaches; you could evaluate the scalability of existing and/or new approaches with increasing data sizes; you could categorize projects using textual descriptions and/or program elements in the source code.
The full development histories with parsed abstract syntax trees enables participants to study how projects have evolved over time instead of only considering the project’s data at the last snapshots or specific points in time (such as releases). For example, you could study the use of certain Java libraries/features over time such as testing frameworks and concurrency utilities; you could mine certain kinds of bugs/errors and their corresponding fixing patterns such as concurrency errors and their fixes.