GermEval Task 1 2019 : Shared Task on hierarchical classification of German Blurbs
Call For Papers
GermEval 2019 Task 1 - Shared Task on hierarchical classification of German blurbs (short texts)
*Call for Participation*
We invite interested parties from academia and industry to participate in this shared task. Further information can be found here: https://competitions.codalab.org/competitions/21226.
Hierarchical multi-label classification (HMC) of blurbs is the task of classifying multiple labels for short descriptive texts of books, where each label is part of an underlying hierarchy of categories. The increasing amount of available digital documents and the need for more and finer-grained categories calls for new, more robust and sophisticated text classification methods. Large datasets often incorporate a categorical hierarchy, that can be used to organize information of documents on different levels of specificity. Traditional multi-class text classification approaches are thoroughly researched, however, since traditional approaches fail to generalize adequately with the increase of available data and the necessity of more specific hierarchies, the need for more robust and sophisticated classification methods increases.
With this task we aim to foster research within the HMC context. This task is focusing on classifying German books into their respective hierarchically structured categories using short advertisement texts (blurbs). The data contains additional metadata such as author, page number, release date, etc.
This shared task consists of two subtasks, described below. Participants are free to participate in either one of them or both.
- *Subtask A*: The task is to classify German books into *one or multiple most general categories*. It can be thus be considered a non-hierarchical multi-label classification task. Eight classes can be assigned in total: 'Literatur & Unterhaltung', 'Ratgeber', 'Kinderbuch & Jugendbuch', 'Sachbuch', 'Ganzheitliches Bewusstsein', 'Glaube & Ethik', 'Künste, Architektur & Garten'.
- *Subtask B*: The second task targets hierarchical multi-label classification, where the full hierarchy of labels should be assigned to a book. In addition to the most general category (Subtask A), additional categories of different specificity can be assigned to a book. In total, 343 different classes can be assigned in a hierarchical structure of maximally 4 levels.
The entire dataset consists of 20,784 examples in total. Sample data is provided in order to enable familiarization with the structure of the data. 14,548 training samples have been released and can be downloaded after registering for the shared tasks. A validation set (2,079 samples) has been published where gold labels have been held back. Submissions for the validation set via the codalab page are accepted and published on a leaderboard until June 1st. From June 1st, we will start the final evaluation phase of the task by providing the gold labels of the validation set, which can be used as additional training data. Additionally, the test set samples will be provided, for which we accept submissions until July, 15th. More information can be found on the task's webpage: https://competitions.codalab.org/competitions/21226
- January 2019: Release of trial data
- February 01, 2019: Release of training data (train + validation)
- June 01, 2019: Release of gold labels for validation set + test data
- July 15, 2019: Final deadline for submissions of test results
- July 31, 2019: Submission of description papers
- August 20, 2019: Notification of acceptance
- September 15, 2019: Camera-ready deadline for system description papers
- October 08, 2019: Workshop in Erlangen, Germany
The shared task will be accompanied by a pre-conference workshop of the Conference on Natural Language Processing ("Konferenz zur Verarbeitung natürlicher Sprache", KONVENS) hosted on October 8, 2019 at FAU Erlangen-Nuremberg (http://2019.konvens.org/).
Description papers will appear in online workshop proceedings. Participants who submit a description paper will be asked to register at the workshop and present their system as a poster or in an oral presentation (depending on the number of submissions).
The task is organized by Rami Aly, Steffen Remus and Chris Biemann, Language Technology, Department of Informatics, Universität Hamburg, https://lt.informatik.uni-hamburg.de
GermEval is a series of shared task evaluation campaigns that focus on Natural Language Processing for the German language. GermEval has been conducted four times since 2014 in co-location with KONVENS/GSCL conferences. For an overview of the currently conducted tasks, please see http://2019.konvens.org/germeval. We highly encourage readers to also take note of task 2 (Identification of offensive language, https://projects.fzai.h-da.de/iggsa/) and task 3 (Lemmatization of German Web and Social Media Texts, https://fau-klue.github.io/empirist-lemmatization/).