[WWW 2021] FinSBD-3 Shared Task 2021 : Structure Boundary Detection, an extension of Sentence Boundary Detection in PDF Noisy Text in the Financial Domain

posted by user: finsbd || 4812 views || tracked by 4 users: [display]

[WWW 2021] FinSBD-3 Shared Task 2021 : Structure Boundary Detection, an extension of Sentence Boundary Detection in PDF Noisy Text in the Financial Domain

Link: https://sites.google.com/nlg.csie.ntu.edu.tw/finweb2021/shared-task-finsbd-3

When	Apr 19, 2021 - Apr 23, 2021
Where	Ljubljana, Slovenia
Submission Deadline	Feb 17, 2021

Categories segmentation/tokenization NLP text preprocessing machine learning

Call For Papers

Greetings,

We would like to invite you to submit to FinSBD-3, the 3rd shared task on Structure Boundary Detection in PDF Noisy Text in the Financial Domain, in conjunction with The Web Conference 2021, April 19-23th, 2021, Ljubljana, Slovenia!

Call for Participation: FinSBD-3 https://sites.google.com/nlg.csie.ntu.edu.tw/finweb2021/shared-task-finsbd-3 [2]
Register here: https://forms.gle/FnVThgUbUa2x7Rr76 [4]
Collocated with FinWeb- 2021 workshop: https://sites.google.com/nlg.csie.ntu.edu.tw/finweb2021 [1]
Submission deadline: February 10, 2021
Workshop date: The Web Conference 2021 @ April 19-23th, 2021, Ljubljana, Slovenia [3]

Motivation
==========
Sentences
Sentences are basic units of the written language. Detecting the beginning and end of sentences, or sentence boundary detection (SBD), is the foundational first step in many Natural Language Processing (NLP) applications such as POS tagging; syntactic, semantic, and discourse parsing; information extraction; or machine translation.
Despite its important role in NLP, Sentence Boundary Detection has so far not received enough attention. Previous research in the area has been confined to only formal texts (news, European Parliament proceedings, etc.) where existing rule-based and machine learning approaches are extremely accurate so-long the data is perfectly clean. No sentence boundary detection research to date has addressed the problem in noisy texts extracted automatically from machine-readable files (generally PDF file format) such as financial documents.
One type of financial document is the prospectus. Financial prospectuses are official PDF documents in which investment funds precisely describe their characteristics and investment modalities. The most important step of extracting any information from these files is to parse them to get noisy unstructured text, clean the text, format the information (by adding several tags) and finally, transform it into semi-structured text, where sentence and list boundaries are well marked.
These prospectuses also contain many visual demarcations indicating a hierarchy of sections including bullets and numbering. There are many sentence fragments and titles, and not just complete sentences. The prospectuses more often than not contain punctuation errors. And in order to structure the dense information in a more easily read format, lists are often used.

Lists
A list can be similar to a sentence that enumerates several items of the same category. For example, the “Simple List” from Figure 1 can be easily read as one normal sentence. However, looking at Figure 2, the list cannot be read as one sentence; although it is one unit, because there are multiple sentences included and there is a visible hierarchy of information. It is therefore important to make the distinction between sentences and lists and, for these lists, to create a hierarchy that organizes the items. Mastering this distinction and item hierarchy can pave the way for more accurate information extraction.

Document structure elements : Footer, Header, Tables
This year, we have included the task of extracting document structure elements like footer, header and tables due to their unique structure and common occurrence in financial documents.
Footers and headers are used in financial prospectuses as shown in Figure 3, for including information that the author wants to appear on every page of a prospectus such as the title of the document or page numbers. Tables are also largely used for presenting text information and statistical data as shown in Figure 4 and we often observe multi-page tables (see Figure 5) in financial documents.

Task Description
================
In the last edition of FinSBD-2 , we focused on extracting well-segmented sentences, lists and list items from financial prospectuses in PDF format by detecting their beginning and end boundaries, in two languages: English and French. This year, we improve the previously proposed tasks and extend this task to the detection of document structure boundaries.
The goal of FinSBD-3 is thus to extract the boundaries of sentences, lists and list items, including structure elements like footer, header, tables. Given a set of textual documents extracted from pdf files, participants in this shared task have to extract a set of well-delimited sentences, lists, list items and structure elements (footer, headers and tables).
For each given PDF, a JSON will be provided containing:
text extracted by us (key "text")
sentence boundaries (key "sentence")
list boundaries (key "list")
list item boundaries (key "item")
list item boundaries of level 1 (key "item1")
list item boundaries of level 2 (key "item2")
list item boundaries of level 3 (key "item3")
list item boundaries of level 4 (key "item4")
Item boundaries overlap with item boundaries of different levels. Each item level represents its depth within the list.
table boundaries (key "table")
footer boundaries (key "footer")
header boundaries (key "header")
Boundaries are represented by indexes of starting and ending characters that the system has to predict.
We also included the PDF coordinates of each boundaries as metadata (which can be used for visualization on PDF if needed).
Example:
{
"text": "Ce document fournit des informations essentielles aux investisseurs ...",
"sentence": [ {"start": 17, "end": 53, "coordinates": ... }, ... ],
"list": [ {"start": 1080, "end": 1267, "coordinates": ... }, ... ],
"item": [ ... ],
"item1": [ ... ],
"item2": [ ... ],
"item3": [ ... ],
"item4": [ ... ]
}
We are providing indexes of characters as well as coordinates of boundaries to allow different kind of character or word tokenization and/or possible usage of spatial and visual cues. Therefore, we hope to encourage novel approaches based on multimodality, especially since lists are often spatially structured to convey information visually.
Participants can choose to work on both languages, or submit systems for one language only. This task is open to everyone. The only exception are the co-chairs of the organizing team, who cannot submit a system, and who will serve as an authority to resolve any disputes concerning ethical issues or completeness of system descriptions.

Evaluation
==========
For each sub-task, the evaluation metrics will be computed based on boundaries which are pairs of character indexes ("start" and "end"). The F-score will be the official metric and an evaluation script will be provided to all the teams.

Prize
=====
A USD$1000 prize will be rewarded to the best-performing teams.

Important dates
===============
Dec 23, 2020 - First announcement of the shared task and beginning of registration
Jan 08, 2021 - Release of training data and scoring script
Feb 02, 2021 - Test set made available
Feb 10, 2021 - Registration deadline
Feb 10, 2021 - Systems' outputs collected
Feb 15, 2021 - Release of results.
Feb 19, 2021 - Shared task title and abstract due
Feb 23, 2021 - Shared task paper submissions due
Mar 01, 2021 - Camera-ready version of shared task paper due
April 19-23, 2021 - FinWeb 2021 Workshop (Ljubljana, Slovenia)

Contact
=======
For any questions on the shared task please contact us on fin.sbd.task@gmail.com [5]

Shared Task Organizing committee
================================
Abderrahim AIT-AZZI, Fortia Financial Solutions
Willy AU, Fortia Financial Solutions
Ismail EL MAAROUF, Fortia Financial Solutions
Juyeon KANG, Fortia Financial Solutions

Sincerely,

The FinSBD Organizers

The Web Conference 2021

Read more: https://sites.google.com/nlg.csie.ntu.edu.tw/finweb2021/shared-task-finsbd-3

[1] FinWeb-2021:https://sites.google.com/nlg.csie.ntu.edu.tw/finweb2021/
[2] FinSBD-3: https://sites.google.com/nlg.csie.ntu.edu.tw/finweb2021/shared-task-finsbd-3
[3] The Web Conference 2021: https://www2021.thewebconf.org/
[4] Registration form:https://forms.gle/FnVThgUbUa2x7Rr76
[5] mailto: fin.sbd.task@gmail.com
[6] Figures:https://sites.google.com/nlg.csie.ntu.edu.tw/finweb2021/shared-task-finsbd-3