FinTOC 2023 : FNP-2023 Shared Task - FinTOC (Financial Document Structure Extraction)
Call For Papers
Call for participation:
FNP-2023 Shared Task: FinTOC - Financial Document Structure Extraction
To be held as part of the 5th Financial Narrative Processing Workshop (FNP 2023) during the 2023 IEEE International Conference on Big Data (IEEE BigData 2023), Sorrento, Italy, from 15th December to 18th December, 2023. It is a one-day event of which the exact date is to be announced.
Shared Task URL: http://wp.lancs.ac.uk/cfie/fintoc2023/
Workshop URL: https://wp.lancs.ac.uk/cfie/fnp2023/
Participation Form: https://docs.google.com/forms/d/e/1FAIpQLSdqUKy3YGho0Cw2GF__VHilHZZbR75UDG3JRBC4k0Yxw4acWg/viewform?usp=pp_url
Shared Task Description:
A vast and continuously growing volume of financial documents are being created and published in machine-readable formats, predominantly in aPDF format. Unfortunately, these documents often lack comprehensive structural information, presenting a challenge for efficient analysis and interpretation. Nevertheless, these documents play a crucial role in enabling firms to report their activities, financial situation, and investment plans to shareholders, investors, and the financial markets. They serve as corporate annual reports, offering detailed financial and operational information.
In certain countries like the United States and France, regulators such as the SEC (Securities and Exchange Commission) and the AMF (Financial Markets Authority) have implemented requirements for firms to adhere to specific reporting templates. These regulations aim to promote standardization and consistency across firms' disclosures. However, in various European countries, management typically possesses more flexibility in determining what, where, and how to report financial information, resulting in a lack of standardization among financial documents published within the same market.
Although there has been some research conducted on the recognition of books and document table of contents (TOC), most of the existing work has focused on small-scale, application-dependent and domain-specific datasets. This limited scope poses challenges when dealing with a vast collection of heterogeneous documents and books, where TOCs from different domains exhibit significant variations in visual layout and style. Consequently, recognizing and extracting TOCs becomes an intricate problem. Indeed, in comparison to regular books that are typically provided in a full-text format with limited structural information such as pages and paragraphs, financial documents possess a more complex structure. They consist of various elements, including parts, sections, sub-sections, and even sub-sub-sections, incorporating both textual and non-textual content. Thus, TOC pages are not always present to help readers navigate the document, and when they are, they often only provide access to the main sections.
In this shared task, our objective is to undertake the analysis of various types of financial documents, encompassing KIID (Key Investor Information Document), Prospectus (official PDF documents where investment funds meticulously describe their characteristics and investment modalities), Réglement and Financial Annual Reports/Financial Statements (that provide a detailed overview of a company's financial performance and operations over the course of a fiscal year). These documents play a vital role in providing crucial information to investors, stakeholders, and regulatory bodies. While the content they must contain is often prescribed and regulated, their format lacks standardization, leading to a significant degree of variability. The presentation styles range from plain text format to more visually rich and data-driven graphical and tabular representations. Notably, the majority of those documents are published without a table of contents . A TOC is typically essential for readers as it enables easy navigation within the document by providing a clear outline of headers and corresponding page numbers. Additionally, TOCs serve as a valuable resource for legal teams, facilitating the verification of the inclusion of all the required contents. Consequently, the automated analysis of these documents to extract their structure is becoming increasingly useful for numerous firms worldwide.
Our primary focus for this edition is to expand the extraction of table of contents to a wider variety of financial documents, and the task will involve developing highly efficient algorithms and methodologies to address the challenges associated with such a dataset. Our aim is to achieve a level of generalization ensuring that the developed system can be applied to different types of financial documents. This broader scope allows us to explore the applicability of our methodologies across a range of financial document categories, such as KIID, Prospectus, Réglement and Financial Annual Reports/Financial Statements. This way, we want to demonstrate the versatility and effectiveness of the ML algorithms used in TOC extraction, enabling a streamlined and consistent approach across various financial document types.
In addition, for this edition, we are excited to introduce a dataset that goes beyond textual annotations. Our proposed dataset will include visual (spatial) annotations that capture the coordinates of the titles and hierarchical structure of the documents. This comprehensive approach enables a more holistic analysis and understanding of financial documents.
By incorporating visual annotations, we can capture the visual cues and design elements that contribute to the overall structure and organization of the documents. This allows us to delve deeper into the visual representation of the table of contents and extract valuable insights from the visual hierarchy present in these financial documents. The combination of textual and visual annotations provides a richer and more nuanced dataset, making it possible to increase the accuracy and effectiveness of the machine learning algorithms and methodologies employed in TOC extraction.
Thanks to the contribution of the Autonomous University of Madrid (UAM, Spain), the fifth edition of the FinTOC Shared Task welcomes a specific track for Spanish documents, continuing from the previous edition.
In this edition, systems will be scored based on their performance in both Title detection and TOC generation using more precise evaluation metrics based on visual annotations.
Participants are required to register for the Shared Task. Once registered, all participating teams will receive a common training dataset consisting of PDF documents along with the associated TOC annotations.
To participate please use the registration form below to add details about your team: https://docs.google.com/forms/d/e/1FAIpQLSdqUKy3YGho0Cw2GF__VHilHZZbR75UDG3JRBC4k0Yxw4acWg/viewform?usp=pp_url (now open as of 06/01/2023)
1st Call for papers & shared task participants: June 12, 2023
2nd Call for papers & shared task participants: July 17, 2023
Final Call for papers & shared task participants: August 17, 2023
Training set release: August 21, 2023
Blind test set release: September 21, 2023
Systems submission: October 03, 2023
Release of results: October 09, 2023
Paper submission deadline: October 18, 2023 (anywhere in the world)
Notification of paper acceptance to authors: November 01, 2023
Camera-ready of accepted papers: November 15, 2023
Workshop date (1 day event) : December 15-18, 2023 (exact date to be announced)
For any questions on the shared task please contact us on:
Shared Task Organizers:
- Abderrahim Ait Azzi, 3DS Outscale (ex Fortia), France
- Sandra Bellato, 3DS Outscale (ex Fortia), France
- Blanca Carbajo Coronado, Universidad Autónoma de Madrid
- Dr Ismail El Maarouf, Imprevicible
- Dr Juyeon Kang, 3DS Outscale (ex Fortia), France
- Prof. Ana Gisbert, Universidad Autónoma de Madrid
- Prof. Antonio Moreno Sandoval, Universidad Autónoma de Madrid