OSACT 2016 : The 2nd Workshop on Arabic Corpora and Processing Tools (2016 Theme: Social Media)
Call For Papers
In the NLP and CL communities, Arabic is considered to be relatively resource poor compared to English. This situation was thought to be the reason for the limited number of corpus based studies in Arabic. However, the last years witnessed the emergence of new considerably free Modern Standard Arabic (MSA) corpora and to a lesser extent Arabic processing tools. Over the past few years, the use of Arabic in social media has increased dramatically, leading to an abundance of Arabic content that is either formal or informal, MSA or dialectal, and Arabic script or Arabizi. Other phenomena include the use of emoticons, abbreviated words, decorations, etc. Despite the abundance of such content, there is a severe shortage of annotated corpora and processing tools that are tailored for such content.
Available Arabic corpora can be divided into two groups. The first group contains large Arabic texts, which are designed and constructed basically for Arabic linguistic and NLP research activities, and can be useful for a variety of tasks such as language modeling. These corpora are diverse in the genres they cover and their sizes range from one million words to billions of words. The second group contains corpora that were designed basically for Arabic specific NLP tasks such as text classification, clustering, POS tagging, etc., and they typically contain annotations at clitic, word, sentence, paragraph, or document level. Most of the currently available corpora in this group are composed of newspaper articles, and range in size between tens of thousands of words to millions of words. Annotated corpora that are derived from social media continues to be limited, and corpus processing tools for such corpora is lacking. Some of the required tools include corpus exploration tools that provide word/stem frequencies, concordances, collocations, etc. and processing tools such as tokenization, normalization, word segmentation, morphological analysis, and part-of-speech tagging. Having proper exploration and processing tools can open the door for a variety of applications such as machine translation, opinion mining, text classification, and a variety of social applications.
Topics of interest
This half-day-workshop aims to encourage the researchers and developers to foster the utilization of freely available Arabic corpora, including social media corpora, and open source Arabic language processing tools and help in highlighting the drawbacks of these resources and discuss techniques and approaches on how to improve them. The workshop topics include but not limited to:
Surveying and criticizing the design of freely available Arabic corpora, their associated tools and stand alone Arabic corpora processing tools.
Availing new annotated corpora for NLP applications such as named entity recognition, machine translation, part-of-speech tagging, sentiment analysis, text classification, and language learning.
Evaluating the use of crowdsourcing platforms (ex. Mechanical Turk, Crowdflower) for Arabic data annotation.
Tools and Technologies:
Language education e.g. L1 and L2.
Language modeling and word embeddings.
Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, parsing, diacritization
Sentiment analysis, dialect identification, and text classification
Trend analysis and opinion mining
Measuring polarization and opinion shift
Religious and ideological discourse
Submission deadline: 10 February 2016
Notification of acceptance: 10 March 2016
Final submission of manuscripts: 21 March 2016
Workshop date: Tuesday, 24 May 2016 (Morning session)
The language of the workshop is English and submissions should be with respect to LREC 2016 paper submission instructions. All papers will be peer reviewed possibly by three independent referees. Papers must be submitted electronically in PDF format to the STAR system.
When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments, including evaluation ones, etc.
Distinguished papers, after further revisions, will be considered for publication in special issue of Journal of King Saud University - Computer and Information Sciences: http://ees.elsevier.com/jksu-cis/default.asp
Note: The LREC Proceedings have been accepted for inclusion in the Thomson Reuters Conference Proceedings Citation Index (CPCI).
Hend Al-Khalifa, King Saud University, KSA
Abdulmohsen Al-Thubaity, King Abdul Aziz City for Science and Technology, KSA
Walid Magdy, Qatar Computing Research Institute, Qatar
Kareem Darwish, Qatar Computing Research Institute, Qatar
Abdulrhman Almuhareb, KACST, KSA
Abdullah Alfaifi, Imam University, KSA
Abeer ALDayel, King Saud University, KSA
Areeb AlOwisheq, Imam University, KSA
Auhood Alfaries, King Saud University, KSA
Hamdy Mubarak, Qatar Computing Research Institute, Qatar
Hazem Hajj, American University of Beirut, Lebanon
Hind Al-Otaibi, King Saud University, KSA
Houda Bouamor, Carnegie Mellon University, Qatar
Kemal Oflazer, Carnegie Mellon University, Qatar
Khurshid Ahmad, Trinity College Dublin, Ireland
Maha Alrabiah, Imam University, KSA
Mohammad Alkanhal, KACST, KSA
Mohsen Rashwan, Cairo University, Egypt
Mona Diab, George Washington University, US
Muhammad M. Abdul-Mageed, Indiana University, US
Nizar Habash, New York University Abu Dhabi, UAE
Nora Al-Twairesh, King Saud University, KSA
Nouf Al-Shenaifi, King Saud University, KSA
Stephan Vogel, Qatar Computing Research Institute, Qatar
Tamer Elsayed, Qatar University, Qatar
Wajdi Zaghouani, Carnegie Mellon University in Qatar, Qatar