Turkish Preprocessing Operations Using Deep Learning Approaches

Turkish Preprocessing Operations Using Deep Learning Approaches

The first step in nearly all natural language processing (NLP) applications is applying preprocessing operations [1] to the text. Preprocessing operations include tokenization (segmenting the text into tokens), sentence splitting (dividing the text into sentences), normalization (converting the text into a canonical form), and the like. In this project, you will develop and implement algorithms for preprocessing of Turkish text using deep learning approaches. First, a literature review will be conducted and similar systems for English will be analyzed (e.g. UDPipe [2], Stanza [3]). Then, deep learning models will be built for each of the preprocessing operations. The models will be adapted to Turkish based on the characteristics of the language (e.g. using embeddings for the suffixes). Finally, the system will be tested on Turkish corpora, probably on the Turkish treebanks in the UD (Universal Dependencies) framework [4].


[1] https://exchange.scale.com/public/blogs/preprocessing-techniques-in-nlp-...

[2] Straka, M., Hajic, J., Strakova, J. (2016) UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing, In Proc. of the Tenth International Conference on Language Resources and Evaluation (LREC), p.4290-4297

[3] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D. (2020) Stanza : A Python Natural Language Processing Toolkit for Many Human Languages, In. Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, p.101-108

[4] https://universaldependencies.org/


Project Members: 

Umut Şener
Koray Tekin

Project Advisor: 

Tunga Güngör

Project Status: 

Project Year: 

  • Spring

Contact us

Department of Computer Engineering, Boğaziçi University,
34342 Bebek, Istanbul, Turkey

  • Phone: +90 212 359 45 23/24
  • Fax: +90 212 2872461

Connect with us

We're on Social Networks. Follow us & get in touch.