Named Entity Recognition For Turkish Microblog Texts Using Semi-Supervised Learning With Word Embeddings

Summary:

Recently, due to the increasing popularity of social media and the value of information contained within real data, the necessity for extracting information from informal text types such as microblog texts gains significant attention, together with the challenges it brings to the Natural Language Processing (NLP) research community. In this study, we focused on the Named Entity Recognition (NER) problem for Turkish, which is known as a morphologically rich language, on informal text types such as microblog texts. For that purpose, we utilized a semi-supervised learning approach composed of an unsupervised stage followed by a supervised stage based on neural networks. We applied a fast unsupervised method for learning continuous representations of Turkish words in vector space. We make use of these obtained word embeddings, together with language independent features that are engineered to work better on informal text types, for generating a Turkish NER system on microblog texts. For examining informal and short texts in Turkish, we focused on the most popular microblogging environment called Twitter and we evaluated our Turkish NER system on short and unstructured Twitter messages called tweets. With our NER system, we achieved better F-score performances than the published results of previously proposed NER systems on Turkish tweets. To be more precise, we outperformed the state-of-the-art F-score by up to 11% on the same Turkish Twitter data. The only language dependent stage of our system is the normalization scheme we applied for Turkish microblog texts as a preprocessing step before the NER application, which improves the performance of our NER system on informal text types. Since we did not employ any language dependent features, other than this Turkish text normalization, we believe that our method can be easily adapted to microblog texts in other morphologically rich languages.

Özet:

Günümüzde sosyal medya kullanımının artan popülerliği ve sosyal meydada paylaşılan verilerin içerdiği bilginin değeri göz önüne alındığında, bu tür yapılandırılmamış metinlerden bilgi çıkarımı yapabilemek büyük ilgi görmeye başlamıştır. Bu durum doğal dil işleme araştırmaları açısından pek çok zorluğu da beraberinde getirmiştir. Bu çalışmamızda morfolojik açıdan zengin bir dil olan Türkçe için varlık ismi tanıma probleminin, özellikle mikroblog metinleri gibi yapılandırılmamış metinlerde çözümüne odaklandık. Bu amaçla, güdümlü ve güdümsüz öğrenme aşamalarından oluşan ve yapay sinir ağlarını baz alan yarı güdümlü bir öğrenme tekniği kullandık. İlk olarak hızlı ve güdümsüz bir öğrenme metodu kullanarak çok boyutlu sürekli vektör uzayında Türkçe kelime temsillerini elde ettik. Daha sonra gerek bu kelime temsillerini, gerekse yapılandırılmamış mentinler için daha iyi sonuç verecek şekilde uyarlanmış, dilden bağımsız öznitelikleri kullanarak bu tür metinler için bir Türkçe varlık ismi tanıma sistemi geliştirdik. Yapılandırılmamış ve kısa Türkçe metinleri incelemek amacıyla, en popüler mikroblog platformu olan Twitter üzerine yoğunlaştık ve geliştirdiğimiz sistemi tweet adı verilen kısa Twitter mesajları üzerinde denedik. Sistemimizin Türkçe Twitter mesajları üzerindeki performansının daha önce bu amaçla yayınlanmış sistemlerin performansından daha iyi olduğunu gördük. Türkçe Twitter metinlerinde varlık ismi tanıma için yayınlanmış en gelişkin sistemi %11 iyileştirme ile aşmış olduk. Sistemimizin dile özgü tek aşaması, varlık isimleri tanınmadan önce Türkçe Twitter metinleri üzerinde uyguladığımız Türkçe metin normalizasyonu aşamasıdır ve bu aşama yapılandırılmamış metinlerde performansı artırmaktadır. Normalizasyon aşaması dışında dile özgü öznitelikleri doğrudan kullanmadığımız için yöntemimizin morfolojik açıdan zengin diğer dillerdeki yapılandırılmamış metinlere de kolayca uyarlanabileceğine inanıyoruz.

Search form

Main Menu

Named Entity Recognition For Turkish Microblog Texts Using Semi-Supervised Learning With Word Embeddings