Obtaining Better Turkish Corpora with High Quality Language Identification

Multilingual corpora like mC4 and OSCAR include Turkish subsets, but they have two significant limitations:

Language Identification: These corpora use models that support hundreds of languages, which can result in inaccurate language identification for Turkish.
Content Extraction: Turkish content often includes headers, footers, and metadata, which these codebases struggle to extract properly.

This project aims to use a Turkish-specific language identification model and better scraping techniques to improve the extraction of main content in the OSCAR project, resulting in a higher-quality pretraining corpus for Turkish LLMs.

Relevant links:

https://github.com/oscar-project/ungoliant?tab=readme-ov-file

Suitable for Cmpe492

Contact us

Department of Computer Engineering, Boğaziçi University,
34342 Bebek, Istanbul, Turkey

Phone: +90 212 359 45 23/24
Fax: +90 212 2872461

Connect with us

We're on Social Networks. Follow us & get in touch.

About BOUN CmpE

Search form

Main Menu

Obtaining Better Turkish Corpora with High Quality Language Identification