Obtaining Better Turkish Corpora with High Quality Language Identification

Obtaining Better Turkish Corpora with High Quality Language Identification

 

Multilingual corpora like mC4 and OSCAR include Turkish subsets, but they have two significant limitations:

  1. Language Identification: These corpora use models that support hundreds of languages, which can result in inaccurate language identification for Turkish.
  2. Content Extraction: Turkish content often includes headers, footers, and metadata, which these codebases struggle to extract properly.

This project aims to use a Turkish-specific language identification model and better scraping techniques to improve the extraction of main content in the OSCAR project, resulting in a higher-quality pretraining corpus for Turkish LLMs.

 

Relevant links:

https://github.com/oscar-project/ungoliant?tab=readme-ov-file

 

Suitable for Cmpe492

Project Advisor: 

Suzan Üsküdarlı

Project Status: 

Project Year: 

2024
  • Fall

Contact us

Department of Computer Engineering, Boğaziçi University,
34342 Bebek, Istanbul, Turkey

  • Phone: +90 212 359 45 23/24
  • Fax: +90 212 2872461
 

Connect with us

We're on Social Networks. Follow us & get in touch.