Obtaining Better Turkish Corpora with High Quality Language Identification
Multilingual corpora like mC4 and OSCAR include Turkish subsets, but they have two significant limitations:
- Language Identification: These corpora use models that support hundreds of languages, which can result in inaccurate language identification for Turkish.
- Content Extraction: Turkish content often includes headers, footers, and metadata, which these codebases struggle to extract properly.
This project aims to use a Turkish-specific language identification model and better scraping techniques to improve the extraction of main content in the OSCAR project, resulting in a higher-quality pretraining corpus for Turkish LLMs.
Relevant links:
https://github.com/oscar-project/ungoliant?tab=readme-ov-file
Suitable for Cmpe492