An Automated Pipeline for Extracting and Annotating Non-Standard Linguistic Data from Large-Scale Digital Text Corpora: A Digital Lexicographical System Architecture

Tri Edliani Lestari, Miftahulkhairah Anwar and Prihantoro

An Automated Pipeline for Extracting and Annotating Non-Standard Linguistic Data from Large-Scale Digital Text Corpora: A Digital Lexicographical System Architecture
Tri Edliani Lestari, Miftahulkhairah Anwar and Prihantoro

1 German Language Education Study Program, Universitas Negeri Surabaya, Surabaya, Indonesia
2 Pascasarjana, Universitas Negeri Jakarta, Jakarta, Indonesia
3English Language and Literature Study Program, Universitas Negeri Diponegoro, Semarang, Indonesia

Abstract

The speed at which digital communications evolve creates expression types such as slang and codes that existing lexicons cannot capture. This research is the first to describe the elements of a fully automated system to extract, process, and ^read^ social media data to uncover non-standard lexicon data to create a German-Indonesian slang dictionary. Data were collected through Apify, a digital scraping tool, from four types of social media: Threads, Instagram, TikTok, and X (the new name for Twitter), to create a multilingual text data set. Data were organized in Sketch Engine to build a corpus and used for analysis to capture frequent slang in both languages. The automated design proposes lexicography as a repeatable process by integrating the collection of social media data, language detection, adjustment of data capture to remove duplicates, and semi-automated process of tagging data. It is shown the design adequately captures modern, platform-specific illustrations and examples of slang not found in existing bilingual dictionaries. The methodology of corpus construction and data analysis supports the design of scalable, transparent, and repeatable processes. This research design builds lexicography to be an automated, scalable, repeatable, and robust discipline in capturing modern multilingual social media.

Keywords: digital lexicography- slang dictionary- corpus linguistics- social media corpus- German-Indonesian- Sketch Engine- automated pipeline- non-standard language

Topic: Instrumentation and Computational Physics

IPS 2026 Conference | Conference Management System