IPS 2026
Conference Management System
Main Site
Submission Guide
Register
Login
User List | Statistics
Abstract List | Statistics
Poster List
Paper List
Reviewer List
Presentation Video
Online Q&A Forum
Ifory System
:: Abstract ::

<< back

An Automated Pipeline for Extracting and Annotating Non-Standard Linguistic Data from Large-Scale Digital Text Corpora: A Digital Lexicographical System Architecture
Tri Edliani Lestari, Miftahulkhairah Anwar and Prihantoro

1 German Language Education Study Program, Universitas Negeri Surabaya, Surabaya, Indonesia
2 Pascasarjana, Universitas Negeri Jakarta, Jakarta, Indonesia
3English Language and Literature Study Program, Universitas Negeri Diponegoro, Semarang, Indonesia


Abstract

The speed at which digital communications evolve creates expression types such as slang and codes that existing lexicons cannot capture. This research is the first to describe the elements of a fully automated system to extract, process, and ^read^ social media data to uncover non-standard lexicon data to create a German-Indonesian slang dictionary. Data were collected through Apify, a digital scraping tool, from four types of social media: Threads, Instagram, TikTok, and X (the new name for Twitter), to create a multilingual text data set. Data were organized in Sketch Engine to build a corpus and used for analysis to capture frequent slang in both languages. The automated design proposes lexicography as a repeatable process by integrating the collection of social media data, language detection, adjustment of data capture to remove duplicates, and semi-automated process of tagging data. It is shown the design adequately captures modern, platform-specific illustrations and examples of slang not found in existing bilingual dictionaries. The methodology of corpus construction and data analysis supports the design of scalable, transparent, and repeatable processes. This research design builds lexicography to be an automated, scalable, repeatable, and robust discipline in capturing modern multilingual social media.

Keywords: digital lexicography- slang dictionary- corpus linguistics- social media corpus- German-Indonesian- Sketch Engine- automated pipeline- non-standard language

Topic: Instrumentation and Computational Physics

Plain Format | Corresponding Author (Tri Edliani Lestari)

Share Link

Share your abstract link to your social media or profile page

IPS 2026 - Conference Management System

Powered By Konfrenzi Ultimate 1.832M-Build9 © 2007-2026 All Rights Reserved