MedSeg: A Statistical Approach to Tokenization Assessment in Medical NLP
Main Article Content
Abstract
With the rapid adoption of Large Language Models (LLMs) in healthcare, accurate tokenization of complex medical terms has become increasingly critical. Improper segmentation leads to high unidentified words and suboptimal performance, particularly in medical Natural Language Processing (NLP) tasks. While subword tokenization methods like WordPiece and Byte Pair Encoding (BPE) have been widely used to mitigate Out-of-Vocabulary issues, there remains a lack of specialized metrics for evaluating their effectiveness in the medical domain. In this study, we propose MedSeg, a novel statistical evaluation metric designed to assess tokenizer performance by analyzing the Token Split Rate and Out-of-Vocabulary distribution across word lengths. MedSeg introduces a domain-aware, regression-based scoring mechanism that compares each tokenizer’s output to an estimated population distribution, quantified using normalized root mean square error (NRMSE). Experimental results using BioBERT and BioLlama on CTCAE data demonstrate that MedSeg effectively captures the trade-off between segmentation granularity and medical vocabulary preservation. The proposed metric provides a robust and interpretable framework for assessing tokenization strategies in domain-specific NLP applications.