Thai Word Segmentation

  • Syllable segmentation is done by applying Thai syllable rules. Segmentation ambiguities are resolved by using a trigram model of syllables, trained with a corpus of 630,000 syllables from a newspaper.
  • Word segmentation is performed by using maximum collocation approach. (see the paper "Collocation and Thai Word Segmentation" submitted to SNLP-COCOSDA2002 conference).
  • Dictionary used in the program is adapted from the Royal Institute Dictionary, which is made available by LINKS. But some obsolete words are deleted from the dictionary. There is no routine to handle proper names, abbrviations directly yet. Thus, segmentation of sentences containing a proper name could be incorrect.

  • A stand alone version running on Windows XP can be downloaded <here> (version 2.1)
  • A DOS version can be downloaded <here>. You will need to unrar all files into a specified directory. To run the program, type "thaiseg  INPUTFILE OUTPUTFILE  /w or /s  (/vb)" The last option (verbose) is optional.
  • Version 2.2 for Windows 7 is <here>
  • This program can be used for non-commercial purposes. 

  • Word segmentation program is now a  function in TLTK module in Python
  • To use Thai word segmentation for Windows, OS X,  first install Python 3, then add TLTK module.
  • We recommend using Anaconda as reuqired packages e.g. PyQt are alreadyincluded. Go to and select Anaconda for Python 3.6
  • After install Anaconda, go to command prompt, and use “pip install tltk”
  • GUI scripts to use Thai word segmentation is
  • Run “Python”
  • Input File is a plain text with utf-8 encoding.
  • Trigram of words from Thai National Corpus can be downloaded here TNC.3g

This program is a part of a project supported by the Research Division of the Faculty of Arts, 2000-2.
Written by Wirote Aroonmanakun. Copyright 2002.

© Wirote 2012