Thai Word Segmentation


  • Syllable segmentation is done by applying Thai syllable rules. Segmentation ambiguities are resolved by using a trigram model of syllables, trained with a corpus of 630,000 syllables from a newspaper.
  • Word segmentation is performed by using maximum collocation approach. (see the paper "Collocation and Thai Word Segmentation" submitted to SNLP-COCOSDA2002 conference).
  • Dictionary used in the program is adapted from the Royal Institute Dictionary, which is made available by LINKS. But some obsolete words are deleted from the dictionary. There is no routine to handle proper names, abbrviations directly yet. Thus, segmentation of sentences containing a proper name could be incorrect.
  • A stand alone version running on Windows XP can be downloaded <here> (version 2.1)
  • A DOS version can be downloaded <here>. You will need to unrar all files into a specified directory. To run the program, type "thaiseg  INPUTFILE OUTPUTFILE  /w or /s  (/vb)" The last option (verbose) is optional.
  • Version 2.2 for Windows 7 is <here>
  • This program can be used for non-commercial purposes. 

This program is a part of a project supported by the Research Division of the Faculty of Arts, 2000-2.
Written by Wirote Aroonmanakun. Copyright 2002.


© Wirote 2012