Search is one of the key functionalities in digital platforms and applications such as electronic dictionary, search engine, and e-commerce.
However, search using Khmer language faces the following challenges:
While we encourage people to write search queries in Khmer as correctly as possible, certain unintentional spelling mistakes are unavoidable. This is where technology comes to our rescue. Techo Startup Center (TSC) proposes the following solutions to the above-stated challenges:
The preliminary findings suggest that the above solutions can solve the above-stated challenges related to Khmer search. Techo Startup Center (TSC) has implemented and opened the API to the public to use freely without any charge in order to collect valuable suggestions and feedbacks to further improve the solutions while our team is working on enlarging the corpus by incorporating texts from other domains such laws and history.
In addition, Techo Startup Center (TSC) also provides the API to the additional functionalities namely: auto-completion and word segmentation.
Character Order Normalization
Some words such as “ស្រី្ត” can be multiple orders of characters. To ensure the consistent representation, words can be normalized by re-ordering the characters.
There are two spell checkers – grapheme-based and phoneme-based. The grapheme-based spell checker searches for the suggestions within a pre-defined edit distance which is computed on graphemes. Similarly, the phoneme-based spell checker uses phonemes to compute edit distances when searching for the suggestions. The phoneme-based spell checker can identify different spelling realizations for the same word as in the case of “បម្លែង” and “បំលែង”.
One of the useful features of Google search is the ability to auto-complete user query. The auto-completion API can auto-complete user query in Khmer in the same way.
Semantically-related words are not synonyms, but words with related meaning. Semantically-related words are words that are often in the same context. For example, “សំលៀកបំពាក់” is semantically related to “ខោ”, “អាវ”, “ខោអាវ”, “អាវធំ”, “ស្រោមជើង” and the likes.
In Khmer writing system, no explicit boundary delimiter is used. Therefore, word segmentation or, in broader term, tokenization is required to extract words for subsequent downstream tasks such search, text classification and so forth. The segmentation API uses the in-house segmentation algorithm to segment words in input text.
FEEDBACK AND CONTACT US FOR APIWe love to hear from you about of API by giving us feedback via this form. Moreover, if you wish to use our API directly, please contact us via this form as well.
+855 92 83 49 89