KHMER WORD SEARCH BASE ON SEMANTIC RELATION


KHMER WORD SEARCH BASE ON SEMANTIC RELATION

Search is one of the key functionalities in digital platforms and applications such as electronic dictionary, search engine, and e-commerce.

However, search using Khmer language faces the following challenges:

  1. Stacked vs unstacked syllable: If a user types "បម្លែង" as a search query, and the database contains only "បំលែង", the search is unable to find any match since "បម្លែង" and "បំលែង" are treated as different words, despite the fact that they are different spelling realizations of the same word.

  2. Spelling mistake: Most spelling mistakes are primarily due to typing errors. In some cases, a user may miss one or more vowels, or diacritics. In other cases, a user may mistakenly write “ី” or “” instead of “ិ” or “”, respectively. For examples, “ប្រសិទ្ធភាព” and “ប្រសិទ្ធិភាព”, “សាលារៀន” and “សាឡារៀន”.

  3. Wrong part of speech: ex. “អភិវឌ្ឍ” (v.) and “អភិវឌ្ឍន៍” (n.).

  4. Character order invariance: “ស្ត្រី” can be written in different orders of characters. “ស្ត្រី” = “+្រ+្ត+ី” or “+្ត+្រ+ី” or “+្រ+ី+្ត”. “ស្រី្ត”. This is problematic for string matching during the search.

  5. No existing model to find semantically-related words: In some cases, a user knows what he/she is searching for, but may not know the exact keywords. In this case, being able to find semantically-related keywords is useful. For example, “សំលៀកបំពាក់” is semantically related to “ខោ”, “អាវ”, “ខោអាវ”, “អាវធំ”, “ស្រោមជើង” and the likes.


While we encourage people to write search queries in Khmer as correctly as possible, certain unintentional spelling mistakes are unavoidable. This is where technology comes to our rescue. Techo Startup Center (TSC) proposes the following solutions to the above-stated challenges:

  1. Character Order Normalization: A user search query is normalized by re-ordering the characters in a specific order.

  2. Spell Checking: A user search query is checked for spelling mistake by using grapheme-based and/or phoneme-based spell checkers. Grapheme-based and/or phoneme-based spell checkers are able to suggest the possible corrections within a pre-defined edit distance. Phoneme-based spell checker can be also used to identify different spell realizations of the same word as in the case of "បម្លែង" and "បំលែង".

  3. Semantic Modelling of Words: A word embedding model was trained using machine learning algorithm to be able to locate semantically-related words. The model was trained on 1-million sentence corpus which has approximately 30 million words.


The preliminary findings suggest that the above solutions can solve the above-stated challenges related to Khmer search. Techo Startup Center (TSC) has implemented and opened the API to the public to use freely without any charge in order to collect valuable suggestions and feedbacks to further improve the solutions while our team is working on enlarging the corpus by incorporating texts from other domains such laws and history.

In addition, Techo Startup Center (TSC) also provides the API to the additional functionalities namely: auto-completion and word segmentation.

OUR API






Character Order Normalization

Some words such as “ស្រី្ត” can be multiple orders of characters. To ensure the consistent representation, words can be normalized by re-ordering the characters.





Spell Checking

There are two spell checkers – grapheme-based and phoneme-based. The grapheme-based spell checker searches for the suggestions within a pre-defined edit distance which is computed on graphemes. Similarly, the phoneme-based spell checker uses phonemes to compute edit distances when searching for the suggestions. The phoneme-based spell checker can identify different spelling realizations for the same word as in the case of “បម្លែង” and “បំលែង”.





Auto-completion

One of the useful features of Google search is the ability to auto-complete user query. The auto-completion API can auto-complete user query in Khmer in the same way.





Semantically-Related Words

Semantically-related words are not synonyms, but words with related meaning. Semantically-related words are words that are often in the same context. For example, “សំលៀកបំពាក់” is semantically related to “ខោ”, “អាវ”, “ខោអាវ”, “អាវធំ”, “ស្រោមជើង” and the likes.





Word Segementation

In Khmer writing system, no explicit boundary delimiter is used. Therefore, word segmentation or, in broader term, tokenization is required to extract words for subsequent downstream tasks such search, text classification and so forth. The segmentation API uses the in-house segmentation algorithm to segment words in input text.



API DEMONSTRATION

PLEASE INPUT WORD HERE:

Please input WORD only



RESULT:




PLEASE INPUT TEXT HERE:

Please input text or sentences



RESULT:





CONTACT US FOR API

FEEDBACK AND CONTACT US FOR API

We love to hear from you about of API by giving us feedback via this form. Moreover, if you wish to use our API directly, please contact us via this form as well.

+855 92 83 49 89

sovisal.chenda@techostartup.center

RUPP's Compound Russian Federation Blvd., Toul Kork, Phnom Penh, Cambodia