Khmer NLP

KHMER WORD SEARCH BASE ON SEMANTIC RELATION

Search is one of the key functionalities in digital platforms and applications such as electronic dictionary, search engine, and e-commerce.

However, search using Khmer language faces the following challenges:

Stacked vs unstacked syllable: If a user types "បម្លែង" as a search query, and the database contains only "បំលែង", the search is unable to find any match since "បម្លែង" and "បំលែង" are treated as different words, despite the fact that they are different spelling realizations of the same word.

Spelling mistake: Most spelling mistakes are primarily due to typing errors. In some cases, a user may miss one or more vowels, or diacritics. In other cases, a user may mistakenly write “ី” or “ឡ” instead of “ិ” or “ល”, respectively. For examples, “ប្រសិទ្ធភាព” and “ប្រសិទ្ធិភាព”, “សាលារៀន” and “សាឡារៀន”.

Wrong part of speech: ex. “អភិវឌ្ឍ” (v.) and “អភិវឌ្ឍន៍” (n.).

Character order invariance: “ស្ត្រី” can be written in different orders of characters. “ស្ត្រី” = “ស+្រ+្ត+ី” or “ស+្ត+្រ+ី” or “ស+្រ+ី+្ត”. “ស្រី្ត”. This is problematic for string matching during the search.

No existing model to find semantically-related words: In some cases, a user knows what he/she is searching for, but may not know the exact keywords. In this case, being able to find semantically-related keywords is useful. For example, “សំលៀកបំពាក់” is semantically related to “ខោ”, “អាវ”, “ខោអាវ”, “អាវធំ”, “ស្រោមជើង” and the likes.

While we encourage people to write search queries in Khmer as correctly as possible, certain unintentional spelling mistakes are unavoidable. This is where technology comes to our rescue. Techo Startup Center (TSC) proposes the following solutions to the above-stated challenges:

Character Order Normalization: A user search query is normalized by re-ordering the characters in a specific order.

Spell Checking: A user search query is checked for spelling mistake by using grapheme-based and/or phoneme-based spell checkers. Grapheme-based and/or phoneme-based spell checkers are able to suggest the possible corrections within a pre-defined edit distance. Phoneme-based spell checker can be also used to identify different spell realizations of the same word as in the case of "បម្លែង" and "បំលែង".

Semantic Modelling of Words: A word embedding model was trained using machine learning algorithm to be able to locate semantically-related words. The model was trained on 1-million sentence corpus which has approximately 30 million words.

The preliminary findings suggest that the above solutions can solve the above-stated challenges related to Khmer search. Techo Startup Center (TSC) has implemented and opened the API to the public to use freely without any charge in order to collect valuable suggestions and feedbacks to further improve the solutions while our team is working on enlarging the corpus by incorporating texts from other domains such laws and history.

In addition, Techo Startup Center (TSC) also provides the API to the additional functionalities namely: auto-completion and word segmentation.

KHMER WORD SEARCH BASE ON SEMANTIC RELATION

KHMER WORD SEARCH BASE ON SEMANTIC RELATION

OUR API

API DEMONSTRATION

PLEASE INPUT WORD HERE:

Please input WORD only

RESULT:

PLEASE INPUT TEXT HERE:

Please input text or sentences

RESULT:

CONTACT US FOR API

KHMER WORD SEARCH BASE ON SEMANTIC RELATION

KHMER WORD SEARCH BASE ON SEMANTIC RELATION

OUR API

API DEMONSTRATION

PLEASE INPUT WORD HERE:

Please input WORD only

RESULT:

PLEASE INPUT TEXT HERE:

Please input text or sentences

RESULT:

CONTACT US FOR API

WE ALSO HAS SOCIAL MEDIA PAGE