3. NLP & ML Issues in Vietnamese¶

Introduction¶

There are many difficulties with developing NLP and applying Machine Learning in Vietnamese.

Vietnamese Word Segmentation¶

The word "Friend" in English means "Bạn Bè" which consists of two single words (morphological unit). When tokenizing these words, they mean differently. Therefore, we need to pre-process data so that the machine understands these as a single word. For example, we can use an underscore to connect them. However, preparing large data with high accuracy is a heavy task and can be costly. We can use traditional tools (statistics-based) to run on original data.

Named-Entity Recognition¶

There are many common and popular names of people, organizations, or locations in Vietnamese which are shared as casual words. For example, "John" is a name when people mention it, however, "Hoa" could be a popular name for a girl or flower in Vietnamese.

Accent¶

Vietnamese words, which have the same base, have a different meanings. For example, "Bạn" means "Friend", and "Bàn" means "Table". There are many variations from accented words meaning there are different meanings.

Especially in the case of the written word, the position of the accent could be placed at different vowels leading to inconsistency.

When feeding ML, one needs to take into account these challenges.

Part-of-Speech Tagging¶

One sentence could be interpreted in many meanings depending on its context and the words/sentences surrounding it. This ambiguity must be considered before an accurate translation is made.

Solution and Idea¶

To solve these challenges, we could implement some smaller neural network as transformer. We also could build encoder-decoder models to tokenize words or add accent to an unaccented word to restore its meaning corectly.

References¶

https://www.1stopasia.com/blog/challenges-developing-nlp-for-vietnamese/

https://arxiv.org/abs/1706.03762

https://kikaben.com/transformers-encoder-decoder/

https://duongnt.com/restore-vietnamese-diacritics-vie/

https://blog.luyencode.net/phan-loai-van-ban-tieng-viet/

In [ ]: