Project Dongcheng: The dangers of maximum match (Part 1)

In this post, we look more closely at one of the simplest algorithms for segmenting text, the maximum match algorithm.

Segmentation refers to taking a sentence, and breaking it up into (usually) words. A lot of Asian languages, such as Chinese, are written without boundaries between words, and this is something that will easily lead to problems for students.

A naive way to approach segmentation into words is the maximum match algorithm. It works by starting from the first character of a sentence, and see what the longest word we can find starting with that character is. This is marked as a word, and we move on to the next character.

A nice and simple example of why this does not always work is the English sentence "Thetabledownthere". A maximum match segmentation of this sentence (this is not really needed in practice, since there are spaces between words in English) results in "Theta bled own there". Was the right segmentation immediately obvious to you?

In Chinese, however, this rather simple algorithm turns out to work quite well, but even so, it is not state of the art, and it will make more mistakes than necessary.

Consider for example the below two sentences. Here, the character 的 alone is not possible to translate in isolation, but it works similarly to a possessive particle.

In the first sentence, 他的 means his in English, and it is natural to treat it as a single word, at least for reading purposes. Since 的 is a possessive particle, considering 的 as a single word and treating the sentence as (Belonging to him) - head would be acceptable as well. However, in general it will make sense to create groups of characters as long as it does not confuse or cause unnnecessary ambiguity.

她摸-了-摸他-的头。
Tā mō-le-mō tā-de tóu.
She - (gently) stroke - his -head
She gently stroked his head.

他受得了别人对他的责备。
Tā shòudeliǎo biéren duì tā de zébèi.
He - is able to endure (r.v.) - other people - against - him - (.) - reproach
He could endure the reproach from others.

In the next two sentences, it is instead the construction 的话, either meaning "if", in a special grammatical construction, or meaning "words of", as in the second sentence, with 的 having being a possessive as in the sentences above. Again, the first sentence will be possible to segment with maximum match, while the second will not.

假如你们不想让我早死的话。
Jiǎrú nǐmen bù-xiǎng ràng wǒ zǎo-sǐ dehuà.
If - you (plural) - don't want - let (passive) - me - die early - if (f.e.)
If you don't want me to die early.

由于夫妻二人本来可聊的话就不多，...
Yóuyú fūqī-èr-rén běnlái kě liáo de huà jiù bù-duō，...
Since - husband and wife - originally - can - talk about - (.) - words - (.) - not many
Since the married couple did not have much to talk about,...

In summary, the first problem with maximum match is that it will assume that the longest word is always the correct word. The problems are in general fewer because there are a lot of characters in Chinese, but some very common characters occur in several different functions, and this will easily lead to errors.

I should say also that one could start arguing about what is a word and what is not in the above. For example 早死, or "die early", should probably be considered as two words. However, modern Chinese is quite naturally divided up into units of (mostly) two or four characters nowadays, so treating it as a single unit is probably easier unless you have just started learning Chinese. More on that topic another day.

When this topic is revisited, we will look at covering ambiguities, another problem with the maximum match algorithm.

Project Dongcheng

onsdag 8 februari 2012

The dangers of maximum match (Part 1)

Inga kommentarer:

Skicka en kommentar