onsdag 29 februari 2012

How I study Chinese (part 1)


After quitting my job at the Embassy end of October last year, I have not been working. Naturally, this gives me a lot of time to study Chinese.

Over about 3 months, I estimate that I have spent 20-30 hours per week on Chinese. Taking an average of 25 hours per week, this adds up to about 300 hours, that's about 3% of the 10000 hours required to become an expert in a subject.


1. Reading
I read a lot right now. I still feel that my reading speed is rather low, but  over the last months, improvement has been rather encouraging.

Most of the time reading, I use texts produced by my reading tool (see last post). As I have a private tutor, I will rarely look up words. Instead, I will go through difficult parts of the texts together with my teacher. Especially when it comes to Chengyu (4-character proverbs), I feel that my level is rather low, and this is one  of the parts I have been concentrating on.

Last week, I spent some time before class to read the first two chapters of Yu Huas new book: China in ten words. I had it downloaded already, converted it with my reading tool, and started reading. The style of Yu Hua is easier than most other authors I have read lately, so I did not have to stop that much.

2. Listening and speaking
During my private lessons, I try to combine reading with listening and speaking. After finishing a text, we will go through the important words and passages. More importantly, we will also discuss major points of the text, starting from what happened in the story, and going further into the historical context and setting, moods etc.

My overall goal with this is to increase appreciation when reading. Another goal which is closely linked is to understand what constitutes a "good" text in Chinese. This includes aspects such as word choice, sentence structure and  flow.

Natually, I also want my spoken Chinese to sound natural, for example during interviews, but this is not so much of a problem. Chinese people have no problem understanding me, and it rarely happens that someone asks me again to repeat what I said. This seemed to happen more when I was speaking "Scandinavian" to Danish people in Denmark.

3. Presenting
I think this is part of the European scale on language acquisition as well. I would really like to improve my presentation skills, but I seem to be slightly too lazy to do any presenting.

So far, this goes into the listening and speaking part. Sometimes, in the first 15-30 minutes of a lesson, I will summarize something I have read, a topic I am interested in or some recent experience from China. I am not afraid of making mistakes when speaking Chinese (at least not anymore), even if I still make a lot of them. However, I am a believer of the school that uttering mostly correct sentences will improve language acquisition faster. So, usually I let me teacher correct me on a sentence level, so that it becomes something in between a presentation and a dialogue.

4. Writing (not by hand)
My goal when it comes to writing is to be able to write a coherent text, such as a blog post or a summary of a book/chapter or news article that I have read. Because the quality of my writing is not very good yet, I spend more time reading and hope that a larger vocabulary will improve my writing as well.

Lately, the only "longer" piece I have written in Chinese is a summary of Bi Feiyus Qingyi that I included in my last post.


This basically summararizes how I have been studyin lately. Next time, I'd like to go into a bit more detail on what I read.

tisdag 21 februari 2012

2 pdf examples

I uploaded two example pdf files of segmented files to my Dropbox. The first one is 3 short news pieces, in ascending order of difficulty (3 short news pieces). The second one is a summary that I wrote about Qingyi, a short novel about the Beijing opera (Qingyi - Bi Feiyu).

The news pieces are accompanied by a word and name list. The second file uses the same markup but does not have a glossary.

I would say that they are for intermediate level - someone who knows about 1000 characters should be able to understand most of it. I have set the threshold for characters that are marked up a little lower, as knowing 1000 characters does not necessarily mean that you know the most common ones.

Is there anything that you would like to see improved with this markup? I would appreciate any comments about the files and the markup.

måndag 13 februari 2012

Segmenting news

I have been working a bit with presentation this week. I segmented 3 recent news articles, among them the one included below, and created a pdf file from it. Unfortunately, I don't have any good place to upload pdf files to for the moment so I only include an image and a text file with one of the articles - will try to work on getting that working this week.
This article is about the opening of a new North Korean restaurant in Amsterdam.

I have made a few adjustments to the final segmented file, so that all words are segmented the way I want them too. The program made about 5 errors on the above file.

I am pretty happy with the display settings I have already, so it looks pretty similar to before. Difficult words are marked in bold, and names are underlined. All characters above a certain threshold are marked with pinyin, and in names, some of the easier characters are marked with a tone but not a sound.

This is really useful for me, as it helps me when I read texts out aloud. I generally find the tones much harder to remember than the sounds. The names are not found automatically (yet), but the difficult words are selected by their frequency. One thing that's missing so far is to keep a closer track of the common characters that  have several pronunciations. This is a tricky problem, so I will save it for a bit, but I think it is also interesting and important.

Some common characters to look out for are 地, 得, 中, 当, 为 and 了. Characters like 了 that basically only take another sound in composite words are a bit easier, as the dictionary can be used to find the pronunciations, but characters like 得 or 为 that can have multiple pronunciations even when they stand alone are really tricky.

Otherwise, the tool already works pretty well, and I do feel that the program manages to find most of the difficult words. The only problem I have noticed so far is that it will pick out some easy words as well - especially when it comes to composite words.

At the same time, I don't think it is possible to create a tool that will find and explain all hard words. For example, composites such as 中央银行 (中央(central) + 银行(bank)) or 贫困地区 (贫困(poor) + 地区(region)) may occur with a much lower frequency than their constituents. However, it is difficult to know exactly when a composite word can easily be understood from its constituents.

Therefore, I think of the work I am doing also as a way to create a workflow tool for a teacher to generate a text with a word list and comments in a simple way. Just reading through a text, a teacher may have an idea of which words will be hard for the student, but getting the aid of a frequency dictionary of words and characters to grade the difficulty of text will make it a lot faster and easier.

I managed to get the program to print a word list as well, with definitions but for now, this is only in the pdf file. Hopefully, it will be useful for me as well - I personally spend too much time in front of the computer, so being able to print texts with pre-generated word lists will be really useful.

Of course, in the end, presentation is just a matter of choice. The above could be printed in a book or be presented in HTML. But so far, the pdf version is closest to the way I like to read a text.

Do you have any ideas for how to make the presentation better, or other preferences for how to make the presentation? Let us know in the comments.

Or do you have any interesting texts that you are struggling with, and want to try to read in this format? You can send them by email to carljohanr@gmail.com, and I will send back a segmented pdf version for you.


荷兰  首家  朝鲜  餐馆  开业  

据  韩联社  2月  5日  报道,1月  28日,首家  朝鲜  餐馆  在  荷兰  首都  阿姆斯特丹  开业。这家  餐馆  由  荷兰  企业家  同  朝方  合作  开办,正式  名称  为  “阿姆斯特丹  平壤  海棠花  餐馆”。餐馆  的  9名  工作人员  都  是  朝鲜人,包括  总经理  韩明姬  和  4名  厨师。

与  海外  其它  朝鲜  餐馆  一样,这里  也  有  几名  身穿  朝鲜  传统  服装  的  年轻  女服务员  为  顾客  提供  服务,还  会  演唱  朝鲜  歌曲。餐馆  里  悬挂着  几幅  朝鲜  画家  的  画作。

该  餐馆  提供  的  套餐  由  9种  料理  构成,价格  为  每  人  79欧元,与  当地  高级  餐馆  的  套餐  价格  水平  相当。

餐馆  总经理  韩明姬  2月  2日  在  接受  韩联社  采访  时  表示,原本  计划  开设  包括  餐馆  在内  的  文化中心,为  西方人  提供  了解  朝鲜  的  平台。这家  餐馆  可以  成为  朝鲜  与  世界  其它  国家  进行  交流  的  窗口。她  还  表示,餐馆  今后  将  通过  举办  演讲会、上映  电影、展览  美术  作品、宣传  朝鲜  旅游  商品  等  活动,起到  与  西方  国家  进行  沟通  的  桥梁  作用。

onsdag 8 februari 2012

The dangers of maximum match (Part 1)

In this post, we look more closely at one of the simplest algorithms for segmenting text, the maximum match algorithm.

Segmentation refers to taking a sentence, and breaking it up into (usually) words. A lot of Asian languages, such as Chinese, are written without boundaries between words, and this is something that will easily lead to problems for students.

A naive way to approach segmentation into words is the maximum match algorithm. It works by starting from the first character of a sentence, and see what the longest word we can find starting with that character is. This is marked as a word, and we move on to the next character.

A nice and simple example of why this does not always work is the English sentence "Thetabledownthere". A maximum match segmentation of this sentence (this is not really needed in practice, since there are spaces between words in English) results in "Theta bled own there". Was the right segmentation immediately obvious to you?

In Chinese, however, this rather simple algorithm turns out to work quite well, but even so, it is not state of the art, and it will make more mistakes than necessary.

Consider for example the below two sentences. Here, the character 的 alone is not possible to translate in isolation, but it works similarly to a possessive particle.

In the  first sentence, 他的 means his in English, and it is natural to treat it as a single word, at least for reading purposes. Since 的 is a possessive particle, considering 的 as a single word and treating the sentence as (Belonging to him) - head would be acceptable as well. However, in general it will make sense to create groups of characters as long as it does not confuse or cause unnnecessary ambiguity.

她 摸-了-摸 他-的 头。
Tā mō-le-mō tā-de tóu.
She - (gently) stroke - his -head
She gently stroked his head.

他 受得了 别人 对 他 的 责备。
Tā shòudeliǎo biéren duì tā de zébèi.
He - is able to endure (r.v.) - other people - against - him - (.) - reproach
He could endure the reproach from others.

In the next two sentences, it is instead the construction 的话, either meaning "if", in a special grammatical construction, or meaning "words of", as in the second sentence, with 的 having being a possessive as in the sentences above. Again, the first sentence will be possible to segment with maximum match, while the second will not.

假如 你们 不想 让 我 早死 的话。
Jiǎrú nǐmen bù-xiǎng ràng wǒ zǎo-sǐ dehuà.
If - you (plural) - don't want - let (passive) - me - die early - if (f.e.)
If you don't want me to die early.

由于 夫妻二人 本来 可 聊 的 话 就 不多,...
Yóuyú fūqī-èr-rén běnlái kě liáo de huà jiù bù-duō,...
Since - husband and wife - originally - can - talk about - (.) - words - (.) - not many
Since the married couple did not have much to talk about,...

In summary, the first problem with maximum match is that it will assume that the longest word is always the correct word. The problems are in general fewer because there are a lot of characters in Chinese, but some very common characters occur in several different functions, and this will easily lead to errors.

I should say also that one could start arguing about what is a word and what is not in the above. For example 早死, or "die early", should probably be considered as two words. However, modern Chinese is quite naturally divided up into units of (mostly) two or four characters nowadays, so treating it as a single unit is probably easier unless you have just started learning Chinese. More on that topic another day.

When this topic is revisited, we will look at covering ambiguities, another problem with the maximum match algorithm.

måndag 6 februari 2012

Presenting Chinese in a pdf

Presenting Chinese in a way that is useful for learners is tricky. The main challenge lies in that the pronunciation of a character will often not be known to the learner. A second challenge is that the words are not divided in standard Chinese text. However, just breaking up the characters one by one makes no sense either - one will somehow have to divide the text into words, or possibly "natural units", that make reading and understanding easier.

Automatically making this split is a challenging theoretical problem, but there are already methods which come pretty close to solving it. More about those challenges in later posts.

One of my early ideas was to put the tone of the characters above the characters, and the pinyin pronunciation below. In a pdf file, this becomes aesthetically rather pleasing, as below.

If possible, I would have been able to copyright this idea, as it is not something I have seen before. However, some people have used LaTeX to present tone marks above characters, just  without the pinyin pronunciation.

In addition, I have chosen to highlight "difficult words" in bold, adding pinyin to them (only tones if they are common enough) and by underlining all names. For the particular file, I have made some manual adjustments, but a lot of the work to be able to generate files automatically in this format is done  already.

The logic behind this is that the annotations should make it easier for the student, but without cluttering the page too much. For example, adding pinyin to all characters, as is done in some books, just makes reading more difficult for someone who already knows most of the characters.

All in all, this solution works pretty well for viewing a text in print, and I would argue that this display is more pleasing that most Chinese textbooks.

söndag 5 februari 2012

What is Project Dongcheng?

I have been interested in studying Chinese ever since I came to China in 2005 to study at Beiwai. And back then, I also met Erik, who was also studying at Beiwai at the time.

We have been living in different cities of Sweden for the last 3 years, but almost as by accident, we happened to move back to Dongcheng (东城, a part of Beijing), last year.

I had the idea to write a teaching tool for Chinese (basically a word segmenter) as I do a lot of reading in Chinese (very slowly, but I try to improve). After he spent some time to get a basic prototype of this working during our X-mas vacation, we agreed to put our efforts together to create a useful teaching tool for Chinese.

And that was the birth of Project Dongcheng!

So far it is mainly for ourselves, but that is hopefully something that will change as our tool gets better.