Segmenting news

I have been working a bit with presentation this week. I segmented 3 recent news articles, among them the one included below, and created a pdf file from it. Unfortunately, I don't have any good place to upload pdf files to for the moment so I only include an image and a text file with one of the articles - will try to work on getting that working this week.
This article is about the opening of a new North Korean restaurant in Amsterdam.

I have made a few adjustments to the final segmented file, so that all words are segmented the way I want them too. The program made about 5 errors on the above file.

I am pretty happy with the display settings I have already, so it looks pretty similar to before. Difficult words are marked in bold, and names are underlined. All characters above a certain threshold are marked with pinyin, and in names, some of the easier characters are marked with a tone but not a sound.

This is really useful for me, as it helps me when I read texts out aloud. I generally find the tones much harder to remember than the sounds. The names are not found automatically (yet), but the difficult words are selected by their frequency. One thing that's missing so far is to keep a closer track of the common characters that  have several pronunciations. This is a tricky problem, so I will save it for a bit, but I think it is also interesting and important.

Some common characters to look out for are 地, 得, 中, 当, 为 and 了. Characters like 了 that basically only take another sound in composite words are a bit easier, as the dictionary can be used to find the pronunciations, but characters like 得 or 为 that can have multiple pronunciations even when they stand alone are really tricky.

Otherwise, the tool already works pretty well, and I do feel that the program manages to find most of the difficult words. The only problem I have noticed so far is that it will pick out some easy words as well - especially when it comes to composite words.

At the same time, I don't think it is possible to create a tool that will find and explain all hard words. For example, composites such as 中央银行 (中央(central) + 银行(bank)) or 贫困地区 (贫困(poor) + 地区(region)) may occur with a much lower frequency than their constituents. However, it is difficult to know exactly when a composite word can easily be understood from its constituents.

Therefore, I think of the work I am doing also as a way to create a workflow tool for a teacher to generate a text with a word list and comments in a simple way. Just reading through a text, a teacher may have an idea of which words will be hard for the student, but getting the aid of a frequency dictionary of words and characters to grade the difficulty of text will make it a lot faster and easier.

I managed to get the program to print a word list as well, with definitions but for now, this is only in the pdf file. Hopefully, it will be useful for me as well - I personally spend too much time in front of the computer, so being able to print texts with pre-generated word lists will be really useful.

Of course, in the end, presentation is just a matter of choice. The above could be printed in a book or be presented in HTML. But so far, the pdf version is closest to the way I like to read a text.

Do you have any ideas for how to make the presentation better, or other preferences for how to make the presentation? Let us know in the comments.

Or do you have any interesting texts that you are struggling with, and want to try to read in this format? You can send them by email to carljohanr@gmail.com, and I will send back a segmented pdf version for you.

荷兰  首家  朝鲜  餐馆  开业  

据  韩联社  2月  5日  报道,1月  28日,首家  朝鲜  餐馆  在  荷兰  首都  阿姆斯特丹  开业。这家  餐馆  由  荷兰  企业家  同  朝方  合作  开办,正式  名称  为  “阿姆斯特丹  平壤  海棠花  餐馆”。餐馆  的  9名  工作人员  都  是  朝鲜人,包括  总经理  韩明姬  和  4名  厨师。

与  海外  其它  朝鲜  餐馆  一样,这里  也  有  几名  身穿  朝鲜  传统  服装  的  年轻  女服务员  为  顾客  提供  服务,还  会  演唱  朝鲜  歌曲。餐馆  里  悬挂着  几幅  朝鲜  画家  的  画作。

该  餐馆  提供  的  套餐  由  9种  料理  构成,价格  为  每  人  79欧元,与  当地  高级  餐馆  的  套餐  价格  水平  相当。

餐馆  总经理  韩明姬  2月  2日  在  接受  韩联社  采访  时  表示,原本  计划  开设  包括  餐馆  在内  的  文化中心,为  西方人  提供  了解  朝鲜  的  平台。这家  餐馆  可以  成为  朝鲜  与  世界  其它  国家  进行  交流  的  窗口。她  还  表示,餐馆  今后  将  通过  举办  演讲会、上映  电影、展览  美术  作品、宣传  朝鲜  旅游  商品  等  活动,起到  与  西方  国家  进行  沟通  的  桥梁  作用。

