skishore / makemeahanzi Goto Github PK

View Code? Open in Web Editor NEW

1.8K 59.0 466.0 371.01 MB

Free, open-source Chinese character data

Home Page: https://www.skishore.me/makemeahanzi/

License: Other

Python 11.94% JavaScript 87.58% Shell 0.48%

makemeahanzi's Introduction

Make Me a Hanzi Demo

New: Inkstone Chinese writing app

New: No more cut-off strokes (due to @chanind)!

Make Me a Hanzi provides dictionary and graphical data for over 9000 of the most common simplified and traditional Chinese characters. Among other things, this data includes stroke-order vector graphics for all these characters. You can see the project output at the demo site where you can look up a characters by drawing them. You can also download the data for use in your own site or app.

See the project site for general information and updates on the project.

Make Me a Hanzi data is split into two data files, dictionary.txt and graphics.txt, because the sources that the files are derived from have different licenses. In addition, we provide an experimental tarball of animated SVGs, svgs.tar.gz that is licensed the same way as graphics.txt. See the Sources section and the COPYING file for more information.

Sources

dictionary.txt is derived from data from Unihan and CJKlib.
graphics.txt and svgs.tar.gz are derived from two free fonts: Arphic PL KaitiM GB and Arphic PL UKai.

This project would not have been possible without the generosity of Arphic Technology, a Taiwanese font forge that released their work under a permissive license in 1999.

In addition, I would like to thank Gábor Ugray for his thoughtful advice on the project and for verifying stroke data for most of the traditional characters in the two data sets. Gábor maintains Zydeo, a free and open-source Chinese dictionary.

Format

Both dictionary.txt and graphics.txt are '\n'-separated lists of lines, where each line is JSON object. They differ in which keys are present, but the common key, 'character', can be used to join the two data sets. You can also rely on the fact that the two files will always come in the same order.

dictionary.txt keys:

character: The Unicode character for this glyph. Required.
definition: A String definition targeted towards second-language learners. Optional.
pinyin A comma-separated list of String pronunciations of this character. Required, but may be empty.
decomposition: An Ideograph Description Sequence decomposition of the character. Required, but invalid if it starts with a full-width question mark '？'.

Note that even if the first character is a proper IDS symbol, any component within the decomposition may be a wide question mark as well. For example, if we have a decomposition of a character into a top and bottom component but can only recognize the top component, we might have a decomposition like so: '⿱逢？'
etymology: An etymology for the character. This field may be null. If present, it will always have a "type" field, which will be one of "ideographic", "pictographic", or "pictophonetic". If the type is one of the first two options, then the etymology will always include a string "hint" field explaining its formation.

If the type is "pictophonetic", then the etymology will contain three other fields: "hint", "phonetic", and "semantic", each of which is a string and each of which may be null. The etymology should be read as: ${semantic} (${hint}) provides the meaning while ${phonetic} provides the pronunciation. with allowances for possible null values.
radical: Unicode primary radical for this character. Required.
matches: A list of mappings from strokes of this character to strokes of its components, as indexed in its decomposition tree. Any given entry in this list may be null. If an entry is not null, it will be a list of indices corresponding to a path down the decomposition tree.

This schema is a little tricky to explain without an example. Suppose that the character '俢' has the decomposition: '⿰亻⿱夂彡'

The third stroke in that character belongs to the radical '夂'. Its match would be [1, 0]. That is, if you think of the decomposition as a tree, it has '⿰' at its root with two children '亻' and '⿱', and '⿱' further has two children '夂' and '彡'. The path down the tree to '夂' is to take the second child of '⿰' and the first of '⿱', hence, [1, 0].

This field can be used to generate visualizations marking each component within a given character, or potentially for more exotic purposes.

graphics.txt keys:

character: The Unicode character for this glyph. Required.
strokes: List of SVG path data for each stroke of this character, ordered by proper stroke order. Each stroke is laid out on a 1024x1024 size coordinate system where:
- The upper-left corner is at position (0, 900).
- The lower-right corner is at position (1024, -124).
Note that the y-axes DECREASES as you move downwards, which is strage! To display these paths properly, you should hide render them as follows:
```
<svg viewBox="0 0 1024 1024">
  <g transform="scale(1, -1) translate(0, -900)">
    <path d="STROKE[0] DATA GOES HERE"></path>
    <path d="STROKE[1] DATA GOES HERE"></path>
    ...
  </g>
</svg>
```
medians: A list of stroke medians, in the same coordinate system as the SVG paths above. These medians can be used to produce a rough stroke-order animation, although it is a bit tricky. Each median is a list of pairs of integers. This list will be as long as the strokes list.

TODOs and Future Work

As an experimental next step, we have produced an animated SVG image for each character that we have data for (see the svgs directory). The SVGs are named by the Unicode codepoint of the character they correspond to. Using Javascript, you can find the codepoint of a character x by calling x.charCodeAt(0). It's easy to embed these SVGs in a website. A minimal example is as follows:
```
<body><embed src="31119.svg" width="200px" height="200px"/></body>
```
This feature is experimental because it is still tricky to work with these images beyond this basic example. For instance, it's not clear how to embed two of these images side-by-side and have the second start animating when the first is complete. However, the images are still the easiest way to make use of this data..

There are quite a few clients using the Make Me a Hanzi data. Many of them have had to do additional preprocessing of it for their use case. If you might find this data useful, please feel free to contact me by email - I may be able to give tips or suggest algorithms for making use of it.

Related projects

This project is focused on building stroke order diagrams that follow the People's Republic of China (PRC) stroke order. Some characters are written with different stroke orders in Japan, Taiwan, and elsewhere. I don't have the time or knowledge to produce similar data for those orderings, but there are other resources that you can try:
- parsimohni's animCJK project provides Japanese stroke order data: GitHub and Demo
- KanjiVG also has Japanese stroke order data, and isn't based on Arphic's font: Website
- chanind's Hanzi Writer Javascript library supports animations and writing practice: Website
There are also some apps and websites that use this data:
- gugray maintains HanDeDict, a Chinese-German dictionary that uses these animations: GitHub and Website
- meshonline wrote a free iOS app for learning Chinese characters using this data: GitHub and App Store
- embermitre uses Make Me a Hanzi animations in Hanping Chinese Dictionary: Lite version and Pro version

makemeahanzi's People

Contributors

Stargazers

Watchers

Forkers

wangjun danielchu zpvip zhaizhai leekangsan vanmik draco1023 andersonlin cuongcua90 mapedd sniperpan liuy97 honeyfish bihai loa jht1900 toferl hsuanchen-ponddy philippechepy jonathanknowles janroz chiahungtai chenxujin 2050utopia qilongxu vermuz santihbc cottrell egemenertugrul webmagnets jpzy writepython ccdump ruddfawcett codinb sergey-misuk jiasir803 rsantana-isg linecode nehz pipizhang doriancuentas zhangyuedmail chanind cosecant-csc cirosantilli gugray wavaya fendaq olivernyc nnashwin hugoesthere eduardocruz pepebecker-contributions chjlang patiencer pzzzzzzzzzzz lovelwpeng t-vk 124591524 sospartan simcyber memdarin gnardinigh duoquote akerush gwli ericbusch yijietang toanz aijiekj saddlekiller arlindar1030 panzhenglian gcy0926 dattl liupeng89 minhloi vercingetorix-forks dqaria royalone94 ichitech dzcpy landinggao tdkr ms5 yetaai shougao giegloop migzone zhangyuwangumass mokacao neonmei cash2one wangshuai9517 yibit bollwarm gokunwu allensmile allan852

makemeahanzi's Issues

How to use the tool branch

Hey skishore!

I can't figure out how to use the tool branch. I'm able to run the local server, but all I see is this page:

Are there any instructions? Please forgive me if I'm missing something obvious. I want to add data for some characters that are missing from graphics.txt, but I'm stuck at this screen! :)

Stroke order error for 覽

The stroke order for 覽 seems very strange, please check.

賴瀨懶籟癩 stroke order error

Hello,

The stroke order of the 刀 component in 賴瀨懶籟癩 seems incorrect.

Hope it helps!

What is the meaning of the number in SVG file names?

As the question in the title, what is the meaning of the number in SVG file names? I have no idea on how to map this number to the character code, or the character index in the word-list files.

Thanks!

Why does 'matches' data often contain null values?

Can you explain what this means? I find around 173 cases in the dictionary where at least one value in the 'matches' field is null.

Remove svgs from source control and create releases instead

This git repository is enormously huge due to the fact that all svg files are stored in it. Thus a simple git clone takes a very very long time.
It is generally considered bad practice to commit binaries and such.
My suggestion:
Remove the svg files from git and upload them to the release section instead.

No stroke data for 〇

Please don't stone me 😆

But it is considered like a CJK sign for 0 by Wikipedia (right side, bottom of table).

So does the MOOC course I code for.

Do we have a source for this sign's stroke order ?
Do we accept the Korean way ?

What library does the demo use for pen input?

Hi! The demo website has the best browser based pen input I've ever seen. Could you explain what library this is?

No stroke data for 鍊 (HSK3)

The character 鍊 is often used as the traditional form of 炼 (锻炼 is HSK3)

There are stroke error in 袤

Please check it

翰 wrong stroke order

The right part of 翰 is messy.

dictionary.txt improvements

From email:

Typo:

Character	Current definition	Proposed definition
檀	"ssandalwood, hardwood; surname"	"sandalwood, hardwood; surname"

Added clarity for paired characters:

Character	Current definition	Proposed definition
琵	"a guitar-like instrument"	"a guitar-like instrument (1)"
琶	"a guitar-like instrument"	"a guitar-like instrument (2)"
蚯	"earthworm"	"earthworm (1)"
蚓	"earthworm"	"earthworm (2)"
蟾	"toad"	"toad (1)"
蜍	"toad"	"toad (2)"
蟋	"cricket"	"cricket (1)"
蟀	"cricket"	"cricket (2)"
龌	"narrow, small-minded; dirty"	"narrow, small-minded; dirty (1)"
龊	"narrow, small-minded; dirty"	"narrow, small-minded; dirty (2)"

Distinguish similar characters/readings:

Character	Current definition	Proposed definition
飙	"whirlwind"	"whirlwind (1)"
飚	"whirlwind"	"whirlwind (2)"

Any plans for a MakeMeAHanzi app?

I'm currently accessing it in my mobile (Android) though the Hanping Chinese app, but it would be great if there was an official app from you guys! Nothing to far from what you have already have, except maybe an extra option to search a character either by typing or drawing them.
Keep up with the good work!

Various tiny errors found in 題衆擊搖湧薦混轍韆

Hello,

It seems that some errors are in characters below.

題: wrong stroke order (7th stroke)
衆: wrong stroke order (7th stroke)
擊: wrong stroke order (8th stroke)
搖: wrong stroke order (6th stroke)
湧: wrong stroke order (10th stroke)
薦: wrong stroke direction of the last stroke
混: last stroke is damaged (a tiny line can be seen). One can suppress L 934 105 L 896 112 and modify medians accordingly to clear the problem.
轍: the starting point of 19th stroke median is misplaced.
韆: a very small stroke is drawn at the end of the animation.

Hope it helps!

摟 error

In 摟, part of the 14th stroke is wrongly drawn with the 13th stroke.

这个字体是什么字体呢？

Can I Convert SVG path data to medians

I want to add new characters like Hirigana to my project and I have SVG path data for them.

Can I generate the medians from SVG data? If so, how?

Say if I have the character 一

With Path Data "M 518 382 Q 572 385 623 389 Q 758 399 900 383 Q 928 379 935 390 Q 944 405 930 419 Q 896 452 845 475 Q 829 482 798 473 Q 723 460 480 434 Q 180 409 137 408 Q 130 408 124 408 Q 108 408 106 395 Q 105 380 127 363 Q 146 348 183 334 Q 195 330 216 338 Q 232 344 306 354 Q 400 373 518 382 Z"

How do you get

"medians":[
[
[121,393],
[193,372],
[417,402],
[827,434],
[920,401]
]
]

Many Thanks, and Thanks for this project!

Issues in

Stroke order for "嗡" should be "竖、横折、横、撇、捺、撇折、点、横折钩、点、提、横折钩、点、提", but when you write like “竖、横折、横、撇、捺、撇折、点、横折钩、横折钩、点、提、点、提”， you can still get "嗡" at first place. There are other characters like "嗡" with complex orders have same issue.

incorrect pronunciation for 耶

I'm sorry I can't fix it myself. I don't know how to do a pull request.

The correct pronuncation for "耶" is "yē", not "yé".

Strokes 4 and 5 in 再 should be swapped.

Horizontal first, then vertical.

Animation doesn't start on Firefox

Hello,

First at all, congratulations for the great job you did.

It seems that animation doesn't start when using Firefox and svg files (i tested it using a simple page containing <body><embed src="31119.svg" width="200px" height="200px"/></body>).

To solve the issue, i just removed the "scoped" attribute from the "style" tag from the svg files. May be Firefox has this strange behavior because it does something with this attribute et is buggy while other browsers just ignore it.

Stroke order for traditional and simplified Chinese are sometimes different

Stroke orders for characters such as 問 are different for simplified and traditional Chinese.
See: https://en.wikipedia.org/wiki/Stroke_order

Probably need to adjust the format of graphics.txt to take this into account or create seperate graphics.txt. It might be pretty easy to fix this if the traditional Chinese font do have a different order than the simplified Chinese font for the same character.

About 再 stroke order (again) and comparison between a simplified Chinese context and a traditional Chinese context

Sorry to open again a closed issue (see issue #51).

In a simplified Chinese context, in 再 (as many others such as 冉鞲遘 etc.) the vertical stroke should be drawn 4th and horizontal stroke 5th. See for instance:

bihua.51240.com
archchinese.com

In a traditional Chinese context, the horizontal stroke is drawn before the vertical one. See for instance:

http://stroke-order.learningweb.moe.edu.tw/character.do

Note that archchinese.com (which is displaying both simplified and traditional Chinese) is not always consistent for characters that contain this component. Some of them have the horizontal stroke drawn before the vertical one (especially those that are not simplified characters such as 購).

As a result, in makemeahanzi, i think that the vertical stroke should be written before the horizontal one since it assumes using a simplified Chinese context.

What is a simplified Chinese context or a traditional Chinese is a difficult subject. Roughly, when the html lang attribute is "zh-hans" or "zh-cn" or "zh-sg", this is a simplified Chinese context. Otherwise, this is a traditional Chinese context, excepting when the lang attribute is just "zh": in such a case, the question is open :-) (otherwise, it should be too simple).

What happens when one want to draw a traditional character in a simplified context (or a simplified character in a traditional context) is another burden. It seems that in such a case, most of the time (i am still researching what is really done most of the time), the traditional character may be "modified" and get a simplified-like stroke order (i.e. vertical stroke before horizontal one for 4th and 5th strokes of 再 when it appears as a component of a traditional character). See for instance https://bihua.51240.com/e8b3bc__bihuachaxun/ for 購 character.

Also, the traditional character may have two different glyphs: one in a simplified Chinese context and another one in a traditional Chinese context (such as characters that contain the 艹 (艸) component). For instance, 莧 (decimal unicode 33703) is a traditional character. It ha a simplified form: 苋 (decimal unicode 33483). When 莧 is drawn in a traditional context, the 艹 has (most of the time?) 4 strokes. But when it is drawn in a simplified context, it has (most of the time?) only 3 strokes, even when one uses the traditional character. You can test it using for instance the noto font, and displaying a html page containing <div lang="zh-hans">莧</div> and <div lang="zh-hant">莧</div>.

Finally, note that even people from mainland China, in a simplified Chinese context, draw the horizontal stroke before the vertical one as the 4th stroke of 再. I just tested "again" :-) that point with some Chinese relatives at home 10mn ago.

Hope it helps!

Makemeahanzi-tool有指导文档吗？

Makemeahanzi-tool是一个非常有用的工具。
十分感谢。
但因为没有文档，使用起来感觉很困惑。
能提供一份较为详细的文档吗？Thanks

Stroke order for 黑 radical inconsistent (e.g. 點)

It's not clear to me whether the 6th stroke should be the horizontal or vertical one. Need a reference to determine the correct order here which we can then use consistently.

Wikimedia Commons Stroke Order project want to use your data

Onboarding Wikimedia Commons members.

Hello there,
I come from the Stroke Order project and Ancient Chinese Characters project on Wikimedia Commons, I just discovered and really love your project and data, which has a much larger reach that our Stroke Order project.

The SO project was created by graphic designers and so has strong UX / design expertise, we discussed a lot about how to display stroke order efficiently for teaching purposes. We came up with few elegant styles, naming conventions and file formats (300px) :

I would like to work on your open data via nodejs to output png and gif images satisfying Wikimedia projects and our graphic guidelines.

Also, could I ask you for guidance while I explore your work ?

Personal notes:

Software to submit stroke order : https://www.skishore.me/makemeahanzi/
Dictionary : https://raw.githubusercontent.com/skishore/makemeahanzi/master/dictionary.txt
XML database: https://raw.githubusercontent.com/skishore/makemeahanzi/master/graphics.txt 👍
XML guideline : https://github.com/skishore/makemeahanzi#graphicstxt-keys 💯 ok, I have to play with that.

Wrong stroke orders for a traditional character

Hey,

The program has an issue when writing 候 (traditional character). It writes the leftmost two strokes, then goes to the righttopmost 2 strokes, then the vertical line, then continues normally. I know you have a place where you can correct individiual character but I'm not sure how to do that. Could you correct it or guide me perhaps?

Thanks!

Traditional 後面 uses simplified 后

Present in inkstone version 0.1.3

Stroke order error in 兜

Running 'Make Me a Hanzi Demo'

Apologies if I missed this somewhere, but how does one run the 'Make Me a Hanzi Demo' ?

I assume the code is on the 'tool' or 'demo' branches, but a quick step-by-step would be much appreciated...

background lines

is there any easy way to get rid of the background lines?

MakeMeAGlyph

It may sounds silly...

But your technology could be used on simplier alphabets such as latin, greek, arabic, thai, etc. These alphabets may themselves have variants: lower case, upper cases, cursive.

Down stream usages such as Hanzi-writer could be helpful for these languages as well. (cc @chanind )

As for doing it, most of them would require modest workload, to define further (find a font, vectorise, break in stroke, define the path), but I wanted raise the idea.

Incomplete components: 㳽

Hey, I'm trying to correct the stroke order for 爾 (swap fifth and sixth strokes) and I'm running in to a small issue.

I'm able to correct the stroke order, which marks a few downstream characters as incomplete. I'm able to complete all but one, 瀰, which gets stuck on Incomplete components: 㳽.

㳽 is not in either font, and I don't think it's part of another character, so I'm not sure how to proceed. Any suggestions?

Incorrect stroke order for 樂 and 灣

For these two characters: 樂 and 灣, I believe the central component (i.e. 白 and 言) should come before the components either side. See here.

SVG y-axes

Hi thanks for the fantastic library!!

Just wondering why the svg y axes are inverted. It would be nice to render them without applying a transform each time.

Thanks

SVGs are missing paths

In the commit Update SVGs to use capped strokes, it looks like path data was removed from the svgs. Now, the svg are blank.

Also, it might be helpful to include your script that generates the svg files.

Thanks!

Stroke order error in 兜

Pls check it.

Generate non-animated stroke order diagrams

I find it much easier to work with non-animated stroke order diagrams because I can just take my time and don't have to watch an animation over and over again.
Would it be possible to generate diagrams like this with makemeahanzi?

Strokes for 為 in graphics.txt are actually 爲

In graphics.txt, the strokes for 為 actually produce the archaic 爲. So the character for this entry should be updated to 爲, and a new entry should be created for 為. (the decimal for 為 is 28858, take a look at 28858.svg)

I love this project btw, thanks for all of your hard work! 😄

How to run tool branch in Meteor?

I'm using Window 10. I checkout the tool branch, had installed mongodb and Meteor, but I can not run it normal. Can you help me?

http://localhost:3000/

F:\HSK\makemeahanzi-b-tool.meteor\local\build\programs\server\app\app.js:4808
if (error) throw error; // 68
^

Error: ENOENT: no such file or directory, open 'F:\HSK\makemeahanzi-b-tool.meteor\cjklib\characterdecomposition.csv'
at Error (native)
Exited with code: 1
Your application is crashing. Waiting for file change.

Demo site has been down

请问你们的汉字点是自己输入的么，还是通过算法生成的呢？

最近遇到要做含有数字和英文字母的需求，和写汉字笔画一样，想知道怎么解决

衢 wrong stroke order

衢 has to be drawn from left to right (4th, 5th and 6th must be moved to the end).

Other characters that have the same kind of decomposition (i.e. ⿲彳*彳) seem ok.

japanese characters

I made a derived project from makemeahanzi called animCJK (https://github.com/parsimonhi/animCJK). It contains svg files for the 2136 "jōyō kanji" in use in Japan and the 3500 "frequently used simplified hanzi".

Svg files of animCJK are completely different from svg files of makemeahanzi. However i made two files (graphicsJa.txt and graphicsZhHans.txt) that have the same format as your graphics.txt. So you can import what was changed very easily if you wish it.

Note that many characters (about one third) are not the same in Japanese and in Chinese even when they share the same unicode (different stroke order, different stroke direction, different glyph, ...). So don't merge the two files without care.

Note also that i recomputed all medians in order to be sure that a stroke-width of 128 is sufficient to cover all stroke shapes.

Hope it helps.

mapping between images and data, plus

is there any way to associate the character data with their corresponding svg? are the svgs named in the same order as they appear in the data files?

How did you generate the SVGs files ?

Hey,

I was wondering how did you generate the SVGs (from the "svgs" folder), from the graphics.txt ? Do you have some kind of script somewhere you could share ?

I tried to match all lines of graphics.txt to files in the svgs folder, but it seems that, when you order the svg files by "number" ASC, some files doesn't match the line in graphics.txt.

Regards,
Quentin

Stroke order error in 瑤謠遙

Wrong stroke order in the 𠂊⺀part of 瑤謠遙 and possibly in other characters that contain the same component.

搖 and 鷂 have already the correct stroke order.

Extrapolate stroke caps for overlapping strokes

Thank you for making this incredible library and open-sourcing it! One possible point of improvement:

Make Me a Hanzi currently clips the end of strokes if it's obscured by another overlapping stroke. It makes sense because this data was extracted from a real font, but this isn't ideal for a few reasons:

Since the SVG path points are all rounded to the nearest integer this can result in a really thin gap between strokes, especially visible when the character is drawn at a larger size. For example:
When showing stroke animations, it looks more natural if the full stroke can be drawn first and then a subsequent stroke can be animated on top of the end of a stroke since this is how strokes would be drawn in reality.
If stroke ends can be added, it makes it possible for some interesting interactive effects like breaking apart a character in 3d (http://wenlincdl.com/) or moving parts of the stroke on click/mouseover (http://wenlincdl.com/demos/explorer). These effects would look strange on strokes with clipped ends.

I think it should be possible to add a realistic stroke end to clipped strokes that won't change the way the character looks after all strokes are drawn, but should look natural when the stroke is viewed in isolation and fix the issues brought up above. Maybe it could work by trying to fit a stroke end from other similar-looking strokes such that the fitted end is fully obscured by the stroke drawn on top.

寝: strokes 4, 5, 6 are backward

Add Hanzi Writer to readme.md

Related project : https://chanind.github.io/hanzi-writer/
Quite impressive as it provide a skritter-like quizz technology.
PS: Your project is awesome !

skishore / makemeahanzi Goto Github PK

makemeahanzi's Introduction

Sources

Format

dictionary.txt keys:

graphics.txt keys:

TODOs and Future Work

Related projects

makemeahanzi's People

Contributors

Stargazers

Watchers

Forkers

makemeahanzi's Issues

Recommend Projects

Recommend Topics

Recommend Org