2017-osnov-programm's People
2017-osnov-programm's Issues
Tags in wrong column
# sent_id = 1
# text = Cengiz Han.
( "Cenghis Khan" , "Çinggis Haan" ya da doğum adıyla Temuçin ( anlamı demirci ) , Moğolca "Чингис Хаан" ya da "Tengiz" ( anlamı deniz ) , ; d.
1162 – ö.
18 Ağustos 1227 ) , Moğol komutan , hükümdar ve Moğol İmparatorluğu'nun kurucusudur.
1 Cengiz _ _ PROPN _ _ _ _ _
2 Han. PROPN
( _ _ SYM _ _ _ _ _
3 "Cenghis _ _ PROPN _ _ _ _ _
4 Khan" _ _ PROPN _ _ _ _ _
5 , _ _ PUNCT _ _ _ _ _
6 "Çinggis _ _ PROPN _ _ _ _ _
7 Haan" _ _ PROPN _ _ _ _ _
8 ya _ _ SCONJ _ _ _ _ _
9 da _ _ SCONJ _ _ _ _ _
10 doğum _ _ ADJ _ _ _ _ _
11 adıyla _ _ NOUN _ _ _ _ _
12 Temuçin _ _ PROPN _ _ _ _ _
13 _ _ X _ _ _ _ _
14 ( _ _ SYM _ _ _ _ _
Tags should be in fourth column, not in the fifth column. e.g.
10 doğum _ ADJ _ _ _ _ _ _
Abbreviations in segmenter
Probably you want to not consider "ö." and "d." as end of sentence markers.
Cengiz Han ("Cenghis Khan", "Çinggis Haan" ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: "Чингис Хаан" ya da "Tengiz" (anlamı: deniz), ; d. 1162 – ö. 18 Ağustos 1227), Moğol komutan, hükümdar ve Moğol İmparatorluğu'nun kurucusudur.
This should be one sentence, not three.
Tokeniser doesn't split punctuation
The current tokeniser outputs:
# sent_id = 1
# text =
Cengiz Han:
Cengiz Han ("Cenghis Khan", "Çinggis Haan" ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: "Чингис Хаан" ya da "Tengiz" (anlamı: deniz), ; d.
1162 – ö.
18 Ağustos 1227), Moğol komutan, hükümdar ve Moğol İmparatorluğu'nun kurucusudur.
1
Cengiz _ _ _ _ _ _ _ _
2 Han:
Cengiz _ _ _ _ _ _ _ _
3 Han _ _ _ _ _ _ _ _
4 ("Cenghis _ _ _ _ _ _ _ _
5 Khan", _ _ _ _ _ _ _ _
6 "Çinggis _ _ _ _ _ _ _ _
7 Haan" _ _ _ _ _ _ _ _
8 ya _ _ _ _ _ _ _ _
9 da _ _ _ _ _ _ _ _
10 doğum _ _ _ _ _ _ _ _
11 adıyla _ _ _ _ _ _ _ _
12 Temuçin _ _ _ _ _ _ _ _
13 (anlamı: _ _ _ _ _ _ _ _
14 demirci), _ _ _ _ _ _ _ _
15 Moğolca: _ _ _ _ _ _ _ _
16 "Чингис _ _ _ _ _ _ _ _
17 Хаан" _ _ _ _ _ _ _ _
18 ya _ _ _ _ _ _ _ _
19 da _ _ _ _ _ _ _ _
20 "Tengiz" _ _ _ _ _ _ _ _
21 (anlamı: _ _ _ _ _ _ _ _
22 deniz), _ _ _ _ _ _ _ _
23 ; _ _ _ _ _ _ _ _
24 d.
1162 _ _ _ _ _ _ _ _
25 – _ _ _ _ _ _ _ _
26 ö.
18 _ _ _ _ _ _ _ _
27 Ağustos _ _ _ _ _ _ _ _
28 1227), _ _ _ _ _ _ _ _
29 Moğol _ _ _ _ _ _ _ _
30 komutan, _ _ _ _ _ _ _ _
31 hükümdar _ _ _ _ _ _ _ _
32 ve _ _ _ _ _ _ _ _
33 Moğol _ _ _ _ _ _ _ _
34 İmparatorluğu'nun _ _ _ _ _ _ _ _
35 kurucusudur. _ _ _ _ _ _ _ _
Errors:
- Split '.' from tokens if not an abbreviation (see 35)
- Split: ( " ) , :
- Extra newline after '1' and 2
- Extra newline between comments and sentence
Transliterator doesn't transliterate
$ cat ../corpus/wiki.10k.txt |python3 segment.py | python3 tokeniser.py | python3 transliterate.py
# sent_id = 1
# text = # sent_id = 1
# text =
Cengiz Han:
Cengiz Han ("Cenghis Khan", "Çinggis Haan" ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: "Чингис Хаан" ya da "Tengiz" (anlamı: deniz), ; d.
1162 – ö.
18 Ağustos 1227), Moğol komutan, hükümdar ve Moğol İmparatorluğu'nun kurucusudur.
1
Cengiz _ _ _ _ _ _ _ _
2 Han:
Cengiz _ _ _ _ _ _ _ _
3 Han _ _ _ _ _ _ _ _
4 ("Cenghis _ _ _ _ _ _ _ _
5 Khan", _ _ _ _ _ _ _ _
6 "Çinggis _ _ _ _ _ _ _ _
7 Haan" _ _ _ _ _ _ _ _
8 ya _ _ _ _ _ _ _ _
9 da _ _ _ _ _ _ _ _
10 doğum _ _ _ _ _ _ _ _
11 adıyla _ _ _ _ _ _ _ _
12 Temuçin _ _ _ _ _ _ _ _
13 (anlamı: _ _ _ _ _ _ _ _
14 demirci), _ _ _ _ _ _ _ _
15 Moğolca: _ _ _ _ _ _ _ _
16 "Чингис _ _ _ _ _ _ _ _
17 Хаан" _ _ _ _ _ _ _ _
Transliteration should appear in 10th column, like:
10 doğum _ _ _ _ _ _ _ Translit=doʲum
Blank tokens
Tokens 18 and 13 are blank, there should be no blank tokens.
11 adıyla _ _ NOUN _ _ _ _ _
12 Temuçin _ _ PROPN _ _ _ _ _
13 _ _ X _ _ _ _ _
14 ( _ _ SYM _ _ _ _ _
15 anlamı _ _ NOUN _ _ _ _ _
16 demirci _ _ PROPN _ _ _ _ _
17 ) _ _ SYM _ _ _ _ _
18 _ _ X _ _ _ _ _
19 , _ _ PUNCT _ _ _ _ _
20 Moğolca _ _ PROPN _ _ _ _ _
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.