Giter Club home page Giter Club logo

2017-osnov-programm's People

Contributors

sevilaybayatli avatar

Watchers

 avatar

2017-osnov-programm's Issues

Tags in wrong column

# sent_id = 1
# text = Cengiz Han.
( "Cenghis Khan" , "Çinggis Haan" ya da doğum adıyla Temuçin  ( anlamı demirci )  , Moğolca "Чингис Хаан" ya da "Tengiz"  ( anlamı deniz )  , ; d.
1162 – ö.
18 Ağustos 1227 )  , Moğol komutan , hükümdar ve Moğol İmparatorluğu'nun kurucusudur.

1	Cengiz	_	_	PROPN	_	_	_	_	_
2	Han.                    PROPN
(	_	_	SYM	_	_	_	_	_
3	"Cenghis	_	_	PROPN	_	_	_	_	_
4	Khan"	_	_	PROPN	_	_	_	_	_
5	,	_	_	PUNCT	_	_	_	_	_
6	"Çinggis	_	_	PROPN	_	_	_	_	_
7	Haan"	_	_	PROPN	_	_	_	_	_
8	ya	_	_	SCONJ	_	_	_	_	_
9	da	_	_	SCONJ	_	_	_	_	_
10	doğum	_	_	ADJ	_	_	_	_	_
11	adıyla	_	_	NOUN	_	_	_	_	_
12	Temuçin	_	_	PROPN	_	_	_	_	_
13		_	_	X	_	_	_	_	_
14	(	_	_	SYM	_	_	_	_	_

Tags should be in fourth column, not in the fifth column. e.g.

10	doğum	_	ADJ	_	_	_	_	_	_

Abbreviations in segmenter

Probably you want to not consider "ö." and "d." as end of sentence markers.

Cengiz Han ("Cenghis Khan", "Çinggis Haan" ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: "Чингис Хаан" ya da "Tengiz" (anlamı: deniz), ; d. 1162 – ö. 18 Ağustos 1227), Moğol komutan, hükümdar ve Moğol İmparatorluğu'nun kurucusudur.

This should be one sentence, not three.

Tokeniser doesn't split punctuation

The current tokeniser outputs:

# sent_id = 1
# text = 
Cengiz Han:
Cengiz Han ("Cenghis Khan", "Çinggis Haan" ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: "Чингис Хаан" ya da "Tengiz" (anlamı: deniz), ; d.
1162 – ö.
18 Ağustos 1227), Moğol komutan, hükümdar ve Moğol İmparatorluğu'nun kurucusudur.

1	
Cengiz	_	_	_	_	_	_	_	_
2	Han:
Cengiz	_	_	_	_	_	_	_	_
3	Han	_	_	_	_	_	_	_	_
4	("Cenghis	_	_	_	_	_	_	_	_
5	Khan",	_	_	_	_	_	_	_	_
6	"Çinggis	_	_	_	_	_	_	_	_
7	Haan"	_	_	_	_	_	_	_	_
8	ya	_	_	_	_	_	_	_	_
9	da	_	_	_	_	_	_	_	_
10	doğum	_	_	_	_	_	_	_	_
11	adıyla	_	_	_	_	_	_	_	_
12	Temuçin	_	_	_	_	_	_	_	_
13	(anlamı:	_	_	_	_	_	_	_	_
14	demirci),	_	_	_	_	_	_	_	_
15	Moğolca:	_	_	_	_	_	_	_	_
16	"Чингис	_	_	_	_	_	_	_	_
17	Хаан"	_	_	_	_	_	_	_	_
18	ya	_	_	_	_	_	_	_	_
19	da	_	_	_	_	_	_	_	_
20	"Tengiz"	_	_	_	_	_	_	_	_
21	(anlamı:	_	_	_	_	_	_	_	_
22	deniz),	_	_	_	_	_	_	_	_
23	;	_	_	_	_	_	_	_	_
24	d.
1162	_	_	_	_	_	_	_	_
25	–	_	_	_	_	_	_	_	_
26	ö.
18	_	_	_	_	_	_	_	_
27	Ağustos	_	_	_	_	_	_	_	_
28	1227),	_	_	_	_	_	_	_	_
29	Moğol	_	_	_	_	_	_	_	_
30	komutan,	_	_	_	_	_	_	_	_
31	hükümdar	_	_	_	_	_	_	_	_
32	ve	_	_	_	_	_	_	_	_
33	Moğol	_	_	_	_	_	_	_	_
34	İmparatorluğu'nun	_	_	_	_	_	_	_	_
35	kurucusudur.	_	_	_	_	_	_	_	_

Errors:

  • Split '.' from tokens if not an abbreviation (see 35)
  • Split: ( " ) , :
  • Extra newline after '1' and 2
  • Extra newline between comments and sentence

Transliterator doesn't transliterate

$ cat ../corpus/wiki.10k.txt   |python3 segment.py  | python3 tokeniser.py | python3 transliterate.py
# sent_id = 1
# text = # sent_id = 1
# text = 
Cengiz Han:
Cengiz Han ("Cenghis Khan", "Çinggis Haan" ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: "Чингис Хаан" ya da "Tengiz" (anlamı: deniz), ; d.
1162 – ö.
18 Ağustos 1227), Moğol komutan, hükümdar ve Moğol İmparatorluğu'nun kurucusudur.

1	
Cengiz	_	_	_	_	_	_	_	_
2	Han:
Cengiz	_	_	_	_	_	_	_	_
3	Han	_	_	_	_	_	_	_	_
4	("Cenghis	_	_	_	_	_	_	_	_
5	Khan",	_	_	_	_	_	_	_	_
6	"Çinggis	_	_	_	_	_	_	_	_
7	Haan"	_	_	_	_	_	_	_	_
8	ya	_	_	_	_	_	_	_	_
9	da	_	_	_	_	_	_	_	_
10	doğum	_	_	_	_	_	_	_	_
11	adıyla	_	_	_	_	_	_	_	_
12	Temuçin	_	_	_	_	_	_	_	_
13	(anlamı:	_	_	_	_	_	_	_	_
14	demirci),	_	_	_	_	_	_	_	_
15	Moğolca:	_	_	_	_	_	_	_	_
16	"Чингис	_	_	_	_	_	_	_	_
17	Хаан"	_	_	_	_	_	_	_	_

Transliteration should appear in 10th column, like:

10	doğum	_	_	_	_	_	_	_	Translit=doʲum

Blank tokens

Tokens 18 and 13 are blank, there should be no blank tokens.

11	adıyla	_	_	NOUN	_	_	_	_	_
12	Temuçin	_	_	PROPN	_	_	_	_	_
13		_	_	X	_	_	_	_	_
14	(	_	_	SYM	_	_	_	_	_
15	anlamı	_	_	NOUN	_	_	_	_	_
16	demirci	_	_	PROPN	_	_	_	_	_
17	)	_	_	SYM	_	_	_	_	_
18		_	_	X	_	_	_	_	_
19	,	_	_	PUNCT	_	_	_	_	_
20	Moğolca	_	_	PROPN	_	_	_	_	_

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.