Giter Club home page Giter Club logo

pragmatic_segmenter's Introduction

Pragmatic Segmenter

Gem Version Build Status License

Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.

Install

Ruby Supports Ruby 2.1.5 and above

gem install pragmatic_segmenter

Ruby on Rails Add this line to your application’s Gemfile:

gem 'pragmatic_segmenter'

Usage

  • If no language is specified, the library will default to English.
  • To specify a language use its two character ISO 639-1 code.
text = "Hello world. My name is Mr. Smith. I work for the U.S. Government and I live in the U.S. I live in New York."
ps = PragmaticSegmenter::Segmenter.new(text: text)
ps.segment
# => ["Hello world.", "My name is Mr. Smith.", "I work for the U.S. Government and I live in the U.S.", "I live in New York."]

# Specify a language
text = "Այսօր երկուշաբթի է: Ես գնում եմ աշխատանքի:"
ps = PragmaticSegmenter::Segmenter.new(text: text, language: 'hy')
ps.segment
# => ["Այսօր երկուշաբթի է:", "Ես գնում եմ աշխատանքի:"]

# Specify a PDF document type
text = "This is a sentence\ncut off in the middle because pdf."
ps = PragmaticSegmenter::Segmenter.new(text: text, language: 'en', doc_type: 'pdf')
ps.segment
# => ["This is a sentence cut off in the middle because pdf."]

# Turn off text cleaning and preprocessing
text = "This is a sentence\ncut off in the middle because pdf."
ps = PragmaticSegmenter::Segmenter.new(text: text, language: 'en', doc_type: 'pdf', clean: false)
ps.segment
# => ["This is a sentence cut", "off in the middle because pdf."]

# Text cleaning and preprocessing only
text = "This is a sentence\ncut off in the middle because pdf."
ps = PragmaticSegmenter::Cleaner.new(text: text, doc_type: 'pdf')
ps.clean
# => "This is a sentence cut off in the middle because pdf."

Live Demo

Try out a live demo of Pragmatic Segmenter in the browser.

Background

According to Wikipedia, sentence boundary disambiguation (aka sentence boundary detection, sentence segmentation) is defined as:

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers.

The goal of Pragmatic Segmenter is to provide a "real-world" segmenter that works out of the box across many languages and does a reasonable job when the format and domain of the input text are unknown. Pragmatic Segmenter does not use any machine-learning techniques and thus does not require training data.

Pragmatic Segmenter aims to improve on other segmentation engines in 2 main areas:

  1. Language support (most segmentation tools only focus on English)
  2. Text cleaning and preprocessing

Pragmatic Segmenter is opinionated and made for the explicit purpose of segmenting texts to create translation memories. Therefore, things such as parenthesis within a sentence are kept as one segment, even if technically there are two or more sentences within the segment in order to maintain coherence. The algorithm is also conservative in that if it comes across an ambiguous sentence boundary it will ignore it rather than splitting.

What do you mean by opinionated?

Pragmatic Segmenter is specifically used for the purpose of segmenting texts for use in translation (and translation memory) related applications. Therefore Pragmatic Segmenter takes a stance on some formatting and segmentation gray areas with the goal of improving the segmentation for the above stated purpose. Some examples:

  • Removes 'table of contents' style long string of periods ('............')
  • Keeps parentheticals, quotations, and parentheticals or quotations within a sentence as one segment for clarity even though technically there may be multiple grammatical sentences within the segment
  • Strips out any xhtml code
  • Conservative in cases where the sentence boundary is ambigious and Pragmatic Segmenter does not have a built in rule

There is an option to turn off text cleaning and preprocessing if you so choose.

The Golden Rules

The Golden Rules are a set of tests I developed that can be run through a segmenter to check its accuracy in regards to edge case scenarios. Most of the papers cited below in Segmentation Papers and Books either use the WSJ corpus or Brown corpus from the Penn Treebank to test their segmentation algorithm. In my opinion there are 2 limits to using these corpora:

  1. The corpora may be too expensive for some people ($1,700).
  2. The majority of the sentences in the corpora are sentences that end with a regular word followed by a period, thus testing the same thing over and over again.

In the Brown Corpus 92% of potential sentence boundaries come after a regular word. The WSJ Corpus is richer with abbreviations and only 83% [53% according to Gale and Church, 1991] of sentences end with a regular word followed by a period.

Andrei Mikheev - Periods, Capitalized Words, etc.

Therefore, I created a set of distinct edge cases to compare segmentation tools on. As most segmentation tools have very high accuracy, in my opinion what is really important to test is how a segmenter handles the edge cases - not whether it can segment 20,000 sentences that end with a regular word followed by a period. These example tests I have named the “Golden Rules". This list is by no means complete and will evolve and expand over time. If you would like to contribute to (or complain about) the test set, please open an issue.

The Holy Grail of sentence segmentation appears to be Golden Rule #18 as no segmenter I tested was able to correctly segment that text. The difficulty being that an abbreviation (in this case a.m./A.M./p.m./P.M.) followed by a capitalized abbreviation (such as Mr., Mrs., etc.) or followed by a proper noun such as a name can be both a sentence boundary and a non sentence boundary.

Download the Golden Rules: [txt | Ruby RSpec]

Golden Rules (English)

1.) Simple period to end sentence

Hello World. My name is Jonas.
=> ["Hello World.", "My name is Jonas."]

2.) Question mark to end sentence

What is your name? My name is Jonas.
=> ["What is your name?", "My name is Jonas."]

3.) Exclamation point to end sentence

There it is! I found it.
=> ["There it is!", "I found it."]

4.) One letter upper case abbreviations

My name is Jonas E. Smith.
=> ["My name is Jonas E. Smith."]

5.) One letter lower case abbreviations

Please turn to p. 55.
=> ["Please turn to p. 55."]

6.) Two letter lower case abbreviations in the middle of a sentence

Were Jane and co. at the party?
=> ["Were Jane and co. at the party?"]

7.) Two letter upper case abbreviations in the middle of a sentence

They closed the deal with Pitt, Briggs & Co. at noon.
=> ["They closed the deal with Pitt, Briggs & Co. at noon."]

8.) Two letter lower case abbreviations at the end of a sentence

Let's ask Jane and co. They should know.
=> ["Let's ask Jane and co.", "They should know."]

9.) Two letter upper case abbreviations at the end of a sentence

They closed the deal with Pitt, Briggs & Co. It closed yesterday.
=> ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]

10.) Two letter (prepositive) abbreviations

I can see Mt. Fuji from here.
=> ["I can see Mt. Fuji from here."]

11.) Two letter (prepositive & postpositive) abbreviations

St. Michael's Church is on 5th st. near the light.
=> ["St. Michael's Church is on 5th st. near the light."]

12.) Possesive two letter abbreviations

That is JFK Jr.'s book.
=> ["That is JFK Jr.'s book."]

13.) Multi-period abbreviations in the middle of a sentence

I visited the U.S.A. last year.
=> ["I visited the U.S.A. last year."]

14.) Multi-period abbreviations at the end of a sentence

I live in the E.U. How about you?
=> ["I live in the E.U.", "How about you?"]

15.) U.S. as sentence boundary

I live in the U.S. How about you?
=> ["I live in the U.S.", "How about you?"]

16.) U.S. as non sentence boundary with next word capitalized

I work for the U.S. Government in Virginia.
=> ["I work for the U.S. Government in Virginia."]

17.) U.S. as non sentence boundary

I have lived in the U.S. for 20 years.
=> ["I have lived in the U.S. for 20 years."]

18.) A.M. / P.M. as non sentence boundary and sentence boundary

At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.
=> ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]

19.) Number as non sentence boundary

She has $100.00 in her bag.
=> ["She has $100.00 in her bag."]

20.) Number as sentence boundary

She has $100.00. It is in her bag.
=> ["She has $100.00.", "It is in her bag."]

21.) Parenthetical inside sentence

He teaches science (He previously worked for 5 years as an engineer.) at the local University.
=> ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]

22.) Email addresses

Her email is [email protected]. I sent her an email.
=> ["Her email is [email protected].", "I sent her an email."]

23.) Web addresses

The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.
=> ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]

24.) Single quotations inside sentence

She turned to him, 'This is great.' she said.
=> ["She turned to him, 'This is great.' she said."]

25.) Double quotations inside sentence

She turned to him, "This is great." she said.
=> ["She turned to him, \"This is great.\" she said."]

26.) Double quotations at the end of a sentence

She turned to him, \"This is great.\" She held the book out to show him.
=> ["She turned to him, \"This is great.\"", "She held the book out to show him."]

27.) Double punctuation (exclamation point)

Hello!! Long time no see.
=> ["Hello!!", "Long time no see."]

28.) Double punctuation (question mark)

Hello?? Who is there?
=> ["Hello??", "Who is there?"]

29.) Double punctuation (exclamation point / question mark)

Hello!? Is that you?
=> ["Hello!?", "Is that you?"]

30.) Double punctuation (question mark / exclamation point)

Hello?! Is that you?
=> ["Hello?!", "Is that you?"]

31.) List (period followed by parens and no period to end item)

1.) The first item 2.) The second item
=> ["1.) The first item", "2.) The second item"]

32.) List (period followed by parens and period to end item)

1.) The first item. 2.) The second item.
=> ["1.) The first item.", "2.) The second item."]

33.) List (parens and no period to end item)

1) The first item 2) The second item
=> ["1) The first item", "2) The second item"]

34.) List (parens and period to end item)

1) The first item. 2) The second item.
=> ["1) The first item.", "2) The second item."]

35.) List (period to mark list and no period to end item)

1. The first item 2. The second item
=> ["1. The first item", "2. The second item"]

36.) List (period to mark list and period to end item)

1. The first item. 2. The second item.
=> ["1. The first item.", "2. The second item."]

37.) List with bullet

• 9. The first item • 10. The second item
=> ["• 9. The first item", "• 10. The second item"]

38.) List with hypthen

⁃9. The first item ⁃10. The second item
=> ["⁃9. The first item", "⁃10. The second item"]

39.) Alphabetical list

a. The first item b. The second item c. The third list item
=> ["a. The first item", "b. The second item", "c. The third list item"]

40.) Errant newline in the middle of a sentence (PDF)

This is a sentence\ncut off in the middle because pdf.
=> ["This is a sentence\ncut off in the middle because pdf."]

41.) Errant newline in the middle of a sentence

It was a cold \nnight in the city.
=> ["It was a cold night in the city."]

42.) Lower case list separated by newline

features\ncontact manager\nevents, activities\n
=> ["features", "contact manager", "events, activities"]

43.) Geo Coordinates

You can find it at N°. 1026.253.553. That is where the treasure is.
=> ["You can find it at N°. 1026.253.553.", "That is where the treasure is."]

44.) Named entities with an exclamation point

She works at Yahoo! in the accounting department.
=> ["She works at Yahoo! in the accounting department."]

45.) I as a sentence boundary and I as an abbreviation

We make a good team, you and I. Did you see Albert I. Jones yesterday?
=> ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]

46.) Ellipsis at end of quotation

Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”
=> ["Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”"]

47.) Ellipsis with square brackets

"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).
=> ["\"Bohr [...] used the analogy of parallel stairways [...]\" (Smith 55)."]

48.) Ellipsis as sentence boundary (standard ellipsis rules)

If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.
=> ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]

49.) Ellipsis as sentence boundary (non-standard ellipsis rules)

I never meant that.... She left the store.
=> ["I never meant that....", "She left the store."]

50.) Ellipsis as non sentence boundary

I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.
=> ["I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."

51.) 4-dot ellipsis

One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .
=> ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."]

52.) No whitespace in between sentences Credit: Don_Patrick

Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.
=> ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]

Golden Rules (German)

1.) Quotation at end of sentence

„Ich habe heute keine Zeit“, sagte die Frau und flüsterte leise: „Und auch keine Lust.“ Wir haben 1.000.000 Euro.
=> ["„Ich habe heute keine Zeit“, sagte die Frau und flüsterte leise: „Und auch keine Lust.“", "Wir haben 1.000.000 Euro."]

2.) Abbreviations

Es gibt jedoch einige Vorsichtsmaßnahmen, die Du ergreifen kannst, z. B. ist es sehr empfehlenswert, dass Du Dein Zuhause von allem Junkfood befreist.
=> ["Es gibt jedoch einige Vorsichtsmaßnahmen, die Du ergreifen kannst, z. B. ist es sehr empfehlenswert, dass Du Dein Zuhause von allem Junkfood befreist."]

3.) Numbers

Was sind die Konsequenzen der Abstimmung vom 12. Juni?
=> ["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"]

4.) Cardinal numbers at end of sentence Credit: Dr. Michael Ustaszewski

Die Information steht auf Seite 12. Dort kannst du nachlesen.
=> ["Die Information steht auf Seite 12.", "Dort kannst du nachlesen."]

Golden Rules (Japanese)

1.) Simple period to end sentence

これはペンです。それはマーカーです。
=> ["これはペンです。", "それはマーカーです。"]

2.) Question mark to end sentence

それは何ですか?ペンですか?
=> ["それは何ですか?", "ペンですか?"]

3.) Exclamation point to end sentence

良かったね!すごい!
=> ["良かったね!", "すごい!"]

4.) Quotation

自民党税制調査会の幹部は、「引き下げ幅は3.29%以上を目指すことになる」と指摘していて、今後、公明党と合意したうえで、30日に決定する与党税制改正大綱に盛り込むことにしています。
=> ["自民党税制調査会の幹部は、「引き下げ幅は3.29%以上を目指すことになる」と指摘していて、今後、公明党と合意したうえで、30日に決定する与党税制改正大綱に盛り込むことにしています。"]

5.) Errant newline in the middle of a sentence

これは父の\n家です。
=> ["これは父の家です。"]

Golden Rules (Arabic)

1.) Regular punctuation

سؤال وجواب: ماذا حدث بعد الانتخابات الايرانية؟ طرح الكثير من التساؤلات غداة ظهور نتائج الانتخابات الرئاسية الايرانية التي أججت مظاهرات واسعة واعمال عنف بين المحتجين على النتائج ورجال الامن. يقول معارضو الرئيس الإيراني إن الطريقة التي اعلنت بها النتائج كانت مثيرة للاستغراب.
=> ["سؤال وجواب:", "ماذا حدث بعد الانتخابات الايرانية؟", "طرح الكثير من التساؤلات غداة ظهور نتائج الانتخابات الرئاسية الايرانية التي أججت مظاهرات واسعة واعمال عنف بين المحتجين على النتائج ورجال الامن.", "يقول معارضو الرئيس الإيراني إن الطريقة التي اعلنت بها النتائج كانت مثيرة للاستغراب."]

2.) Abbreviations

وقال د‪.‬ ديفيد ريدي و الأطباء الذين كانوا يعالجونها في مستشفى برمنجهام إنها كانت تعاني من أمراض أخرى. وليس معروفا ما اذا كانت قد توفيت بسبب اصابتها بأنفلونزا الخنازير.
=> ["وقال د‪.‬ ديفيد ريدي و الأطباء الذين كانوا يعالجونها في مستشفى برمنجهام إنها كانت تعاني من أمراض أخرى.", "وليس معروفا ما اذا كانت قد توفيت بسبب اصابتها بأنفلونزا الخنازير."]

3.) Numbers and Dates

ومن المنتظر أن يكتمل مشروع خط أنابيب نابوكو البالغ طوله 3300 كليومترا في 12‪/‬08‪/‬2014 بتكلفة تُقدر بـ 7.9 مليارات يورو أي نحو 10.9 مليارات دولار. ومن المقرر أن تصل طاقة ضخ الغاز في المشروع 31 مليار متر مكعب انطلاقا من بحر قزوين مرورا بالنمسا وتركيا ودول البلقان دون المرور على الأراضي الروسية.
=> ["ومن المنتظر أن يكتمل مشروع خط أنابيب نابوكو البالغ طوله 3300 كليومترا في 12‪/‬08‪/‬2014 بتكلفة تُقدر بـ 7.9 مليارات يورو أي نحو 10.9 مليارات دولار.", "ومن المقرر أن تصل طاقة ضخ الغاز في المشروع 31 مليار متر مكعب انطلاقا من بحر قزوين مرورا بالنمسا وتركيا ودول البلقان دون المرور على الأراضي الروسية."]

4.) Time

الاحد, 21 فبراير/ شباط, 2010, 05:01 GMT الصنداي تايمز: رئيس الموساد قد يصبح ضحية الحرب السرية التي شتنها بنفسه. العقل المنظم هو مئير داجان رئيس الموساد الإسرائيلي الذي يشتبه بقيامه باغتيال القائد الفلسطيني في حركة حماس محمود المبحوح في دبي.
=> ["الاحد, 21 فبراير/ شباط, 2010, 05:01 GMT الصنداي تايمز:", "رئيس الموساد قد يصبح ضحية الحرب السرية التي شتنها بنفسه.", "العقل المنظم هو مئير داجان رئيس الموساد الإسرائيلي الذي يشتبه بقيامه باغتيال القائد الفلسطيني في حركة حماس محمود المبحوح في دبي."]

5.) Comma

عثر في الغرفة على بعض أدوية علاج ارتفاع ضغط الدم، والقلب، زرعها عملاء الموساد كما تقول مصادر إسرائيلية، وقرر الطبيب أن الفلسطيني قد توفي وفاة طبيعية ربما إثر نوبة قلبية، وبدأت مراسم الحداد عليه
=> ["عثر في الغرفة على بعض أدوية علاج ارتفاع ضغط الدم، والقلب،", "زرعها عملاء الموساد كما تقول مصادر إسرائيلية،", "وقرر الطبيب أن الفلسطيني قد توفي وفاة طبيعية ربما إثر نوبة قلبية،", "وبدأت مراسم الحداد عليه"]

Golden Rules (Italian)

1.) Abbreviations

Salve Sig.ra Mengoni! Come sta oggi?
=> ["Salve Sig.ra Mengoni!", "Come sta oggi?"]

2.) Quotations

Una lettera si può iniziare in questo modo «Il/la sottoscritto/a.».
=> ["Una lettera si può iniziare in questo modo «Il/la sottoscritto/a.»."]

3.) Numbers

La casa costa 170.500.000,00€!
=> ["La casa costa 170.500.000,00€!"]

Golden Rules (Russian)

1.) Abbreviations

Объем составляет 5 куб.м.
=> ["Объем составляет 5 куб.м."]

2.) Quotations

Маленькая девочка бежала и кричала: «Не видали маму?».
=> ["Маленькая девочка бежала и кричала: «Не видали маму?»."]

3.) Numbers

Сегодня 27.10.14
=> ["Сегодня 27.10.14"]

Golden Rules (Spanish)

1.) Question mark to end sentence

¿Cómo está hoy? Espero que muy bien.
=> ["¿Cómo está hoy?", "Espero que muy bien."]

2.) Exclamation point to end sentence

¡Hola señorita! Espero que muy bien.
=> ["¡Hola señorita!", "Espero que muy bien."]

3.) Abbreviations

Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre, el Dr. Naser.
=> ["Hola Srta. Ledesma.", "Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre, el Dr. Naser."]

4.) Numbers

¡La casa cuesta $170.500.000,00! ¡Muy costosa! Se prevé una disminución del 12.5% para el próximo año.
=> ["¡La casa cuesta $170.500.000,00!", "¡Muy costosa!", "Se prevé una disminución del 12.5% para el próximo año."]

5.) Quotations

«Ninguna mente extraordinaria está exenta de un toque de demencia.», dijo Aristóteles.
=> ["«Ninguna mente extraordinaria está exenta de un toque de demencia.», dijo Aristóteles."]

Golden Rules (Greek)

1.) Question mark to end sentence

Με συγχωρείτε· πού είναι οι τουαλέτες; Τις Κυριακές δε δούλευε κανένας. το κόστος του σπιτιού ήταν £260.950,00.
=> ["Με συγχωρείτε· πού είναι οι τουαλέτες;", "Τις Κυριακές δε δούλευε κανένας.", "το κόστος του σπιτιού ήταν £260.950,00."]

Golden Rules (Hindi)

1.) Full stop

सच्चाई यह है कि इसे कोई नहीं जानता। हो सकता है यह फ़्रेन्को के खिलाफ़ कोई विद्रोह रहा हो, या फिर बेकाबू हो गया कोई आनंदोत्सव।
=> ["सच्चाई यह है कि इसे कोई नहीं जानता।", "हो सकता है यह फ़्रेन्को के खिलाफ़ कोई विद्रोह रहा हो, या फिर बेकाबू हो गया कोई आनंदोत्सव।"]

Golden Rules (Armenian)

1.) Sentence ending punctuation

Ի՞նչ ես մտածում: Ոչինչ:
=> ["Ի՞նչ ես մտածում:", "Ոչինչ:"]

2.) Ellipsis

Ապրիլի 24-ին սկսեց անձրևել...Այդպես էի գիտեի:
=> ["Ապրիլի 24-ին սկսեց անձրևել...Այդպես էի գիտեի:"]

3.) Period is not a sentence boundary

Այսպիսով` մոտենում ենք ավարտին: Տրամաբանությյունը հետևյալն է. պարզություն և աշխատանք:
=> ["Այսպիսով` մոտենում ենք ավարտին:", "Տրամաբանությյունը հետևյալն է. պարզություն և աշխատանք:"]

Golden Rules (Burmese)

1.) Sentence ending punctuation

ခင္ဗ်ားနာမည္ဘယ္လိုေခၚလဲ။၇ွင္ေနေကာင္းလား။
=> ["ခင္ဗ်ားနာမည္ဘယ္လိုေခၚလဲ။", "၇ွင္ေနေကာင္းလား။"]

Golden Rules (Amharic)

1.) Sentence ending punctuation

እንደምን አለህ፧መልካም ቀን ይሁንልህ።እባክሽ ያልሽዉን ድገሚልኝ።
=> ["እንደምን አለህ፧", "መልካም ቀን ይሁንልህ።", "እባክሽ ያልሽዉን ድገሚልኝ።"]

Golden Rules (Persian)

1.) Sentence ending punctuation

خوشبختم، آقای رضا. شما کجایی هستید؟ من از تهران هستم.
=> ["خوشبختم، آقای رضا.", "شما کجایی هستید؟", "من از تهران هستم."]

Golden Rules (Urdu)

1.) Sentence ending punctuation

کیا حال ہے؟ ميرا نام ___ ەے۔ میں حالا تاوان دےدوں؟
=> ["کیا حال ہے؟", "ميرا نام ___ ەے۔", "میں حالا تاوان دےدوں؟"]

Golden Rules (Dutch)

1.) Sentence starting with a number

Hij schoot op de JP8-brandstof toen de Surface-to-Air (sam)-missiles op hem af kwamen. 81 procent van de schoten was raak.
=> ["Hij schoot op de JP8-brandstof toen de Surface-to-Air (sam)-missiles op hem af kwamen.", "81 procent van de schoten was raak."]

2.) Sentence starting with an ellipsis

81 procent van de schoten was raak. ...en toen barste de hel los.
=> ["81 procent van de schoten was raak.", "...en toen barste de hel los."]

Comparison of Segmentation Tools, Libraries and Algorithms

Name Programming Language License GRS (English) GRS (Other Languages)† Speed‡
Pragmatic Segmenter Ruby MIT 98.08% 100.00% 3.84 s
TactfulTokenizer Ruby GNU GPLv3 65.38% 48.57% 46.32 s
OpenNLP Java APLv2 59.62% 45.71% 1.27 s
Standford CoreNLP Java GNU GPLv3 59.62% 31.43% 0.92 s
Splitta Python APLv2 55.77% 37.14% N/A
Punkt Python APLv2 46.15% 48.57% 1.79 s
SRX English Ruby GNU GPLv3 30.77% 28.57% 6.19 s
Scapel Ruby GNU GPLv3 28.85% 20.00% 0.13 s

†GRS (Other Languages) is the total of the Golden Rules listed above for all languages other than English. This metric by no means includes all languages, only the ones that have Golden Rules listed above. ‡ Speed is based on the performance benchmark results detailed in the section "Speed Performance Benchmarks" below. The number is an average of 10 runs.

Other tools not yet tested:

Speed Performance Benchmarks

To test the relative performance of different segmentation tools and libraries I created a simple benchmark test. The test takes the 50 English Golden Rules combined into one string and runs it 100 times through the segmenter. This speed benchmark is by no means the most scientific benchmark, but it should help to give some relative performance data. The tests were done on a Mac Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5. For Punkt the tests were run using this Ruby port, for Standford CoreNLP the tests were run using this Ruby port, and for OpenNLP the tests were run using this Ruby port.

Languages with sentence boundary punctuation that is different than English

If you know of any languages that are missing from the list below, please open an issue. Thank you.

Pragmatic Segmenter supports the following languages with regards to sentence boundary punctuation that is different than English:

  • Amharic
  • Arabic
  • Armenian
  • Burmese
  • Chinese
  • Greek
  • Hindi
  • Japanese
  • Persian
  • Urdu

Segmentation Papers and Books

  • Elephant: Sequence Labeling for Word and Sentence Segmentation - Kilian Evang, Valerio Basile, Grzegorz Chrupała and Johan Bos (2013) [pdf | mirror]
  • Sentence Boundary Detection: A Long Solved Problem? (Second Edition) - Jonathon Read, Rebecca Dridan, Stephan Oepen, Lars Jørgen Solberg (2012) [pdf | mirror]
  • Handbook of Natural Language Processing (Second Edition) - Nitin Indurkhya and Fred J. Damerau (2010) [amazon]
  • Sentence Boundary Detection and the Problem with the U.S. - Dan Gillick (2009) [pdf | mirror]
  • Thoughts on Word and Sentence Segmentation in Thai - Wirote Aroonmanakun (2007) [pdf | mirror]
  • Unsupervised Multilingual Sentence Boundary Detection - Tibor Kiss and Jan Strunk (2005) [pdf | mirror]
  • An Analysis of Sentence Boundary Detection Systems for English and Portuguese Documents - Carlos N. Silla Jr. and Celso A. A. Kaestner (2004) [pdf | mirror]
  • Periods, Capitalized Words, etc. - Andrei Mikheev (2002) [pdf]
  • Scaled log likelihood ratios for the detection of abbreviations in text corpora - Tibor Kiss and Jan Strunk (2002) [pdf | mirror]
  • Viewing sentence boundary detection as collocation identification - Tibor Kiss and Jan Strunk (2002) [pdf | mirror]
  • Automatic Sentence Break Disambiguation for Thai - Paisarn Charoenpornsawat and Virach Sornlertlamvanich (2001) [pdf | mirror]
  • Sentence Boundary Detection: A Comparison of Paradigms for Improving MT Quality - Daniel J. Walker, David E. Clements, Maki Darwin and Jan W. Amtrup (2001) [pdf | mirror]
  • A Sentence Boundary Detection System - Wendy Chen (2000) [ppt | mirror]
  • Tagging Sentence Boundaries - Andrei Mikheev (2000) [pdf | mirror]
  • Automatic Extraction of Rules For Sentence Boundary Disambiguation - E. Stamatatos, N. Fakotakis, AND G. Kokkinakis (1999) [pdf]
  • A Maximum Entropy Approach to Identifying Sentence Boundaries - Jeffrey C. Reynar and Adwait Ratnaparkhi (1997) [pdf | mirror]
  • Adaptive Multilingual Sentence Boundary Disambiguation - David D. Palmer and Marti A. Hearst (1997) [pdf | mirror]
  • What is a word, What is a sentence? Problems of Tokenization - Gregory Grefenstette and Pasi Tapanainen (1994) [pdf | mirror]
  • Chapter 2: Tokenisation and Sentence Segmentation - David D. Palmer [pdf | mirror]
  • Using SRX standard for sentence segmentation in LanguageTool - Marcin Miłkowski and Jarosław Lipski [pdf | mirror]

TODO

  • Add additional language support
  • Add abbreviation lists for any languages that do not currently have one (only relevant for languages that have the concept of abbreviations with periods)
  • Get Golden Rule #18 passing - Handling of a.m. or p.m. followed by a capitalized non sentence starter (ex. "At 5 p.m. Mr. Smith went to the bank. He left the bank at 6 p.m. Next he went to the store." --> ["At 5 p.m. Mr. Smith went to the bank.", "He left the bank at 6 p.m.", "Next he went to the store."])
  • Support for Thai. This is a very challenging problem due to the absence of explicit sentence markers (i.e. like a period in English) and the ambiguity in Thai regarding what constitutes a sentence even among native speakers. For more information see the following research papers (#1 | #2).

Change Log

Version 0.0.1

  • Initial Release

Version 0.0.2

  • Major design refactor

Version 0.0.3

  • Add travis.yml
  • Add Code Climate
  • Update README

Version 0.0.4

  • Add ConsecutiveForwardSlashRule to cleaner
  • Refactor segmenter.rb and process.rb

Version 0.0.5

  • Make symbol substitution safer
  • Refactor process.rb
  • Update cleaner with escaped newline rules

Version 0.0.6

  • Add rule for escaped newlines that include a space between the slash and character
  • Add Golden Rule #52 and code to make it pass

Version 0.0.7

  • Add change log to README
  • Add passing spec for new end of sentence abbreviation (EN)
  • Add roman numeral list support

Version 0.0.8

  • Fix error in list.rb

Version 0.0.9

  • Improve handling of alphabetical and roman numeral lists

Version 0.1.0

  • Add Kommanditgesellschaft Rule

Version 0.1.1

  • Fix handling of German dates

Version 0.1.2

  • Fix missing abbreviations
  • Add footnote rule to cleaner.rb

Version 0.1.3

  • Improve punctuation in bracket replacement

Version 0.1.4

  • Fix missing abbreviations

Version 0.1.5

  • Fix comma at end of quotation bug

Version 0.1.6

  • Fix bug in numbered list finder (ignore longer digits)

Version 0.1.7

  • Add Alice in Wonderland specs
  • Fix parenthesis between double quotations bug
  • Fix split after quotation ending in dash bug

Version 0.1.8

  • Fix bug in splitting new sentence after single quotes

Version 0.2.0

  • Add Dutch Golden Rules and abbreviations
  • Update README with additional tools
  • Update segmentation test scores in README with results of new Golden Rule tests
  • Add Polish abbreviations

Version 0.3.0

  • Add support for square brackets
  • Add support for continuous exclamation points or questions marks or combinations of both
  • Fix Roman numeral support
  • Add English abbreviations

Version 0.3.1

  • Fix undefined method 'gsub!' for nil:NilClass issue

Version 0.3.2

  • Add English abbreviations

Version 0.3.3

  • Fix cleaner bug

Version 0.3.4

  • Large refactor

Version 0.3.5

  • Reduce GC by replacing #gsub with #gsub! where possible

Version 0.3.6

  • Refactor SENTENCE_STARTERS to each individual language and add SENTENCE_STARTERS for German

Version 0.3.7

  • Add unicode gem and use it for downcasing to better handle cyrillic languages

Version 0.3.8

  • Fix bug that cleaned away single letter segments

Version 0.3.9

  • Remove guard-rspec development dependency

Version 0.3.10

  • Change load order of dependencies to fix bug

Version 0.3.11

  • Update German abbreviation list
  • Refactor 'remove_newline_in_middle_of_sentence' method

Version 0.3.12

  • Fix issue involving words with leading apostrophes

Version 0.3.13

  • Fix issue involving unexpected sentence break between abbreviation and hyphen

Version 0.3.14

  • Add English abbreviation Rs. to denote the Indian currency

Version 0.3.15

  • Handle em dashes that appear in the middle of a sentence and include a sentence ending punctuation mark

Version 0.3.16

  • Add support and tests for Danish

Version 0.3.17

  • Fix issue involving the HTML regex in the cleaner

Version 0.3.18

  • Performance optimizations

Version 0.3.19

  • Treat a parenthetical following an abbreviation as part of the same segment

Version 0.3.20

  • Handle slanted single quotation as a single quote
  • Handle a single character abbreviation as part of a list
  • Add support for Chinese caret brackets
  • Add viz as abbreviation

Version 0.3.21

  • Add support for file formats
  • Add support for numeric references at the end of a sentence (i.e. Wikipedia references)

Version 0.3.22

  • Add initial support and tests for Kazakh

Version 0.3.23

  • Refactor for Ruby 3.0 compatibility

Contributing

If you find a text that is incorrectly segmented using this gem, please submit an issue.

  1. Fork it ( https://github.com/diasks2/pragmatic_segmenter/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Ports

License

The MIT License (MIT)

Copyright (c) 2015 Kevin S. Dias

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

pragmatic_segmenter's People

Contributors

adymo avatar airy avatar alextsui05 avatar aseifert avatar bryant1410 avatar diasks2 avatar dmandalinic avatar eliotjones avatar maia avatar mollerhoj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pragmatic_segmenter's Issues

failed test case if there are book quote marks and exclamatory mark in Chinese sentence

text = "我们明天一起去看《摔跤吧!爸爸》好吗?好!"
expected: ["我们明天一起去看《摔跤吧!爸爸》好吗?", "好!"]
actual: ["我们明天一起去看《摔跤吧!", "爸爸》好吗?", "好!"]

it's working well when using double quote marks, for example:

text = "我们明天一起去看“摔跤吧!爸爸”好吗?好!"
expected: ["我们明天一起去看“摔跤吧!爸爸”好吗?", "好!"]
actual: ["我们明天一起去看“摔跤吧!爸爸”好吗?", "好!"]

FYI, "摔跤吧!爸爸" is the Chinese name of movie "Dangal"

Ellipses and design decision

I've been testing the ellipsis rules with . . . replaced with U+2026 (…) and find that pragmatic segmenter fails when given the actual ellipsis character. I'm probably missing something but shouldn't ellipsis.rb contain rules for the actual ellipsis character?

This brings up a bigger question of how all the variants of symbols are covered. I notice that certain end punctuation characters are explicitly defined, e.g., U+FF1F (?) in punctuation_replacer.rb. However, there are many Unicode characters that could stand in for their ASCII equivalents, e.g., U+FE56 (﹖), U+FE16 (︖), etc. for question marks or U+2047 (⁇), U+2048 (⁈), etc. for double end punctuation and so on for all symbols that are used in segmenting decisions, e.g., (), [], -, ., ...
Chasing all these down seems like a nightmare!

Couldn't it make sense to convert everything to ASCII, i.e, unidecode, segment, and then replace the decoded characters with their original characters? This assumes that all 'equivalent' characters have the same meaning but I believe they do, e.g., ፧ is the Ethiopic question mark which carries the same linguistic meaning as in English. If not, those could be the exceptions rather than the rule.

I would love to hear your thoughts.

Thanks for the great library...from my testing it performs better than spacy, segtok, CoreNLP, and Punkt on English wikipedia data.

Parse a sentence to words

I need to parse a text to sentences.
Afterwords I need to find the most frequent word.

I saw how to use the parser in the README.
In order to count the words in a sentence I will need to parse it down to words.
Does 'pragmatic_segmenter' include a REGEX to words (abbreviations etc..)?
Under the hood it must be using some kind of word detector,
can you please share it thus I can also parse a sentence to words by the Same Logic.

return String instead of PragmaticSegmenter::Text

Currently pragmatic_segmenter returns an instance of PragmaticSegmenter::Text, which is a subclass of String. As pragmatic_tokenizer checks if text.class == String and also returning segmented objects of a different class than initially passed, I suggest to return strings instead of instances of the only internally used subclass.

I wonder if there is a smarter idea than using #to_s when returning the result, as it would unnecessarily duplicate the strings in memory. Maybe instead of using a subclass of String rather extend the String class with a module providing that single method used? (and use a method name which won't have a chance to mess with anyones code surprisingly if they also decide to extend the String class)

Breaks the sentence if start is abbreviation

If there is abbreviation at the start of the text, it breaks the sentence correctly and sometimes it doesn't.

My sample sentences were:

TAB. ECOSPRIN 75MG ONE TAB 10PM

Output:

TAB.
ECOSPRIN 75MG ONE TAB 10PM

whereas in the case of Cal. is traditional abbreviation for California it worked fine.

Is this because lib/pragmatic_segmenter/languages/common.rb has cal abbreviation in it?

French 3 petit point is not handle.

HI,
In french we have a ... at the end of sentence but here it doesn't segment right I think it's because etc is also an abreviation that is written etc.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<wrapper>
<s>J&#39;aime le sport etc..</s>
<s>. Cependant est ce vrai ?</s>
</wrapper>

Should look like this :

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<wrapper>
<s>J&#39;aime le sport etc...</s>
<s>Cependant est ce vrai ?</s>
</wrapper>

Thanks for your time

Run as a Service?

I'd like to use the segmenter with a large Java application and would prefer not to use JRuby, etc. I'd prefer to run the segmenter as a light-weight JSON service. Apologies, I don't know Ruby or I would read the code, but can you run this project as an always-on service?

replace_parens_in_numbered_list() calls scan_lists() twice

replace_parens_in_numbered_list() calls scan_lists() twice with same paramters. I checked the commit which introduced the duplication and it looks like a mistake.

def replace_parens_in_numbered_list
scan_lists(
NUMBERED_LIST_PARENS_REGEX, NUMBERED_LIST_PARENS_REGEX, '☝')
scan_lists(NUMBERED_LIST_PARENS_REGEX, NUMBERED_LIST_PARENS_REGEX, '☝')
end

Reference

Punctuation removed even with clean turned off

See example below, when 'clean' parameter is 'false', the asterisk after cat is still removed

pry(main)> s = "I am a dog. Cat.*"
=> "I am a dog. Cat.*"

pry(main)> ps = PragmaticSegmenter::Segmenter.new(text: s, language: 'en', clean: false)
=> #<PragmaticSegmenter::Segmenter:0x00007fdf5d6890e0
 @doc_type=nil,
 @language="en",
 @language_module=PragmaticSegmenter::Languages::English,
 @text="I am a dog. Cat.*">

pry(main)> segments = ps.segment
=> ["I am a dog.", "Cat."]

replace_abbreviation_as_sentence_boundary causing high GC

The method PragmaticSegmenter::AbbreviationReplacer#replace_abbreviation_as_sentence_boundary chains eleven .gsub calls for each of the 23 SENTENCE_STARTERS, which each create a copy of the string and gc the old string. That's 253 GC'd strings for each call.

If replaced by gsub!this would avoid redundant copy/assign operations and speedup the method.

Cannot segment text on Ruby 3

Hi, first of all, thanks for a great gem! I get the following error when trying to segment text using this gem in Ruby 3:

> PragmaticSegmenter::Segmenter.new(text: 'This is apple. This is pen. Ah. Apple pen.').segment
NoMethodError: undefined method `apply' for "This is apple. This is pen. Ah. Apple pen.":String
from vendor/bundle/ruby/3.0.0/gems/pragmatic_segmenter-0.3.22/lib/pragmatic_segmenter/processor.rb:37:in `block in split_into_segments'

It seems that this is a feature(?) of Ruby 3 where calling String methods like String#split will return a String type rather than the subclass (https://github.com/ruby/ruby/blob/v3_0_0/NEWS.md) So, calling split on PragmaticSegmenter::Text will turn it into String and break the chaining that happens internally.

I think if we use delegation instead of inheritance for PragmaticSegmenter::Text, and make any necessary changes in calling code, we can have a version of pragmatic_segmenter that is compatible with Ruby 3 - will try to open a PR over the weekend. I ended up just refactoring the code to use vanilla String instead of a String subclass: #68

Take advantage of non-breaking spaces

I have a corpus of text that often uses explicit non-breaking spaces (NBSP, U+00A0). They are mainly used to keep together words in the same sentence. They often appear after sentence-medial terminal punctuation (.!?) and before short sentence-final words (as in G. F. Handel composed Water Music for George I.). They were used in order to improve both text document layout and parsing.

Consider the following four cases:

1.  'Peter Pan is a J. M. Barrie play.'   # no NBSP
2.  'Peter Pan is a J. M. Barrie play.'   # NBSP after J.
3.  'Peter Pan is a J. M. Barrie play.'   # NBSP after M.
4.  'Peter Pan is a J. M. Barrie play.'   # NBSP after J. and M.

I was surprised that only the first case was segmented correctly out of the box:

1.  ['Peter Pan is a J. M. Barrie play.']
2. *['Peter Pan is a J.', ' M.', 'Barrie play.']
3. *['Peter Pan is a J. M.', ' Barrie play.']
4. *['Peter Pan is a J.', ' M.', ' Barrie play.']

Of course, it would be easy to just translate all of them to normal spaces and call it a day.


But non-breaking spaces can serve as useful disambiguation to a pragmatic sentence segmenter.

Not every sentence can be accurately segmented using a limited number of rules, so taking advantage of non-breaking spaces can improve the results in some trickier cases without making the checks much more complicated.

(Disclaimer: I have not looked at the rules and do not claim to understand the internals of the program.)

I will provide a few examples to demonstrate how this may be useful. You are welcome to incorporate them as new test cases, even if you ultimately decide not to bother with non-breaking spaces.


Cases 5–7 are extremely similar, but 7 surprisingly produces a different result. In case 8, adding an NBSP seems to fix the problem, presumably without making This behave like He and They internally.

5.  'Sri Lanka was conquered by the Cholas and Raja Raja I. They moved the capital.'   # no NBSP
6.  'Sri Lanka was conquered by Raja Raja I. He moved the capital to Polonnaruwa.'     # no NBSP
7.  'Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.'   # no NBSP
8.  'Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.'   # NBSP before I.

5.  ['Sri Lanka was conquered by the Cholas and Raja Raja I.', 'They moved the capital.']
6.  ['Sri Lanka was conquered by Raja Raja I.', 'He moved the capital to Polonnaruwa.']
7. *['Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.']
8.  ['Sri Lanka was conquered by Raja Raja I.', 'This moved the capital to Polonnaruwa.']

Cases 9–10 show the same sort of “fix”:

9.   'Lu Xun wrote The True Story of Ah Q. He was a Chinese author.'   # no NBSP
10.  'Lu Xun wrote The True Story of Ah Q. He was a Chinese author.'   # NBSP before Q.

9.  *['Lu Xun wrote The True Story of Ah Q. He was a Chinese author.']
10.  ['Lu Xun wrote The True Story of Ah Q.', 'He was a Chinese author.']

Cases 11–14 show that it’s much more difficult than it looks as Feng S. He is a plausible Chinese name:

11.  'Feng S. He was a Chinese diplomat who secretly saved 3,000 Austrian Jews.'  # no NBSP
12.  'He said the story of Feng S. He was a secret. He was a Chinese diplomat.'   # no NBSP
13.  'He said the story of Feng S. He was a secret. He was a Chinese diplomat.'   # NBSP after S.
14.  'I learned the son of Feng S. He was a Chinese-American microbiologist.'     # no NBSP

11.  ['Feng S. He was a Chinese diplomat who secretly saved 3,000 Austrian Jews.']
12.  ['He said the story of Feng S. He was a secret.', 'He was a Chinese diplomat.']
13. *['He said the story of Feng S.', ' He was a secret.', 'He was a Chinese diplomat.']
14.  ['I learned the son of Feng S. He was a Chinese-American microbiologist.']

The non-breaking space currently has no effect in cases 15 and 16. However, they could be easily segmented correctly without adding more rules or rare words like .NET to an explicit list.

15.  'I want to learn Microsoft’s .NET framework.'   # no NBSP
16.  'I want to learn Microsoft’s .NET framework.'   # NBSP before .NET

15. *['I want to learn Microsoft’s .', 'NET framework.']
16. *['I want to learn Microsoft’s .', 'NET framework.']

Resolving this issue could help with the following:

  1. Improve accuracy for corpora that (partially) use an existing convention for keeping words together using non-breaking spaces.
  2. Improve the existing rules by scrutinizing issues with the cases provided.

I favor this program for my use case because of its stance on embedded quotations.

Text Chunking

Pragmatic_Segmenter should be able to return segments of sentences of a maximum size.
E.g.
https://github.com/akalsey/textchunk
https://github.com/algolia/chunk-text

The following code example is donated by https://auditus.cc courtesy of @havenwood.
It is used to ensure that the conversion requests stay within the limits of AWS Polly (1500 char limit)

      optimized_sentences = sentences.each_with_object([]) do |sentence, accumulator| # like reduce except better
        if (accumulator.last&.size&. < 1500) && ((accumulator.last&.size&. + sentence.size& + 2) < 1500)
          accumulator.last << sentence
          accumulator.last << ' '
        else
          accumulator << sentence
          accumulator.last << ' '
        end
      end

Given an array of sentences, the above code counts the number of characters in each sentence and concatenates a sentence and the following sentence together if their sum is less than an arbitray number of characters
(in this case 1500, amazon polly char limit)

https://gist.github.com/havenwood/e9c286c524f2de5649586e7d28fec7af

The above code however do not handle cases where the length of the concatenated string exceeds 1500, here is one possible method (also donated by https://auditus.cc):

      stripped_sentences = optimized_sentences.flat_map { |sentence|
        # sentence_too_long? sentence
        if sentence.size > 1500
          # sentence.scan(/.{1, 1500}/)
          sentence.scan(/(\S{1500,}|.{1,1500})(?:\s|$)/)
        else
          sentence
        end
      }

wrong segmentation

the output for this text is wrong. In total there are 3 sentences in this
"Joe did not calculatingly set out to steal or defraud," his attorneys wrote in their submission to the judge. "Rather, he -- illegally -- shifted risk to unwitting investors by making false statements. This was not the Ponzi scheme it was made out to be in the media reports of his arrest and it was not a ’fictitious business’ where no tickets or ticket deals ever existed and money was simply stolen for selfish reasons."

Incorrect segmentation or intended behavior?

The following sentence is broken into three segments when it is supposed to be one sentence?

"Some part of the sentence that says refer to paragraphs (a) and (b) above is usually in correct."

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<wrapper>
<s>Some part of the sentence that says refer to paragraphs</s>
<s>(a) and</s>
<s>(b) above but not always that.</s>
</wrapper>

pragmatic_segmenter installing problem

gem install pragmatic_segmenter
Successfully installed pragmatic_segmenter-0.3.22
Parsing documentation for pragmatic_segmenter-0.3.22
Done installing documentation for pragmatic_segmenter after 0 seconds
1 gem installed

But it does not install:
sevilay@sevilay-linux:~$ apt-cache policy pragmatic_segmenter
N: Unable to locate package pragmatic_segmenter

Please anyone could tell me what is the problem?

Unexpected sentence break between abbreviation and hyphen

Hi again,

I came across this surprising behavior. The first three of these are as expected (for comparison), but the last seems like a bug?

PragmaticSegmenter::Segmenter.new(text: "He has high level training", clean: false).segment 
    # => ["He has high level training"]

PragmaticSegmenter::Segmenter.new(text: "He has high-level training", clean: false).segment 
    # => ["He has high-level training"]

PragmaticSegmenter::Segmenter.new(text: "He has Ph.D. level training", clean: false).segment 
    # => ["He has Ph.D. level training"]

PragmaticSegmenter::Segmenter.new(text: "He has Ph.D.-level training", clean: false).segment 
    # => ["He has Ph.D.", "-level training"]

Thanks!

Infinite Loop

Hi,

When I use this great tool for preprocessing wikipedia dumps, I encountered the infinite loop and failed with NoMemoryError.

Example:

When we input

'' (a '\0 !\0')

with "en" option to pragmatic segmenter,
sub_4 = sub_characters(sub_3, '!', '&ᓴ&') at https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/punctuation_replacer.rb#L55
causes the infinite loop.

I'm wondering if we can solve this problem by escaping '\0' in sub_characters function.

def sub_characters(string, char_a, char_b)
      sub = string.gsub(char_a, char_b).gsub('\\0', '\\\\\0')
      @text.gsub!(/#{Regexp.escape(string)}/, sub)
      sub
end

Thanks!

Unexpected sentence break when parentheses immediately follow abbreviation with period

Hi Kevin - first of all, thanks for your work on this gem.

I'd like to report the following unexpected behavior:

Example 1: Unexpected Result:

Note the period in Inc.

PragmaticSegmenter::Segmenter.new(text: 'The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. (“Company B”).', clean: false).segment
["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc.", "(“Company A”), and PragmaticSegmenterExampleCompanyB Inc.", "(“Company B”)."]

Example 2: Expected Result:

No period in Inc

PragmaticSegmenter::Segmenter.new(text: 'The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc (“Company A”), and PragmaticSegmenterExampleCompanyB Inc (“Company B”).', clean: false).segment
["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc (“Company A”), and PragmaticSegmenterExampleCompanyB Inc (“Company B”)."]

Example 3: Expected Result:

Note period in Inc. but now there's text between Inc. and the parens

PragmaticSegmenter::Segmenter.new(text: 'The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc., a fake corporation (“Company A”), and PragmaticSegmenterExampleCompanyB Inc., a fake corporation (“Company B”).', clean: false).segment
["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc., a fake corporation (“Company A”), and PragmaticSegmenterExampleCompanyB Inc., a fake corporation (“Company B”)."]

Example 4: Expected Result:

Same as Example 3 but without the comma after Inc.

PragmaticSegmenter::Segmenter.new(text: 'The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. a fake corporation (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. a fake corporation (“Company B”).', clean: false).segment
["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. a fake corporation (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. a fake corporation (“Company B”)."]

Instructions for using on the command line

Would it be possible to get instructions for how to use this on the command line in a pipe? e.g.

$ cat ~/corpora/languages/tatar/wikipedia/wiki.txt |  ruby pragmatic_segmenter.rb 

This gives no output...

I have trouble running the program

Hello,

I've read this thread and followed what's written on it :

#18

The problem is that I got this error message :

irb(main):004:0> require 'pragmatic_segmenter'
NameError: uninitialized constant PragmaticSegmenter::Languages::Common::Abbreviation::Set
from /var/lib/gems/2.3.0/gems/pragmatic_segmenter-0.3.9/lib/pragmatic_segmenter/languages/common.rb:12:in module:Abbreviation' from /var/lib/gems/2.3.0/gems/pragmatic_segmenter-0.3.9/lib/pragmatic_segmenter/languages/common.rb:11:inmodule:Common'
from /var/lib/gems/2.3.0/gems/pragmatic_segmenter-0.3.9/lib/pragmatic_segmenter/languages/common.rb:6:in module:Languages' from /var/lib/gems/2.3.0/gems/pragmatic_segmenter-0.3.9/lib/pragmatic_segmenter/languages/common.rb:5:inmodule:PragmaticSegmenter'
from /var/lib/gems/2.3.0/gems/pragmatic_segmenter-0.3.9/lib/pragmatic_segmenter/languages/common.rb:4:in <top (required)>' from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:inrequire'
from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in require' from /var/lib/gems/2.3.0/gems/pragmatic_segmenter-0.3.9/lib/pragmatic_segmenter/languages.rb:5:in'
from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in require' from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:inrequire'
from /var/lib/gems/2.3.0/gems/pragmatic_segmenter-0.3.9/lib/pragmatic_segmenter/segmenter.rb:2:in <top (required)>' from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:inrequire'
from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in require' from /var/lib/gems/2.3.0/gems/pragmatic_segmenter-0.3.9/lib/pragmatic_segmenter.rb:2:in'
from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:127:in require' from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:127:inrescue in require'
from /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:40:in require' from (irb):4 from /usr/bin/irb:11:in'

Any help would be greatly appreciated.

Thanks

`Washington, D.C.` at end of sentence not segmented.

The text in question:

On April 11, our friends at the Financial Times' "Alphachat" podcast invited THE INDICATOR to host a panel at a bar in Washington, D.C. The joint event was called A Night Of Jargon-Free Economics, and there was even a jargon bell that people in the audience could ring each time someone on the panel used jargon.

Based on rule 14: Multi-period abbreviations at the end of a sentence the text should be segmented at the end of Washington, D.C. and before The joint.

The actual result:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<wrapper>
<s>On April 11, our friends at the Financial Times&#39; &quot;Alphachat&quot; podcast invited THE INDICATOR to host a panel at a bar in Washington, D.C. The joint event was called A Night Of Jargon-Free Economics, and there was even a jargon bell that people in the audience could ring each time someone on the panel used jargon.</s>
</wrapper>

Leading apostrophe interpreted as start of single quote

Hi, thanks for writing this and making it available, very cool!

I came across an issue, however, involving a word with leading apostrophes. As far as I can tell, a word with a leading apostrophe (represented as a single-quote) gets interpreted as the start of a single-quoted quotation, slurping up all subsequent sentences until a sentence containing another apostrophe gets encountered.

Probably easiest to explain by example:

$ \pry -r pragmatic_segmenter
[1] pry(main)> def seg(s) ; PragmaticSegmenter::Segmenter.new(text: s).segment ; end
:seg
[2] pry(main)> # This works as expected:
[3] pry(main)> seg("I wrote this last year.  It has four sentences.  This is the third, isn't it?  And this is the last")
[
    [0] "I wrote this last year.",
    [1] "It has four sentences.",
    [2] "This is the third, isn't it?",
    [3] "And this is the last"
]
[4] pry(main)> # This gets confused by the leading apostrophe:
[5] pry(main)> seg("I wrote this in the 'nineties.  It has four sentences.  This is the third, isn't it?  And this is the last")
[
    [0] "I wrote this in the 'nineties.  It has four sentences.  This is the third, isn't it?",
    [1] "And this is the last"
]
[6] pry(main)>

(Of course if the text was using a proper unicode right-quote/apostrophe symbol that wouldn't be an issue.)

Not work!!!

Brothers, this text not work:

Zur Pflanzengeographie Chinas China oder Sina sagen die einen, andere nennen es Eine einzige Ordnung ist fur das kosmische Leben . Diese letztgenannte Bezeichnung massgebend, und zwar ist es jene Ordnung, die ihm ist eine Ubersetzung des offiziellen chinesischen . 1'' a die Kultur aufpriigt. " mcns, zhong kuo>. Der zweite-ebenfalls offizielle Riesig gross ist China, das Blutenland. 'on der . 1''amc des chinesischen Reichs lautet, Ostgrenze bis zur \Vestgrenze mnfasst es 6o Langen was so'iel heisst wie, ssfumen>-oder, ssfutenlamf, ein grade, was einem Sechstel des Erdumfangs ent Name, der fur dieses Land der Nordhemisphare spricht; Yom Norden zum Suden sind es so Breiten durchaus berechtigt ist, denn hier ist die l.iclfiiltig grade. Das Klima ist kuhl-gemassigt im . 1''orden mit stc, artenreichste Flora der Alten \'elt 1.orhandcn. ariden, wustenhaften Gebieten, im Suden hingegen Zudem sind Blumen, Baume und Straucher nicht gibt es tropische Regenwalder mit entsprechend ho nur Schmuck und Zierde, sondern sie sind seit ur her Luftfeuchtigkeit. Der Chomolongma im Eln alter Zeit mit Bedeutungstiefe 1erankert. Jahreszeit estmassi' ist mit seinen 884H m der hochste Berg der liches Denken und Fuhlen ist in China lebendig Welt. Der tiefste Punkt in China liegt unter dem geblieben. Auch heute noch 11 eiss jedermann, wel Meeresni'eau und misst -'54 m; er befindet sich in che Pflanzen unter den ''ier Edlen> zu 1erstehen der Nahe des Aydingkolsees. Riesenhafte Berge, sind oder 11er d)ie Drei Freunde des Winters> sind."

Test Suiite

Dear Kevin,

Thank you for your tool and for comparison with other tools. I was actually looking for test cases for WBD and how different approaches perform on them. Though I have found the results of testing, I have not found "a set of distinct edge cases" that you have created.
I would be glad to use your data for testing since Penn Treebank corpora is indeed too expensive for me.

Best,
Artem

reduce memory usage by reusing segmenter

I just realised that for an array with 1000 strings with each 50-300 chars length (url titles and description generated by gottfrois/link_thumbnailer), the following causes a much higher memory load…
array.map {|string| PragmaticSegmenter::Segmenter.new(text: string, language: 'de').segment }
…than this here:
PragmaticSegmenter::Segmenter.new(text: array.join('\r'), language: 'de').segment

In my tests it's a 30-50MB difference, I assume objects inside a #map will not get garbage collected sequentially but all at once, when the entire array has been mapped.

@diasks2 would you consider updating the API to also support:
ps = PragmaticSegmenter::Segmenter.new(language: 'de'); array.map {|string| ps.segment(string) }
…which would allow to reuse the Segmenter object and will most likely reduce memory load? It would be possible to support the old API as well, by additionally allowing initialisation without a passed text and adding an optional argument to #segment.

As a side note, I've noticed lots of #gsub which probably can be replaced with #gsub! to reduce the strain on the garbage collector. I'll submit a PR whenever I ever get to it, unfortunately my current work load only allows me to report the issue and not much more.

Thanks!

Segmenter modifies the segment

You can see below the last segment was changed. Spaces were removed from inside the [ ]

s =  "A representative office of the Bank in Paris was first established in 2047. It later became licensed as a banking branch (with unlimited duration) on 29 November 1983, commencing business in 1986. Its legal name is Bank of China Limited (“BoC Ltd”). The registered address of this branch in France is 23-25 avenue de la Grande Armée, Paris 75016, France. The branch is located in the centre of Paris, close to the Arc de Triomphe.  Its registration number is 322 284 696 R.C.S. Paris, and its telephone number is [     ].">
[4] pry(main)> ps = PragmaticSegmenter::Segmenter.new(text: s, language: 'en').segment
=> ["A representative office of the Bank in Paris was first established in 2047.",
 "It later became licensed as a banking branch (with unlimited duration) on 29 November 1983, commencing business in 1986.",
 "Its legal name is Bank of China Limited (“BoC Ltd”).",
 "The registered address of this branch in France is 23-25 avenue de la Grande Armée, Paris 75016, France.",
 "The branch is located in the centre of Paris, close to the Arc de Triomphe.",
 "Its registration number is 322 284 696 R.C.S. Paris, and its telephone number is [ ]."]

Whitespace getting mangled even with clean turned off

I'm trying to test this library against some larger english corpora but I'm running into trouble aligning the results back to the original text. Even with "clean" turned off, the resulting sentences have modified whitespace in a seemingly unpredictable way.

Unfortunately I can't supply the data to demonstrate the problem due to license restrictions. Looking at the code it doesn't appear that there is any easy to to ensure that the returned sentence text is fully unmodified, is that correct?

Unable to segment text. Ruby 3.

Hi! So basic stuff. Trying to segment text but not able to. I get this error:

/usr/local/lib/ruby/gems/3.0.0/gems/pragmatic_segmenter-0.3.22/lib/pragmatic_segmenter/processor.rb:37:in block in split_into_segments': undefined method apply' for "":String (NoMethodError) from /usr/local/lib/ruby/gems/3.0.0/gems/pragmatic_segmenter-0.3.22/lib/pragmatic_segmenter/processor.rb:37:in map!'
from /usr/local/lib/ruby/gems/3.0.0/gems/pragmatic_segmenter-0.3.22/lib/pragmatic_segmenter/processor.rb:37:in split_into_segments' from /usr/local/lib/ruby/gems/3.0.0/gems/pragmatic_segmenter-0.3.22/lib/pragmatic_segmenter/processor.rb:30:in process'
from /usr/local/lib/ruby/gems/3.0.0/gems/pragmatic_segmenter-0.3.22/lib/pragmatic_segmenter/segmenter.rb:26:in segment'

I am on ruby ruby 3.0.0p0 and a Mac M1 machine.
Can anybody please help with this?

Language support

Can you list all the supported languages? It would be helpful to know if I were to use this in a project.

How to run a ruby command?

Hi, I am a new one for Ruby. I have installed the pragmatic_segmenter with the command "sudo gem install pragmatic_segmenter". And could you give my a example rb file to run the tool for text? Because I met the following error when I use the command as your usage in the irb environment.

NameError: uninitialized constant PragmaticSegmenter
from (irb):2
from /usr/bin/irb:12:in `

'

Preserving characters between sentences?

I tried looking in the docs and even in the code and this doesn't seem to be a feature. So this is either a question or a potential feature request:

Is it possible to turn on a setting to preserve the spaces between sentences? It would be ideal to be able to split sentences such that I can later re-create the entire document from the split parts. Like if I had the string "What is a test? This is a test!", I would get the array ['What is a test? ', 'This is a test!']

SENTENCE_STARTERS ignores defined language

PragmaticSegmenter::AbbreviationReplacer::SENTENCE_STARTERS is a constant that does not care about the defined language:

SENTENCE_STARTERS = %w(A Being Did For He How However I In It Millions More She That The There They We What When Where Who Why)

I suggest it to be defined in PragmaticSegmenter::Languages::SomeLanguage , with an empty array as default.

If given the explanation/regex on how to analyze a dataset for appropriate sentence starters), I could provide the array for German.

Golden rule for telephone numbers with letters?

I stumbled upon the following case where (the otherwise wonderful) PragmaticSegmenter trips up:

It will split a sentence containing a telephone number with letter characters 800.ACME.NOW is split after 800.:

    it "Telephone number with letters" do
      sentence = "If you have questions, call ACME Enterprises at 800.ACME.NOW (800.123.4567) or visit our website at: ACME-Enterprises.com."
      ps = PragmaticSegmenter::Segmenter.new(text: sentence, language: "en")
      expect(ps.segment).to eq([sentence])
    end

Naming and attribution for port of code

Hi, I've ported this library to C# since I've been doing some work which required sentence boundary detection in C#.

Link

I wanted to check what you would like in terms of attribution and naming of the project.

Currently it is called PragmaticSegmenterNet but if you wanted it to use a name that was less directly linked then I'd be happy to do this.

Also I have no understanding of how licenses work so I wanted to check you were happy with the license format below:

The MIT License (MIT)

Original work Copyright (c) 2015 Kevin S. Dias
Modified work Copyright (c) 2017 UglyToad

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

But I have no idea if this is the correct way to manage this situation so let me know if you wanted some other form of license text. I've included attribution in the README but if it would also be polite to provide attribution in some kind of Notices.txt file or similar please let me know.

I hope to feed back any changes or bug-fixes to the Ruby code as I make any further modifications. Please let me know if this is useful or you'd rather avoid this.

Feel free to close this issue straight away. Thanks, Eliot

Any Python equivalent of it?

@diasks2 : It's working very nice on few example I have tried on live demo. Could you provide any resource where I can get python equivalent of this repository? or else can you please guide or give general idea of How should I proceed if I wish to make python one?

doc_type

What kind of doc_types are supported? I have tried html, but it is not working.

Kazakh Segmenter

hi,
we have used your segmenter to deal with very big corpus(wiki dump) with size 320MB, it is written in Kazakh but the segmenter going to segment a very very long sentence. Because of a very long sentence, I have a problem to use these sentences as input to my application. Please, could you guide us with a solution?

Thanks

Segmenting multiple sentences in quotes

I noticed the following is treated as one sentence:

"This is a sentence. And this is another one. And this is the third."

Is that by design? When not surrounded by double quotes, it is split into 3 sentences.

I was segmenting sentences in a novel, and entire paragraphs were surrounded in double quotes and got treated as one huge sentence.

Exclamation set apart by em dashes

Hi,

I came across this sentence:

"Mix it, put it in the oven, and -- voila! -- you have cake."

And the exclamation mark is being treated as a sentence break:

[
   "Mix it, put it in the oven, and -- voila!",
  "-- you have cake."
]

More generally it seems that a sentence should be able to contain an em dash clause that ends with an exclamation or question mark:

"There are many -- oh so very many! -- similar cases."
"Some can be -- if I may say so? -- a bit questionable."

You're the expert and I'm not, but I couldn't resist pondering a bit how to handle this :) Here's what occurred to me...

A simple-minded rule would be to avoid a break if the subsequent sentence would start with a dash. But I searched a bit and found that there are counterexamples, such as this line from Moby Dick:

What do you see?—Posted like silent sentinels all around the town, stand thousands upon thousands of mortal men fixed in ocean reveries."

So that would imply that a slightly less simple-minded rule would be to suppress a break due to a question or exclamation mark, if that mark is immediately followed by a dash and if there is a dash somewhere earlier in the sentence.

In any case, thanks as always for the great gem

Abbreviations

Rs. (or rs.) is standard abbreviation for currency Rupees (INR).

pragmatic segmenter not installing

Hello,

I have been using pragmatic segmenter by following the steps below:
sudo apt-get install ruby-full
gem install pragmatic_segmenter

And after install the pragmatic_segmenter I got this:
Successfully installed pragmatic_segmenter-0.3.22
Parsing documentation for pragmatic_segmenter-0.3.22
Done installing documentation for pragmatic_segmenter after 0 seconds
1 gem installed

And using it to sengemet the sentences in a whole file by applying the code block below:
require 'pragmatic_segmenter'
if (ARGV.length < 3)
puts "\nUsage : ruby2.5 sentenceTokenizer.rb 639-1ISOlangCode textFilePath sentencesFilePath"
exit
end
File.open(ARGV[1]).each do |line1|
line1.delete! ('()[]{}<>|$/'"')
ps = PragmaticSegmenter::Segmenter.new(text: line1, language: ARGV[0], doc_type: 'txt')
sentences = ps.segment
File.open(ARGV[2], "a") do |line2|
sentences.each { |sentence| line2.puts sentence }
end
end

But the problem now I got the error below:
Traceback (most recent call last):
2: from sentenceTokenizer.rb:1:in

' 1: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in require'
/usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require': cannot load such file -- pragmatic_segmenter (LoadError)
Could you point me please? the same code and same steps have been used before and worked? I am curious, why I got this error now. Help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.