kkrugler / yalder Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 2.0 27.7 MB

Yet another language detector

License: Apache License 2.0

Java 100.00%

yalder's People

Contributors

Stargazers

Watchers

Forkers

tballison kzwang001

yalder's Issues

[WikipediaCrawlTool] Set Accept-Language to zh-Hant and zh-Hans for Chinese

This means we make two requests for each page.

Try with https://zh.wikipedia.org/wiki/疏花木犀榄

Should get different pages, for zh-Hant vs. zh-Hans

Fix tokenizer to handle incremental additions of characters

Have a buffer, and avoid processing to end (stay at least maxNGramLength away) until a complete() call is made. This way we don't get bogus ngrams when handling repeated addText() calls while extracting text fragments from say an HTML document.

As part of this, we will no longer create a new tokenizer for every call to addText(), which is more efficient.

Save training data in public S3 bucket

Switch to gradle for build

Currently we're using ant (with the ant-maven task), but Gradle would be better.

And then we wouldn't get this warning...

warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds

Use a priori probabilities to change per-language dampening

Just using them for starting probabilities doesn't have much impact, as these differences go away quickly.

If we used them when re-calculating the relative probabilities (change damping, for example) then the impact would continue.

[WikipediaCrawlTool] Support crawl for just a few languages

For example, when fetching the (two) Chinese pages (see issue #5).

This way we don't have to recrawl everything.

Move extended languages into separate jar

Add training data for zho-Hans vs. zho-Hant

Currently I think we just have simplified Chinese (zho-Hans) data, as that's what the Wikipedia pages are defaulting to.

kkrugler / yalder Goto Github PK

yalder's People

Contributors

Stargazers

Watchers

Forkers

yalder's Issues

[WikipediaCrawlTool] Set Accept-Language to zh-Hant and zh-Hans for Chinese

Fix tokenizer to handle incremental additions of characters

Save training data in public S3 bucket

Switch to gradle for build

Use a priori probabilities to change per-language dampening

[WikipediaCrawlTool] Support crawl for just a few languages

Move extended languages into separate jar

Add training data for zho-Hans vs. zho-Hant

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent