Giter Club home page Giter Club logo

wongnai-corpus's Introduction

Wongnai-corpus

This project is a collection of Wongnai's datasets which are mostly in Thai language. We hope that these datasets will advance research in natural language processing(NLP) especially in Thai language.

1. Search query dataset

There are 500,000 unique words extracted from search queries. These words were labeled by algorithms and judges for a word segmentation task. Our segmentation criteria is to segment the longest food word as possible for archiving the highest precision score in search system.

1.1 Files

  • search/labeled_queries_by_algo.txt : List of 500K words labeled by algorithms which were described in detail in blog post.

  • search/labeled_queries_by_judges.txt : List of 10K words labeled by judges following Wongnai's search criteria.

  • search/food_dictionary.txt : List of 400K food words used for labelling the labeled_queries_by_algo.txt.

Please note that these words were collected from user-generated content(UGC) which might include some out of topic words.

1.2 Usage

  • You may use labeled_queries_by_algo.txt for training your own word segmentation model by spliting into train and validation set and then evaluate your model with labeled_queries_by_judges.txt.

2. Review dataset

The review dataset contains restaurant reviews and ratings (there are only 5 classes ranging from 1 to 5 stars).

2.1 Files

2.2 Usage

wongnai-corpus's People

Contributors

diewland avatar ekkalak-t avatar tanapoln avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.