Giter Club home page Giter Club logo

c4-zh's Introduction

C4-zh

随着预训练模型发展,中文预训练模型对于学术界和工业界更加重要,我们从C4 以及其他公开的数据集中中文自然语言数据集,从而构建大规模高质量的中文预训练语料

目标

构建100G的高质量中文无监督语料,来源新闻,百科,评论等

数据来源

  • 已有数据大小
数据来源 数据规模 大小 数据来源链接 下载链接(自行构建或清洗)
搜狐新闻 2008~2019 共计600w条 ,未出重 21G 2012 2014-2016 2009-2016
百度知道 60万条 3G
百度搜索 60万条 3G
新浪新闻 2008~2019滚动新闻 共计 10w条 2G
百度百科 2012年百度百科 ,400w词条 22G
百度百科 2019年百度百科,500w词条 50G baike.baidu.com 不提供下载,下载教程
清华新闻 86万条 4G
维基中文 50万条 2G
微信公众号文章 未知 3G 来源

Reference

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP https://github.com/brightmart/nlp_chinese_corpus

c4-zh's People

Contributors

pluto-junzeng avatar

Stargazers

 avatar wifibaby4u avatar Lil2J avatar zhaoming avatar XiaHan avatar  avatar Shen Hao avatar Legend avatar Hao Zhou avatar Minskiter avatar Xiaoquan Kong avatar  avatar HouyiLi avatar lorinma avatar  avatar Yang Yu avatar Qifan Zhang avatar Timothy Wang avatar  avatar Jingxin Lee avatar Luerwei avatar steve wei avatar zhouxss avatar  avatar Wenhao Jiang avatar linhaozhi avatar pyz avatar chuwei129 avatar Ruibin Yuan avatar  avatar Zhangqiu Yu avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.