Giter Club home page Giter Club logo

nlp_chinese_corpus's Issues

社区问答数据,答案被截断

短答案没什么问题,长答案几乎全部在200字左右被截断。我随机看了几十个例子,并且在网上对照了一下,基本上没有完整的长答案。

关于语料的版权问题

你好,发布的语料质量很高且体量也大,很有帮助!
想请问下发布的语料是否会存在版权问题,能否用于商业?

请提供百度网盘外其他下载方式

请提供百度网盘外其他下载方式吧,谷歌的云端硬盘个人的也有15G,腾讯微盘也行,百度真的下不下来啊,不能大家都去开会员吧,很痛苦啊……或者像另一位说的用torrent也可以吧。诚恳的请求大佬考虑一下该建议,感谢!

<br> in webtext2019zh

webtext2019zh中的文本仍含有
,有计划把这些html的tag去除了吗?

{"qid": 65618973, "title": "AlphaGo只会下围棋吗?阿法狗能写小说吗?", "desc": "那么现在会不会有智能机器人能从事文学创作?<br>如果有,能写出什么水平的作品?", "topic": "机器人", "star": 3, "content": "AlphaGo只会下围棋,因为它的设计目的,架构,技术方案以及训练数据,都是围绕下围棋这个核心进行的。它在围棋领域的突破,证明了深度学习深度强化学习MCTS技术在围棋领域的有效性,并且取得了重大的PR效果。AlphaGo不会写小说,它是专用的,不会做跨出它领域的其它事情,比如语音识别,人脸识别,自动驾驶,写小说或者理解小说。如果要写小说,需要用到自然语言处理(NLP))中的自然语言生成技术,那是人工智能领域一个", "answer_id": 545576062, "answerer_tags": "人工智能@游戏业"}

数据来源

作者您好,请问这些数据是来自于哪里的,是否有出处?非常感谢您的贡献

json转换

json文件里存的的是unicode编码 "text":"\u30a2\u30d5\u30ea\u30ab \u30a2\u30d5\u30ea\u30ab\uff08\u82f1\u00a0:

    lines1 = f1.read()
    lines1  = lines1 .encode('utf-8').decode("unicode_escape")

print(path1+':'+line)

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 118-119: surrogates not allowed

这个错误怎么解决?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.