Giter Club home page Giter Club logo

douban-comments-mapreduce's Introduction

MapReduce中文分词和清洗

这一步很多步骤和上面的数据清洗过程是类同的,相同的部分就仅展示过程截图了。

新建MapReduce项目:

2018010587-__19__0001.jpg

配置中文分词库:

由于中文分词和英文的词频统计有较大差异,我们无法通过简单的分隔符来切割中文的句子和段落(会造成语义问题),因此这里我借助了一个开源的第三方分词库 IKanalyzer。

下载地址(谷歌代码仓库):http://code.google.com/archive

2018010587-__19__0002.jpg

下载完成后,将分词包拖到项目的库文件目录中:(这里我使用的版本是3.2.8)

2018010587-__20__0001.jpg

Map阶段代码:

2018010587-__20__0002.jpg

分词处理代码:

2018010587-__20__0003.jpg

2018010587-__21__0001.jpg

Reduce阶段代码:

2018010587-__21__0002.jpg

这里使用正则表达式进行筛选,仅保留中文分词结果。并同时 将value进行规约。

执行阶段代码:

这部分代码主要是指定Mapper类,Reduce类,指定文件的输入输出信息以及期交MapReduce作业。

2018010587-__21__0003.jpg

2018010587-__22__0001.jpg

打包并运行MapReduce:

2018010587-__22__0002.jpg

查看运行结果:

运行完成后可以通过浏览HDFS的输出文件夹查看效果。

2018010587-__22__0003.jpg 2018010587-__23__0001.jpg

将HDFS的文件下载到本地 后,可以 看到基于中文分词的词频统计

2018010587-__23__0002.jpg

douban-comments-mapreduce's People

Contributors

emptinessboy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.