The text-preprocess from zhang-hongshen

text-preprocess's Introduction

项目介绍

该项目是搜索引擎中的前期文本预处理阶段，可以将爬取到的html文件提取纯文本并返回词组。由于本人能力有限，如有Bug请谅解。

环境搭建

名称	描述	版本号
IntelliJ IDEA	开发IDE	2021.1
JDK	Java集成环境	11.0
fastjson	第三方jar包	1.2.76

项目使用

个人参数填写

package com.zhanghongshen.textpreprocess;

public class AliyunNlp {
    /**
     * 个人参数
     */
    private String ResourceOwnerAccount = "";
    private String AccessKeyId = "";
    private String AccessKeySecret = "";
 }

该版本已知问题

中文文档处理时，在分词阶段由于阿里云NLP的中文分词功能调用频率限制，如果在极小一段时间内请求过多会有很大概率不成功；
英文文档处理时，Porter Stemming提取词干不准确会出现单词还原错误的情况。

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

zhang-hongshen / text-preprocess Goto Github PK