Giter Club home page Giter Club logo

big-data-management-analytics-project's Introduction

CS 6350.001 - Big Data Project

Project 2: De-duplication of Spanish Language Articles using UDPipe.

Team Members:

  1. Ishan Sharma - IXS171130
  2. Mavis Francia - MCF140030
  3. Tanushri Singh - TTS150030
  4. Vyaas Shenoy - VNS170230

Running The Crawler

There are two crawlers: News Please, RSS Crawler.

News Please

News Please can be installed by running pip install news-please and run by using news-please -c Config/config.cfg

By default, it will use config.cfg file inside Crawler folder. Some websites tend to crash it and there are hjson comments that mark those websites.

The crawler writes to disk and these files can be indexed using python misc/mongo_index.py. The file also has some extra configuration options that can be changed on top.

RSS Crawler

RSS Crawler uses Crawler/config/sitelist_rss.hjson config file. This file can be automatically generated by running python3 misc/crawler_config_transformer.py if sites in sitelist.hjson have changed.

Doc2Vec

The model can be trained using misc/doc2vec.py. It reads data from mongo database big_data with collection spanish_articles. It will be saved to misc/models/.

After model training, Pandas/doc2vec.py needs to be run. It will read from spanish_articles collection and write to d2v_calculated collection.

Spark Streaming for UDPipe

Download a Universal Dependencies model for Spanish. Here are two different ones:

You can find a full list of models here.

Install the ufal.udpipe library by running: pip install ufal.udpipe. You can read more about this library here.

Once UDPipe is installed, Spark job can be run using spark-submit Streaming/streamToSpark.py. It will write the results to mongo collection udpipe_parse.

UDPipe Similarity Calculation

Similarity from UDPipe can be calculated by running python UDPipe/runningJaccSim.py. This will write the results to collection jacc_sim_calculated.

Data Analysis

iPython notebook for graphing and seeing some statistics is included in Analysis/data_analysis.ipnyb.

Data Source

All data can be fetched from * onedrive link

All models can be fetched from here

big-data-management-analytics-project's People

Contributors

ishansharma avatar mavisfrancia avatar tanussingh avatar saayv1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.