Giter Club home page Giter Club logo

habrcrawler's Introduction

HabrCrawler

This repository contains programs that allow to create a collection of documents from Habr website and prepare this collection for further data analysis.

The aim of this project is to create a collection of documents that contains around 100000 papers from Habr website. This collection is needed in further data analysis.

There are two python scripts.

The first one, which is called create_raw_collection.py allows to create collection of the documents from Habr website. The information that is taken from all papers includes Paper's name, Paper's text, Paper's hubs, Paper's tags, Date and Time of the publication, Paper's rating, How many times people added this paper in bookmarks, How many comments are left under the paper, Author's karma, Author's rating. There are two functions in this file: habr_url_puller where you have to specify the first paper number and how many iterations the function has to pass to create the collection and get_data_from_articles where you have to specify the URL which is automatically made in habr_url_puller. Finally, the file called "data_collection" is created.

The second file which is called prepare_ready_collection.py allows to open raw collection, take only existing lists, then create a list which contains lists of papers where there are lists with all the info which was taken from the papers in create_raw_collection.py. The final collection contains only papers which text is longer then 2000 characters, moreover, all lists with info are cleaned from unprinted symbols, extra spaces, line breaks and other extra symbols. Finally, there is an output of the file which is called "cleaned_collection" which is a csv file that is made by converting list of papers into DataFrame object using pandas library.

All the requirements are written in Requirements.txt file.

You can access the resulting "data_collection" and "cleaned_collection" by passing through the link: https://drive.google.com/drive/folders/1do9FHo_-CIsys9UhpWdwaFLuHbSA7RHr?usp=sharing

habrcrawler's People

Contributors

kchpomp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.