Giter Club home page Giter Club logo

data-512-a1's Introduction

Assignment 1 - Data Curation

Author: Anushna Prakash
Date: October 7, 2021
Class: DATA 512 - Human-Centered Data Science

Purpose

The goal of this project is practice proper documentation of process and sources from start to end of a complete project. For this project, I graph page visits to English Wikipedia pages from 2008 to 2021 August.

Project Organization

This repository is structured as:

.
├── LICENSE
├── README.md
├── data_clean
│   └── en-wikipedia_traffic_200712-202108.csv
├── data_raw
│   ├── README.txt
│   ├── pagecounts_desktop-site_200712-201607.json
│   ├── pagecounts_mobile-site_200712-201607.json
│   ├── pageviews_desktop-site_200712-202108.json
│   ├── pageviews_mobile-app_200712-202108.json
│   └── pageviews_mobile-web_200712-202108.json
├── results
│   └── wikipedia_page_vists_2008-2021.png
└── src
    └── A1 - Data Curation.ipynb

Licenses

The data for this project is obtained from Wikimedia Foundation REST API.
This repository is licensed under the MIT license. The MIT License is a permissive free software license originating at the Massachusetts Institute of Technology (MIT). As a permissive license, it puts only very limited restriction on reuse and has therefore an excellent license compatibility. For more info please see: MIT license.

Data Sources

There are two sources of data for this project. Both require the use of the Wikimedia Foundation REST API. There was a change to how the data is collected in May 2015 that eliminated crawler traffic. From the Wikimedia legacy Pagecounts API documentation: "This API makes available the pagecounts aggregated per project from January 2008 to July 2016. The main difference among pagecounts and the current pageview data is lack of filtering of self-reported bots, thus automated and human traffic are reported together." New data is collected in the Pageviews API. There is a one year overlap in this data collection.

Data Output File

The final data output file is located in the data_clean folder and has the following metadata:
year: The year the views were aggregated as YYYY.
month: The month the views were aggregated as MM.
pagecounts_desktop_views: The views from pages accessed from desktop, including crawlers and bots from the legacy Pagecounts API.
pagecounts_mobile_views: The views from pages accessed from mobile, including crawlers and bots from the legacy Pagecounts API.
pagecounts_all_views: The sum of views from pages accessed from desktop and mobile, including crawlers and bots from the legacy Pagecounts API.
pageviews_desktop_views: The views from pages accessed from desktop, excluding crawlers and bots from the Pageviews API.
pageviews_mobile_views: The views from pages accessed from mobile methods, such as mobile web and the mobile app, excluding crawlers and bots from the Pageviews API.
pageviews_all_views: The views from pages accessed from desktop and mobile methods, excluding crawlers and bots from the Pageviews API.

Special Considerations

The Pageviews API has the ablity to eliminate crawlers, which the legacy API did not. The series reported using the Pageviews API filters out these results. Please note that the total views from years prior to this change may be inflated due to this change.

data-512-a1's People

Contributors

anushnap avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.