Giter Club home page Giter Club logo

wikimedia-challenge's Introduction

Diego Santamaria - ProgramingChallenge

Description

The main purpose of this project is to evaluate my skills in object-oriented programming and design.

I've use TPL Dataflow in C# Net Core to processing specific pageviews for Wikipedia site provide by The Wikimedia Foundation acording to the requirement explained below.

Requirement

The Wikimedia Foundation provides all pageviews for Wikipedia site since 2015 in machine-readable format. The pageviews can be downloaded in gzip format and are aggregated per hour per page. Each hourly dump is approximately 50MB in gzipped text file and is somewhere between 100MB and 250MB in size unzipped.

Create a command line application with following capabilities:

  1. Do not use any relative database in your code.
  2. Get data for last 5 hours.
  3. Calculate by the code the following SQL statement (ALL_HOURS table represent all files)
SELECT TOP 100 R.DOMAIN_CODE, R.PAGE_TITLE, R.MAX_COUNT_VIEWS
FROM 
(
	SELECT B.DOMAIN_CODE, B.PAGE_TITLE, C.MAX_COUNT_VIEWS
	FROM 
	(
		SELECT		DOMAIN_CODE, PAGE_TITLE, COUNT (*) CNT 
		FROM		ALL_HOURS 
		GROUP BY	DOMAIN_CODE, PAGE_TITLE
	) B 
	JOIN	
	(
	SELECT A.DOMAIN_CODE, MAX (A.CNT) MAX_COUNT_VIEWS
	FROM 
		(
		SELECT		DOMAIN_CODE, PAGE_TITLE, COUNT (*) CNT 
		FROM		ALL_HOURS 
		GROUP BY	DOMAIN_CODE, PAGE_TITLE
		) A 
	GROUP A.DOMAIN_CODE
	) C ON B.DOMAIN_CODE = C.DOMAIN_CODE AND B.CNT = C.MAX_COUNT_VIEWS
	ORDER BY C.MAX_COUNT_VIEWS DESC
) R

Output example:

domain_code page_title max_ count_views
it.m renault 100000
en apple 50000
fr.m.d relativité 3000
it.m bongur 2000
en microsoft 1000
fr.m.d paris 500

How It Works

This command line application has the following capabilities:

  1. Gets the download URLs built using the parameters specified in the appsettings.json file and then downloads the file to the workspace (local disk).
  2. Unzip the file to the workspace (local disk).
  3. Reads the unzipped file and processes it line by line to get a list of objects. These objects are filtered using linq statements to reduce the size of all data. The result is saved in a new file in the workspace (local disk). When all the files have been processed and saved in the workspace, the result is combined into a single file. Finally this file (smaller than the previous ones) is transformed into a list of objects that is used as a data source to execute new Linq statements.
  4. Finally, print the result of the analysis.

Config

View of appsettings.json

{
  "LastHoursRequest": 5,
  "BaseURLDownload": "https://dumps.wikimedia.org/other/pageviews",
  "FileRuteFormat": "yyyy/yyyy-MM",
  "FileNameFormat": "'pageviews'-yyyyMMdd-HH'0000.gz'",
  "FilesWorkspacePath": "C:\\tmp",
  "Serilog": {
    "MinimumLevel": {
      "Default": "Information",
      "Override": {
        "Microsoft": "Information",
        "System": "Warning"
      }
    }
  }
}
  1. LastHoursRequest node represent the number of hours for the request. According to the requirement, it is 5 hours.
  2. BaseURLDownload node represents the base url for the request.
  3. FileRuteFormat node represents the expected format of the rute for the request.
  4. FileNameFormat node represents the expected format of the file for the request.
  5. FilesWorkspacePath node represents the directory of the workspace for the application.
  6. Serilog section represents the configuration for the Serilog tool.

Environment

There are some restrictions, some of them to meet the first requirement (don't use any relative database in your code):

  1. At least 4 GB of RAM available.

Results

I've tested under correct internet conditions and this was the result:

imagen

Also, Diagnostic Tool of Visual Studio shows the following results:

imagen

Conclusions

Acording to the previus results:

  1. The application takes about 4 minutes to process 5 downloaded files. This measure may vary depending on the download speed.
  2. In some parts of the process It uses 3.8 GB of RAM. This measurement may vary depending on the size of the file that has been processed.

  • Author : Diego Santamaria Sotelo
  • Date : 24/04/2021

wikimedia-challenge's People

Contributors

diego-santamaria avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.