Giter Club home page Giter Club logo

eurowiki's Introduction

EuroWiki Media

The application sets a producer and a consumer using RabbitMQ to process the wikipedia updates. Global updates count, and german updates are aggregated every minute and stored in a PostgresSQL database.

The consumer is based in the asyncio_consumer_example from pika library, to ensure on-time processing.

#Steps:

Clone the repo and run the following command in you terminal:

Set environment variables

` CSV_FILE_PATH POSTGRES_USER POSTGRES_PASSWORD DB_HOST DB_PORT `

Start RabbitMQ and PostgresSQL server with docker compose or create manually two databases in your server(dainmedians, daintests)

` docker-compose up `

Create a python environment and run:

` pip3 install -r requirements.txt `

## Running the script:

Start the producer in your terminal .

` python producer.py `

Open a second terminal. Start the consumer.

` python consumer.py `

Run the application tests:

` Pending... `

## Considerations about the solution:

##Data design

Multi-measure records

In this case, the application emit multiple metrics or events at the same timestamp. In such cases, could be store all the metrics emitted at the same timestamp in the same multi-measure record. All the measures stored in the same multi-measure record appear as different columns in the same row of data.

A possible extention is to add a 'measure' column, to consider other Wikipedia updates not just type='edit'

##Database considerations:

  • How long must be keept the saved data?
  • Needs near realtime processing?
  • Cost considerations

In this case was used PostgresSQL as database, due to its high performance and the posibilty to use partitioning. A weekly partition was choosed(sql/setup.sql), but this could be adjusted accordingly the analysis(daily? monthly?)

Another choice to consider is Amazon Timestream using customer-defined partition keys. This may be a better choice considering the actual throughput, future needs and scalability.

##Exchange type

Selected: Topic

Although a direct type would satisfy the current task, wider possibilities are open for distributing data relevant to other geographic location, not only for the german wikipedia.

##Process improvements

The field added_date of type timestamp, was set to the current time. This choice considers a 'real time' connection to the Wiki Recent API, to use the current date during the aggregation. The sample data file contains a meta_dt this may be the exact moment the change was registered. However all the aggregated data would not have the same timestamp.

Author: [Laya Rabasa](https://github.com/layadelcarmen)

Sources:

[1] https://github.com/pika/pika/blob/main/examples/asyncio_consumer_example.py

[2] https://docs.aws.amazon.com/timestream/latest/developerguide/data-modeling.html

[3] https://www.rabbitmq.com/tutorials/amqp-concepts.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.