Giter Club home page Giter Club logo

wiki-science's Introduction

WikiScience | base.camp | University Of Hamburg

This project is a students research project for the base.camp software-development internship of the university of Hamburg, Germany. WikiScience is a data science project which goal is to analyze dumps from wikipedia and collect data from it. It consists of a

  • map-reduce application, which is capable to map & reduce wiki-data via hadoop,
  • spring-boot backend - which connects to a database, holding most of the information extracted from the map-reduce jobs, aswell as serving as REST api for the frontend
  • flutter frontend - which is responsible for the display of the analytical results.

You can find a running version of this project @ http://basecamp-demos.informatik.uni-hamburg.de:8080/wiki-science-backend-1.0.2.RELEASE/#/

map-reduce

Setup

Input Data

Make sure you have downloaded one of the recent wikipedia dumps from https://dumps.wikimedia.org/. Currently, the map-reduce application supports two types of dumps:

  • Wikipedia XML format
  • Wikipedia preprocessed JSON format
  • Wikipedia Title List Plain Text

Make sure that your dump is in one of these formats. In order to convert the basic Wikipedia XML format to JSON you can use the wikiextractor: https://github.com/attardi/wikiextractor.

Note: It might happen that the XML jobs will fail, this is probably due to the format itself. The jobs require to have one article per line, starting and ending with a xml tag.

Build

In order to work with the map-reduce application, you'll need to build the sources first. Make sure to have maven & java 8 installed.

run mvn clean install -f map-reduce . This will trigger the compile and packaging of the sources.

Note: You can also run mvn clean install on root level, the aggregator pom.xml will trigger builds for map-reduce AND backend application.

Run on Hadoop Cluster

Copy the generated map-reduce-.jar onto your Hadoop cluster. Make sure, your dumps are also on the cluster so you can start the procedure.

To start a job, you first need to decide which map-reduce job you want to execute.

To start a job run hadoop jar map-reduce-<VERSION>.jar <JOB-NAME> <INPUT-DIR> <OUTPUT-DIR>

Example: hadoop jar map-reduce-1.0.2.RELEASE.jar most-edited-articles ENWIKI_PAGES_FILTERED/* ENWIKI_PAGES_MREDUCED/most-edited-articles

A full curated list of jobs can be found below.

Job Overview

XML Jobs
  • word-count
  • contributor-count
  • most-redirects
  • article-length
  • article-count
  • category
  • premium-contributor
  • most-edited-articles
JSON Jobs
  • article-length-json
  • redirects-json
  • real-article-count-json
  • topics-json
  • total-article-count-json
  • word-count-json
Plain Text Jobs
  • article-count-plain

backend

Setup

Build

n order to work with the backend application you'll need to have maven & java 8 installed.

run mvn clean install -f backend . or just mvn clean install This will trigger the compile and packaging of the sources (excluding or including the map-reduce application).

Run

Via mvn spring-boot:run -f backend you'll start the backend spring boot application. It will etablish a database connection upon startup.

Important Note: You'll need to have access to the bascecamp db of the university of hamburg in order to start in prod mode. You can still etablish a connection via the "dev" profile which requires a mysql database on localhost:3306.

When the application is started, you can visit the REST api @ http://localhost:8080/swagger-ui.html.

frontend

Setup

Install Flutter

Make sure you have correctly installed Flutter on your computer.

For more information and a detailed tutorial check out https://flutter.dev/docs/get-started/install.

Enable web support

As far as now, web support for Flutter is still in beta, but you can enable web support for the Flutter version on your computer manually.

Check out https://flutter.dev/docs/get-started/web for a step for step guide to use the latest version of the Flutter SDK from the beta channel and enable web support.

Run

When you have web support enabled, you can run your project on the Chrome browser with the following command: flutter run -d chrome.

Build

Run the following command to generate a release build: flutter build web.

This will create a 'build' folder in your project folder. You can find the index.html and other necessary documents under 'projectFolder'/build/web/

Deploy

To deploy the frontend, just copy all sources below build/web/ into the static resources of the backend application. A resource handler will expose the frontend application to the servlet. The running live demo is also deployed this way.

wiki-science's People

Contributors

talbx avatar infinitystacks avatar tdicknet avatar t-ugur avatar

Watchers

 avatar

wiki-science's Issues

PageExtractor

Wir benötigen ein MapReduce Job, um die Pages aus dem XML zu extrahieren.

rust approach w/ whitfin/efflux

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -file wordcount_mapper -file wordcount_reducer -mapper wordcount_mapper -reducer wordcount_reducer -input user/ba-prakt20-4/WIKI_EN_ARTICLES_2020/enwiki-latest-pages-articles.xml -output user/ba-prakt20-4/WIKI_EN_ARTICLES_2020/enwiki-latest-pages-articles.wc.rust

Setup Project

nach update auf master →

tools:

  • java 13

  • maven

Dann sollte bei jedem mit mvn clean install / mvn clean install -DskipTests das Projekt durchlaufen.

Mit mvn spring-boot:run sollte das Programm startbar sein lokal (auch wenn exceptions dann fliegen ist das nur auf das fehlende hadoop cluster zurückzuführen)

Java 13 incomatible with Hadoop?

ba-prakt20-4@ltheadnode:~$ hadoop jar wiki-science-0.0.1-SNAPSHOT.jar PageExtractorJob user/ba-prakt20-4/WIKI_EN_ARTICLES_2020/enwiki-latest-pages-articles.xml user/ba-prakt20-4/WIKI_EN_ARTICLES_2020/enwiki-latest-pages-articles.pages
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/unihh/basecamp/g4/wiki/Runner has been compiled by a more recent version of the Java Runtime (class file version 57.0), this version of the Java Runtime only recognizes class file versions up to 52.0
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:92)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:46)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:87)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:51)
	at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:52)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
ba-prakt20-4@ltheadnode:~$ 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.