Giter Club home page Giter Club logo

viaskyse / earkweb Goto Github PK

View Code? Open in Web Editor NEW

This project forked from e-ark-software/earkweb

0.0 2.0 0.0 41.22 MB

E-ARK Web is an open source archiving and digital preservation system. It is OAIS-oriented which means that data ingest, archiving and dissemination functions operate on information packages bundling content and metadata in contiguous containers.

License: GNU General Public License v3.0

Dockerfile 0.04% Python 17.78% Shell 1.55% HTML 7.20% CSS 31.00% JavaScript 42.40% Ruby 0.02%

earkweb's Introduction

earkweb

Table of Contents

Introduction

E-ARK Web is an open source archiving and digital preservation system. It is OAIS-oriented which means that data ingest, archiving and dissemination functions operate on information packages bundling content and metadata in contiguous containers. The information package format uses METS to represent the structure and PREMIS to record digital provenance information.

E-ARK Web offers functionality for the three types of information packages defined in the OAIS reference model: the Submission Information Package (SIP) which is the information sent from the producer to the archive, the Archival Information Package (AIP) which is the information stored by the archive, and the Dissemination Information Package (DIP) which is the information sent to a user when requested. The system allows executing different types of actions, such as information extraction, validation, or transformation operations, on information packages to support ingesting a SIP, archiving an AIP, and creating a DIP from a set of AIPs.

E-ARK Web consists of a frontend web application together with a task execution system based on Celery which allows synchronous and asynchronous processing of information packages by means of processing units which are called “tasks”.

earkweb home

The backend can also be controlled via remote command execution without using the web frontend. The outcomes of operations performed by a task are stored immediately so that the status information in the frontend's database can be updated afterwards.

Installation

User guide

Architecture overview

The E-ARK Web architecture is designed for efficiently processing, storing, and accessing very large data collections in terms of scalability, reliability, and cost. The system makes use of technologies like the Apache Hadoop framework, NGDATA's Lily repository, and the Apache SolR search server allowing the repository infrastructure to scale-out horizontally. Using Hadoop, the number of nodes in a cluster is virtually unlimited and clusters may range from single node installations to clusters comprising thousands of computers. The following diagram gives an overview about this architecture:

architecture overview full scale version

The user interface represented by the box on top of the diagram is a Python/Django-based web application which allows
managing the creation and transformation of information packages. It supports the complete archival package transformation pipeline, beginning with the creation of the Submission Information Package (SIP), over the conversion to an Archival Information Package (AIP), to the creation of the Dissemination Information Package (DIP) which is used to disseminate digital objects to the requesting user. Tasks can be assigned to Celery workers (green boxes with a "C") which share the same storage area and the result of the package transformation is stored in the information package’s working directory based on files.

Once the creation of information packages is finished, they can be deployed to the Lily access repository. Lily is build on top of HBase, a NoSQL database that is running on top of Hadoop. Lily defines some data types where most of them are based on existing Java data types. Lily records are defined using these data types as compared to using plain HBase tables, which makes them better suited for indexing due to a richer data model. The Lily Indexer is the component which sends the data to the Solr server and keeps the index synchronized with the Lily repository. Solr neither reads data from HDFS nor writes data to HDFS. The index is stored on the local file system and optionally distributed over multiple cluster nodes if index sharding or replication is used.

There is also a lightweight version of E-ARK Web where the large-scale storage backend (HDFS, HBase) is replaced by a conventional file system storage and the SolR search server is a single instance SolR instead of a SolR Cloud deployment illustrated by the following diagram:

architecture overview lightweight version

NLP Natural Language Processing

E-ARK Web now offers NLP tools. Please check out the [NLP documentation] (./docs/nlp_documentation.md) if you are interested.

earkweb's People

Contributors

bartham avatar janrn avatar romankarl avatar rschmidt13 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.