Giter Club home page Giter Club logo

hyphe2solr's Introduction

principles

This scripts needs :

  • one Hyphe server and more precisely :
    • one Hyphe core instance
    • one Hyphe web-pages Mongo instance
  • one SOLR node ready to index web-pages

What it does is :

  • get web entities info from Hyphe core filtering by status (see configuration)
  • put the web entities not already processed (see logs) into WEB_ENTITY_PILE
  • start nb_process (see configuration) processes to work on the web entities retrieved :
    • get a web entity from WEB_ENTITY_PILE
    • get web pages list from Hyphe core
    • retrieve the mongo document for all URLs (filtering on mimetype see configuration)
    • prepare documents to be indexed by creating a text verison of the HTML code (see html2text.py)
    • index documents

dependencies

HYPHE

This script relies on an existing Hyphe server running. see https://github.com/medialab/Hypertext-Corpus-Initiative

SOLR

This script relies on an existing solr server running. see https://lucene.apache.org/solr/

python requirements

sunburnt lxml httplib2 pymongo jsonrpclib argparse #for python<2.7

INSTALL

You need a hyphe and a solr server running.

git clone this repository

Than simply executes (ideally in a virtualenv):

pip install -r requirements.txt

CONFIGURE

hyphe SOLR schema

use the solr node example provided in solr_hyphe_core directory. the script deploy_solr_core.sh might helps you. Change the solr core path and tomcat user/service (depends on your install) in the script before using it. BEWARE : It will erase any hyphe core already present in solr core path.

You should review the script before using it.

connection to data sources

Copy config.json.default into config.json and edit the parameters :

  • hyphe2core :
    • nb_process: number of concurrent process to start
    • web_entity_status_filter: a web entity filter to index based on hyphe status
  • host/port of Hyphe core
  • host/port/db/collection of mongo hyphe db
  • host/port/path of solr node

Mime-type filter

Hyphe2solr proposes you to filter out web pages which doesn't have a mimetype compatible with solr indexing (our schema don't use TIKKA). The script generate_content_filter.py outputs from the mongodb (version >2.1 only) a CSV listing the cotent-type ordered by number of pages found in the mongo. From this csv you have to write the content_type_whitelist.txt file. This file must contain one mimetype (to be indexed) by line. An example is provided : content_type_whitelist.txt.default

usage

Once you prepared the configuration, simply use :

$ python index_hyphe_web_pages.py

Only one option which delete the existing index before (re)indexing

$ python index_hyphe_web_pages.py -h
usage: index_hyphe_web_pages.py [-h] [-d]

optional arguments:
  -h, --help          show this help message and exit
  -d, --delete_index  delete solr index before (re)indexing. WARNING all
                      previous indexing work will be lost.

If calling index_hyphe_web_pages.py multiple times without -d|--delete_index option, the indexation process will omit the web entities listed by id in logs/we_id_done.log The defautl behaviour is thus to resume any previous unfinished indexations.

logs

Hyphe2solr logs into 3 log directories :

  • ./logs/by_pid/ : one log file by process
  • ./logs/by_web_entity/ : one log file by web entity indexed
  • ./logs/errors_solr_document/ : logs documents the script couldn't index in Solr

Hyphe2solr outputs the ids of indexed web entities in :

  • ./logs/we_id_done.log : this file is used to resume indexing operations from where it stopped

When using -d or --delete_index option, the script clears all the logs.

hyphe2solr's People

Contributors

boogheta avatar legaultpierre avatar paulgirard avatar

Stargazers

OSINTAI avatar Guillaume Levrier avatar Romain Loth avatar Andrew Shaffer avatar Romain Brunias avatar Tobias Bornakke avatar Michael Anthony avatar JT5D avatar

Watchers

Mathieu Jacomy avatar Tobias Bornakke avatar James Cloos avatar Romain Loth avatar Pierre JdlF avatar Michael Anthony avatar  avatar Julien Rault avatar Guillaume Plique avatar Drey avatar Béatrice Mazoyer avatar  avatar  avatar

hyphe2solr's Issues

Solr info

Hi Medialab,

For a complet newbee in the field of Solr do you then have any suggestions as to:

  • What version of Solr one should use?
  • Should I setup Jetty or Tomcat allong with Solr?

(I'm currently using this guide: https://www.digitalocean.com/community/tutorials/how-to-install-solr-on-ubuntu-14-04)

Secondly, I'm unable to understand this warning but I have a destinct feeling that it is important that I do understand it before I run anything. Could you elaborate a bit on the section:

"se the solr node example provided in solr_hyphe_core directory. the script deploy_solr_core.sh might helps you. Change the solr core path and tomcat user/service (depends on your install) in the script before using it. BEWARE : It will erase any hyphe core already present in solr core path. You should review the script before using it."

Thirdly, is it posible for you to share the files that you have adapted on your server for it to work.

Lots of questions. Hope you can help me out.

Best regards
Tobias

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.