Giter Club home page Giter Club logo

mdr's Introduction

===
MDR
===

.. image:: https://travis-ci.org/scrapinghub/mdr.svg?branch=master
    :target: https://travis-ci.org/scrapinghub/mdr

MDR is a library detect and extract listing data from HTML page. It implemented base on the `Finding and Extracting Data Records from Web Pages <http://dl.acm.org/citation.cfm?id=1743635>`_ but
change the similarity to tree alignment proposed by `Web Data Extraction Based on Partial Tree Alignment <http://doi.acm.org/10.1145/1060745.1060761>`_ and `Automatic Wrapper Adaptation by Tree Edit Distance Matching <http://arxiv.org/pdf/1103.1252.pdf>`_.


Requires
========

- Requires python 2.
- ``numpy`` and ``scipy`` must be installed to build this package.


Compile and test
================
    # optionally use docker
    $ docker run -ti python:2.7.13 bash

    $ apt-get update && apt-get install -y python-numpy cython python-scipy
    $ cd
    $ git clone https://github.com/scrapinghub/mdr.git
    $ cd mdr
    $ pip install -r requirements.txt
    $ python setup.py build
    $ python setup.py install

    # let's move away from this dir, otherwise it would fail with ImportError: No module named _tree
    $ cd
    $ cp -r mdr/tests .
    # -m: use it as a library, so that it reads the get_page def from tests/__init__.py
    $ python -m tests.test_mdr

    .....
    ----------------------------------------------------------------------
    Ran 5 tests in 2.689s

    OK


Usage
=====

Detect listing data
~~~~~~~~~~~~~~~~~~~

MDR assume the data record close to the elements has most text nodes::

    [1]: import requests
    [2]: from mdr import MDR
    [3]: mdr = MDR()
    [4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')
    [5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))
    ...

    [8]: [doc.getpath(c) for c in candidates[:10]]
     ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',
     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',
     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',
     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',
     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',
     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',
     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',
     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',
     '/html/body/div[2]',
     '/html/body/div[2]/div[4]/div/div[1]']

Extract data record
~~~~~~~~~~~~~~~~~~~

MDR can find the repetiton patterns by using tree matching under certain candidate DOM tree, then it builds a mapping from HTML element to other matched elements of the DOM tree.

Used with annotation (optional)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can annotate the seed elements with any tools (e.g. scrapely_) you like, then mdr will be able to find the other matched elements on the page.

e.g. you can find this demo page here_. the colored data in first row are annotated manually, the rest are extracted by MDR.

Author
======

Terry Peng <[email protected]>

License
=======

MIT

.. _scrapely: https://github.com/scrapy/scrapely
.. _here: http://ibc.scrapinghub.com/tmp/h.html

mdr's People

Contributors

tpeng avatar dportabella avatar shaneaevans avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.