Giter Club home page Giter Club logo

hyphe-traph's Introduction

Build Status

hyphe-traph

The Traph is an on-file index structure designed to store hyphe's network of pages & webentities.

Under the hood, the Traph is the combination of a ternary search tree of URL stems and linked lists of pages' hyperlinks (hence the portmanteau name).

Development

hyphe-traph was written to work with the 2.7 version of Python.

# Install dev dependencies (preferably in a virtual env)
pip install -r requirements.txt

# Run the tests
make test

# Run the linter
make lint

hyphe-traph's People

Contributors

boogheta avatar jacomyma avatar yomguithereal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyphe-traph's Issues

The raw hashmap

To optimize RAM consumption when computing the webentity network, we could store the page block / webentity association by storing it into a raw file contiguous array hashmap where blocks are the same as in the LRU trie (0.008 size factor, ~50% sparse) and storing the webentity id.

This issue might be a duplicate. Better safe than sorry.

Also store source somehow in links

It could enable us to read the whole find sequentially when aggregating links rather than hopping through it very fast. But it'll double writing cost, so there's that...

Add a paginate_webentity_pagelinks function

We also have another route that generates huge very long queries for some webentity which would benefit from a paginated call

  • add function in traph
  • use it in Hyphe's core API
  • do paginated calls in front
  • use an estimated nb_links (nb_pages + 50 * nb_pages_crawled ?) to simulate progress bar :p
  • actually use it

Do we add underlying number of pages on the trie's nodes?

We could increment an underlying number of pages in the trie when going down the trie when we add a page. This could yield those potential benefits:

  • In the arborescence, to know how much pages a prefix contains. (Without knowing about child webentities, warning!).
  • We could know the number of pages of a webentity without a huge cost (forgetting about child webentities).

But, this would mean to reduce the stem size by 4 bytes and this would probably incur a performance cost. What's more we would need to write traversed nodes on descent, which would also incur a performance cost.

"Node has no right sibling" error

Running a very simple corpus on Hyphe by only crawling with a depth 1 www.regardscitoyens.org returns this error when trying to display its internal network of pages.
Previous pagination calls run smoothly until we reach a token that does not:

2021-06-07 18:52:54+0200 [INFO - today] Traph client query: paginate_webentity_pagelinks [1, ["s:http|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|h:www|", "s:http|h:org|h:regardscitoyens|h:www|"]] {"include_outbound": false, "pagination_token": "2#DTZ", "source_page_count": 10}
2021-06-07 18:52:54+0200 [DEBUG - ANSWER] store.paginate_webentity_pagelinks_network: "{\"jsonrpc\": \"2.0\", \"result\": {\"code\": \"success\", \"result\": {\"token\": \"1714|40|3#2__\", \"links\": [[\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:les-senateurs-veulent-marier-open-data-et-secret-des-affaires|\", 1], [\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:lopen-data-francais-merite-mieux-quune-transposition-au-rabais-de-la-directive-psi|\", 1], [\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:publication|p:|f:logiciels|\", 1], [\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:01net-pro-%E2%80%93-open-data-dix-questions-pour-tout-comprendre|\", 1], [\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:statu-quo-pour-les-donnees-publiques-le-lobbying-des-not ... [52640 cars truncated]

2021-06-07 18:52:54+0200 [INFO - today] Traph client query: paginate_webentity_pagelinks [1, ["s:http|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|h:www|", "s:http|h:org|h:regardscitoyens|h:www|"]] {"include_outbound": false, "pagination_token": "3#2__", "source_page_count": 10}
2021-06-07 18:52:54+0200 [INFO - today] Traph server answer: {"query": {"args": [1, ["s:http|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|h:www|", "s:http|h:org|h:regardscitoyens|h:www|"]], "method": "paginate_webentity_pagelinks", "kwargs": {"include_outbound": false, "pagination_token": "3#2__", "source_page_count": 10}}, "message": "Node has no right sibling.", "code": "fail"}

What to do when we recrawl a page

Currently :

  • old links are preserved, new ones added and existing ones get weights incremented

Possible options:

  • tell scrapy not to recrawl pages already crawled
  • remove old links before adding new ones (=> ffragmentation of traph data=
  • remove pagelinks weight
  • flag old links disappeared as timestamps

Optimize network generation (ram issues + time issues)

  • Split link_store in two inlink_store + outlink_store
  • Add within outlink_store source before every packets of links (it's possible to have a same source multiple times) (cf #15)
  • Optimize outlink_store file structure by having different block sizes per packet of links
  • Optimise inlink_store by reverting chained list (cf #13)
  • In network generation, trash the list of pointers, and read sequentially the outlink_store instead

Overall this should both fasten things and lighten the computation

Algorithmics: inlinks linked list apocalypse?

The fact that inlinks are stored in a linked list might become a problem in the future for very popular pages.

We can fix that by avoiding to store weights and keeping a reversed list storing a pointer to the tail rather than to the head so we can insert links in O(1) rather than current O(n).

Variation nibble flag

We could drop the https?/www? variations by storing them as a nibble flag in the blocks. This could alleviate a lot of issues related to memory and computation time.

Might be a duplicate issue.

Unit testing edge case with prefix deletion

Scenario:

  1. Creating a webentity
  2. Deleting one of its 4 default prefixes
  3. Adding a page under the deleted prefix that will trigger the creation of a webentity on the delete prefix
  4. Verify that: 1) the once deleted prefix now belongs to new webentity 2) The three other ones still belong to the first one.

Adding more debug info to iteration state

We could add statuses and rough progresses to the iteration state class. For instance, when computing the graph, we could say at which step we currently are (DFS, link aggregation), and roughly at what percentage we stand.

Adjust Hyphe's front to new traph capabilities

  • [WEBENTITY page] Use cached total of pages instead of total pages received
  • [WEBENTITY page] Display total crawled pages if any in parentheses next to total
  • [WEBENTITY page] Remove search filter on pages
  • [EXPORT page] Add pages numbers per entity to exportable fields
  • [NETWORK page] Add pages numbers per entity to exportable network
  • [NETWORK page] Allow to use pages numbers to set nodes sizes
  • [OVERVIEW] Control total numbers of pages with respect to sum of total pages per entity
  • [WEBENTITIES page] Add orderable column with number of total pages per entity ? (warning: requires an API adjustment)
  • [WEBENTITY page] Load pages dynamically with pagination triggered by scroll (medialab/hyphe#293)
  • [WEBENTITY page/AS FOLDERS] Load pages progressively via bigger pagination and use it for loaderbar
  • [WEBENTITY page/PAGESNETWORK] Load pages progressively via bigger pagination and use it for loaderbar

Cursor hopping

Might be detrimental but I cannot find a way to do so without compromising concurrency.

Zero based issues

For some pointers, 0 should be kept as a NULL pointer. We can solve this by the following solutions:

  • Header blocks starting the actual root > 0 (+ might be useful to keep some more metadata)
  • 1-based indexing (involves a lot of -1 operations to get the actual block address)
  • Using a flag to denote absence

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.