Light

medialab / hyphe-traph Goto Github PK

View Code? Open in Web Editor NEW

10.0 7.0 2.0 9.49 MB

A Trie/Graph hybrid memory structure used by the Hyphe crawler to index pages & webentities.

License: MIT License

Makefile 0.18% Python 99.82%

hyphe-traph's Introduction

hyphe-traph

The Traph is an on-file index structure designed to store hyphe's network of pages & webentities.

Under the hood, the Traph is the combination of a ternary search tree of URL stems and linked lists of pages' hyperlinks (hence the portmanteau name).

Development

hyphe-traph was written to work with the 2.7 version of Python.

# Install dev dependencies (preferably in a virtual env)
pip install -r requirements.txt

# Run the tests
make test

# Run the linter
make lint

hyphe-traph's People

Contributors

Stargazers

Watchers

Forkers

cdonnees polymath-is

hyphe-traph's Issues

get_webentity_child_webentities takes forever on big entities

We should try to find a way to make these calls faster...

Test faster child_webentity collection

crawl wikipedia, then add a rule on pages, then collect child_webentities with both traph version

Write some notes about the history of the structure's creation

I have the notes in a messy way somewhere in my computer. Will be useful for later presentations.

Optimize WebEntityCreationRules regexps

Does linked pages need to create WE?

@boogheta

Recode Hyphe in rust and python3

Opti ram for get_links on huge corpus (lru_cache)

On big "polarisation" corpus with 150M links (4.8G link_store + 3G lru_trie), calculating webentities_inlinks raises the ram up to 2.5G...

The raw hashmap

To optimize RAM consumption when computing the webentity network, we could store the page block / webentity association by storing it into a raw file contiguous array hashmap where blocks are the same as in the LRU trie (0.008 size factor, ~50% sparse) and storing the webentity id.

This issue might be a duplicate. Better safe than sorry.

Also store source somehow in links

It could enable us to read the whole find sequentially when aggregating links rather than hopping through it very fast. But it'll double writing cost, so there's that...

Add a paginate_webentity_pagelinks function

We also have another route that generates huge very long queries for some webentity which would benefit from a paginated call

add function in traph
use it in Hyphe's core API
do paginated calls in front
use an estimated nb_links (nb_pages + 50 * nb_pages_crawled ?) to simulate progress bar :p
actually use it

Investigate why indexing links from some WebEntities is way longer

example to try from local database with crawls of lexpress.fr or conseil-constitutionnel.com (long urls?)

Investigate the respadon-sprint/theatre traph

It displays some degenerate structure.

Possible to store frequent nodes in a LRU cache in the instance for speedups

Do we add underlying number of pages on the trie's nodes?

We could increment an underlying number of pages in the trie when going down the trie when we add a page. This could yield those potential benefits:

In the arborescence, to know how much pages a prefix contains. (Without knowing about child webentities, warning!).
We could know the number of pages of a webentity without a huge cost (forgetting about child webentities).

But, this would mean to reduce the stem size by 4 bytes and this would probably incur a performance cost. What's more we would need to write traversed nodes on descent, which would also incur a performance cost.

"Node has no right sibling" error

Running a very simple corpus on Hyphe by only crawling with a depth 1 www.regardscitoyens.org returns this error when trying to display its internal network of pages.
Previous pagination calls run smoothly until we reach a token that does not:

2021-06-07 18:52:54+0200 [INFO - today] Traph client query: paginate_webentity_pagelinks [1, ["s:http|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|h:www|", "s:http|h:org|h:regardscitoyens|h:www|"]] {"include_outbound": false, "pagination_token": "2#DTZ", "source_page_count": 10}
2021-06-07 18:52:54+0200 [DEBUG - ANSWER] store.paginate_webentity_pagelinks_network: "{\"jsonrpc\": \"2.0\", \"result\": {\"code\": \"success\", \"result\": {\"token\": \"1714|40|3#2__\", \"links\": [[\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:les-senateurs-veulent-marier-open-data-et-secret-des-affaires|\", 1], [\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:lopen-data-francais-merite-mieux-quune-transposition-au-rabais-de-la-directive-psi|\", 1], [\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:publication|p:|f:logiciels|\", 1], [\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:01net-pro-%E2%80%93-open-data-dix-questions-pour-tout-comprendre|\", 1], [\"s:https|h:org|h:regardscitoyens|h:www|p:open-data-en-france|p:|\", \"s:https|h:org|h:regardscitoyens|h:www|p:statu-quo-pour-les-donnees-publiques-le-lobbying-des-not ... [52640 cars truncated]

2021-06-07 18:52:54+0200 [INFO - today] Traph client query: paginate_webentity_pagelinks [1, ["s:http|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|h:www|", "s:http|h:org|h:regardscitoyens|h:www|"]] {"include_outbound": false, "pagination_token": "3#2__", "source_page_count": 10}
2021-06-07 18:52:54+0200 [INFO - today] Traph server answer: {"query": {"args": [1, ["s:http|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|", "s:https|h:org|h:regardscitoyens|h:www|", "s:http|h:org|h:regardscitoyens|h:www|"]], "method": "paginate_webentity_pagelinks", "kwargs": {"include_outbound": false, "pagination_token": "3#2__", "source_page_count": 10}}, "message": "Node has no right sibling.", "code": "fail"}

What to do when we recrawl a page

Currently :

old links are preserved, new ones added and existing ones get weights incremented

Possible options:

tell scrapy not to recrawl pages already crawled
remove old links before adding new ones (=> ffragmentation of traph data=
remove pagelinks weight
flag old links disappeared as timestamps

Store as metadata traph version number in data

To allow future model migration of existing traph data

Add some lru caching of node lookups in the trie

Optimize network generation (ram issues + time issues)

Split link_store in two inlink_store + outlink_store
Add within outlink_store source before every packets of links (it's possible to have a same source multiple times) (cf #15)
Optimize outlink_store file structure by having different block sizes per packet of links
Optimise inlink_store by reverting chained list (cf #13)
In network generation, trash the list of pointers, and read sequentially the outlink_store instead

Overall this should both fasten things and lighten the computation

Optimize stem tail structure

We could reallocate unused integers in the struct for tail blocks.

Algorithmics: inlinks linked list apocalypse?

The fact that inlinks are stored in a linked list might become a problem in the future for very popular pages.

We can fix that by avoiding to store weights and keeping a reversed list storing a pointer to the tail rather than to the head so we can insert links in O(1) rather than current O(n).

Compute stats about pagelinks weights

To check whether we can drop the feature of storing them

Add `max_depth` to `lru_trie.webentity_dfs_iter`

We have to track depth + it's not that trivial because one must skip the children in the stack later.

Variation nibble flag

We could drop the https?/www? variations by storing them as a nibble flag in the blocks. This could alleviate a lot of issues related to memory and computation time.

Might be a duplicate issue.

Bug with get_webentity_pages

Reproducible by crawling https://fr.wikipedia.org/, defining a webentity either on the domain or the subdomain, then asking for the pages of the webentity

Cacher et renvoyer le nombre de pages par webentité à la génération du réseau

Et l'utiliser côté Hyphe pour avoir une estimation en cache du nombre de pages de chaque WE

CritBit & HAMT

https://infoscience.epfl.ch/record/64394/files/triesearches.pdf

Try buffered memmap

Add pagination capabilities to get_webentity_pages

It should be possible by only taking a token as argument pointing toward the last seen block in the trie.

How to perform DFS without quadratic stack size?

Unit testing edge case with prefix deletion

Scenario:

Creating a webentity
Deleting one of its 4 default prefixes
Adding a page under the deleted prefix that will trigger the creation of a webentity on the delete prefix
Verify that: 1) the once deleted prefix now belongs to new webentity 2) The three other ones still belong to the first one.

Header blocks starting the actual root > 0 (+ might be useful to keep some more metadata)
1-based indexing (involves a lot of -1 operations to get the actual block address)
Using a flag to denote absence

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.