Giter Club home page Giter Club logo

Comments (2)

blackforestboi avatar blackforestboi commented on July 18, 2024 1

hey @dldx @jarmitage

We forked the Falcon tool a while back and integrated the import of the existing history and bookmarks.

We have done it by importing it via the chrome.history/bookmarks api.
You can check it out here: https://github.com/WorldBrain/Research-Engine

We are more than happy to collaborate on this in the future!

Best,
Oliver

from falcon.

dldx avatar dldx commented on July 18, 2024

I've come up with a hackish way and a little technical way of doing this. Chrome/Opera stores the past 3 months worth of history, not more, which is annoying but that is what we have to work with. For me, that's still a helluva lot of urls so I had to come up with various ways of filtering it down to something more manageable. I don't really want to load every random website I visited in any case.. So here's what I did. These instructions are for Linux but I'm sure they would be similar on Mac too:

  1. Change Chrome's settings to not load any images to save bandwidth and memory. Also close/save any tabs you care about because we're going to load a lot of new tabs at once and you won't be able to rescue old ones.
  2. Close all windows of Chrome/Opera - you can't open the history file if you don't.
  3. Install SQliteman or a similar SQLite database viewer and sqlite3-pcre (a regex plugin for sqlite)
  4. Open the History database which is located at ~/.config/google-chrome/Default/History (or something similar if you have several profiles) or ~/.config/opera/History
  5. Load the regex plugin into Sqliteman with SELECT load_extension('/usr/lib/sqlite3/pcre.so');
  6. Run the following code to create a list of websites you want.
    select urls.url from urls inner join visits on urls.id = visits.url where urls.url not like '%google.%' and urls.url not like '%facebook.com%' and urls.url not like '%youtube.com%' and urls.url not like '%localhost%' and urls.url not like '%127.0%' and urls.url not like '%192.168%' and urls.url not like '%zero%' and urls.url not like '%out.reddit.com%' and urls.url not regexp '^https?:\/\/[\w\.]+[a-z\/]?$' and urls.title like '%income%' or urls.title like '%climate%' group by urls.url order by sum(visits.visit_duration) desc;
    This is just an example but you can change it to suit your needs. For example, I filtered out facebook, youtube, localhost, etc because they wouldn't be interesting. Then I filtered out all urls that go to the homepage of a site and finally I searched for the words "income" or "climate" in the page titles because I'm interested in basic income and climate change. Without those final filters, I would get thousands of urls but with them, I only get about 200. Anyway, play with the filters a bit in sqliteman to get a list of urls you want to archive but make sure it isn't too long. Save the SQL code you used, including the load_extension line to a file called interesting_sites.sql. Then close sqliteman.
  7. Open a terminal and run something like this:
    cat interesting_sites.sql | sqlite3 ~/.config/opera-developer/History | while read line; do opera-developer --new-page $line &; done
    Replace opera-developer with google-chrome, etc, etc
  8. This command will get the list of urls from sqlite, then load up each url in chrome/opera and hopefully, falcon will automatically index every site. It worked pretty well for me and only took a few seconds to load about 150 sites.

Hope that helps. I'll try to find a way to do better filtering of history but this is what I have so far!

Cheers,
Durand

from falcon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.