juhanurmi / ahmia Goto Github PK

View Code? Open in Web Editor NEW

197.0 197.0 64.0 19.54 MB

Ahmia hidden service search engine

License: BSD 3-Clause "New" or "Revised" License

Python 6.06% JavaScript 16.04% Shell 0.25% HTML 71.66% TeX 1.89% CSS 3.90% XSLT 0.14% Perl 0.05%

ahmia's People

Contributors

Stargazers

Watchers

ahmia's Issues

Why some pages are so slow? Make them faster!

Why on earth this page is so slow: https://ahmia.fi/address/

There is nothing special in the template: https://github.com/juhanurmi/ahmia/blob/master/ahmia/templates/hs_list_view.html

and there is nothing really special in the views:
https://github.com/juhanurmi/ahmia/blob/master/ahmia/views.py#L53

Is this normal?

got the installation error for requirements

When I use pip to install the requirements using this command:
sudo pip install -r requirements.txt
It goes well until the installation of "lxml==3.3.5"

I get the following error when I try to install "lxml==3.3.5" directly by pip.
+++++++++++++++++++++
src/lxml/lxml.etree.c:8:22: fatal error: pyconfig.h: No such file or directory

#include "pyconfig.h"

compilation terminated.

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Rolling back uninstall of lxml
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;file='/tmp/pip_build_root/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-njGBar-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/lxml
Storing debug log for failure in /home/htor/.pip/pip.log
++++++++

I think, the requirement file may need an update. I tried to install python3-dev but did not help. Same error.

Automated visualizations

It is very practical to visualize the statistical data.

Visualizations can be anything!

Ideas:

What these hidden services are? number of web server, IRC servers, BitTorrent trackers etc.

Word clouds: we can even cluster which hidden services are close to each other and show some connections.

Backlinking visualization.

Another crawler to search .onion links from the public Internet

Use an another crawler to search .onion pages from the public Internet. Search new .onion domains from different online sources. Ask help from organizations that are crawling. This is an excellent case to test open source crawlers like Heritrix and Apache Nutch? Or use the search engines that exist.

2 workweeks

tor2web

tor2web.fi proxy is dead, so it would be nice to remove redirection link in your results.

Popularity tracking

Popularity tracking (catch users clicks too). Development of a popularity tracking feature for ahmia.fi. Show more relevant search results using the popularity data. TOP 10 sites?

2 workweek

Improve validator when submitting a new tor hs

When submitting a new TorHs it's mandatory to add the trailing / at the end.

Without the "/" at the end it gives:
Error: Invalid URL! URL must be exactly like http://something.onion/

So:

http://something.onion/ is valid
http://something.onion is NOT valid

I suggest to accept also TorHS without the ending /

When a hidden service has been online

When a hidden service has been online?

Show data over history:

Provide HTTP API that returns the data in JSON
Draw nice graphs over time

Adding gpg-signed onion address / gpg key to onion sites

Description Linked Data Proposal to WWW Servers Inside Tor
This proposal is to all HTTP servers inside Tor network.
The problem: We are wondering how we could get authenticated hidden service descriptions. That is, descriptions of hidden services that are provided by the hidden service operator themselves. We would like to show these official descriptions in ahmia.fi search.
The solution proposal: Simple linked description datasheets provided by the hidden services.
How to do this: Simply provide a linked description file in your hidden service web page. You can write your information to this form and use the generated JSON file. Please make sure that your JSON description is valid JSON. You could then provide JSON information of your hidden service in this way http://something.onion/description.json and anyone could find this file.
Example JSON (DuckDuckGo):

{
"title": "DuckDuckGo",
"description": "DuckDuckGo is a search engine that is based in Valley Forge, Pennsylvania and uses information from crowd-sourced sites (like Wikipedia) with the aim of augmenting traditional results and improving relevance. The search engine philosophy emphasizes privacy and does not record user information.",
"domains": ["http://duckduckgo.com/", "http://3g2upl4pq6kufc4m.onion/"],
"keywords": ["search engine", "privacy", "no tracking", "DuckDuckGo"],
"type": "search engine",
"language": "en",
"contactInformation": "http://help.duckduckgo.com/customer/portal/emails/new",
"keyFP": "879B DA5B F6B2 7B61 2745 0A25 03CF 4A0A B3C7 9A63",
"gpg_signature": "BEGIN PGP SIGNATURE...", //// optional
"gpg_asc": "http://duckduckgo.com/sig.asc, http://xxxxx.onion/sig.asc"
}

You can download and study the example JSON file.
Help us:
Please tell us your opinion about this. We are considering this as a simple method for getting official description information from the hidden services. We will show these in our page. See here.

Signature example

https://gist.github.com/glamrock/ae79384f6a714f6dcf0a (the pad keeps breaking format)

Tweaking Apache Solr to get better search results

Apache Solr is a popular, open source enterprise search platform. Its major features include powerful full-text search, hit highlighting, faceted search, and near real-time indexing.

The schema.xml[1] file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.

Is anyone familiar with Solr and would know how to tweak it for full text search? I would like to see it working even better.

How about using the popularity data in efficient way?

[1] https://github.com/juhanurmi/ahmia/blob/master/solr/schema.xml

Memory leak in Django WSGI process

Apache WSGI processes are filling the memory

ps aux | grep 'ahmia-w' | awk '{print $6/1024 " MB";}'

0.746094 MB
77.1719 MB
97.75 MB
104.418 MB
89.3242 MB
52.4023 MB

These processes has started to consume memory after Django-Haystack installation.

It might be a memory leak. It seems very weird that after a day one WSGI process might take a gigabyte of RAM.

Problem is now "solved" by restarting Apache every day.

Fix Tor2web uptime reporting by changing the way uptime checking is done

Currently at https://ahmia.fi/stats/tor2web the test to see if a tor2web node is up or down does not behave correctly, as it's checking that only http://domain.tld answer, however some of the node is dead.

This ticket is to change the way the testing is done, by trying to load some existing/known HS resources by matching that Tor2web effectively answer at that HTTPS address.

Introduce "Similarity Matching" logic into ahmia to aggregated website that are very similar

This ticket is to introduce "Similarity Matching" logic into ahmia to aggregated website that are very similar, exporting those data from an API.

This ticket follow as an idea from tor2web/Tor2web#271 to be able to block Cryptolocker in a way that, once you block one Cryptolocker page sitting on a single onion, all the similar one can be blocked too (as they keep changing their TorHS frequently)

Gather ALL hidden services

We could gather every hidden service serving some service and categories them.

HTTP servers
IRC servers
BitTorrent tarckers
etc.

Moreover, we could show the actual connection status of the hidden service. Did the circuit fail? Which ports answerd? Does this hidden service even exists?

Sort search results according to popularity measurements

Sort the search results from YaCy back-end according the popularity data gathered by ahmia.fi. Ahmia.fi gathers Tor2web visit statistics, crawls back-links and saves the number of clicks in the search results. This information is used to calculate the popularity of an onion address.

Online checker script is consuming a large amount of memory

Online checker script[1] is consuming a large amount of memory.

[1] https://github.com/juhanurmi/ahmia/blob/master/tools/test_hidden_services.py

Install database

Hi there,

First of all thanks for this great project. I've been searching for something like this for quite a while now and it seems most of Ahmia fit my needs perfectly.

I'm trying to setup Ahmia. I find the instructions pretty hard to follow unfortunately, it seems there are some parts missing here and there.

In the readme.md it says "And after creating the SQLite database". However it doesn't say how these should be create, and what the layout needs to be. I would also like to set this up with PostgreSQL instead of SQLite, but how would I go on and do this?

If I could help out in any way with this project I would love to do so although I have minimal coding skills but tons of system/network and security administration skills.

Cheers, Ronald.

Integration with Tor2web to gather new .onion domains and stats

Thanks to our suggestion recently, Tor2web has implemented a feature that provides secure and anonymous statistics within a day. I want to implement an automatic fetch and handling of this data. Ahmia.fi should fetch these and add each new .onion page to its database.

https://abc.tor2web.fi/antanistaticmap/stats/yesterday

1 workweek + few days with the stats

Globaleaks integration

Currently, GlobaLeaks informs ahmia.fi to index new hidden services.

Globaleaks is good reputation to the Tor network Ahmia.fi could extend the visibility of Globaleaks on the search results.

Together with GlobaLeaks: RESTful API according to Globaleaks' needs and an UI to show information about Globaleaks nodes.

1 workweek

dependency for postgresql

When it comes to install the psycopg in the requirements, you may get this error:
++++++++++++++
Downloading psycopg2-2.5.3.tar.gz (690kB): 690kB downloaded
Running setup.py (path:/tmp/pip_build_root/psycopg2/setup.py) egg_info for package psycopg2

Error: pg_config executable not found.

Please add the directory containing pg_config to the PATH
or specify the full executable path with the option:

    python setup.py build_ext --pg-config /path/to/pg_config build ...

or with the pg_config option in 'setup.cfg'.
Complete output from command python setup.py egg_info:
running egg_info

creating pip-egg-info/psycopg2.egg-info

writing pip-egg-info/psycopg2.egg-info/PKG-INFO

writing top-level names to pip-egg-info/psycopg2.egg-info/top_level.txt

writing dependency_links to pip-egg-info/psycopg2.egg-info/dependency_links.txt

writing manifest file 'pip-egg-info/psycopg2.egg-info/SOURCES.txt'

warning: manifest_maker: standard file '-c' not found

Error: pg_config executable not found.

Please add the directory containing pg_config to the PATH

or specify the full executable path with the option:

python setup.py build_ext --pg-config /path/to/pg_config build ...

or with the pg_config option in 'setup.cfg'.

Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/psycopg2
Storing debug log for failure in /home/htor/.pip/pip.log
++++

To resolve: this should be installed in ubuntu:
sudo apt-get install libpq-dev

(dependency for postgresql)

Publish statistical data

As the result of the indexing Tor network's content ahmia.fi can produce an authoritative and exact quantitative research data about what is published through the Tor network.

Share information about each site found: Server type, how long it has been online/offline, when it was crawled, popularity and backlinks, keywords, language etc.

RESTful JSON API that provides the data.

Hidden service mirror for ahmia.fi

Hidden service mirror for ahmia.fi.

Shared SQL database and YaCy back-end with ahmia.fi.

Physical server in secure and unknown place.

1 workweek

Statistics about hidden websites over time

Publish and visualize onion online statistics over time. How long a certain address was working? When it was found and when it went offline? What was it h1/h2/keywords/description?

This data is gathered now but not in use.

Mobile optimization and user experience

Fix all PageSpeed Insights issues: http://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Fahmia.fi%2Fsearch

Better edited HS descriptions

Design and development of a more useful and complete UI including more complete and exhaustive descriptions and details (e.g., show the whole history of descriptions and let the users edit it better).

Requires security conscious design.

Show sites popularity.

Commenting features.

Authenticated hidden service information about the hidden service: what does it say about itself.

Expose some of popularity/backlinks information to users, in case that lets them pick results more safely.

2 workweek

Public open YaCy back-end for everyone

Let's make our YaCy network open so anyone can join their YaCy nodes.

This way we could get real P2P decentralization ahmia.fi is a free software and the back-end YaCy network should be free to everyone; also, we will get voluntary YaCy nodes this way.

Share installation configuration package that joins a YaCy node to ahmia.fi's nodes.

1 workweek

Backlink checking from the public WWW

Search backlinks from the public WWW to .onion sites. Test to search backlinks from certains domains, for instance, wikipedia.org. This gives us useful popularity data.

2 workweeks

Rewrite tools as Scrapy spiders

Scrapy: The crawler is developed using Scrapy spider framework [1]. This onion site crawler is called OnionBot[2].

At the moment we have many different test tools for hidden services [3]. It might be cleaner solution to rewrite these tools with Scrapy.

[1] http://scrapy.org/
[2] https://github.com/juhanurmi/ahmia/tree/master/onionbot
[3] https://github.com/juhanurmi/ahmia/tree/master/tools

Show cached text versions of the pages

There has been cached text versions of the pages but I had to remove them.

The problem is non-trivial: there are a lot of ways to inject pictures and harmful JavaScript to the text cache.

When I found that someone even injected images using only URL schema I had to take down the text cache (data:[][;charset=][;base64],).

Solutions?

2 workweek

Testing in Django

Django project needs more test cases.

It easy to write those test using Django's testing properties [1][2].

At the moment there are a few tests[3] and these test can be executed by

python manage.py test ahmia/tests/

[1] https://docs.djangoproject.com/en/dev/topics/testing/
[2] https://docs.djangoproject.com/en/dev/topics/testing/overview/
[3] https://github.com/juhanurmi/ahmia/tree/master/ahmia/tests

Implement a filterlist export for Tor2web import with known pattern of malware

This ticket is to implement a filterlist export for Tor2web import with known pattern of malware.

That way a tor2web node would be able to important such list to apply malware blacklist.

Child abuse detection and filtering information sharing

Development of a Content Abuse Signaling feature in order to allow fast handling of abuse comments; I want to implement a Callback API in order to publish this data to Tor2web nodes in real-time. We would also like to get automated signal from the Tor2web nodes when they are banning some site so ahmia.fi can also ban that site if necessary. Development of a Content Abuse Signaling feature in order to allow fast handling of abuse comments; I want to implement a Callback API in order to publish this data to Tor2web nodes in real-time. A well designed and authoritative entity may be useful for provide some filtering lists. To this aim we are currently handling manually a filter list already integrated with Tor2web and in use on quite all the nodes of the Tor2web network (https://ahmia.fi/policy/, tor2web/Tor2web#25). In collaboration with Tor2web I want to develop an efficient and automated system to handle and share a filtering information in a secure manner. We are only sharing the MD5Sum of the banned domain.

1 workweek

juhanurmi / ahmia Goto Github PK

ahmia's People

Contributors

Stargazers

Watchers

Forkers

ahmia's Issues

Recommend Projects

Recommend Topics

Recommend Org