Giter Club home page Giter Club logo

ahmia's Introduction

Ahmia - Tor Hidden Service Search

https://ahmia.fi/

https://ahmia.fi/

New repository!

https://github.com/ahmia/search

Compatibility

Ahmia requires Python 2.7+ and Django 1.6+

The crawler is called Onionbot and it requires Apache Solr for the data.

Installation

  • Currently, ahmia is listening Solr from http://127.0.0.1:33433/
  • HTTP server is required
  • Please see /apache2/ to setup to run with Apache HTTP server
  • Note the crontabs. The order of the task is important
  • See crontabs file. It is strongly recommended to put crontask to another server than the web-server itself
Install depencies:
$ apt-get install libxml2-dev libxslt-dev python-dev
$ apt-get install libpq-dev
$ apt-get install python-socksipy python-psycopg2 libapache2-mod-wsgi
$ apt-get install libffi-dev
$ pip install -r requirements.txt
Furthermore, you will need to set the rights to the tools:
$ chmod -R ugo+rx /usr/local/lib/ahmia/tools/
And to Apache:
$ chown -R www-data:www-data /usr/local/lib/ahmia/
$ chmod -R u=rwX,g=rX,o=rX /usr/local/lib/ahmia/
Move the Apache settings and adjust WSGI processes=X threads=Y

Upper limit to memory that Apache needs is XY8MB. For instance, 4168MB = 513MB.

cp apache2/sites-available/django-ahmia /etc/apache2/sites-available/django-ahmia
/etc/init.d/apache2 restart
And after creating the SQLite database:
$ chown www-data:www-data /usr/local/lib/ahmia
$ chown www-data:www-data /usr/local/lib/ahmia/ahmia_db
Not required, but recommended for better system performance:
  • Install haveged - A simple entropy daemon
  • Edit the process and threads parameters of the WSGIDaemonProcess in apache2/sites-available/django-ahmia
  • Use PostgreSQL
  • Install PgBouncer: a lightweight connection pooler for PostgreSQL

Features

  • Search engine for Tor hidden services.
  • Privacy: ahmia saves no IP logs.
  • Filtering child abuse.
  • Popularity tracking from Tor2web nodes, public WWW backlinks and the number of clicks in the search results.
  • Hidden service online tracker.

Demo

You can try the demo by cloning this repository and running the test server with provided data:

$ python manage.py syncdb
$ python manage.py loaddata ahmia/fixtures/initial_data.json
$ python manage.py runserver

Then open your browser to http://localhost:8000

Tests

Unittests:

$ python manage.py test ahmia/tests/

For developers

Please, at least, validate your Python code with:

$ pylint --rcfile=pylint.rc ./ahmia/python_code_file.py

and fix the major problems.

ahmia's People

Contributors

bsloan avatar copiesofcopies avatar juhanurmi avatar wtf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ahmia's Issues

Publish statistical data

As the result of the indexing Tor network's content ahmia.fi can produce an authoritative and exact quantitative research data about what is published through the Tor network.

Share information about each site found: Server type, how long it has been online/offline, when it was crawled, popularity and backlinks, keywords, language etc.

RESTful JSON API that provides the data.

Show cached text versions of the pages

There has been cached text versions of the pages but I had to remove them.

The problem is non-trivial: there are a lot of ways to inject pictures and harmful JavaScript to the text cache.

When I found that someone even injected images using only URL schema I had to take down the text cache (data:[][;charset=][;base64],).

Solutions?

2 workweek

Install database

Hi there,

First of all thanks for this great project. I've been searching for something like this for quite a while now and it seems most of Ahmia fit my needs perfectly.

I'm trying to setup Ahmia. I find the instructions pretty hard to follow unfortunately, it seems there are some parts missing here and there.

In the readme.md it says "And after creating the SQLite database". However it doesn't say how these should be create, and what the layout needs to be. I would also like to set this up with PostgreSQL instead of SQLite, but how would I go on and do this?

If I could help out in any way with this project I would love to do so although I have minimal coding skills but tons of system/network and security administration skills.

Cheers, Ronald.

dependency for postgresql

When it comes to install the psycopg in the requirements, you may get this error:
++++++++++++++
Downloading psycopg2-2.5.3.tar.gz (690kB): 690kB downloaded
Running setup.py (path:/tmp/pip_build_root/psycopg2/setup.py) egg_info for package psycopg2

Error: pg_config executable not found.

Please add the directory containing pg_config to the PATH
or specify the full executable path with the option:

    python setup.py build_ext --pg-config /path/to/pg_config build ...

or with the pg_config option in 'setup.cfg'.
Complete output from command python setup.py egg_info:
running egg_info

creating pip-egg-info/psycopg2.egg-info

writing pip-egg-info/psycopg2.egg-info/PKG-INFO

writing top-level names to pip-egg-info/psycopg2.egg-info/top_level.txt

writing dependency_links to pip-egg-info/psycopg2.egg-info/dependency_links.txt

writing manifest file 'pip-egg-info/psycopg2.egg-info/SOURCES.txt'

warning: manifest_maker: standard file '-c' not found

Error: pg_config executable not found.

Please add the directory containing pg_config to the PATH

or specify the full executable path with the option:

python setup.py build_ext --pg-config /path/to/pg_config build ...

or with the pg_config option in 'setup.cfg'.


Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/psycopg2
Storing debug log for failure in /home/htor/.pip/pip.log
++++

To resolve: this should be installed in ubuntu:
sudo apt-get install libpq-dev

(dependency for postgresql)

Better edited HS descriptions

Design and development of a more useful and complete UI including more complete and exhaustive descriptions and details (e.g., show the whole history of descriptions and let the users edit it better).

Requires security conscious design.

Show sites popularity.

Commenting features.

Authenticated hidden service information about the hidden service: what does it say about itself.

Expose some of popularity/backlinks information to users, in case that lets them pick results more safely.

2 workweek

tor2web

tor2web.fi proxy is dead, so it would be nice to remove redirection link in your results.

Tweaking Apache Solr to get better search results

Apache Solr is a popular, open source enterprise search platform. Its major features include powerful full-text search, hit highlighting, faceted search, and near real-time indexing.

The schema.xml[1] file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.

Is anyone familiar with Solr and would know how to tweak it for full text search? I would like to see it working even better.

How about using the popularity data in efficient way?

[1] https://github.com/juhanurmi/ahmia/blob/master/solr/schema.xml

Popularity tracking

Popularity tracking (catch users clicks too). Development of a popularity tracking feature for ahmia.fi. Show more relevant search results using the popularity data. TOP 10 sites?

2 workweek

Adding gpg-signed onion address / gpg key to onion sites

Description Linked Data Proposal to WWW Servers Inside Tor
This proposal is to all HTTP servers inside Tor network.
The problem: We are wondering how we could get authenticated hidden service descriptions. That is, descriptions of hidden services that are provided by the hidden service operator themselves. We would like to show these official descriptions in ahmia.fi search.
The solution proposal: Simple linked description datasheets provided by the hidden services.
How to do this: Simply provide a linked description file in your hidden service web page. You can write your information to this form and use the generated JSON file. Please make sure that your JSON description is valid JSON. You could then provide JSON information of your hidden service in this way http://something.onion/description.json and anyone could find this file.
Example JSON (DuckDuckGo):

{
"title": "DuckDuckGo",
"description": "DuckDuckGo is a search engine that is based in Valley Forge, Pennsylvania and uses information from crowd-sourced sites (like Wikipedia) with the aim of augmenting traditional results and improving relevance. The search engine philosophy emphasizes privacy and does not record user information.",
"domains": ["http://duckduckgo.com/", "http://3g2upl4pq6kufc4m.onion/"],
"keywords": ["search engine", "privacy", "no tracking", "DuckDuckGo"],
"type": "search engine",
"language": "en",
"contactInformation": "http://help.duckduckgo.com/customer/portal/emails/new",
"keyFP": "879B DA5B F6B2 7B61 2745 0A25 03CF 4A0A B3C7 9A63",
"gpg_signature": "BEGIN PGP SIGNATURE...", //// optional
"gpg_asc": "http://duckduckgo.com/sig.asc, http://xxxxx.onion/sig.asc"
}

You can download and study the example JSON file.
Help us:
Please tell us your opinion about this. We are considering this as a simple method for getting official description information from the hidden services. We will show these in our page. See here.

Signature example

https://gist.github.com/glamrock/ae79384f6a714f6dcf0a (the pad keeps breaking format)

Public open YaCy back-end for everyone

Let's make our YaCy network open so anyone can join their YaCy nodes.

This way we could get real P2P decentralization ahmia.fi is a free software and the back-end YaCy network should be free to everyone; also, we will get voluntary YaCy nodes this way.

Share installation configuration package that joins a YaCy node to ahmia.fi's nodes.

1 workweek

got the installation error for requirements

When I use pip to install the requirements using this command:
sudo pip install -r requirements.txt
It goes well until the installation of "lxml==3.3.5"

I get the following error when I try to install "lxml==3.3.5" directly by pip.
+++++++++++++++++++++
src/lxml/lxml.etree.c:8:22: fatal error: pyconfig.h: No such file or directory

#include "pyconfig.h"

                  ^

compilation terminated.

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1


Rolling back uninstall of lxml
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;file='/tmp/pip_build_root/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-njGBar-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/lxml
Storing debug log for failure in /home/htor/.pip/pip.log
++++++++

I think, the requirement file may need an update. I tried to install python3-dev but did not help. Same error.

Child abuse detection and filtering information sharing

Development of a Content Abuse Signaling feature in order to allow fast handling of abuse comments; I want to implement a Callback API in order to publish this data to Tor2web nodes in real-time. We would also like to get automated signal from the Tor2web nodes when they are banning some site so ahmia.fi can also ban that site if necessary. Development of a Content Abuse Signaling feature in order to allow fast handling of abuse comments; I want to implement a Callback API in order to publish this data to Tor2web nodes in real-time. A well designed and authoritative entity may be useful for provide some filtering lists. To this aim we are currently handling manually a filter list already integrated with Tor2web and in use on quite all the nodes of the Tor2web network (https://ahmia.fi/policy/, tor2web/Tor2web#25). In collaboration with Tor2web I want to develop an efficient and automated system to handle and share a filtering information in a secure manner. We are only sharing the MD5Sum of the banned domain.

1 workweek

Gather ALL hidden services

We could gather every hidden service serving some service and categories them.

HTTP servers
IRC servers
BitTorrent tarckers
etc.

Moreover, we could show the actual connection status of the hidden service. Did the circuit fail? Which ports answerd? Does this hidden service even exists?

Globaleaks integration

Currently, GlobaLeaks informs ahmia.fi to index new hidden services.

Globaleaks is good reputation to the Tor network Ahmia.fi could extend the visibility of Globaleaks on the search results.

Together with GlobaLeaks: RESTful API according to Globaleaks' needs and an UI to show information about Globaleaks nodes.

1 workweek

Sort search results according to popularity measurements

Sort the search results from YaCy back-end according the popularity data gathered by ahmia.fi. Ahmia.fi gathers Tor2web visit statistics, crawls back-links and saves the number of clicks in the search results. This information is used to calculate the popularity of an onion address.

Automated visualizations

It is very practical to visualize the statistical data.

Visualizations can be anything!

Ideas:

What these hidden services are? number of web server, IRC servers, BitTorrent trackers etc.

Word clouds: we can even cluster which hidden services are close to each other and show some connections.

Backlinking visualization.

Statistics about hidden websites over time

Publish and visualize onion online statistics over time. How long a certain address was working? When it was found and when it went offline? What was it h1/h2/keywords/description?

This data is gathered now but not in use.

Hidden service mirror for ahmia.fi

Hidden service mirror for ahmia.fi.

Shared SQL database and YaCy back-end with ahmia.fi.

Physical server in secure and unknown place.

1 workweek

Memory leak in Django WSGI process

Apache WSGI processes are filling the memory

ps aux | grep 'ahmia-w' | awk '{print $6/1024 " MB";}'

0.746094 MB
77.1719 MB
97.75 MB
104.418 MB
89.3242 MB
52.4023 MB

These processes has started to consume memory after Django-Haystack installation.

It might be a memory leak. It seems very weird that after a day one WSGI process might take a gigabyte of RAM.

Problem is now "solved" by restarting Apache every day.

Another crawler to search .onion links from the public Internet

Use an another crawler to search .onion pages from the public Internet. Search new .onion domains from different online sources. Ask help from organizations that are crawling. This is an excellent case to test open source crawlers like Heritrix and Apache Nutch? Or use the search engines that exist.

2 workweeks

Backlink checking from the public WWW

Search backlinks from the public WWW to .onion sites. Test to search backlinks from certains domains, for instance, wikipedia.org. This gives us useful popularity data.

2 workweeks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.