Giter Club home page Giter Club logo

domains's People

Contributors

cesnokov avatar tb0hdan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

domains's Issues

Error when downloading

Hi, I'm getting an error message when cloning the repo:

error: 472 bytes of body are still expected3.29 MiB | 2.00 KiB/s
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

And if I try to download it as a Zip file, when I try to open it, I get a corrupt file.

Thanks!

Crawler causing SYN floods

Yesterday at 18:17 CEST we noted a SYN flood caused by the project crawler. Please implement request limits.

Missing domains

Hi,

Thanks for this nice project. I found that some domains are missing from your dataset, such as "aaaa.com" or "azjj.com". Is there a reason for this?

Thanks.

Error while clone the project : repository is over its data quota

Cloning into 'domains'...
remote: Enumerating objects: 170161, done.
remote: Counting objects: 100% (17313/17313), done.
remote: Compressing objects: 100% (17245/17245), done.
remote: Total 170161 (delta 75), reused 17304 (delta 67), pack-reused 152848
Receiving objects: 100% (170161/170161), 1.71 GiB | 16.16 MiB/s, done.
Resolving deltas: 100% (1105/1105), done.
Downloading data/afghanistan/domain2multi-af00.txt.xz (79 KB)
Error downloading object: data/afghanistan/domain2multi-af00.txt.xz (4b3903e): Smudge error: Error downloading data/afghanistan/domain2multi-af00.txt.xz (4b3903eef1e6e05f9f526e7eaa667ce0528d149dc12169299de8e3868601e4de): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to '/home/system/new/domains/.git/lfs/logs/20240202T235755.763457986.log'.
Use git lfs logs last to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: data/afghanistan/domain2multi-af00.txt.xz: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

ICANN CZDS to seed domains

Would the ICANN CZDS API (https://czds.icann.org/) be a useful mechanism for you to more efficiently pull a list of all domains across the web? It allows you to apply for access to the zone data for each TLD, and then using their API tool, you can quickly download the zone file for each TLD that approves your request for access. This approach wouldn't be effective for finding subdomains though, so further work would be needed there, but perhaps this would be a good way to seed your scan?

I was also wondering if you tracked A records, or similar for each domain. Looks like the dataset is just domains, so I suspect the answer is no, but thought I would ask just in case I'm missing it somewhere as you mentioned using DNS checks in the README.

Extra domain

While looking where dunctebot.link was getting requests from on the internet I stumbled upon this cool dataset. Since I'm getting rid of that domain soon here's the domain that will replace it duncte.bot

Raw data page returns 403.

I'm trying to download the data from the website dataset.domainsproject.org but I'm getting an http 403.

# wget -m https://dataset.domainsproject.org
--2021-05-20 16:14:26--  https://dataset.domainsproject.org/
Resolving dataset.domainsproject.org (dataset.domainsproject.org)... 104.26.14.47, 104.26.15.47, 172.67.72.229, ...
Connecting to dataset.domainsproject.org (dataset.domainsproject.org)|104.26.14.47|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-05-20 16:14:26 ERROR 403: Forbidden.

Issue with your patreon

Hello, i’m trying to subscribe to your patreon and everytime i do it the account gets disabled, i wanted to know if you have any contact method like telegram or xmpp so we can discuss the payment.

Add example of what data looks like

I'm interested in this data, but the download and space commitments look large so it would be great if there were some examples of the records I could expect to see in the README

Wget returns 403 forbidden

(same issue as #5)

Hey, I'm including one of the datasets in my script. I'm using Python's wget package (which uses the standard wget user agent) and Cloudflare gives me a 403 error.

urllib.error.HTTPError: HTTP Error 403: Forbidden

can't be cloned due to error

git clone is not working on this repo and gets the following error

Cloning into 'domains'...
remote: Enumerating objects: 170140, done.
remote: Counting objects: 100% (17292/17292), done.
remote: Compressing objects: 100% (17265/17265), done.
remote: Total 170140 (delta 61), reused 17250 (delta 26), pack-reused 152848
Receiving objects: 100% (170140/170140), 1.71 GiB | 6.87 MiB/s, done.
Resolving deltas: 100% (1091/1091), done.
Downloading data/afghanistan/domain2multi-af00.txt.xz (79 KB)
Error downloading object: data/afghanistan/domain2multi-af00.txt.xz (4b3903e): Smudge error: Error downloading data/afghanistan/domain2multi-af00.txt.xz (4b3903eef1e6e05f9f526e7eaa667ce0528d149dc12169299de8e3868601e4de): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Anyone else having the same issue?

Domain dataset source: ICANN Centralized Zone Data Service (CZDS)

ICANN (Internet Corporation for Assigned Names and Numbers) provides a Centralized Zone Data Service (CZDS) where you can request free access to the Zone Files provided by participating generic Top-Level Domains (gTLDs).

Currently there are over 1100 generic TLDs (including the big ones like com, net, org) available for download (terms-of-service limit: maximum of one download per 24 hours per zonefile).

You can scrape your domains from the zonefiles.


You will need to provide a request "reason" to each zonefile registrar.

From my experience, most (>98%) registrars expect the following information (a handful will reject your access request if any of the following information is omitted):

  • Name
  • Email
  • IP Address
  • Physical Address (Building, Street, Postcode etc.)
  • Phone Number
  • Reason (what you intend to do with the zonefiles)

It is also possible to automatically request zonefile access rights and download zonefiles via the ICANN CZDS API instead of from the website user interface.

Do not download large files such as FLAC and MP3

Downloading these places a significant load on servers, and most are not going to contain URL metadata of use to the project.

This is probably true of image files too.

At least please explicitly describe a suitable robots.txt User-agent name to stop the tool scraping inappropriate sites/subtrees.

Rgds

Damon

If you are only searching domains, why is your crawler scraping pages???

  1. It doesn't appear that your crawler checks for a robots.txt before crawling a site. This is BAD practice if you want to be a 'legitimate' bot. Some sites do NOT want to be crawled by random bots, or there are pages you shouldn't be crawling (for various reasons)...

  2. You appear to have ZERO rate limiting for your crawl speed. That is ABUSE... I count 40 PAGES/sec request rate (and that doesn't include other content)...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.