tb0hdan / domains Goto Github PK

View Code? Open in Web Editor NEW

676.0 676.0 102.0 1.72 GB

World’s single largest Internet domains dataset

Home Page: https://domainsproject.org

License: BSD 3-Clause "New" or "Revised" License

Shell 35.75% HTML 64.25%

colly dataset internet-domains scrapy search-engines yacy

domains's People

Contributors

Stargazers

Watchers

Forkers

alexis4689 ktzgraph supperspiderman drsect0r shin701 srilikestosing samarpw cesnokov shanawarkhan avdemin7 famatte69 abromeit summercms 00mjk sultanzio bishab mervin0502 itoutsourcing86 qidouhai fasuto satan1a jora-ko progrocket vjingbi nntdoan rust3r nsknet josepowera umanshahzad arryboom ericdelasecu jimba86 mrheat qsdj workcha ledino masums 3453-315h simokx wha000tif tdr130 angrygraywolf elliotwutingfeng mhmtbsbyndr adam-9123674 vladretca eboese adilsoybali isolisp yuksel75 jayhutajulu1 xpcom-bsd essayy dcnick3 kenjoe41 banditaf saboor-hakimi yvenone hokiegeek2 jimwangzx sevmardi ajinwu xiaolaomen duckheada infernexio markwhite005 a1vinsmith igorb gavrias amirulandalib toman-sasaki dnsplus xell8 feathered-arch suryatmodulus fuck007910 taoscorpi barkoczy vigov5 surajmishra0 gecatalin sandarkin share-i daydiff sonofescobar1337 waldiirawan hokim-m usever xsmael cybersecops rfcti tianlelyd ctrlht jqk6 venhow rbjs sn1p2r78 inatic windywxf

domains's Issues

French domain names from public orgs

Hi!

You may be interested by this dataset: https://github.com/etalab/noms-de-domaine-organismes-publics

If I get my shell-fu right (comm -13 theirs mine | wc -l), it may contain 22673 domains that you don't have, it's "small" as it only contains domains from public orgs (and you already have a lot from AFNIC open data), but it's probably worth it?

Domain dataset source: CubDomain.com

I found a website with large lists of newly registered domains updated on a daily basis, perhaps you would consider integrating it into your collection: https://www.cubdomain.com/domains-registered-dates/1

You'll need to do some scraping though, I can help with that.

Error when downloading

Hi, I'm getting an error message when cloning the repo:

error: 472 bytes of body are still expected3.29 MiB | 2.00 KiB/s
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

And if I try to download it as a Zip file, when I try to open it, I get a corrupt file.

Thanks!

Crawler causing SYN floods

Yesterday at 18:17 CEST we noted a SYN flood caused by the project crawler. Please implement request limits.

Domain dataset source (ipsniper.info)

Hey tb0hdan, glad to see that you're still up and about, I've found a large source of domains at https://ipsniper.info/numbers.html, apparently they have around 286 million domains and it's being updated on a regular basis.

They've also got lots of domain/IP related datasets at https://ipsniper.info

Suggestion: Add amount of storage taken by data into STATS.md

It is sometimes interesting to see how much size those domains take¹.

It would be cool if you guys keep track this total size, in gigabytes, of all data packed/unpacked (archived/non-archived).

What size of google dns DB? ↩

Please specifiy your crawler User-agent for robots.txt

It does not appear to be documented, and your crawler is wasting lots of bandwidth downloading lots of media files that do not contain any URL metadata.

Missing domains

Hi,

Thanks for this nice project. I found that some domains are missing from your dataset, such as "aaaa.com" or "azjj.com". Is there a reason for this?

Thanks.

Error while clone the project : repository is over its data quota

Cloning into 'domains'...
remote: Enumerating objects: 170161, done.
remote: Counting objects: 100% (17313/17313), done.
remote: Compressing objects: 100% (17245/17245), done.
remote: Total 170161 (delta 75), reused 17304 (delta 67), pack-reused 152848
Receiving objects: 100% (170161/170161), 1.71 GiB | 16.16 MiB/s, done.
Resolving deltas: 100% (1105/1105), done.
Downloading data/afghanistan/domain2multi-af00.txt.xz (79 KB)
Error downloading object: data/afghanistan/domain2multi-af00.txt.xz (4b3903e): Smudge error: Error downloading data/afghanistan/domain2multi-af00.txt.xz (4b3903eef1e6e05f9f526e7eaa667ce0528d149dc12169299de8e3868601e4de): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to '/home/system/new/domains/.git/lfs/logs/20240202T235755.763457986.log'.
Use git lfs logs last to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: data/afghanistan/domain2multi-af00.txt.xz: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

I recently sponsored this project

I recently sponsored this project and, where can i get username password to access https://dataset.domainsproject.org/ ?

ICANN CZDS to seed domains

Would the ICANN CZDS API (https://czds.icann.org/) be a useful mechanism for you to more efficiently pull a list of all domains across the web? It allows you to apply for access to the zone data for each TLD, and then using their API tool, you can quickly download the zone file for each TLD that approves your request for access. This approach wouldn't be effective for finding subdomains though, so further work would be needed there, but perhaps this would be a good way to seed your scan?

I was also wondering if you tracked A records, or similar for each domain. Looks like the dataset is just domains, so I suspect the answer is no, but thought I would ask just in case I'm missing it somewhere as you mentioned using DNS checks in the README.

Certificate Transparency Data?

This seems like a good source of data for your use case.
Every SSL certificate (including subject and subject alternate names) gets written to a log. You can read the logs and extract the domains.

https://certificate.transparency.dev/

Might even be able to use the stream from Cali Dog Security to feed you the data to parse and build onto your data set:

https://certstream.calidog.io/

Extra domain

While looking where dunctebot.link was getting requests from on the internet I stumbled upon this cool dataset. Since I'm getting rid of that domain soon here's the domain that will replace it duncte.bot

Raw data page returns 403.

I'm trying to download the data from the website dataset.domainsproject.org but I'm getting an http 403.

# wget -m https://dataset.domainsproject.org
--2021-05-20 16:14:26--  https://dataset.domainsproject.org/
Resolving dataset.domainsproject.org (dataset.domainsproject.org)... 104.26.14.47, 104.26.15.47, 172.67.72.229, ...
Connecting to dataset.domainsproject.org (dataset.domainsproject.org)|104.26.14.47|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-05-20 16:14:26 ERROR 403: Forbidden.

Description of data sources

First thank you for the database! I wonder could you please add more details of data sources? The README only provides an extremly vague description ("crawler and DNS checks").

To start with, I assume that the names provided by the links in https://github.com/tb0hdan/domains#additional-sources are all included in your database?

401 Authorization Required

https://dataset.domainsproject.org/ is returning "401 Authorization Required"

If this is no longer being used, it should be removed from the homepage and the README

Issue with your patreon

Hello, i’m trying to subscribe to your patreon and everytime i do it the account gets disabled, i wanted to know if you have any contact method like telegram or xmpp so we can discuss the payment.

More Statistics

Hello, Is Possible To Show Statistics Like This ::::: https://dnscensus2013.neocities.org/statistics.html
Show Numbers By TLDs Please

Newa Our Master.

Slovak domains dataset (open access)

SK-NIC provides the zonefile for all .sk domains at https://sk-nic.sk/subory/domains.txt

Reference: https://sk-nic.sk/en/home/

Some controversy: https://medium.com/@Oskar456/stolen-sk-domain-717e070f6735

Add example of what data looks like

I'm interested in this data, but the download and space commitments look large so it would be great if there were some examples of the records I could expect to see in the README

Wget returns 403 forbidden

(same issue as #5)

Hey, I'm including one of the datasets in my script. I'm using Python's wget package (which uses the standard wget user agent) and Cloudflare gives me a 403 error.

urllib.error.HTTPError: HTTP Error 403: Forbidden

can't be cloned due to error

git clone is not working on this repo and gets the following error

Cloning into 'domains'...
remote: Enumerating objects: 170140, done.
remote: Counting objects: 100% (17292/17292), done.
remote: Compressing objects: 100% (17265/17265), done.
remote: Total 170140 (delta 61), reused 17250 (delta 26), pack-reused 152848
Receiving objects: 100% (170140/170140), 1.71 GiB | 6.87 MiB/s, done.
Resolving deltas: 100% (1091/1091), done.
Downloading data/afghanistan/domain2multi-af00.txt.xz (79 KB)
Error downloading object: data/afghanistan/domain2multi-af00.txt.xz (4b3903e): Smudge error: Error downloading data/afghanistan/domain2multi-af00.txt.xz (4b3903eef1e6e05f9f526e7eaa667ce0528d149dc12169299de8e3868601e4de): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Anyone else having the same issue?

Domain dataset source: ICANN Centralized Zone Data Service (CZDS)

ICANN (Internet Corporation for Assigned Names and Numbers) provides a Centralized Zone Data Service (CZDS) where you can request free access to the Zone Files provided by participating generic Top-Level Domains (gTLDs).

Currently there are over 1100 generic TLDs (including the big ones like com, net, org) available for download (terms-of-service limit: maximum of one download per 24 hours per zonefile).

You can scrape your domains from the zonefiles.

You will need to provide a request "reason" to each zonefile registrar.

From my experience, most (>98%) registrars expect the following information (a handful will reject your access request if any of the following information is omitted):

Name
Email
IP Address
Physical Address (Building, Street, Postcode etc.)
Phone Number
Reason (what you intend to do with the zonefiles)

It is also possible to automatically request zonefile access rights and download zonefiles via the ICANN CZDS API instead of from the website user interface.

Dataset Source: crawler.ninja

https://crawler.ninja

This person has been crawling Alexa's Top 1M domains for the past 7 years and has made the raw data public domain at the above website.

Do not download large files such as FLAC and MP3

Downloading these places a significant load on servers, and most are not going to contain URL metadata of use to the project.

This is probably true of image files too.

At least please explicitly describe a suitable robots.txt User-agent name to stop the tool scraping inappropriate sites/subtrees.

Rgds

Damon

If you are only searching domains, why is your crawler scraping pages???

It doesn't appear that your crawler checks for a robots.txt before crawling a site. This is BAD practice if you want to be a 'legitimate' bot. Some sites do NOT want to be crawled by random bots, or there are pages you shouldn't be crawling (for various reasons)...
You appear to have ZERO rate limiting for your crawl speed. That is ABUSE... I count 40 PAGES/sec request rate (and that doesn't include other content)...

tb0hdan / domains Goto Github PK

domains's People

Contributors

Stargazers

Watchers

Forkers

domains's Issues

Footnotes

Recommend Projects

Recommend Topics

Recommend Org