dmitriiweb / extract-emails Goto Github PK

View Code? Open in Web Editor NEW

81.0 2.0 33.0 12.39 MB

Extract emails from a given website

License: MIT License

Python 99.19% Makefile 0.81%

python email extract-emails scraper parser parsing parsing-library linkedin

extract-emails's Introduction

Extract Emails

Extract emails and linkedins profiles from a given website

Support the project with BTC: bc1q0cxl5j3se0ufhr96h8x0zs8nz4t7h6krrxkd6l

Documentation

Requirements

Python >= 3.9

Installation

pip install extract_emails[all]
# or
pip install extract_emails[requests]
# or
pip install extract_emails[selenium]

Simple Usage

As library

from pathlib import Path

from extract_emails import DefaultFilterAndEmailFactory as Factory
from extract_emails import DefaultWorker
from extract_emails.browsers.requests_browser import RequestsBrowser as Browser
from extract_emails.data_savers import CsvSaver


websites = [
    "website1.com",
    "website2.com",
]

browser = Browser()
data_saver = CsvSaver(save_mode="a", output_path=Path("output.csv"))

for website in websites:
    factory = Factory(
        website_url=website, browser=browser, depth=5, max_links_from_page=1
    )
    worker = DefaultWorker(factory)
    data = worker.get_data()
    data_saver.save(data)

As CLI tool

$ extract-emails --help

$ extract-emails --url https://en.wikipedia.org/wiki/Email -of output.csv -d 1
$ cat output.csv
email,page,website
[email protected],https://en.wikipedia.org/wiki/Email,https://en.wikipedia.org/wiki/Email

extract-emails's People

Contributors

Stargazers

Watchers

extract-emails's Issues

how can I extract emails from the list of website?

Hello
how can I extract emails from the list of website by using your tool?

thanks

httpx

Add support for httpx
https://www.python-httpx.org

save as csv issue

save as CSV doesn't append to CSV file when running on loop

Integration testing

Need to add tests for uncover parts of the code

CONTRIBUTING.md

Create CONTRIBUTING.md file with rules and tips

Advanced Usage

Need to add descriptions and examples of how to create and use custom elements (filters, browser, factories, etc.)

hi @dmitriiweb please can you help identify what's wrong i tired running your example also from the doc and i've pip install the latest extract_emails v5.0.2 but i'm not getting any response or output seems there's an issues somewhere.
Did you run the example from your end also ?

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Hi,

Again, thanks for your job. It am experimenting it before to scale. It is great but I get issue when it doesn't find emails obviously.

I run this simple script for testing:

from extract_emails import ExtractEmails


em = ExtractEmails("http://www.formationgrowthhacking.com/", depth=None, print_log=False, ssl_verify=False, user_agent=None, request_delay=0.0)
emails = em.emails

print(emails)

I get these errors:

Traceback (most recent call last):
File "C:/Users/Nino/PycharmProjects/EmailVerif/github_extract-email/extract_emails/myextrator.py", line 4, in
em = ExtractEmails("http://www.formationgrowthhacking.com/", depth=None, print_log=False, ssl_verify=False, user_agent=None, request_delay=0.0)
File "C:\Users\Nino\PycharmProjects\EmailVerif\github_extract-email\extract_emails\extract_emails.py", line 30, in init
self.extract_emails(url)
File "C:\Users\Nino\PycharmProjects\EmailVerif\github_extract-email\extract_emails\extract_emails.py", line 43, in extract_emails
self.extract_emails(new_url)
File "C:\Users\Nino\PycharmProjects\EmailVerif\github_extract-email\extract_emails\extract_emails.py", line 43, in extract_emails
self.extract_emails(new_url)
File "C:\Users\Nino\PycharmProjects\EmailVerif\github_extract-email\extract_emails\extract_emails.py", line 43, in extract_emails
self.extract_emails(new_url)
[Previous line repeated 30 more times]
File "C:\Users\Nino\PycharmProjects\EmailVerif\github_extract-email\extract_emails\extract_emails.py", line 36, in extract_emails
self.get_all_links(r.text)
File "C:\Users\Nino\PycharmProjects\EmailVerif\github_extract-email\extract_emails\extract_emails.py", line 59, in get_all_links
tree = html.fromstring(page)
File "C:\Users\Nino\PycharmProjects\EmailVerif\venv\lib\site-packages\lxml\html_init_.py", line 875, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "C:\Users\Nino\PycharmProjects\EmailVerif\venv\lib\site-packages\lxml\html_init_.py", line 761, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src\lxml\etree.pyx", line 3234, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1871, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Could you please help me to fix this issue?

Not picking up .co.uk email addresses correctly

The ExtractEmails function is returning email addresses as ".co" and not ".co.uk". This may be an issue with the regex?

Example URL: http://www.aubreypark.co.uk/

Setup.py install fail

Hey, hope you find this as I'm really using your tool for an important project of mine.

I've been trying many ways to run "pip install extract_emails" and I even tried running setup.py but they all give me this same error:

I tried so far with Python 3.7, then after 3.6 as I realized that's what the requirements showed, but still same results. I've tried many different solutions but none have worked so far, do you think you could help me out with this?

FileNotFoundError: [Errno 2] No such file or directory:

Hi,

Thanks for your work.

I installed and followed your instruction. I get these errors:

Traceback (most recent call last):
  File "C:/Users/Nino/PycharmProjects/EmailVerif/emailverif.py", line 8, in <module>
    em = ExtractEmails(url, depth=None, print_log=False, ssl_verify=True, user_agent=None, request_delay=0.0)
  File "C:\Users\Nino\PycharmProjects\EmailVerif\venv\lib\site-packages\extract_emails\extract_emails.py", line 31, in __init__
    self.extract_emails(url)
  File "C:\Users\Nino\PycharmProjects\EmailVerif\venv\lib\site-packages\extract_emails\extract_emails.py", line 38, in extract_emails
    self.get_emails(r.text)
  File "C:\Users\Nino\PycharmProjects\EmailVerif\venv\lib\site-packages\extract_emails\extract_emails.py", line 53, in get_emails
    domains = self.get_domains()
  File "C:\Users\Nino\PycharmProjects\EmailVerif\venv\lib\site-packages\extract_emails\extract_emails.py", line 61, in get_domains
    with open(DOMAINS_FAIL, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Nino\\PycharmProjects\\EmailVerif\\venv\\lib\\site-packages\\extract_emails\\top_level_domains.pkl'

Process finished with exit code 1

Could you provide the missing file please?

Kind regards

playwright

Add support for playwright
https://playwright.dev/python/docs/intro/

from extract_emails import DefaultFilterAndEmailFactory as Factory

hi @dmitriiweb can you help with this issues below after i pip installed the extract-emails module then i copied the example code to run it but i keep getting the errors below
from extract_emails import DefaultFilterAndEmailFactory as Factory
ImportError: cannot import name 'DefaultWorker' from 'extract_emails'

Extract social media accounts

Hi!

Thanks for this awesome project. I've already used it for the extraction of emails, but can this also be used for the extraction of social media (LinkedIn) accounts?

Thanks!

Error when running code

Hello,
after pip install extract_emails and trying to run your sample code i got the error below. What could be the problem. Thanks in advance and between pip install extract_emails[all] is not working, why?

Traceback (most recent call last):
File "/Users/user/My Drive/emailextractfromURL/olu.py", line 3, in
from extract_emails import DefaultFilterAndEmailFactory as Factory
File "/Users/user/My Drive/emailextractfromURL/venv/lib/python3.9/site-packages/extract_emails/init.py", line 2, in
from .factories import (
File "/Users/user/My Drive/emailextractfromURL/venv/lib/python3.9/site-packages/extract_emails/factories/init.py", line 1, in
from .base_factory import BaseFactory
File "/Users/user/My Drive/emailextractfromURL/venv/lib/python3.9/site-packages/extract_emails/factories/base_factory.py", line 6, in
from extract_emails.link_filters import LinkFilterBase
File "/Users/user/My Drive/emailextractfromURL/venv/lib/python3.9/site-packages/extract_emails/link_filters/init.py", line 1, in
from .contact_link_filter import ContactInfoLinkFilter
File "/Users/user/My Drive/emailextractfromURL/venv/lib/python3.9/site-packages/extract_emails/link_filters/contact_link_filter.py", line 7, in
class ContactInfoLinkFilter(LinkFilterBase):
File "/Users/user/My Drive/emailextractfromURL/venv/lib/python3.9/site-packages/extract_emails/link_filters/contact_link_filter.py", line 53, in ContactInfoLinkFilter
contruct_candidates: list[str] | None = None,
TypeError: unsupported operand type(s) for |: 'types.GenericAlias' and 'NoneType'