spamscope / mail-parser Goto Github PK

View Code? Open in Web Editor NEW

359.0 10.0 86.0 3 MB

Tokenizer for raw mails

Home Page: https://pypi.python.org/pypi/mail-parser

License: Apache License 2.0

Python 96.96% Dockerfile 0.51% Makefile 2.53%

mail-analyzer mail mail-parser python python3 outlook docker-image docker mailparser security

mail-parser's Introduction

mail-parser

mail-parser is not only a wrapper for email Python Standard Library. It give you an easy way to pass from raw mail to Python object that you can use in your code. It's the key module of SpamScope.

mail-parser can parse Outlook email format (.msg). To use this feature, you need to install libemail-outlook-message-perl package. For Debian based systems:

$ apt-get install libemail-outlook-message-perl

For more details:

$ apt-cache show libemail-outlook-message-perl

mail-parser supports Python 3.

Apache 2 Open Source License

mail-parser can be downloaded, used, and modified free of charge. It is available under the Apache 2 license.

Support the project

Dogecoin: DAUbDUttkf8WN1kwP9YYQQKyEJYY2WWtEG

mail-parser on Web

Description

mail-parser takes as input a raw email and generates a parsed object. The properties of this object are the same name of RFC headers:

bcc
cc
date
delivered_to
from_ (not from because is a keyword of Python)
message_id
received
reply_to
subject
to

There are other properties to get:

body
body html
body plain
headers
attachments
sender IP address
to domains
timezone

The attachments property is a list of objects. Every object has the following keys:

binary: it's true if the attachment is a binary
charset
content_transfer_encoding
content-disposition
content-id
filename
mail_content_type
payload: attachment payload in base64

To get custom headers you should replace "-" with "_". Example for header X-MSMail-Priority:

$ mail.X_MSMail_Priority

The received header is parsed and splitted in hop. The fields supported are:

by
date
date_utc
delay (between two hop)
envelope_from
envelope_sender
for
from
hop
with

mail-parser can detect defect in mail:

defects: mail with some not compliance RFC part

All properties have a JSON and raw property that you can get with:

name_json
name_raw

Example:

$ mail.to (Python object)
$ mail.to_json (JSON)
$ mail.to_raw (raw header)

The command line tool use the JSON format.

Defects

These defects can be used to evade the antispam filter. An example are the mails with a malformed boundary that can hide a not legitimate epilogue (often malware). This library can take these epilogues.

Authors

Main Author

Fedele Mantuano: LinkedIn

Installation

Clone repository

git clone https://github.com/SpamScope/mail-parser.git

and install mail-parser with setup.py:

$ cd mail-parser

$ python setup.py install

or use pip:

$ pip install mail-parser

Usage in a project

Import mailparser module:

import mailparser

mail = mailparser.parse_from_bytes(byte_mail)
mail = mailparser.parse_from_file(f)
mail = mailparser.parse_from_file_msg(outlook_mail)
mail = mailparser.parse_from_file_obj(fp)
mail = mailparser.parse_from_string(raw_mail)

Then you can get all parts

mail.attachments: list of all attachments
mail.body
mail.date: datetime object in UTC
mail.defects: defect RFC not compliance
mail.defects_categories: only defects categories
mail.delivered_to
mail.from_
mail.get_server_ipaddress(trust="my_server_mail_trust")
mail.headers
mail.mail: tokenized mail in a object
mail.message: email.message.Message object
mail.message_as_string: message as string
mail.message_id
mail.received
mail.subject
mail.text_plain: only text plain mail parts in a list
mail.text_html: only text html mail parts in a list
mail.text_not_managed: all not managed text (check the warning logs to find content subtype)
mail.to
mail.to_domains
mail.timezone: returns the timezone, offset from UTC
mail.mail_partial: returns only the mains parts of emails

It's possible to write the attachments on disk with the method:

mail.write_attachments(base_path)

Usage from command-line

If you installed mailparser with pip or setup.py you can use it with command-line.

These are all swithes:

usage: mailparser [-h] (-f FILE | -s STRING | -k)
                   [-l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}] [-j] [-b]
                   [-a] [-r] [-t] [-dt] [-m] [-u] [-c] [-d] [-o]
                   [-i Trust mail server string] [-p] [-z] [-v]

Wrapper for email Python Standard Library

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Raw email file (default: None)
  -s STRING, --string STRING
                        Raw email string (default: None)
  -k, --stdin           Enable parsing from stdin (default: False)
  -l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}, --log-level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        Set log level (default: WARNING)
  -j, --json            Show the JSON of parsed mail (default: False)
  -b, --body            Print the body of mail (default: False)
  -a, --attachments     Print the attachments of mail (default: False)
  -r, --headers         Print the headers of mail (default: False)
  -t, --to              Print the to of mail (default: False)
  -dt, --delivered-to   Print the delivered-to of mail (default: False)
  -m, --from            Print the from of mail (default: False)
  -u, --subject         Print the subject of mail (default: False)
  -c, --receiveds       Print all receiveds of mail (default: False)
  -d, --defects         Print the defects of mail (default: False)
  -o, --outlook         Analyze Outlook msg (default: False)
  -i Trust mail server string, --senderip Trust mail server string
                        Extract a reliable sender IP address heuristically
                        (default: None)
  -p, --mail-hash       Print mail fingerprints without headers (default:
                        False)
  -z, --attachments-hash
                        Print attachments with fingerprints (default: False)
  -sa, --store-attachments
                        Store attachments on disk (default: False)
  -ap ATTACHMENTS_PATH, --attachments-path ATTACHMENTS_PATH
                        Path where store attachments (default: /tmp)
  -v, --version         show program's version number and exit

It takes as input a raw mail and generates a parsed object.

Example:

$ mailparser -f example_mail -j

This example will show you the tokenized mail in a JSON pretty format.

From raw mail to parsed mail.

Exceptions

Exceptions hierarchy of mail-parser:

MailParserError: Base MailParser Exception
|
\── MailParserOutlookError: Raised with Outlook integration errors
|
\── MailParserEnvironmentError: Raised when the environment is not correct
|
\── MailParserOSError: Raised when there is an OS error
|
\── MailParserReceivedParsingError: Raised when a received header cannot be parsed

mail-parser's People

Contributors

Stargazers

Watchers

Forkers

cezarsantanna-zz olivierh59500 sirlord wzr m42e kojibhy bbxytl paradoxis zhanghetong heikipikker socmap rand0macc3ss hamano chihhunglin rudymatela sathishdsgithub watchdogpolska luskbo jvmsangkal primmus dfeinzeig noobskywalker bhtabor redsift dst1213 jhorowitz spankders phreak1990 confusedsecuritydudes pythonthings wahaha1967 bf aa-dank scalefactor fpx406 konstantinklepikov aksconsultants aruneshmathur marcucla saeed-abdullah pixel-jack zhangtaihong jkamdjou swills phihsh 5l1v3r1 wynnedev vnmcosta breno-dsuite liammahoney nitishkansal brianjmurrell dadokkio seanthegeek etnbe h4xl0r sylencecc cunhaac rymmx-gls marirs usairimisani ctemplar alealbonico pombredanne sitedata wszostak jasbeilin jhdavino fabma sorakiu frankfanslc yradio matevn ivancativan capuanob mayhemheroes criped alfonsrv sn01615 bitcoinoutput francois4224 ljuturu nitram2342 jonz-secops

mail-parser's Issues

If the attachment name is in Cyrillic then TypeError: decoding to str: need a bytes-like object, Header found

Raw mail
RAW email https://gist.github.com/yatakoi/77523914f80776a8d3323de73417e767

Environment:

OS: CentOS 7
Docker: no
mail-parser version 3.12.0

Additional context
If the attachment name is in Cyrillic then TypeError: decoding to str: need a bytes-like object, Header found

Traceback (most recent call last):
File "main.py", line 139, in
last_uid = get_emails(host, login, password, last_uid=last_uid)
File "main.py", line 54, in get_emails
mail = mailparser.parse_from_bytes(message_data[b"RFC822"])
File "/home/m.kostromin/send_tickets/send_tickets/lib64/python3.6/site-packages/mailparser/mailparser.py", line 116, in parse_from_bytes
return MailParser.from_bytes(bt)
File "/home/m.kostromin/send_tickets/send_tickets/lib64/python3.6/site-packages/mailparser/mailparser.py", line 239, in from_bytes
return cls(message)
File "/home/m.kostromin/send_tickets/send_tickets/lib64/python3.6/site-packages/mailparser/mailparser.py", line 136, in init
self.parse()
File "/home/m.kostromin/send_tickets/send_tickets/lib64/python3.6/site-packages/mailparser/mailparser.py", line 374, in parse
p.get('content-disposition'))
File "/home/m.kostromin/send_tickets/send_tickets/lib64/python3.6/site-packages/mailparser/utils.py", line 80, in wrapper
return normalize('NFC', func(*args, **kwargs))
File "/home/m.kostromin/send_tickets/send_tickets/lib64/python3.6/site-packages/mailparser/utils.py", line 114, in ported_string
return six.text_type(raw_data, encoding).strip()
TypeError: decoding to str: need a bytes-like object, Header found

Please, help me.

mail.date does not return timezone information

The routine should return the dates with timezone offset information. Prefered would be iso 8601 ex: 2013-10-29T09:38:41.341Z

Missing Content-ID in msg.attachments

Missing Content-ID in msg.attachments, so contents and attachments cannot be connected.

https://tools.ietf.org/html/rfc2392

Can not save attachments for emails with Content-Type: text/calendar

Describe the bug

Using mailparser from the cli fails fatally on Content-Type: text/calendar

To Reproduce
Steps to reproduce the behavior:

mailparser -f mail.txt -ap /var/tmp --store-attachments
'....'
See error

Expected behavior
The meeting attachment is written to disk as other (base64) attachments are.

Raw mail
Any raw mail containing a meeting request attachment, i.e.

Received: from mail.me.com (LHLO mail.me.com) (10.0.1.80) by
 mail.me.com with LMTP; Fri, 8 Feb 2019 10:21:37 -0500 (EST)
Received: from localhost (localhost [127.0.0.1])
        by mail.me.com (Postfix) with ESMTP id 82E49C900985;
        Fri,  8 Feb 2019 10:21:37 -0500 (EST)
Received: from mail.me.com ([127.0.0.1])
        by localhost (mail.me.com [127.0.0.1]) (amavisd-new, port 10032)
        with ESMTP id Or1feO1BbrkB; Fri,  8 Feb 2019 10:21:36 -0500 (EST)
Received: from localhost (localhost [127.0.0.1])
        by mail.me.com (Postfix) with ESMTP id 875A1C90070B;
        Fri,  8 Feb 2019 10:21:36 -0500 (EST)
X-Quarantine-ID: <OUFU1ymuNiJU>
X-Virus-Scanned: amavisd-new at mail.me.com
Received: from mail.me.com ([127.0.0.1])
        by localhost (mail.me.com [127.0.0.1]) (amavisd-new, port 10026)
        with ESMTP id OUFU1ymuNiJU; Fri,  8 Feb 2019 10:21:36 -0500 (EST)
Received: from mail.me.com (mail.me.com [10.0.1.80])
        by mail.me.com (Postfix) with ESMTP id 5C626C9000C9;
        Fri,  8 Feb 2019 10:21:36 -0500 (EST)
Date: Fri, 8 Feb 2019 10:21:36 -0500 (EST)
From: [email protected]
To: Me Too <[email protected]>
Message-ID: <[email protected]>
Subject: Project Meeting
MIME-Version: 1.0
Content-Type: multipart/alternative; 
        boundary="----=_Part_6960233_303396869.1549639296326"
X-Originating-IP: [10.0.2.1]
X-Mailer: Zimbra 8.6.0_GA_1194 (ZimbraWebClient - GC72 (Win)/8.6.0_GA_1194)

------=_Part_6960233_303396869.1549639296326
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

A single instance of the following meeting has been modified:

Subject: Project Meeting 
Organizer: "Me Me" <[email protected]> 

Location: "Conference Room" <[email protected]> 
Resources: "Conference Room" <[email protected]> (Conference Room) 
Time: Friday, February 8, 2019, 11:00:00 AM - 12:00:00 PM GMT -05:00 US/Canada Eastern
 
Invitees: [email protected] ... 


*~*~*~*~*~*~*~*~*~*

The following meeting has been modified:

Subject: Project Meeting 
Organizer: "Me Me" <[email protected]> 

Location: "Conference Room" <[email protected]> 
Resources: "Conference Room" <[email protected]> (Conference Room) 
Time: 11:00:00 AM - 12:00:00 PM GMT -05:00 US/Canada Eastern [MODIFIED]
 Recurrence : Every Friday.   End by Jun 28, 2019.   Effective Oct 26, 2018

Invitees: [email protected] ... 


*~*~*~*~*~*~*~*~*~*

Creating new series to fix conference room issue 

------=_Part_6960233_303396869.1549639296326
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

<html><body id='htmlmode'><h3>A single instance of the following meeting has been modified:</h3>

<p>
<table border='0'>
<tr><th align=left>Subject:</th><td>Project Meeting </td></tr>
<tr><th align=left>Organizer:</th><td>"Me Me" &lt;[email protected]&gt; </td></tr>
</table>
<p>
<table border='0'>
<tr><th align=left>Location:</th><td>"Conference Room" &lt;[email protected]&gt; </td></tr>
<tr><th align=left>Resources:</th><td>"Conference Room" &lt;[email protected]&gt; (Conference Room) </td></tr>
<tr><th align=left>Time:</th><td>Friday, February 8, 2019, 11:00:00 AM - 12:00:00 PM GMT -05:00 US/Canada Eastern
 </td></tr></table>
<p>
<table border='0'>
<tr><th align=left>Invitees:</th><td>[email protected] ... </td></tr>
</table>
<div>*~*~*~*~*~*~*~*~*~*</div><br><div style="font-family: tahoma,new york,times,serif; font-size: 12pt; color: #000000"><div style="font-family: tahoma,new york,times,serif; font-size: 12pt; color: #000000"><div>Creating new series to fix conference room issue</div></div></div></body></html>
------=_Part_6960233_303396869.1549639296326
Content-Type: text/calendar; charset=utf-8; method=CANCEL; name=meeting.ics
Content-Transfer-Encoding: 7bit

BEGIN:VCALENDAR
PRODID:Zimbra-Calendar-Provider
VERSION:2.0
METHOD:CANCEL
BEGIN:VTIMEZONE
TZID:America/New_York
BEGIN:STANDARD
DTSTART:16010101T020000
TZOFFSETTO:-0500
TZOFFSETFROM:-0400
RRULE:FREQ=YEARLY;WKST=MO;INTERVAL=1;BYMONTH=11;BYDAY=1SU
TZNAME:EST
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T020000
TZOFFSETTO:-0400
TZOFFSETFROM:-0500
RRULE:FREQ=YEARLY;WKST=MO;INTERVAL=1;BYMONTH=3;BYDAY=2SU
TZNAME:EDT
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:77b6c34a-ec9c-4d34-8552-2896f3ab5996
SUMMARY:Project Meeting
COMMENT:A single instance of a recurring meeting has been modified.
LOCATION:"Conference Room" <conferenceroom@me
 .com>
ATTENDEE;CN=Me Too;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED:mailto:m
 [email protected]
ATTENDEE;CN=Conference Room;CUTYPE=RESOURCE;ROLE=NON-PARTI
 CIPANT;PARTSTAT=ACCEPTED:mailto:[email protected]
ORGANIZER;CN=Me Me:mailto:[email protected]
DTSTART;TZID="America/New_York":20190208T110000
DTEND;TZID="America/New_York":20190208T120000
STATUS:CANCELLED
CLASS:PUBLIC
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
TRANSP:OPAQUE
RECURRENCE-ID;TZID="America/New_York":20190208T110000
LAST-MODIFIED:20190208T152136Z
DTSTAMP:20190208T152136Z
SEQUENCE:2
DESCRIPTION:A single instance of the following meeting has been modified:\n
 \nSubject: Project Meeting \nOrganizer: "Me Me" <me@me
 .com> \n\nLocation: "Conference Room" <confer
 [email protected]> \nResources: "Conference Room" <
 [email protected]> (Conference Room) \nT
 ime: Friday\, February 8\, 2019\, 11:00:00 AM - 12:00:00 PM GMT -05:00 US/Ca
 nada Eastern\n \nInvitees: [email protected] ... \n\n\n*~*~*~*~*~*~*~*
 ~*~*\n\nThe following meeting has been modified:\n\nSubject: Projec
 t Meeting \nOrganizer: "Me Me" <[email protected]> \n\nLocation: "IT - 
 Earl's Court Conference Room" <[email protected]> \nRe
 sources: "Conference Room" <conferenceroom@me
 me.com> (Conference Room) \nTime: 11:00:00 AM - 12:00:0
 0 PM GMT -05:00 US/Canada Eastern [MODIFIED]\n Recurrence : Every Friday.   
 End by Jun 28\, 2019.   Effective Oct 26\, 2018\n\nInvitees: metoo@me
 .com ... \n\n\n*~*~*~*~*~*~*~*~*~*\n\nCreating new series to fix confere
 nce room issue \n
X-ALT-DESC;FMTTYPE=text/html:<html><body id='htmlmode'><h3>A single instance
  of the following meeting has been modified:</h3>\n\n<p>\n<table border='0'
 >\n<tr><th align=left>Subject:</th><td>Project Meeting </td></tr>\n
 <tr><th align=left>Organizer:</th><td>"Me Me" &lt\;[email protected]&gt
 \; </td></tr>\n</table>\n<p>\n<table border='0'>\n<tr><th align=left>Locatio
 n:</th><td>"Conference Room" &lt\;conferencer
 [email protected]&gt\; </td></tr>\n<tr><th align=left>Resources:</th><td>"IT -
  Earl's Court Conference Room" &lt\;[email protected]&
 gt\; (Conference Room) </td></tr>\n<tr><th align=left>Time
 :</th><td>Friday\, February 8\, 2019\, 11:00:00 AM - 12:00:00 PM GMT -05:00 
 US/Canada Eastern\n </td></tr></table>\n<p>\n<table border='0'>\n<tr><th ali
 gn=left>Invitees:</th><td>[email protected] ... </td></tr>\n</table>\n
 <div>*~*~*~*~*~*~*~*~*~*</div><br><div style="font-family: tahoma\,new york\
 ,times\,serif\; font-size: 12pt\; color: #000000"><div style="font-family: t
 ahoma\,new york\,times\,serif\; font-size: 12pt\; color: #000000"><div>Creat
 ing new series to fix conference room issue</div></div></div></body></html>
END:VEVENT
END:VCALENDAR
------=_Part_6960233_303396869.1549639296326--

Environment:

OS: Linux
Docker: no
mail-parser version 3.9.2

Additional context
Add any other context about the problem here (e.g. stack traceback error).

  File "/usr/bin/mailparser", line 9, in <module>
    load_entry_point('mail-parser==3.9.2', 'console_scripts', 'mailparser')()
  File "/usr/lib/python3.4/site-packages/mail_parser-3.9.2-py3.4.egg/mailparser/__main__.py", line 260, in main
    write_attachments(parser.attachments, args.attachments_path)
  File "/usr/lib/python3.4/site-packages/mail_parser-3.9.2-py3.4.egg/mailparser/utils.py", line 527, in write_attachments
    filename=a["filename"],
  File "/usr/lib/python3.4/site-packages/mail_parser-3.9.2-py3.4.egg/mailparser/utils.py", line 552, in write_sample
    f.write(payload.encode("utf-8"))
TypeError: must be str, not bytes

Charset iso-8859-8-i is not supported (hebrew)

Describe the bug
Email body (text_plain, text_html) contains dots instead of text when charset="iso-8859-8-i" and Content-Transfer-Encoding: quoted-printable.

Expected behavior
Text should be parsed

Raw mail
https://gist.github.com/Rashe/2d6887c1d442d51f2bea3c5b0f8cfb6a

Environment:

OS: Linux
Docker: Yes
mail-parser version 3.12.0

base64 decode error on new save-attachments code

Describe the bug
If I use the new "save attachments" functionality of the command-line tool, I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/mailparser", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/mailparser/__main__.py", line 260, in main
    write_attachments(parser.attachments, args.attachments_path)
  File "/usr/local/lib/python3.7/site-packages/mailparser/utils.py", line 518, in write_attachments
    filename=a["filename"],
  File "/usr/local/lib/python3.7/site-packages/mailparser/utils.py", line 540, in write_sample
    f.write(payload.decode("base64"))
AttributeError: 'str' object has no attribute 'decode'

To Reproduce
Steps to reproduce the behavior:

$ mkdir test
$ mailparser -sa -ap test -f test_email.MSG -o

Expected behavior
Tool puts attachments into "test" directory

Raw mail
Test email that reproduces the bug can be found at https://github.com/mattgwwalker/msg-extractor/blob/master/example-msg-files/unicode.msg

Environment:

OS: OSX
Docker: no
mail-parser version 3.8.0

Additional context
Because it's new code I figured it's probably easier for you to make a quick fix than for me to go through the trouble of making a pull request.

The problem can be fixed by changing
https://github.com/SpamScope/mail-parser/blob/develop/mailparser/utils.py#L540

into:

            f.write(base64.b64decode(payload))

Add: Option to output 'to' and 'headers' as JSON objects in addition to arrays

I need it for another use case and would like to get feedback before working on it.
Not sure how to implement it. Things coming to mind:

A separate set of _obj() terminated properties?
How to choose from command line?
Replace the array thingy? I don't think so/not sure, unless...
...it's useful for adding additional fields, for diagnostic or other infos on each individual element.
Other thoughts?

HTML body regression in mail-parser 3.12.0

Describe the bug
HTML bodies (raw or base64-encoded) without a email boundary are treated as a binary attachment instead of HTML body content. This bug was not present in 3.11.0.

To Reproduce
Steps to reproduce the behavior:

import mailparser
mail = mailparser.parse_from_file(f)
See error

Expected behavior
The HTML body should be parsed as the message body

Raw mail
Samples Warning - includes phishing emails: samples.zip Password: infected.

Environment:

OS: Linux
Docker: No
mail-parser version 3.12.0

Custom Header only takes first occurence

Utilisation of Custom headers
When utilising calling custom headers from a mailparser object, the return value is only 1 and does not return a list when there is more than one of that email header. EG: "Received-SPF".

To Reproduce
Have a .MSG file with email headers with more than one of the same attribute such as "Received-SPF" and when calling mail.Received_SPF only gives you the first occurrence instead of a list with all of them

Environment
Python 3.8.0 using latest release of Pypi3 Mailparser.

text_plain returns all parts that are not text/html

Describe the bug
MailParser.text_plain returns all parts that are not text/html.

To Reproduce

>>> import mailparser

>>> mail = mailparser.parse_from_bytes(b'''From: [email protected]
Subject: Test
Date: Wed, 24 Apr 2019 10:05:02 +0200 (CEST)
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============8544575414772382491=="
To: [email protected]

--===============8544575414772382491==
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<!doctype html>
<title>Foo</title>
<meta charset="utf-8">

HTML here

--===============8544575414772382491==
Content-Type: image/png
Content-Transfer-Encoding: base64
Content-Disposition: inline

UE5HIGhlcmU=
--===============8544575414772382491==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Plaintext here.
--===============8544575414772382491==--
''')

>>> mail.text_html
['<!doctype html>\n<title>Foo</title>\n<meta charset="utf-8">\n\nHTML here']

>>> mail.text_plain
['PNG here', 'Plaintext here.']

Expected behavior
text_plain should only return parts with Content-Type text/plain.

Raw mail

From: [email protected]
Subject: Test
Date: Wed, 24 Apr 2019 10:05:02 +0200 (CEST)
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============8544575414772382491=="
To: [email protected]

--===============8544575414772382491==
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<!doctype html>
<title>Foo</title>
<meta charset="utf-8">

HTML here

--===============8544575414772382491==
Content-Type: image/png
Content-Transfer-Encoding: base64
Content-Disposition: inline

UE5HIGhlcmU=
--===============8544575414772382491==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Plaintext here.
--===============8544575414772382491==--

Environment:

OS: Linux
Docker: no
mail-parser version 3.9.3

Additional context
It is impossible to sort out non-text parts (without heuristics), because everything is parsed into a list of strings and Content-Type information is thrown away.

No matches found

To Reproduce
i'm not actually sure what causes this. i'm new to email standards etc; could you give me a brief rundown of whats going on here?

Expected behavior
i assume this is supposed to happen?

Raw mail
using only the information above, is there a way to work backwards to the email? it takes hours to parse the amount of emails to get here and i don't believe it can be caught in a try/except

Environment:

OS: linux
Docker: no
mail-parser version [e.g. 3.6.0]

Suggestion better parsing of the received field

Would be great to have a dict which contains From / By / With / Time for each received element.

Is there a method for downloading attachments if using mail-parser as module in project?

Maybe I am missing something or is there nothing in the readme about retrieving the actual email attachments.
There are methods for getting the attachment names but not the actual files? (mail.attachments)

Thanks in advance.

can't make the difference between --- mail_boundary --- from the library and those from the actual message

Describe the bug
In the mail.body attributes, the library often includes --- mail_boundary --- in the string, however if your email's body actually contains a --- mail_boundary --- written by the sender, you got no way to differentiate the one included by the lib from the one written by the actual use.

To Reproduce
Steps to reproduce the behavior:

write an email from your favorite mailer with the text "hello i've written myself --- mail_boundary --- what will happen blablabla"
parse that email
print the email.body attributes
=> you have no way to differentiate your --- mail_boundary --- from the one added by mail-parser

Expected behavior
I'm not an expert in email, so I'm wondering why the library makes it appears to the "end user" that they were some mail boundaries in it ? What is the use case for it ?

Raw mail
https://gist.github.com/allan-simon/55326f4b63f8d9d74d9b887f96753de6.

Environment:

OS: Linux
Docker: no
mail-parser version 3.8.1

should self.mail be self._mail at line 279

Reviewing the code I was wondering if i stumbled on a typo in the code in the dev branch

self.mail["has_defects"] = self.has_defects

should it be?

self._mail["has_defects"] = self.has_defects

Filename of attachments are not decoded

Hello,

I am have following e-mail with following content:

...
/span></a></span></p></div></div></div></div></div></div></div></div></div>
</div></div>
</div><br></div>

--f4030435bb14695e4d05669b618a--
--f4030435bb14695e5105669b618c
Content-Type: application/pdf; 
        name="=?UTF-8?Q?Prokuratura_Rejonowa_Warszawa=2D=C5=9Ar=C3=B3dmie=C5=9Bcie_p=C3=B3=C5=82noc_sygn?=
        =?UTF-8?Q?=2E_2Ds=2E_137414_=2D_RSK_pracownik=C3=B3w_Skarbowych_NSZZ_Solidarno=C5=9B?=
        =?UTF-8?Q?=C4=87_=2D_Zarz=C4=85dzenie_o_odmowie_dopuszczenia_SOWP_do_udzia=C5=82u_w_?=
        =?UTF-8?Q?postepowaniu=2Epdf?="
Content-Disposition: attachment; 
        filename="=?UTF-8?Q?Prokuratura_Rejonowa_Warszawa=2D=C5=9Ar=C3=B3dmie=C5=9Bcie_p=C3=B3=C5=82noc_sygn?=
        =?UTF-8?Q?=2E_2Ds=2E_137414_=2D_RSK_pracownik=C3=B3w_Skarbowych_NSZZ_Solidarno=C5=9B?=
        =?UTF-8?Q?=C4=87_=2D_Zarz=C4=85dzenie_o_odmowie_dopuszczenia_SOWP_do_udzia=C5=82u_w_?=
        =?UTF-8?Q?postepowaniu=2Epdf?="
Content-Transfer-Encoding: base64
X-Attachment-Id: f_i53uo58b0

JVBERi0xLjIKJcjH0MRGCjQgMCBvYmoKPDwKL1R5cGUgL091dGxpbmVzCi9Db3VudCAwCj4+CmVu
ZG9iago1IDAgb2JqCjw8Ci9UeXBlIC9Gb250Ci9TdWJ0eXBlIC9UeXBlMQovTmFtZSAvRjAKL0Jh
c2VGb250IC9IZWx2ZXRpY2EKL0VuY29kaW5nIC9NYWNSb21hbkVuY29kaW5nCj4+CmVuZG9iago2
IDAgb2JqCjw8Ci9UeXBlIC9QYWdlCi9QYXJlbnQgMyAwIFIKL1Jlc291cmNlcyA4IDAgUgovTWVk
aWFCb3ggWyAwIDAgNTc2IDgyOS40NCBdCi9Db250ZW50cyA3IDAgUgo+PgplbmRvYmoKOSAwIG9i
ago8PAovVHlwZSAvWE9iamVjdAovU3VidHlwZSAvSW1hZ2UKL05hbWUgL0ltMAovV2lkdGggMTYw
MAovSGVpZ2h0IDIzMDQKL0JpdHNQZXJDb21wb25lbnQgMQovQ29sb3JTcGFjZSAvRGV2aWNlR3Jh
eQovRmlsdGVyIC9DQ0lUVEZheERlY29kZQovRGVjb2RlUGFybXMgPDwgL0sgLTEgL0NvbHVtbnMg
...

The decoded filename are useless:

=?UTF-8?Q?Prokuratura_Rejonowa_Warszawa=2D=C5=9Ar=C3=B3dmie=C5=9Bcie_p=C3=B3=C5=82noc_sygn?=\n\t=?UTF-8?Q?=2E_2Ds=2E_137414_=2D_RSK_pracownik=C3=B3w_Skarbowych_NSZZ_Solidarno=C5=9B?=\n\t=?UTF-8?Q?=C4=87_=2D_Zarz=C4=85dzenie_o_odmowie_dopuszczenia_SOWP_do_udzia=C5=82u_w_?=\n\t=?UTF-8?Q?postepowaniu=2Epdf?=

I suggest add something like:

import email

from mailparser import mailparser
import sys

filename = sys.argv[1]

mail = mailparser.parse_from_file_obj(open(filename, 'r'))

for attachment in mail.attachments:
    bin_text, encoding = email.header.decode_header(attachment['filename'])[0]
    print(bin_text.decode(encoding))

I think that the text of the file name should be returned, not the text of the raw header containing the file name.

Add headers as string to message object

I need to be able to read message headers in their original state as a string. Is there already a property for this? I can't find one.

Parsing of receiveds doesn't support via or id

for example:

Received: from XXX.namprd06.prod.outlook.com (2603:10b6:207:3d::31) by XXX.namprd06.prod.outlook.com with HTTPS id 12345 via XXX.NAMPRD02.PROD.OUTLOOK.COM; Mon, 1 Oct 2018 09:49:22 +0000

Unparsable Received: header

Describe the bug
Received: header that doesn't parse:

Received: by filter0948p1iad2.sendgrid.net with SMTP id
 filter0948p1iad2-19514-5EEA473C-34 2020-06-17 16:39:24.757045452 +0000 UTC
 m=+349867.495769315

date was not found/parsed.

TypeError: expected string or bytes-like object

File "/root/PythonProjects/venv/lib/python3.5/site-packages/mailparser/mailparser.py", line 59, in parse_from_string
self._parse()
File "/root/PythonProjects/venv/lib/python3.5/site-packages/mailparser/mailparser.py", line 169, in _parse
self.make_mail()
File "/root/PythonProjects/venv/lib/python3.5/site-packages/mailparser/mailparser.py", line 95, in make_mail
"to": self.to,
File "/root/PythonProjects/venv/lib/python3.5/site-packages/mailparser/mailparser.py", line 237, in to
self._message.get('to', self._message.get('delivered-to')))
File "/root/PythonProjects/venv/lib/python3.5/site-packages/mailparser/utils.py", line 80, in decode_header_part
for d, c in decode_header(header):
File "/usr/lib/python3.5/email/header.py", line 80, in decode_header
if not ecre.search(header):
TypeError: expected string or bytes-like object

mail-parser fails to install depending on locale

mail-parser currently fails to install depending on system locale (LC_CTYPE), like so:

$ LC_CTYPE=C pip install mail-parser --user
Collecting mail-parser
  Using cached https://files.pythonhosted.org/packages/10/fa/988a3fe204d7cca689d75e623adf45a2f57c43e55603b73b7ffaf092a7e0/mail-parser-3.3.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-k3_u6ooo/mail-parser/setup.py", line 28, in <module>
        long_description = f.read()
      File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6050: ordinal not in range(128)
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-k3_u6ooo/mail-parser/

If an UTF-8 locale is provided, it installs ok:

$ LC_CTYPE=en_US.utf8 pip install mail-parser --user
Collecting mail-parser
  Using cached https://files.pythonhosted.org/packages/10/fa/988a3fe204d7cca689d75e623adf45a2f57c43e55603b73b7ffaf092a7e0/mail-parser-3.3.2.tar.gz
Requirement already satisfied: ipaddress in ./.local/lib/python3.6/site-packages (from mail-parser) (1.0.22)
Requirement already satisfied: simplejson in ./.local/lib/python3.6/site-packages (from mail-parser) (3.15.0)
Requirement already satisfied: six in /usr/lib/python3.6/site-packages (from mail-parser) (1.11.0)
Installing collected packages: mail-parser
  Running setup.py install for mail-parser ... done
Successfully installed mail-parser-3.3.2

This means that mail-parser will fail to install on a fresh docker ubuntu 18.04 image as no locale is set by default. (This was how I discovered this bug.)

$ docker run --rm -ti ubuntu
...
$ apt-get update
$ apt-get install -y python3 python3-pip
$ pip3 install mail-parser --user
Collecting mail-parser
  Downloading https://files.pythonhosted.org/packages/10/fa/988a3fe204d7cca689d75e623adf45a2f57c43e55603b73b7ffaf092a7e0/mail-parser-3.3.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-o3petv5z/mail-parser/setup.py", line 28, in <module>
        long_description = f.read()
      File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6050: ordinal not in range(128)
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-o3petv5z/mail-parser/

Again, setting a locale (e.g.: C.UTF-8) will fix the issue:

$ LC_CTYPE=C.UTF-8 pip3 install mail-parser --user
Collecting mail-parser
  Using cached https://files.pythonhosted.org/packages/10/fa/988a3fe204d7cca689d75e623adf45a2f57c43e55603b73b7ffaf092a7e0/mail-parser-3.3.2.tar.gz
Collecting ipaddress (from mail-parser)
  Downloading https://files.pythonhosted.org/packages/fc/d0/7fc3a811e011d4b388be48a0e381db8d990042df54aa4ef4599a31d39853/ipaddress-1.0.22-py2.py3-none-any.whl
Collecting simplejson (from mail-parser)
  Downloading https://files.pythonhosted.org/packages/8b/6c/c512c32124d1d2d67a32ff867bb3cdd5bfa6432660975f7ee753ed7ad886/simplejson-3.15.0.tar.gz (80kB)
    100% |████████████████████████████████| 81kB 1.1MB/s 
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from mail-parser)
Building wheels for collected packages: mail-parser, simplejson
  Running setup.py bdist_wheel for mail-parser ... done
  Stored in directory: /root/.cache/pip/wheels/b2/bf/36/983bad64964927fa0377c2a0d2b8265bb8e38514c49b025dbb
  Running setup.py bdist_wheel for simplejson ... done
  Stored in directory: /root/.cache/pip/wheels/2c/96/fb/b63af7400da79753dcd2a3f9bf5e7a3010e8d0233844445c2c
Successfully built mail-parser simplejson
Installing collected packages: ipaddress, simplejson, mail-parser
Successfully installed ipaddress-1.0.22 mail-parser-3.3.2 simplejson-3.15.0

Feature request: Add Avro as an output [file] format

Thanks for this great piece of software.

Having Avro as an output file format to the command line tool would be awesome to use it with all sorts of big data tools. Looks like there is an official Avro lib for Python:
https://avro.apache.org/docs/current/gettingstartedpython.html

Usable via an --avro command line switch as is currently done with --json would be the way, IMHO.

I'm not sure of how much coupling between the actual parsing code and the JSON file writer there is. Will try to figure out from code, but I'm not very deep with Python. I'm currently just using the command line tool and and consuming it as JSON files.

TIA.

Support for Python 3?

Edit: Just saw the developer branch you're working on.

Great library! Byte, string, unicode issues introduced with Python 3

# mailparser -f test3.email -j                                                                                                   
Traceback (most recent call last):
  File "/home/test/.venv/bin/mailparser", line 11, in <module>
    load_entry_point('mail-parser==0.5.0', 'console_scripts', 'mailparser')()
  File "/home/test/.venv/lib/python3.5/site-packages/mailparser/__main__.py", line 140, in main
    parser.parse_from_file(args.file)
  File "/home/test/.venv/lib/python3.5/site-packages/mailparser/__init__.py", line 52, in parse_from_file
    self._parse()
  File "/home/test/.venv/lib/python3.5/site-packages/mailparser/__init__.py", line 192, in _parse
    encoding=charset)
  File "/home/test/.venv/lib/python3.5/site-packages/mailparser/__init__.py", line 86, in _force_unicode
    if not isinstance(u, unicode):
NameError: name 'unicode' is not defined

write_sample does not create directory structure included in $filename

Describe the bug
When an attachments has a filename containing /, then write_sample() function call fails with Error: No such file or directory: '/tmp/foo/bar/attachment.png'

To Reproduce
Steps to reproduce the behavior:

mailparser -f foo -sa -ad /tmp on email with attachment that has a slash in the file name
observe error

Expected behavior
write_sample() should create subdirectories not only for attachment output dir specified in -ap, but also for the directory structure included in the attachment filename

Raw mail

Environment:

OS: linux
Docker: no
mail-parser version 3.9.3

Deal with 'cc' and 'bcc' as multi-recipient address email fields as done on 'to'

This issue tracks followup to #16 and serves as discussion.
The 'bcc' field might appear is some cases, such as email proxies like emailrelay.

How should I make this work for .MSG files ?

I have used mailparser for extracting attachments from .EML files. But I am not able to use it for .MSG files. I read about the installation of 'libemail-outlook-message-perl' for parsing .MSG files. But I am not sure how to install it for windows.

Duplicate anomalies.mail_without_date?

Warning: I don't understand fully if this is expected.
When running mp on test_mail_1 I get this:

mailparser_1  |     "has_defects": true,
mailparser_1  |     "has_anomalies": true,
mailparser_1  |     "defects": [
mailparser_1  |         {
mailparser_1  |             "multipart/mixed": [
mailparser_1  |                 "CloseBoundaryNotFoundDefect: A start boundary was found, but not the corresponding close boundary."
mailparser_1  |             ]
mailparser_1  |         }
mailparser_1  |     ],
mailparser_1  |     "defects_category": [
mailparser_1  |         "CloseBoundaryNotFoundDefect"
mailparser_1  |     ],
mailparser_1  |     "anomalies": [
mailparser_1  |         "mail_without_date",
mailparser_1  |         "mail_without_message-id",
mailparser_1  |         "mail_without_date"
mailparser_1  |     ]
mailparser_1  | }

The defect mail_without_date appears twice, not sure if should be that way.

Too many files open error

I have encountered the error "Too many files open", when working on datasets greater than 1000. I believe the error occur in the following line

I cannot see the needs for creating multiple different files, why not instead create a single file, which you can keep reading/writing (and afterwards delete it).

`Content-Disposition` info for attachments missed

Without this we can't determinate is attachment really attachment or it must be inlined. Ofc we have Content-Id but it may be availible also for not-inline attachments.

anyway thank you guys for awesome parser!

Decoding of payload of attachment

Hello,

I am have following code in my project:

decoder_map = {'base64': lambda payload: base64.b64decode(payload),
               '': lambda payload: payload.encode('utf-8'),
               '7bit': lambda payload: payload.encode('utf-8'),
               'quoted-printable': lambda payload: quopri.decodestring(payload)}
for msg_id in data[0].decode('utf-8').split():
    result_fetch, data = client.fetch(msg_id, "(RFC822)")

    if result_fetch != 'OK':
        raise Exception("Fetch failed!")

    raw_mail = data[0][1]
    mail = mailparser.parse_from_bytes(raw_mail)
    for attachment in mail.attachments:
        if attachment['content_transfer_encoding'] not in decoder_map:
            msg = "Unsupported Content-Transfer Encoding ({}) in msg {}.".format(attachment['content_transfer_encoding'], msg_id)
            raise RuntimeError(msg)

        decoder = decoder_map[attachment['content_transfer_encoding']]
        try:
            fp, fname = find_filename(attachment['filename'], args.attachment_dir)
            fp.write(decoder(attachment['payload']))
        except binascii.Error as e:
            print("Unable to parse attachment '{}'".format(attachment['filename']))
        finally:
            fp.close()

I think that this is not a fully-handled case of use. I had to introduce quite a lot of logic to save attachments that can be coded in a variety of ways.

Error in timezone parsing

Describe the bug

The timezone parsing only works for offsets with full hours. For example, parsing a date with +05:30 returns "+6", which is incorrect. This bug potentially impacts a number of significant regions (e.g., India with offset +05:30, Austalia — ACST with offset +09:30).

To Reproduce

from email.utils import parsedate_tz
import mailparser
d = "Mon, 11 Dec 2017 15:27:44 +0530"
print(parsedate_tz(d)[9]/3600)  # 5.5
print(mailparser.utils.convert_mail_date(d)[1])  # +6 -- should be 5.5

Raw mail
Raw mail is not necessary to produce the error

Environment:

OS: [macOS Mojave (10.14)]
Docker: no
mail-parser version 3.9.3

Additional context
The bug results from using 0 decimal precision while formatting the timezone string. I recommend using 2 decimal precision to address this bug (please see the patch). Or, we can keep the offset as returned by email.utils.parsedate_tz (in seconds).

Attactment is truncated!

I have a problem with big attachment and the truncate email using

TypeError: expected string or bytes-like object

Hello.

I have found an error with the parser when it tries to find who it was delivered to and the field is missing.

  File "C:\project\dashboard\management\commands\mail_service.py", line 90, in check_mail_by_id
    headers = mailparser.parse_from_bytes(response_part[1])
  File "C:\project\venv\lib\site-packages\mailparser\mailparser.py", line 73, in parse_from_bytes
    return MailParser.from_bytes(bt).parse()
  File "C:\project\venv\lib\site-packages\mailparser\mailparser.py", line 297, in parse
    self._make_mail()
  File "C:\project\venv\lib\site-packages\mailparser\mailparser.py", line 217, in _make_mail
    "to": self.to_,
  File "C:\project\venv\lib\site-packages\mailparser\mailparser.py", line 383, in to_
    self.message.get('to', self.message.get('delivered-to')))
  File "C:\project\venv\lib\site-packages\mailparser\utils.py", line 80, in decode_header_part
    for d, c in decode_header(header):
  File "C:\Python36\Lib\email\header.py", line 80, in decode_header
    if not ecre.search(header):
TypeError: expected string or bytes-like object

to_json returns []

import mailparser
mail = mailparser.parse_from_file('samples/eml_file')
print(mail.to_json)

returns []

(other functions like mail.body work as expected)

Attachments with identical filename overwrite eachother when using .write_attachments

In my opinion there should be like a filename number increment when filenames have the same name so all of them are saved.

ipaddress requirements in Python 3.3+

I am building own Debian's package, then mailparser is not installed via pip and i am using it with Python 3.7.

In the requirements.txt is a ipaddress module, which is part of Python3 standard lib from 3.3 version. I didn't inspect the source in depth, but it seems to work, when i use mailparser module from a script, but when i try to run it as command, it complains about missing distribution under python 3.7:

pkg_resources.DistributionNotFound: The 'ipaddress' distribution was not found and is required by the application

As i can see in ipaddress package description, it is port of built-in module to older versions of Python without it, then it is (have to be) useless in Python 3.3+. Please, consider make this requirement conditional, depending on used Python version.

headers with the same name get clobbered

some headers, such as Authentication-Results, can occur multiple times in a message. the current code clobbers previous values.

it seems like the following places need to be updated to support headers having lists of values found in a message. all headers could have values that are lists, or only the ones that have more than one value.

the current code uses email.message.get() but needs to use email.message.get_all(). https://docs.python.org/3/library/email.message.html#email.message.EmailMessage.get_all

Issue parsing MSG file

Having issues parsing a .msg files.
Did install the dependencies, and using python3.5

Ran simple test:

import mailparser
import json

fp = open("messagefile.msg", "r")

# tried to read in file directly with file in same directory. 
#mail = mailparser.parse_from_file('messagefile.msg')

# tried to read in file as a file object
mail = mailparser.parse_from_file_obj(fp)

print(mail.body)

Traceback (most recent call last):
  File "mail.py", line 12, in <module>
    print(mail.body)
  File "/opt/testenv/lib/python3.5/site-packages/mail_parser-3.3.1-py3.5.egg/mailparser/mailparser.py", line 491, in body
    return "\n--- mail_boundary ---\n".join(self.text_plain)
TypeError: can only join an iterable

Error: remove_email_envelope() function

Describe the bug
The method mailparser.parse_from_bytes(bt) cannot create a Mailparser object, because the remove_email_envelope(message) function called inside fails. The function tries to use a string pattern on a bytes-like object.

To Reproduce
Steps to reproduce the behavior:

import mailparser
mail_obj = open('TEST_Mail.eml', 'rb')
test_mail = mail_obj.read()
mail = mailparser.parse_from_bytes(test_mail)
See error

Expected behavior
Getting a Mailparser object.

Raw mail
Delivered-To: [email protected]
Received: by 10.140.178.13 with SMTP id a13cs354079rvf;
Fri, 21 Nov 2008 20:05:05 -0800 (PST)
Received: by 10.151.44.15 with SMTP id w15mr2254748ybj.98.1227326704711;
Fri, 21 Nov 2008 20:05:04 -0800 (PST)
Return-Path: [email protected]
Received: from mail11.tpgi.com.au (mail11.tpgi.com.au [203.12.160.161])
by mx.google.com with ESMTP id 10si5117885gxk.81.2008.11.21.20.05.03;
Fri, 21 Nov 2008 20:05:04 -0800 (PST)
Received-SPF: neutral (google.com: 203.12.160.161 is neither permitted nor denied by domain of [email protected]) client-ip=203.12.160.161;
Authentication-Results: mx.google.com; spf=neutral (google.com: 203.12.160.161 is neither permitted nor denied by domain of [email protected]) smtp.mail=[email protected]
X-TPG-Junk-Status: Message not scanned
X-TPG-Antivirus: Passed
Received: from [192.0.0.253] (60-241-138-146.static.tpgi.com.au [60.0.0.146])
by mail11.tpgi.com.au (envelope-from [email protected]) (8.14.3/8.14.3) with ESMTP id mAM44xew022221
for [email protected]; Sat, 22 Nov 2008 15:05:01 +1100
Message-Id: [email protected]
From: Mikel Lindsaar [email protected]
To: Mikel Lindsaar [email protected]
Content-Type: text/plain; charset=US-ASCII; format=flowed
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v929.2)
Subject: Testing 123
Date: Sat, 22 Nov 2008 15:04:59 +1100
X-Mailer: Apple Mail (2.929.2)

Plain email.

Hope it works well!

Mikel

Environment:

OS: Linux
Docker: yes
mail-parser version 3.7.0

Additional context

Error:
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/dist-packages/mailparser/mailparser.py", line 116, in parse_from_bytes
return MailParser.from_bytes(bt)
File "/usr/local/lib/python3.6/dist-packages/mailparser/mailparser.py", line 243, in from_bytes
envelope_present, bt = remove_email_envelope(bt)
File "/usr/local/lib/python3.6/dist-packages/mailparser/utils.py", line 526, in remove_email_envelope
True if EMAIL_ENVELOPE_PATTERN.search(message) else False
TypeError: cannot use a string pattern on a bytes-like object

New release on pypi?

Is your feature request related to a problem? Please describe.
Can you update the release on pypi? Would help me a lot thx!

Describe the solution you'd like
A new release i.e. with the content-disposition header for attachments.

KeyError: 'content-transfer-encoding'

I have some issue with an email from Paypal, as it hasn't any 'content-transfer-encoding' in the header.
Due to this I have an exception, originated by replace_header function inside message.py

To Reproduce
Steps to reproduce the behavior:

import mailparser
(stuff for connection and retrieve the last unread email)
mail = mailparser.parse_from_bytes(data[0][1])

This is working good for the rest of the emails, but when I read the one who hasn't this header, then I have this exception:

exception=KeyError('content-transfer-encoding',)>
Traceback (most recent call last):
  File "C:/Users/BOHEM/imapbot/beta/imapbot.py", line 47, in connectMail
    body = await searchMail(imap4ssl)
  File "C:/Users/BOHEM/imapbot/beta/imapbot.py", line 26, in searchMail
    mail = mailparser.parse_from_bytes(data[0][1])
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\site-packages\mailparser\mailparser.py", line 115, in parse_from_bytes
    return MailParser.from_bytes(bt)
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\site-packages\mailparser\mailparser.py", line 238, in from_bytes
    return cls(message)
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\site-packages\mailparser\mailparser.py", line 135, in __init__
    self.parse()
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\site-packages\mailparser\mailparser.py", line 338, in parse
    p_string = ported_string(p.as_string())[:100] + "..."
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\email\message.py", line 158, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\email\generator.py", line 116, in flatten
    self._write(msg)
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\email\generator.py", line 181, in _write
    self._dispatch(msg)
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\email\generator.py", line 214, in _dispatch
    meth(msg)
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\email\generator.py", line 272, in _handle_multipart
    g.flatten(part, unixfrom=False, linesep=self._NL)
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\email\generator.py", line 116, in flatten
    self._write(msg)
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\email\generator.py", line 189, in _write
    **msg.replace_header('content-transfer-encoding', munge_cte[0])**
  File "C:\Users\BOHEM\AppData\Local\Programs\Python\Python36\lib\email\message.py", line 559, in replace_header
    **raise KeyError(_name)
KeyError: 'content-transfer-encoding'**

Raw mail
The raw mail to reproduce the behavior.
edited: (https://gist.github.com/Marenostrum81/dae5ce430687060832a9aa4b47d4d328).

Environment:

OS: [Windows]
Docker: [no]
mail-parser version [e.g. 3.8.1]

Additional context
For testing purposes, I modified replace_header function in message.py, from this:

def replace_header(self, _name, _value):
        _name = _name.lower()
        for i, (k, v) in zip(range(len(self._headers)), self._headers):
            if k.lower() == _name:
                self._headers[i] = self.policy.header_store_parse(k, _value)
                break
        else:
            raise KeyError(_name)

To this:

def replace_header(self, _name, _value):
        _name = _name.lower()
        for i, (k, v) in zip(range(len(self._headers)), self._headers):
            if k.lower() == _name:
                self._headers[i] = self.policy.header_store_parse(k, _value)
                break
        else:
            pass

With this modification, the mailparser was parsing good the email, being able to read data from header, like mail.subject, mail.from_, mail.body, etc.

Create an official, Automated Build image on Docker Hub

Docker Hub allows you to create Automated Builds from source: https://docs.docker.com/docker-hub/builds/
It would add another packaging/distribution/installation method, whose buildings would be triggered automatically on each commit. It also allows to create different image tags from git tags & branches (thus allowing to test :develop right away, as shown).
Also, documentation could easily include a canonical docker run statement to quickly try the tool with just a single command.

By making the image build via an AB, you give the resulting image verifiability and auditability. Also, the build is fully automatic. You can have the latest image tag build from HEAD and individual image tags from git's release tags.
Some people avoid non-verifiable (manually uploaded) images due to security & traceability reasons. Docker search command clearly displays AB when listing images.

Just a free Docker Hub account and a quick setup would do. Ping me if you need help.

Email content 'rtf' not handled

With the latest version of Spamscope mailparser v3.12.0, there is a warning log that I see popping up when parsing MSG files.

For each MSG file I parse I keep getting this Email content 'rtf' not handled log. I know its that I am not handling rtf of the maildata text but how can I avoid this warning log from appearing.

six is fixed at v1.13.0

Describe the bug

six have released version 1.14.0
requirements.txt file has six fixed at v1.13.0 https://github.com/SpamScope/mail-parser/blob/develop/requirements.txt#L3
This is giving dependency issues as some libraries download the latest, but this is fixed to 1.13.0
We could bump it or change sign to >= ?

To Reproduce
Steps to reproduce the behavior:

Have 2 python modules that use six
One is dependant >= 1.13.0
One is == 1.13.0

Expected behavior
Both Python Modules install fine

Actual behavior
We get a dependency confilct :(

Environment:

Suggestion print only mail headers

Currently you can print all mail headers, but some of them can already be retrieved for example mail.retrieved get only the retrieved header, From, To etc.

Suggestion to add a command that print only the mail headers remaining (not available from the other command)

Read the charset of the attachment if given in the raw MIME

The problem
The charset property isn't read upon parsing MIME like the following snippet:

Content-Type: text/plain; charset="US-ASCII"; name="attachment1.txt"
Content-Disposition: attachment; filename="attachment1.txt"
Content-Transfer-Encoding: base64
Content-ID: <f_jnt7un642>
X-Attachment-Id: f_jnt7un642

MQ==
--000000000000d551e4057998ef2b
Content-Type: text/plain; charset="US-ASCII"; name="attachement2.txt"
Content-Disposition: attachment; filename="attachement2.txt"
Content-Transfer-Encoding: base64
Content-ID: <f_jnt7un5p1>
X-Attachment-Id: f_jnt7un5p1

Mg==

Note that the charset="US-ASCII" is set for both attachments above!

but mail-parser doesn't parse charset. For example the following is the output of parsing the above attachments without charset:

[
    {
        "filename": "attachment1.txt",
        "payload": "MQ==",
        "binary": true,
        "mail_content_type": "text/plain",
        "content-id": "<f_jnt7un642>",
        "content_transfer_encoding": "base64"
    },
    {
        "filename": "attachement2.txt",
        "payload": "Mg==",
        "binary": true,
        "mail_content_type": "text/plain",
        "content-id": "<f_jnt7un5p1>",
        "content_transfer_encoding": "base64"
    }
]

The proposed solution
The parsed attachments objects should contain the 'charset' field if given in the MIME data.

Question: Can mail-parser recover nested messages?

I hope it's okay if I'm asking this here.

A lot of data sets (e.g. Enron) contain only plain text messages or no explicit information whether an email has been forwarded or replied.

In order to reconstruct this information I was thinking about trying to recover nested messages e.g. (just an example)

Message-ID: <10792109.1075856435465.JavaMail.evans@thyme>
Date: Fri, 4 May 2001 10:39:00 -0700 (PDT)
From: [email protected]
To: [email protected]
Subject: Lunch

Hey, when do you want to have lunch today?

and

Message-ID: <12168444.1075856436014.JavaMail.evans@thyme>
From: [email protected]
To: [email protected]
Subject: Re: Lunch

How about 12:30 pm ?

 -----Original Message-----
[email protected]
05/04/2001 10:39 PM
To: [email protected]
Subject: Lunch

Hey, when do you want to have lunch today?

Where the first message "Hey, when do you.." would be labelled for example as HAS_ANSWERED.

Can this be done with mail-parser or are there any libraries which could do something like that?

mail.attachment

I thing their is an issue with mail.attachments.
It dont return any attachement.

This is my test case in ipython3

import mailparser

a="""Received: from EX3.local (172.16.2.3) by EX3.local (172.16.2.3) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1415.2 via Mailbox
 Transport; Fri, 12 Jan 2018 12:48:13 +0100
Received: from [10.42.106.119] (10.10.254.33) by EX3.local (172.16.2.3)
 with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1415.2; Fri, 12
 Jan 2018 12:48:13 +0100
To: Vincent  <[email protected]>
From: Simon  <[email protected]>
Subject: test eml attachment
Message-ID: <[email protected]>
Date: Fri, 12 Jan 2018 12:46:42 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.4.0
Content-Type: multipart/mixed;
    boundary="------------55D5C868D71CE57DC948563F"
Content-Language: en-US
Return-Path: [email protected]
X-MS-Exchange-Organization-Network-Message-Id: 6a810f66-4b65-4a5f-da82-08d559b25c95
X-MS-Exchange-Organization-AuthSource: EX3.local
X-MS-Exchange-Organization-AuthAs: Internal
X-MS-Exchange-Organization-AuthMechanism: 07
X-Originating-IP: [10.10.254.33]
X-ClientProxiedBy: ex2.local (172.16.2.2) To EX3.local (172.16.2.3)
X-MS-Exchange-Transport-EndToEndLatency: 00:00:00.2112587
X-MS-Exchange-Processed-By-BccFoldering: 15.01.1415.002
MIME-Version: 1.0

--------------55D5C868D71CE57DC948563F
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

test


--------------55D5C868D71CE57DC948563F
Content-Type: message/rfc822; name="Attached Message"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="Attached Message"

Received: from ex2.local (172.16.2.2) by EX3.local (172.16.2.3) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1415.2 via Mailbox
 Transport; Fri, 12 Jan 2018 12:47:07 +0100
Received: from EX3.local (172.16.2.3) by EX2.local (172.16.2.2) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1415.2; Fri, 12
 Jan 2018 12:47:07 +0100
Received: from EX3.local ([fe80::7806:9f98:c0b4:db47]) by EX3.local
 ([fe80::7806:9f98:c0b4:db47%13]) with mapi id 15.01.1415.002; Fri, 12 Jan
 2018 12:47:07 +0100
From: Vincent  <[email protected]>
To: Simon  <[email protected]>
Subject: testing attachement
Thread-Topic: testing attachement
Thread-Index: AQHTi5sSajMVfq8xMUuQxsxmTSCh5A==
Date: Fri, 12 Jan 2018 12:47:07 +0100
Message-ID: <[email protected]>
Accept-Language: fr-FR, en-US
Content-Language: en-US
X-MS-Exchange-Organization-AuthAs: Internal
X-MS-Exchange-Organization-AuthMechanism: 04
X-MS-Exchange-Organization-AuthSource: EX3.local
X-MS-Has-Attach:
X-MS-Exchange-Organization-Network-Message-Id: 77b61742-2589-475b-1765-08d559b2355c
X-MS-Exchange-Organization-SCL: -1
X-MS-TNEF-Correlator:
X-MS-Exchange-Organization-RecordReviewCfmType: 0
x-mailer: Apple Mail (2.3445.5.20)
Content-Type: text/plain; charset="us-ascii"
Content-ID: <[email protected]>
MIME-Version: 1.0

testing attachement

--------------55D5C868D71CE57DC948563F--
"""
mail = mailparser.parse_from_string(a)
mail.message_as_string
mail.attachments

PST files

I wonder if there is a possibility to add functionalities to read PST (outlook) format ? Probably this could be relevant https://github.com/libyal/libpff

Attachment Payload Shouldn't be Trimmed

Describe the bug

I'm trying to generate MD5 hashes of attachments found in email files. I'm using that MD5 in a tool like VirusTotal to determine if the attachment is malicious or not. However, the attachment payloads appear to be getting stripped, which is changing the MD5 hash of some attachments. This prevents me from accurately looking up these attachments in tools like the one mentioned above.

To Reproduce

Steps to reproduce the behavior:

download the .eml file attached to this issue and unzip it
using python 3 (i'm using 3.6.9):

import mailparser
f = open('/path/to/example.eml', 'rb')
mail = mailparse.parse_from_bytes(f.read())
f.close()
attachment = mail.attachments[0]

import hashlib
hashlib.md5(attachment.get("payload").encode(attachment.get("charset"))).hexdigest()
# outputs '442972dbdaba2b9c8c742b4e35a61e70'

I used the email library to parse the same eml file and found that the payload of the attachment with the email library ended with </html>\r\n\t=\r\n where as the payload from the mailparser library ended with </html>.

Expected behavior

The correct MD5 hash for the attachment is 'a4546bf059509d6f13a81436214972d7'. I determined this by saving the attachment from the .eml file and using the following powershell:

Get-FileHash -Path "C:\path\to\Voicemail Audio.html" -Algorithm MD5

Algorithm       Hash
---------       ----
MD5             A4546BF059509D6F13A81436214972D7

Raw mail

Attached below is the zipped directory 'example.zip'. Within this directory there is one email file, 'example.eml'. This is an email with an attachment in it. I don't recommend opening the attachment in the email (Voicemail Audio.html) on your machine, I'm unsure of what it does (if anything).

example.zip

Environment:

OS: Ubuntu 18.04.4 LTS
Docker: no
mail-parser version: 3.12.0

Additional context

I forked the repo and noticed that the function ported_string in mailparser>utils.py used the .strip() method. I took out all of the .strip() method calls in that function and ran the eml file through that 'updated' version and the MD5 came out correct. However, I'm assuming this modification could mess up other things, as I see ported_string gets used elsewhere.

spamscope / mail-parser Goto Github PK

mail-parser's Introduction

mail-parser

Apache 2 Open Source License

Support the project

mail-parser on Web

Description

Defects

Authors

Main Author

Installation

Usage in a project

Usage from command-line

Exceptions

mail-parser's People

Contributors

Stargazers

Watchers

Forkers

mail-parser's Issues

Edit: Just saw the developer branch you're working on.

Recommend Projects

Recommend Topics

Recommend Org