wahlflo / eml_analyzer Goto Github PK

View Code? Open in Web Editor NEW

79.0 3.0 17.0 67 KB

A cli script to analyze an E-Mail in the EML format for viewing the header, extracting attachments, etc.

License: MIT License

Python 99.74% Makefile 0.26%

eml python eml-analyzer malware-analysis

eml_analyzer's People

Contributors

Stargazers

Watchers

Forkers

debragail lcarcamo1526 pombredanne zarandija jesusgavancho rangerrkm cordobes9parabelum rinkattendant6 murphylo anactualfridge mauriziocasciano tonykhoriaty ror-ian sh-op-x 5ydfyb4 larryluyy johntday

eml_analyzer's Issues

Reading from standard input

It would be nice if the script can read the EML from standard input similar to how many other CLI tools work in Unix (and to some extent PowerShell).

$ cat my_cat_photos.eml | emlAnalyzer

# a more practical example:
$ curl -X GET 'https://graph.microsoft.com/v1.0/me/messages/.../attachments/.../$value' | emlAnalyzer

This would eliminate the need to have the EML saved on disk.

Structured output

This tool is excellent for interactive usage. It would be really nice to be able to produce structured output (e.g. JSON, YAML. XML)

Example:

$ emlAnalyzer --input my_cat_photos.eml  --header --tracking --attachments --text --html --url --format=json

Output:

{
  "headers": {
    "Received": ["...", "...", "..."],
    "From": ["Bob <[email protected]>"],
    "To": ["Alice <[email protected]>"]
  },
  "attachments": [
    { "name": "meow.jpg", "disposition": "inline", "type": "image/jpg" },
    { "name": "cat.png", "disposition": "inline", "type": "image/png" }
  ],
  "text": "hi Alice, attached are some cat photos. some more can be found at https://http.cat/200",
  "html": "<html><body><p>hi Alice,</p><p>attached are some cat photos</p><p>some more can be found at <a href='https://http.cat/200'>https://http.cat/200</a></p></body></html>",
  "urls": [
    "https://http.cat/200"
  ],
  "reloaded_content": []
}

Haven't really thought about the --extract-all flag but perhaps that can be supported by adding a content property in each attachments object containing its base64-encoded representation. Or not support extracting attachments when returning structured output.

Problem installing on Remnux

Hello,
I ran the Remnux distribution on a virtual machine (Ubuntu-based distribution for Malware analysis)
The problem is that during installation I get this error: https://pastebin.com/raw/RigS9EuA

I have tried the following installation methods:

Installation with pip install
compiling from code using git clone and make

Please help me install

eml_analyzer --text throws cyrillic characters instead of german umlauts

Hi,

I face a issue when parsing outgoing mails:

Extract from the eml-File (save-as via Thunderbird):
"...
This is a cryptographically signed message in MIME format.

--------------ms090908070501060903060609
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

..."

If the plain text contains i. e. a "ä" the output character becomes cyrillic "д" independend from the chosen output format "--text" or "--text --format json".

In comparison an extract from a eml-File that get's parsed correctly:

"...
--eTqZtiOboXMORarM2jeks2PNUJpOw=_O7X
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
...
"

Any idea?

I add a test-file.

Regards,
MrChang.
test.zip

UnicodeEncodeError

This is a very convenient and powerful tool library; but when I use it to parse emails(Chinese email or English email) in batches, the following errors occasionally occur:
File "D:\anaconda3\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\anaconda3\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\anaconda3\Scripts\emlAnalyzer.exe_main.py", line 7, in
File "D:\anaconda3\lib\site-packages\eml_analyzer\cli_script.py", line 66, in main
output_format.process_option_show_html(parsed_email=parsed_email)
File "D:\anaconda3\lib\site-packages\eml_analyzer\library\outputs\standard_output.py", line 62, in process_option_show_html
print(html)
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 45613: illegal multibyte sequence

For coding problems, I don't have a solution，My execution environment is Windows.
Due to the confidentiality of the email content, I am sorry that I cannot provide.

TypeError: 'type' object is not subscriptable

v2.0.0 no longer works on Python 3.8 due to this issue: https://stackoverflow.com/q/63460126/404623

I can confirm the issue by running it inside Docker:

$ docker run -it -v "$PWD"/emails:/srv -w /srv python:3.8 /bin/bash -c "python3 -m pip install eml-analyzer && cat foo2.eml | emlAnalyzer"
...<snip>...
Traceback (most recent call last):
  File "/usr/local/bin/emlAnalyzer", line 5, in <module>
    from eml_analyzer.cli_script import main
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/cli_script.py", line 7, in <module>
    from eml_analyzer.library.outputs import AbstractOutput, StandardOutput, JsonOutput
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/outputs/__init__.py", line 1, in <module>
    from .abstract_output import AbstractOutput
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/outputs/abstract_output.py", line 3, in <module>
    from eml_analyzer.library.parser.parsed_email import ParsedEmail
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/parser/__init__.py", line 1, in <module>
    from .parsed_email import ParsedEmail, EmlParsingException, PayloadDecodingException
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/parser/parsed_email.py", line 17, in <module>
    class ParsedEmail:
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/parser/parsed_email.py", line 34, in ParsedEmail
    def get_error_messages(self) -> list[str]:
TypeError: 'type' object is not subscriptable

It works fine on Python 3.11:

$ docker run -it -v "$PWD"/emails:/srv -w /srv python:3.11 /bin/bash -c "python3 -m pip install eml-analyzer && cat foo2.eml | emlAnalyzer"
...<snip>...
 =================
 ||  Structure  ||
 =================
|- multipart/mixed                       
|  |- multipart/related                  
|  |  |- multipart/alternative           
|  |  |  |- text/plain                   
|  |  |  |- text/html                    
|  |  |- image/png                         [1f438.png]
|  |- image/png                            [cover.png]

As of right now, the supported versions of Python are 3.7+. I discovered the issue as my WSL runs Ubuntu 20.04 LTS which comes with Python 3.8.

It might be worth documenting and enforcing the version requirement of >=3.9 if that's what's desired, or implementing the workaround in the SO question if supporting Python <3.9 is necessary.

URL extraction a bit greedy

The changes to the URL extraction in v2.0.0 causes the results to have quotes or chunks of HTML at the end, and sometimes encoded HTML in the middle rather than the actual characters ( & is pretty common in URLs). The outputs no longer include the actual URLs in some cases (see examples)

Actual output

Text and HTML included for reference purposes. Notice the & in some of the URLs:

 ==================================
 ||  URLs in HTML and text part  ||
 ==================================
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&amp;reserved=0"
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&amp;reserved=0
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&amp;reserved=0"
 - https://carleton.ca/</a></span></div>
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&amp;reserved=0
 - https://twitter.com/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&reserved=0>
 - https://twitter.com/</a></span></span>
 - https://carleton.ca/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&reserved=0>

 =================
 ||  Plaintext  ||
 =================
https://twitter.com/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&reserved=0>

https://carleton.ca/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&reserved=0>

[cid:d2eb1b04-21d5-4a18-bfdb-12c4c3e65b8a]


 ============
 ||  HTML  ||
 ============
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">
<span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);"><span class="x_elementToProof FluidPluginCopy" style="font-size:15px;font-family:&quot;Segoe UI&quot;, &quot;Segoe UI Web (West European)&quot;, &quot;Segoe UI&quot;, -apple-system, BlinkMacSystemFont, Roboto, &quot;Helvetica Neue&quot;, sans-serif;margin:0px;color:rgb(36, 36, 36);background-color:rgb(255, 255, 255)"><span style="font-size:12pt;font-family:Calibri, Arial, Helvetica, sans-serif;margin:0px;color:rgb(0, 0, 0);background-color:rgb(255, 255, 255)"><a href="https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&amp;reserved=0" originalsrc="https://twitter.com/" shash="qzkaBC2sLxgP1TgKDzZQ22fgeb1b7BsI5lFHyR43eYo7m7MI1zAsITkXPZdZ6n1KIL8l5zsW9ZYym8Zh696/mItL4iYvJ0Ubwz1de7W+ONeLI4b2ew0tg4HzBWjz70b4QTUpxwi7deoO2c/HOahf1M884A1dPLtJtc8s+ZBrhcA=" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable" data-safelink="true" data-linkindex="0" style="margin:0px" class="ContentPasted0">https://twitter.com/</a></span></span>
<div class="x_elementToProof FluidPluginCopy" style="font-size:15px;font-family:&quot;Segoe UI&quot;, &quot;Segoe UI Web (West European)&quot;, &quot;Segoe UI&quot;, -apple-system, BlinkMacSystemFont, Roboto, &quot;Helvetica Neue&quot;, sans-serif;margin:0px;color:rgb(36, 36, 36);background-color:rgb(255, 255, 255)">
<span style="font-size:12pt;font-family:Calibri, Arial, Helvetica, sans-serif;margin:0px;color:rgb(0, 0, 0);background-color:rgb(255, 255, 255)"><br class="ContentPasted0">
</span></div>
<div class="x_elementToProof FluidPluginCopy" style="font-size:15px;font-family:&quot;Segoe UI&quot;, &quot;Segoe UI Web (West European)&quot;, &quot;Segoe UI&quot;, -apple-system, BlinkMacSystemFont, Roboto, &quot;Helvetica Neue&quot;, sans-serif;margin:0px;color:rgb(36, 36, 36);background-color:rgb(255, 255, 255)">
<span style="font-size:12pt;font-family:Calibri, Arial, Helvetica, sans-serif;margin:0px;color:rgb(0, 0, 0);background-color:rgb(255, 255, 255)"><a href="https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&amp;reserved=0" originalsrc="https://carleton.ca/" shash="Dxc62cEP1Yg+/wKXMo/VoujSwhva4+Frdv2Yr8iQuG/kuzsq8b6WfRRSuA3H0L4B+GRsbWHMjTjX4Mg2/0vuwo9UW9HOglt0hd7TcsPjzwi8IgUT3bVowbeQfPFAoMMdOWkKIzbKe4Ax/2E4rJf8j/m4b+N+/72C5VvaPLBgd+8=" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable" data-safelink="true" data-linkindex="2" style="margin:0px" class="ContentPasted0">https://carleton.ca/</a></span></div>
<br>
</span></div>
<div class="elementToProof" style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">
<span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);"><img style="max-width: 100%;" class="w-240 h-240" size="6492" contenttype="image/png" data-outlook-trace="F:1|T:1" src="cid:d2eb1b04-21d5-4a18-bfdb-12c4c3e65b8a"><br>
</span></div>
</body>
</html>

Expected output for the extracted URLs:

 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&reserved=0
 - https://carleton.ca/
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&reserved=0
 - https://twitter.com/

URLExtract

There is a library called URLExtract which does exactly as the name says. I haven't used it, though it appears to have some dependencies and requires an Internet connection to download a list of TLDs. It has the risk of false positives (see known issues in the README), and adding it as a dependency to this library does increase the overall complexity by quite a bit IMHO.

Alternative options... perhaps the regex can be improved with some ideas here?

Need to extract all attachments

Feature request:

Allow passing * to allow for all attachments not just int value

diff output by diff run with same file with same para

the same input the same para
but random out with xxx by diff run
the input file is here : aaa_123.zip

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.