Giter Club home page Giter Club logo

eml_analyzer's People

Contributors

murphylo avatar rinkattendant6 avatar wahlflo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

eml_analyzer's Issues

Reading from standard input

It would be nice if the script can read the EML from standard input similar to how many other CLI tools work in Unix (and to some extent PowerShell).

$ cat my_cat_photos.eml | emlAnalyzer

# a more practical example:
$ curl -X GET 'https://graph.microsoft.com/v1.0/me/messages/.../attachments/.../$value' | emlAnalyzer

This would eliminate the need to have the EML saved on disk.

Structured output

This tool is excellent for interactive usage. It would be really nice to be able to produce structured output (e.g. JSON, YAML. XML)

Example:

$ emlAnalyzer --input my_cat_photos.eml  --header --tracking --attachments --text --html --url --format=json

Output:

{
  "headers": {
    "Received": ["...", "...", "..."],
    "From": ["Bob <[email protected]>"],
    "To": ["Alice <[email protected]>"]
  },
  "attachments": [
    { "name": "meow.jpg", "disposition": "inline", "type": "image/jpg" },
    { "name": "cat.png", "disposition": "inline", "type": "image/png" }
  ],
  "text": "hi Alice, attached are some cat photos. some more can be found at https://http.cat/200",
  "html": "<html><body><p>hi Alice,</p><p>attached are some cat photos</p><p>some more can be found at <a href='https://http.cat/200'>https://http.cat/200</a></p></body></html>",
  "urls": [
    "https://http.cat/200"
  ],
  "reloaded_content": []
}

Haven't really thought about the --extract-all flag but perhaps that can be supported by adding a content property in each attachments object containing its base64-encoded representation. Or not support extracting attachments when returning structured output.

Problem installing on Remnux

Hello,
I ran the Remnux distribution on a virtual machine (Ubuntu-based distribution for Malware analysis)
The problem is that during installation I get this error: https://pastebin.com/raw/RigS9EuA

I have tried the following installation methods:

  • Installation with pip install
  • compiling from code using git clone and make

Please help me install

eml_analyzer --text throws cyrillic characters instead of german umlauts

Hi,

I face a issue when parsing outgoing mails:

Extract from the eml-File (save-as via Thunderbird):
"...
This is a cryptographically signed message in MIME format.

--------------ms090908070501060903060609
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

ä

..."

If the plain text contains i. e. a "ä" the output character becomes cyrillic "д" independend from the chosen output format "--text" or "--text --format json".

In comparison an extract from a eml-File that get's parsed correctly:

"...
--eTqZtiOboXMORarM2jeks2PNUJpOw=_O7X
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
...
"

Any idea?

I add a test-file.

Regards,
MrChang.
test.zip

UnicodeEncodeError

This is a very convenient and powerful tool library; but when I use it to parse emails(Chinese email or English email) in batches, the following errors occasionally occur:
File "D:\anaconda3\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\anaconda3\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\anaconda3\Scripts\emlAnalyzer.exe_main
.py", line 7, in
File "D:\anaconda3\lib\site-packages\eml_analyzer\cli_script.py", line 66, in main
output_format.process_option_show_html(parsed_email=parsed_email)
File "D:\anaconda3\lib\site-packages\eml_analyzer\library\outputs\standard_output.py", line 62, in process_option_show_html
print(html)
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 45613: illegal multibyte sequence

For coding problems, I don't have a solution,My execution environment is Windows.
Due to the confidentiality of the email content, I am sorry that I cannot provide.

TypeError: 'type' object is not subscriptable

v2.0.0 no longer works on Python 3.8 due to this issue: https://stackoverflow.com/q/63460126/404623

I can confirm the issue by running it inside Docker:

$ docker run -it -v "$PWD"/emails:/srv -w /srv python:3.8 /bin/bash -c "python3 -m pip install eml-analyzer && cat foo2.eml | emlAnalyzer"
...<snip>...
Traceback (most recent call last):
  File "/usr/local/bin/emlAnalyzer", line 5, in <module>
    from eml_analyzer.cli_script import main
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/cli_script.py", line 7, in <module>
    from eml_analyzer.library.outputs import AbstractOutput, StandardOutput, JsonOutput
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/outputs/__init__.py", line 1, in <module>
    from .abstract_output import AbstractOutput
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/outputs/abstract_output.py", line 3, in <module>
    from eml_analyzer.library.parser.parsed_email import ParsedEmail
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/parser/__init__.py", line 1, in <module>
    from .parsed_email import ParsedEmail, EmlParsingException, PayloadDecodingException
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/parser/parsed_email.py", line 17, in <module>
    class ParsedEmail:
  File "/usr/local/lib/python3.8/site-packages/eml_analyzer/library/parser/parsed_email.py", line 34, in ParsedEmail
    def get_error_messages(self) -> list[str]:
TypeError: 'type' object is not subscriptable

It works fine on Python 3.11:

$ docker run -it -v "$PWD"/emails:/srv -w /srv python:3.11 /bin/bash -c "python3 -m pip install eml-analyzer && cat foo2.eml | emlAnalyzer"
...<snip>...
 =================
 ||  Structure  ||
 =================
|- multipart/mixed                       
|  |- multipart/related                  
|  |  |- multipart/alternative           
|  |  |  |- text/plain                   
|  |  |  |- text/html                    
|  |  |- image/png                         [1f438.png]
|  |- image/png                            [cover.png]

As of right now, the supported versions of Python are 3.7+. I discovered the issue as my WSL runs Ubuntu 20.04 LTS which comes with Python 3.8.

It might be worth documenting and enforcing the version requirement of >=3.9 if that's what's desired, or implementing the workaround in the SO question if supporting Python <3.9 is necessary.

URL extraction a bit greedy

The changes to the URL extraction in v2.0.0 causes the results to have quotes or chunks of HTML at the end, and sometimes encoded HTML in the middle rather than the actual characters ( & is pretty common in URLs). The outputs no longer include the actual URLs in some cases (see examples)

Actual output

Text and HTML included for reference purposes. Notice the &amp; in some of the URLs:

 ==================================
 ||  URLs in HTML and text part  ||
 ==================================
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&amp;reserved=0"
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&amp;reserved=0
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&amp;reserved=0"
 - https://carleton.ca/</a></span></div>
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&amp;reserved=0
 - https://twitter.com/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&reserved=0>
 - https://twitter.com/</a></span></span>
 - https://carleton.ca/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&reserved=0>

 =================
 ||  Plaintext  ||
 =================
https://twitter.com/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&reserved=0>

https://carleton.ca/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&reserved=0>

[cid:d2eb1b04-21d5-4a18-bfdb-12c4c3e65b8a]


 ============
 ||  HTML  ||
 ============
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">
<span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);"><span class="x_elementToProof FluidPluginCopy" style="font-size:15px;font-family:&quot;Segoe UI&quot;, &quot;Segoe UI Web (West European)&quot;, &quot;Segoe UI&quot;, -apple-system, BlinkMacSystemFont, Roboto, &quot;Helvetica Neue&quot;, sans-serif;margin:0px;color:rgb(36, 36, 36);background-color:rgb(255, 255, 255)"><span style="font-size:12pt;font-family:Calibri, Arial, Helvetica, sans-serif;margin:0px;color:rgb(0, 0, 0);background-color:rgb(255, 255, 255)"><a href="https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&amp;reserved=0" originalsrc="https://twitter.com/" shash="qzkaBC2sLxgP1TgKDzZQ22fgeb1b7BsI5lFHyR43eYo7m7MI1zAsITkXPZdZ6n1KIL8l5zsW9ZYym8Zh696/mItL4iYvJ0Ubwz1de7W+ONeLI4b2ew0tg4HzBWjz70b4QTUpxwi7deoO2c/HOahf1M884A1dPLtJtc8s+ZBrhcA=" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable" data-safelink="true" data-linkindex="0" style="margin:0px" class="ContentPasted0">https://twitter.com/</a></span></span>
<div class="x_elementToProof FluidPluginCopy" style="font-size:15px;font-family:&quot;Segoe UI&quot;, &quot;Segoe UI Web (West European)&quot;, &quot;Segoe UI&quot;, -apple-system, BlinkMacSystemFont, Roboto, &quot;Helvetica Neue&quot;, sans-serif;margin:0px;color:rgb(36, 36, 36);background-color:rgb(255, 255, 255)">
<span style="font-size:12pt;font-family:Calibri, Arial, Helvetica, sans-serif;margin:0px;color:rgb(0, 0, 0);background-color:rgb(255, 255, 255)"><br class="ContentPasted0">
</span></div>
<div class="x_elementToProof FluidPluginCopy" style="font-size:15px;font-family:&quot;Segoe UI&quot;, &quot;Segoe UI Web (West European)&quot;, &quot;Segoe UI&quot;, -apple-system, BlinkMacSystemFont, Roboto, &quot;Helvetica Neue&quot;, sans-serif;margin:0px;color:rgb(36, 36, 36);background-color:rgb(255, 255, 255)">
<span style="font-size:12pt;font-family:Calibri, Arial, Helvetica, sans-serif;margin:0px;color:rgb(0, 0, 0);background-color:rgb(255, 255, 255)"><a href="https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&amp;data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&amp;reserved=0" originalsrc="https://carleton.ca/" shash="Dxc62cEP1Yg+/wKXMo/VoujSwhva4+Frdv2Yr8iQuG/kuzsq8b6WfRRSuA3H0L4B+GRsbWHMjTjX4Mg2/0vuwo9UW9HOglt0hd7TcsPjzwi8IgUT3bVowbeQfPFAoMMdOWkKIzbKe4Ax/2E4rJf8j/m4b+N+/72C5VvaPLBgd+8=" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable" data-safelink="true" data-linkindex="2" style="margin:0px" class="ContentPasted0">https://carleton.ca/</a></span></div>
<br>
</span></div>
<div class="elementToProof" style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">
<span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);"><img style="max-width: 100%;" class="w-240 h-240" size="6492" contenttype="image/png" data-outlook-trace="F:1|T:1" src="cid:d2eb1b04-21d5-4a18-bfdb-12c4c3e65b8a"><br>
</span></div>
</body>
</html>

Expected output for the extracted URLs:

 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcarleton.ca%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=49SuRO9jfbmE5QmGqq85RvUZgZnyJ6XgFjlD3V7duYw%3D&reserved=0
 - https://carleton.ca/
 - https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2F&data=05%7C01%7Cadmin%40tq2zr.onmicrosoft.com%7C49360182de73427cd8e908dae851e772%7C71759330a027406e9082f1f64f1007b3%7C0%7C0%7C638077735753161743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YF3Fdvo0FyHQkXoHuhv3WzGzfhmqGjCRMdxUPIZsYvA%3D&reserved=0
 - https://twitter.com/

URLExtract

There is a library called URLExtract which does exactly as the name says. I haven't used it, though it appears to have some dependencies and requires an Internet connection to download a list of TLDs. It has the risk of false positives (see known issues in the README), and adding it as a dependency to this library does increase the overall complexity by quite a bit IMHO.

Alternative options... perhaps the regex can be improved with some ideas here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.