Giter Club home page Giter Club logo

Comments (7)

rcarmo avatar rcarmo commented on July 29, 2024

The 3.8 port may need some UTF-8 fixes. I can't reproduce that without the actual data, but I suspect that the From address for that message has accented or non-ASCII characters.

from imapbackup.

mhow2 avatar mhow2 commented on July 29, 2024

Thanks @rcarmo.
BTW the error I get doesn't tell which message (id?) it could be ?

from imapbackup.

rcarmo avatar rcarmo commented on July 29, 2024

Nope. But that can be added, should be just a matter of printing key. I suspect changing that ascii to utf-8 might suffice, but cannot test it.

from imapbackup.

samsonjs avatar samsonjs commented on July 29, 2024

I ran into this as well and have started to debug it. As far as I can tell this is a bug in mailbox.py causing it to interpret lines in the message body as the From line. The line that throws that UnicodeDecodeError is msg.set_from(from_line[5:].decode('ascii')). From lines start with "From " so it slices off the first 5 bytes and then tries to decode the remaining bytes. I modified that method to look like the following to dig in further:

Screen Shot 2021-12-24 at 9 54 32 AM

I have 3 emails that are affected and their From lines look like this:

Decoding From line failed! 💥
from_line = From the start, Flickr has been an act of co-creation. Today marks a new beginning. Together, let’s create the future of photography. Learn More: https://www.flickr.com/lookingahead
...
Decoding From line failed! 💥
from_line = From the beginning, when we introduced our automated investing portfolios five years ago, we've aimed to build financial products that will help our clients meet their longterm goals with the lowest possible cost and the smartest possible outcome. As students of financial history, we know that means being wise during market rallies and market collapses. Surprises and uncertainty are expected — we just don't know when they'll show up. What's happening now is devastating on a public health level, and will likely have negative economic impacts in the short term, but our approach to long-term investing remains the same.
...
Decoding From line failed! 💥
from_line = From celebrity feuds to baby bump rumors, if it’s happening in the glitterati sphere, it’s happening on Twitter. Be there when the gossip breaks (and then be the first to share it with your friends). 

If you only want to fix your own problem then you can modify mailbox.py accordingly and run imapbackup38.py again to find the emails and delete them if you don't care about them, or maybe move them into a folder that you don't back up.

The next step is to check out mailbox.py more and see why it thinks this line in the body is the From line. Not sure how far I'll go on this one but I might continue this later on.

from imapbackup.

samsonjs avatar samsonjs commented on July 29, 2024

Well this is a fun rabbit hole. While reading up on the mbox format I found this in RFC-4155:

Many implementations are also known to escape message body lines that
begin with the character sequence of "From ", so as to prevent
confusion with overly-liberal parsers that do not search for full
separator lines. In the common case, a leading Greater-Than symbol
(0x3E) is used for this purpose (with "From " becoming ">From ").
However, other implementations are known not to escape such lines
unless they are immediately preceded by a blank line or if they also
appear to contain an email address and a timestamp. Other
implementations are also known to perform secondary escapes against
these lines if they are already escaped or quoted, while others
ignore these mechanisms altogether.

A comprehensive description of mbox database files on UNIX-like
systems can be found at http://qmail.org./man/man5/mbox.html, which
should be treated as mostly authoritative for those variations that
are otherwise only documented in anecdotal form. However, readers
are advised that many other platforms and tools make use of mbox
databases, and that there are many more potential variations that can
be encountered in the wild.

The RFC continues to state that by default implementations should not perform >From quoting:

Also note that this specification does not prescribe any escape
syntax for message body lines that begin with the character sequence
of "From ". Recipient systems are expected to parse full separator
lines as they are documented above.

More interesting details about this family of incompatible formats collectively called mbox: http://jdebp.info/FGA/mail-mbox-formats.html

One part of the problem lies in the mailbox.mbox class which starts on line 839 of mailbox.py in Python 3.9.9 from Homebrew on macOS 12. It tries to parse the file by looking for lines that start with "From " following an empty line but it doesn't handle multipart encoding or >From quoting as described in the de-facto standard, so it can't properly read back the mbox files that it writes. Here's the parsing code, which is simple and relatively fast:

class mbox(_mboxMMDF):
    """A classic mbox mailbox."""

    _mangle_from_ = True

    # All messages must end in a newline character, and
    # _post_message_hooks outputs an empty line between messages.
    _append_newline = True

    def __init__(self, path, factory=None, create=True):
        """Initialize an mbox mailbox."""
        self._message_factory = mboxMessage
        _mboxMMDF.__init__(self, path, factory, create)

    def _post_message_hook(self, f):
        """Called after writing each message to file f."""
        f.write(linesep)

    def _generate_toc(self):
        """Generate key-to-(start, stop) table of contents."""
        starts, stops = [], []
        last_was_empty = False
        self._file.seek(0)
        while True:
            line_pos = self._file.tell()
            line = self._file.readline()
            if line.startswith(b'From '):
                if len(stops) < len(starts):
                    if last_was_empty:
                        stops.append(line_pos - len(linesep))
                    else:
                        # The last line before the "From " line wasn't
                        # blank, but we consider it a start of a
                        # message anyway.
                        stops.append(line_pos)
                starts.append(line_pos)
                last_was_empty = False
            elif not line:
                if last_was_empty:
                    stops.append(line_pos - len(linesep))
                else:
                    stops.append(line_pos)
                break
            elif line == linesep:
                last_was_empty = True
            else:
                last_was_empty = False
        self._toc = dict(enumerate(zip(starts, stops)))
        self._next_key = len(self._toc)
        self._file_length = self._file.tell()

Here's a hacked up version of _generate_toc() that skips over "From " lines in multipart bodies but it's going to make things a lot slower because it does a regex match on every line to look for boundaries and then a comparison on each line within each boundary to check for the end of the boundary. Anyway it works for me and might work for you too: https://gist.github.com/samsonjs/455e59fd75b2783071cc2215c3b3e3e1

One possible fix is to make the mailbox library support >From quoting and that doesn't seem like a lot of work, but I'm honestly not sure whether it's correct to do that inside of multipart bodies or not. Considering that this is all very vaguely specified maybe it doesn't matter that much in the grand scheme of things. Bt it would be nice to write the most portable mbox file that's reasonably possible.

from imapbackup.

samsonjs avatar samsonjs commented on July 29, 2024

Huh, actually according to Python's mailbox docs it is supposed to perform >From quoting but only when writing and not when reading:

Several variations of the mbox format exist to address perceived shortcomings in the original. In the interest of compatibility, mbox implements the original format, which is sometimes referred to as mboxo. This means that the Content-Length header, if present, is ignored and that any occurrences of “From ” at the beginning of a line in a message body are transformed to “>From ” when storing the message, although occurrences of “>From ” are not transformed to “From ” when reading the message.

from imapbackup.

samsonjs avatar samsonjs commented on July 29, 2024

We could change the download_messages function here in imapbackup.py to make it perform >From quoting when it writes emails, but since Python's mailbox.mbox still doesn't unquote when reading we'd still run into problems.

edit: oh, yeah actually that would be enough since we're not actually doing anything with the email content when we parse the local mbox file. Maybe that's all it'd take. I'll submit a patch for review.

from imapbackup.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.