Giter Club home page Giter Club logo

yahoo-group-archive-tools's Introduction

Yahoo Groups Archive Tools

Many of us are using yahoo-group-archiver to back up Yahoo Groups API results. This script takes the output of that tool, and converts it into individual email files, mbox mail folders, and optionally, PDF files.

Mail folders stored as mbox can be imported by a wide range of desktop and server-side email clients, including Thunderbird (Linux, Mac, Windows), Apple Mail.app (Mac), Microsoft Outlook (Windows and Mac).

Many non-technical users won't know what to do with an mbox file, but will really appreciate getting a PDF file containing all the emails in the list. You can enable experimental PDF support by installing Andrew Ferrier's email2pdf script. This process is known to be buggy, and your bug reports would be appreciated.

1. Installation and usage

Requirements

  • Perl 5.14 or higher
  • several Perl modules installed via CPAN:
    • CAM::PDF
    • Email::MIME
    • Email::Sender
    • HTML::Entities
    • HTML::FormatText::WithLinks::AndTables
    • IO::All
    • IPC::Cmd
    • JSON
    • List::AllUtils
    • Log::Dispatch
    • MCE
    • Pandoc
    • Sort::Naturally
    • Text::Levenshtein::XS
    • autodie
  • Optional: if you're creating PDF files for lists with more than say 10,000 emails, you'll probably need to install qpdf to avoid running out of memory (there are packages for Yum/RPM, Debian/Ubuntu, and MacOS brew)

Basic usage:

mkdir output-dir
yahoo-group-archive-tools.pl --source <archived-input-dir> --destination <output-dir>

Experimental PDF support

Start by installing Andrew Ferrier's email2pdf script. It can be a little complicated to install, but giving someone a PDF file of their list can elicit delight. This is experimental, so bug reports are appreciated.

mkdir output-dir
yahoo-group-archive-tools.pl --source <archived-input-dir> --destination <output-dir> --pdf --email2pdf <path to email2pdf Python script>

2. Output

The output directory will contain:

  • An email folder containing standalone email files for every email in the archive, e.g. email/1.eml, email/2.eml. The emails won't be pristine, because Yahoo redacts email addresses (see that and other caveats below). The email IDs reflect those downloaded by yahoo-group-archiver, and it's normal to see some gaps in keeping with the original numbering.
  • A consolidated mailbox file, mbox/list.mbox, for the entire history of the list
  • With PDF support enabled, a pdf-individual directory containing individual PDFs for every email
  • With PDF support enabled, a pdf-combined directory with a single PDF file containing every email

3. Learn more

4. Yahoo Groups API issues, and how we work around them

4.1. Censored email addresses (major problem)

The Yahoo Groups API redacts emails found in message headers. For example, they'll rewrite [email protected] as ceo@....

Why is this bad?

  • Deleting hostnames from headers could cause the emails to be unparseable by client software expecting valid hostnames.
  • It's hard for people to tell the difference between users. For example, [email protected] and [email protected] look the same if both are truncated to ceo@....

How we're trying to fix it

Because the API tells us the submitting Yahoo user's username, we can make a fake email domain that preserves the part before the @ in redacted emails, while being unique per user.

  • Imagine the CEO of Ford, [email protected] (Yahoo ID fordfan), emails the list:
    • Yahoo Groups redacts the hostname, and saves that as ceo@...
    • We turn that it into [email protected]
  • Then the CEO of Toyota, [email protected] (Yahoo ID toyotalover123), emails the list:
    • Yahoo Groups also saves that as ceo@... even though this is a totally different person
    • But we turn that email it into [email protected], which is different from [email protected]

We make this change in several headers that include the original sender's email, including From and Message-Id. We save the original redacted version as an X- header. For example, if Yahoo says an email is From: ceo@..., we modify that to From: [email protected], and save the original as X-Original-Yahoo-Groups-Redacted-From: ceo@.... If we don't have a Yahoo profile name (e.g. "ceo123"), we use the numeric Yahoo user ID (e.g. "123456789") instead.

4.2. Attachments

The Yahoo Groups API detaches all attachments, and saves them in a separate place.

Our solution

We try to stitch the emails back together, navigating through the MIME structure to attach the right attachment at the right place. In some cases, we're not able to identify where in the email MIME structure an attachment goes, so we reattach orphaned attachments to the whole email. In some cases, Yahoo doesn't give us the attachment, so we replace the attachment with a text part containing an error message, with original attachment-related headers added (X-Yahoo-Groups-Attachment-Not-Found, X-Original-Content-Type, X-Original-Content-Disposition, X-Original-Content-Id).

4.3. Long emails being truncated

The Yahoo Groups API forcibly truncates email messages with over 64 KB in text, and places a truncation message right in the middle of encoded content, e.g. Base64.

Our solution

Whenever we see an email body that end with (Message over 64 KB, truncated), we remove that string from the broken message part, and pray that downstream parsers will be able to deal with truncated HTML, Base64, etc. We mark these message parts with a X-Yahoo-Groups-Content-Truncated header.

4.4. Character encoding issues

The Yahoo Groups API appears to be decoding and recoding textual message bodies, because we see Unicode "U+FFFD" replacement characters in the raw RFC822 text that should be 7-bit clean. We're also seeing ^M linefeeds at the end of every header line and MIME body part.

Our solution

We remove invalid linefeeds and 8-bit characters from 7-bit RFC822 text.

5. Fixing common errors

Perl modules

Installing this script requires installing many CPAN dependencies. If you're confused, feel free to search for things like "installing CPAN ". In some cases, an environment's package manager may have all these scripts. At least one Ubuntu user was able to install all the dependencies using the package manager (all but one: they could install the Text::LevenshteinXS module rather than Text::Levenshtein::XS, so they just changed the dependency in the Perl script to match.)

email2pdf

This tool directly executes the email2pdf script specified by the --email2pdf option. Make sure the #! shebang line is set to the Python interpreter of your choice. You can test email2pdf execution by manually running something like email2pdf --headers -i <a .eml file> --output-file <the-filename-to-write.pdf> on a single email/[number].eml file generated by this script.

6. Changelog

Significant changes:

  • 2020-12-09: Saves list description metadata in several formats
  • 2020-04-07: Can process only specific emails, helpful for spam filtering
  • 2020-02-11: One more strategy to turn ornery emails into PDF
  • 2020-01-23: Fixes bug in some emails with X-eGroups-Approved-By: header
  • 2020-01-21: Fixes bug where PDF Date/Subject sometimes listed as 'ARRAY'
  • 2020-01-14: Uses qpdf to create combined PDFs for lists with many emails
  • 2020-01-05: Faster PDF generation, with fewer conversion errors
  • 2019-12-17: Checks /topics/ for attachments and email descriptions
  • 2019-12-11: Checks /attachments/ for attachments
  • 2019-12-10: Support for PDF generation
  • 2019-12-07: First solid release

7. Bugs and todo

  • Catch and solve some of the most common email2pdf errors
  • Maybe fix redacted headers in sub-parts so the message is valid
  • Need to verify that attached files round trip correctly

8. Feedback and authorship

This software is copyright Anirvan Chatterjee, and licensed under the MIT License.

Have questions, bug reports, or suggestions? Feel free to use GitHub's issue tracker. If you need to contact me privately, DM me @anirvan on Twitter.

yahoo-group-archive-tools's People

Contributors

anirvan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

yahoo-group-archive-tools's Issues

mbox opened with UTF-8 encoding but contains invalid characters

I can't make a final judgement because the import of the .mbox file into kmail (v5.10.3) choked on the 4793rd email (out of 7615). When I try to open the .mbox file in a text editor (kate), I get the error "The file .mbox was opened with UTF-8 encoding but contained invalid characters." Examining the .mbox file, it seems as if all the messages are contained in it; but I'm guessing there are some invalid characters preventing the import from completing?

Originally posted by @jnew-gh in #2 (comment)

Attachments not being found

First, many thanks for this fantastic tool. Lovely work.

The issue is that no attachments are being found, despite them being present in both the attachments and topics directories.

In the case of the attachments directory everything seems to be correctly named
/attachments/attachmentId/fileId-filename
โ€“ and the info seems to be correct in the corresponding attachmentinfo.json file.

(I'm using your tools on output produced by IgnoredAmbience's version of the archiver.)

Could you look into this?

Can't locate object method "dir" via package "IO::All::File"

I am very new to this, so sorry if this is a very stupid question...
I'm getting this error Can't locate object method "dir" via package "IO::All::File" at ./yahoo-group-archive-tools.pl line 147. despite having installed all necessary dependencies on CPAN. I tried Strawberry Perl and Perl on WSL1 and both reported the same issue.

perl version is 5.30.0

No attachments in .eml or .mbox files

I used this tool on the output of IgnoredAmbience's yahoo-group-archiver and while all of the .eml and .mbox files are generated, there are no attachments. To be more specific, the emails show that there are attachments but none of the attachments exist.

Line 279 of yahoo-group-archive-tools.pl:
$attachments_dir_path =~ s/_raw\.json$/_attachments/;
For me this resolves to <source_directory>/email/xx_attachments but there are no xx_attachments directories in the .../email source directory.

The output of the tool shows a lot of the following:
[<datetime>] [<groupname>] message xxxx: attachment named '<filename>' could not be found, skipping

Did the output file structure of IgnoredAmbience's yahoo-group-archiver change? Between the time I started using it (October 25, 2019) and today, all of the .../xxxx_attachments/ directories and xxxx.json files have been moved from the .../email directory to the .../topics directory (which didn't exist before). The contents of the .../attachments directory is also very different. The two datasets are downloaded from the same group.

Note: If I run this tool on the October 25th version of the data download, most of the attachments are found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.