timhutton / twitter-archive-parser Goto Github PK

Python code to parse a Twitter archive and output in various ways

License: GNU General Public License v3.0

Python 100.00%

twitter-archive-parser's Introduction

How do I use it?

Download your Twitter archive (Settings > Your account > Download an archive of your data).
Unzip to a folder.
Right-click this link --> parser.py <-- and select "Save Link as", and save into the folder where you extracted the archive. (Or use wget or curl on that link. Or clone the git repo.)
Open a command prompt and change directory into the unzipped folder where you just saved parser.py.
(Here's how to do that on Windows: Hold shift while right-clicking in the folder. Click on Open PowerShell.)
Run parser.py with Python 3. e.g. python parser.py.
(On Windows: When the command window opens, paste or enter python parser.py at the command prompt.)

If you are having problems please check the issues list to see if it has happened before, and open a new issue otherwise.

What does it do?

The Twitter archive gives you a bunch of data and an HTML file (Your archive.html). Open that file to take a look! It lets you view your tweets in a nice interface. It has some flaws but maybe that's all you need. If so then stop here, you don't need our script.

Flaws of the Twitter archive:

It shows you tweets you posted with images, but if you click on one of the images to expand it then it takes you to the Twitter website. If you are offline or have deleted your account or twitter.com is down then that won't work.
The tweets are stored in a complex JSON structure so you can't just copy them into your blog for example.
The images they give you are smaller than the ones you uploaded. I don't know why they would do this to us.
DMs are included but don't show you who they are from - many of the user handles aren't included in the archive.
The links are all obfuscated in a short form using t.co, which hides their origin and redirects traffic to Twitter, giving them analytics. Also they will stop working if t.co goes down.

Our script does the following:

Converts the tweets to markdown and also HTML, with embedded images, videos and links.
Replaces t.co URLs with their original versions (the ones that can be found in the archive).
Copies used images to an output folder, to allow them to be moved to a new home.
Will query Twitter for the missing user handles (checks with you first).
Converts DMs (including group DMs) to markdown with embedded media and links, including the handles that we retrieved.
Outputs lists of followers and following.
Downloads the original size images (checks with you first).

For advanced users:

Some of the functionality requires the requests and imagesize modules. parser.py will offer to install these for you using pip. To avoid that you can install them before running the script.

Articles about handling your Twitter archive:

Related tools:

If our script doesn't do what you want then maybe a different tool will help:

https://github.com/Webklex/tbm download Twitter bookmarks incl. download of all media, GUI/search interface via local server
https://github.com/selfawaresoup/twitter-tools
https://github.com/roobottom/twitter-archive-to-markdown-files
https://gist.github.com/divyajyotiuk/9fb29c046e1dfcc8d5683684d7068efe#file-get_twitter_bookmarks_v3-py
https://archive.alt-text.org/
https://observablehq.com/@enjalot/twitter-archive-tweets
https://github.com/woluxwolu/twint
https://github.com/jarulsamy/Twitter-Archive
https://sk22.github.io/twitter-archive-browser/
https://pypi.org/project/pleroma-bot/
https://github.com/mshea/Parse-Twitter-Archive
https://github.com/dangoldin/twitter-archive-analysis
https://fedi.doom.solutions/tumelune/
https://github.com/mhucka/taupe

twitter-archive-parser's People

Contributors

Stargazers

Watchers

Forkers

clayote pauljacobson coldclimate lesault wesley83 jthodge jpluimers aevaonline doncruse n1ckfg enginkarahan duk3luk3 xanathon masukomi klevenstein abaumg curiouskiwi davidmichaelroberts ralfbarkow sasikala-99 animesh156 rossgrady twoscomplement fupete yls520lx statikman mutwirim achisto wshao12 kamirhoseain cyberflamego blackstream-x laibacode mikeckennedy svisser nhudson anthial xiexieni000 nikithakothakapu product-think2049 nefinett landonjam andrewbaker-uk adacable oyelowo olithissen jankatins grayside apfeiffer1 miniupnp granthenninger theotherside bennettscience marquisthecoder rhialto efischer19 gmh5225 lenaschimmel amcap1712 wyan ewenmcneill lmcintyre press-rouch ixs wkaluza joeycastillo professeurfalken l1kw1d hiimsonkul tabatkins hangeramber sergeykadiyevskiy asdlei99 panzer1119 april prereflect mhucka hartl3y94 mialondon lancemccarthy pitmonticone lholmquist richardsonjf gayuna ukaserge pchestek gheja haikusw hendriyeager icodein moodykeke andybalaam wardmundy sid-nik davonbl hubertshum superwillyfoc emmakat rmdes pablojimeno

twitter-archive-parser's Issues

Fetch twitpic

Another blast from the past, used until about 2013.

Example:
https://twitter.com/provoost/status/319779227333427200

Feature Request: Handling for Quote Retweets

Presently QRTs are only provided as unshortened links back to Twitter, where RTs are reproduced (in 140-character part?) in the markdown output.

Here there are two issues.

For self-referential QRTs, presumably we'll be able to reconstruct those relationships by tweet ID found in the twitter URLs themselves at any time. That means any solution to the below can skip server queries on those references, as they're already identifiable in the archive somewhere.

For QRTs of others, however, that data can only be recovered while Twitter remains operational. So, yeah.

sigh

Thank you for all the work you're doing on this, BTW.

Support for Python 3.7 and 3.8

For anyone running Python 3.8 and below, the script now fails with:

Parsing ./data/tweets.js...
Traceback (most recent call last):
  File "parser.py", line 176, in <module>
    main()
  File "parser.py", line 148, in main
    tweets_markdown += [tweet_json_to_markdown(tweet, username, archive_media_folder, output_media_folder_name) for tweet in json]
  File "parser.py", line 148, in <listcomp>
    tweets_markdown += [tweet_json_to_markdown(tweet, username, archive_media_folder, output_media_folder_name) for tweet in json]
  File "parser.py", line 62, in tweet_json_to_markdown
    body = body.removeprefix(replying_to)
AttributeError: 'str' object has no attribute 'removeprefix'

The .removeprefix(..) string method was introduced in Python 3.9 (changelog).

It might be nice to also support Python 3.7 and 3.8 to reduce any hassle with not running a particular version of Python 3. Alternatively, it could be mentioned in the README that Python 3.9+ is expected but personally I feel it's better to make it work for those Python releases.

Convert this to a Python package so it becomes a CLI installable by pipx

I recommend that you convert this to a Python package so it becomes a CLI installable by pipx. This will dramatically simplify usage for end users.

Let's assume you want to name it twitter_archive on pypi.org, then:

# Once pipx is install, can be done via homebrew 
# or via 'python -m pip install --user --upgrade pipx' if Python is installed

$ pipx install twitter_archive
$ twitter_archive_markdown # <-- run in the archive directory
$ twitter_archive_images   # <-- run in the archive directory

I created a proof of concept with these CLI names but they are just a name in a file, we can call them whatever you'd like.

https://github.com/mikeckennedy/twitter-archive-parser/tree/installable

Here it is in action:

Are you interested in this? Just let me know what you want the package name, and CLI command names to be and I'll send you a PR.

Missing a suitable #! line

Also, python2 or python3? (/usr/bin/python doesn't actually exist here right now, but that's my problem.)

Render HTML client-side

Currently we have a script that converts md files to html.

It is also possible to render the md on the client-side, using e.g.:

<!doctype html>
<html>
<head>
  <meta charset="utf-8"/>
  <title>Marked in the browser</title>
</head>
<body>
  <div id="content"></div>
  <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
  <script>
    var md = "# Important #\nThis is some *markdown* text.";
    document.getElementById('content').innerHTML = marked.parse(md);
  </script>
</body>
</html>

We could eliminate convert_to_html.py by simply wrapping the md like that.

A more sophisticated solution could retain the separate md files and render them on demand:

<!doctype html>
<html>
<head>
  <meta charset="utf-8"/>
  <title>Marked in the browser</title>
</head>
<body>
  <div id="content"></div>
  <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
  <script>
    var client = new XMLHttpRequest();
    client.open('GET', '/2013-12-01-Tweet-Archive-2013-12.md');
    client.onreadystatechange = function() {
        document.getElementById('content').innerHTML = marked.parse(client.responseText);
    }
    client.send();
  </script>
</body>
</html>

This could even be used to replace Jekyll completely which would be great from my point of view because a) maintaining the installation locally is a pain, b) running github actions just to publish your blog is crazy.

@sweh - thoughts?

SyntaxError: invalid character '·' (U+00B7) in download_better_images.py

I'm seeing this error when trying download_better_images.py after I ran parser.py.

PS E:\twitter-2022-11-07-923ca14fb076eed5a4b0a75ee733bda92c36bfbfac68abfde9937c206c657bdc> python .\download_better_images.py
  File "E:\twitter-2022-11-07-923ca14fb076eed5a4b0a75ee733bda92c36bfbfac68abfde9937c206c657bdc\download_better_images.py", line 70
    <title>twitter-archive-parser/download_better_images.py at main · timhutton/twitter-archive-parser · GitHub</title>
                                                                    ^
SyntaxError: invalid character '·' (U+00B7)

Some t.co URLs don't have expanded_url

Simple patch will handle them not being present (and emit a warning):

diff --git a/parser.py b/parser.py
index c442590..5eef6c3 100644
--- a/parser.py
+++ b/parser.py
@@ -26,7 +26,11 @@ def tweet_json_to_markdown(tweet, username):
     tweet_id_str = tweet['id_str']
     # replace t.co URLs with their original versions
     for url in tweet['entities']['urls']:
-        body = body.replace(url['url'], url['expanded_url'])
+        try:
+            body = body.replace(url['url'], url['expanded_url'])
+        except KeyError:
+            print (f"Warning: no expanded URL found for {url['url']}")
+
     # replace image URLs with markdown image links to local files
     if 'media' in tweet['entities']:
         for media in tweet['entities']['media']:

heads-up about DOS line endings

Thanks for the great tool! Love the friendly command-line explanations and logging.

I'm on Linux. I went to https://raw.githubusercontent.com/timhutton/twitter-archive-parser/main/parser.py and used File: Save in my browser (Firefox) to save the file locally.

When I tried to run it, I got this set of error messages:

$ python3 -m parser.py
/home/[redacted]/bin/python3: Error while finding module specification for 'parser.py' (ModuleNotFoundError: __path__ attribute not found on 'parser' while trying to find 'parser.py')
$ ./parser.py 
/usr/bin/env: ‘python3\r’: No such file or directory

and my spouse said -- aha, that file has DOS line-endings. I used my editor (emacs) to change that to Unix-style line endings, saved parser.py, and then was able to run it with no problem.

This is not necessarily something you need to fix - just wanted to share this here in case anyone ran into the same error messages and was trying to figure out why. Please feel free to close the issue.

Idea: improve image display

Images can be very large and intrusive in the output - perhaps, with judicious use of <img> tags, images could have a smaller, consistent size. And could be linked to full-size images.

Output should be sorted by timestamp

The current order in the output is the same as in the input but this seems to be somewhat random.

Feature request: Parse DMs, add user names and handles

The current twitter archive downloaded omits all the user names and handles. It only contains the ids of the accounts that someone interacted with. With that the archive looses context, especially for the DMs and reply's.

Expand bit.ly

This may be a bit of scope creep, because afaik bit.ly is not used by Twitter itself (see #36), but rather by some users of it. Probably automatic social media cross-posting tools.

However, I'm assuming it's easy to just add it to the list of shorteners that are expanded.

"Media couldn't be retrieved" on download_better_images

When running download_better_images.py, about half of the results (randomly) are outputting "Fail. Media couldn't be retrieved". The given URLs with :orig appended are working fine in my browser. There doesn't appear to be any pattern.

Feature request: Parse likes

It would be cool to parse your likes archive in the same way as your own tweets (sorry if you can already do this, I’m a Python noob). I realize it might be more challenging because of multiple tweet authors though!

Thanks for making this parser and for your clear instructions, I really appreciate it! 💚

I don't understand the output order

The output from my data appears to be mixed up. (The html file from twitter is ordered by date, as expected.) My output starts with some tweets from 2015, then 2014, 2013, then 2016, 2018, 2022, 2018 again, 2016 again, back to 2018, then 2019, 2018, 2019, 2016, 2017, 2020, etc. It appears to be a mix of tweets, retweets, and replies. Maybe the json import just jumbles them up? (my python version is 3.10.6).

I can't discern any pattern, nor did I see any error in the script output. I spot-checked for some specific tweets, and didn't see any missing. Skimming through, I don't see any obvious parsing errors. Also, I'm not worried enough about it to investigate further. I'm just very happy to have the content there, and I can find what I need easily enough!!! Thanks! :)

p.s. Maybe one other detail in case it's useful: I downloaded my twitter archive April 29, 2022, so maybe there was some minor change in format between then and now? I wouldn't worry about it, and especially not if there are no other reports of a problem like this.

p.p.s. Oh, checking my tweet.js file, the output from your script appears to be in the same order as the entries there. (At least, the first few and last few are the same.) So the twitter html file must include some feature to sort by date. I guess I'll post this issue anyway, just in case it comes up for others.

data/tweet_media is sometimes data/tweets_media

As reported by @calmeilles:

"...need to change
new_filename = 'data/tweets_media/'
at line 32 otherwise references in output.md are incorrect, as below. Easily fixed after the event of course but would be nicer not to have to.

matthew: $ grep 1552588046175961090-FYvmfP_XgAEh6K0.jpg output.md 
@nnnnnn This morning's effort, out of the oven 3 minutes ago. ![](data/tweet_media/1552588046175961090-FYvmfP_XgAEh6K0.jpg)

matthew: $ display data/tweet_media/1552588046175961090-FYvmfP_XgAEh6K0.jpg
display-im6.q16: unable to open image `data/tweet_media/1552588046175961090-FYvmfP_XgAEh6K0.jpg': No such file or directory @ error/blob.c/OpenBlob/2874.

matthew: $ find . -name 1552588046175961090-FYvmfP_XgAEh6K0.jpg
./data/tweets_media/1552588046175961090-FYvmfP_XgAEh6K0.jpg

bug: hardcoded media/ as the output folder

    Minor oversight: you have now accidentally hardcoded media/ as the output folder in line 105

body = header + body + f'\n\n<img src="media/tweet.ico" width="12" />

and 139, 140

 if not os.path.isfile('media/tweet.ico'):
         shutil.copy('assets/images/favicon.ico', 'media/tweet.ico');

where it should use the variable output_media_folder_name ...

Originally posted by @jwildeboer in #43 (comment)

Bug: Hang in Requesting headers

A user reports:

A few minutes in, the script hangs on

" 218/12457 media/1418046275941974017-E63pbh9WUAA1Rt1.jpg: Requesting headers for pbs.twimg.com/media/E63pbh9WUA......"

and doesn't proceed. Cursor is stuck at the beginning of the line.

Have been waiting 20 minutes now. No progress. I will keep waiting.

Bug: tweet text in html output is not escaped

Related #40

Feature request: Retrieve ALT-text for images

This data appears to be omitted from archive entirely.

Side note: Another Mastodoner has offered up an online tool to parse the .js file and grap down ALTs... but there are currently UI issues, however, that have thus far foiled attempts to use it.

Unable to run - <!DOCTYPE html> SyntaxError: invalid syntax, or <title>Syntax error: invalid character '·' (U+00B7)

Using python parser.py with Python 3.10.6 the command fails with the error

> abaker@Andrew-PC:~/Twitter$ python parser.py
>   File "/home/abaker/Twitter/parser.py", line 70
>     <title>twitter-archive-parser/parser.py at main · timhutton/twitter-archive-parser · GitHub</title>
>                                                     ^
> SyntaxError: invalid character '·' (U+00B7)
> abaker@Andrew-PC:~/Twitter$ ll
> total 21288
> drwxr-x---+  4 abaker abaker     4096 Nov 11 15:52  ./
> drwxr-x---+  3 abaker abaker     4096 Nov 11 15:49  ../
> -rw-rw-r--   1 abaker abaker     1432 Nov 10 23:30 'Your archive.html'
> drw-rw-r--+  5 abaker abaker     4096 Nov 10 23:30  assets/
> drw-rw-r--+ 11 abaker abaker     4096 Nov 10 23:30  data/
> -rw-rw-r--   1 abaker abaker   208979 Nov 11 15:48  parser.py
> -rw-rw-r--   1 abaker abaker 21561802 Nov 10 23:33  twitter-2022-11-10-fe53538580a484bb0e4a4af7c4b36c493b6815a9a9bc28e02b33a82f31eda3b3.zip
> abaker@Andrew-PC:~/Twitter$ python --version
> Python 3.10.6

Feature Request: option to always download

In the case everything really goes sour I'd like to have an option to leave nothing online. Could you add a flag to download everything, not only if it's larger online?

Needed: Improve DMs output: images, links, html?, split files?

In tweets we embed the images and videos. We should do the same for DMs.
~~We currently output a single md file with all the DMs in it. For users with very many DMs this might present a problem for rendering, so we could split them out as we do for tweets.~~
We could output html for DMs, as we do for tweets.

Any one of these issues is worth tackling on its own.

Invalid character error

Using the current script and the image script returns the following error in cmd for line 70:

<title>twitter-archive-parser/parser.py at main · timhutton/twitter-archive-parser · GitHub</title> ^ SyntaxError: invalid character '·' (U+00B7)

The middle dot (U+00B7) needs to be replaced.

Feature Request: Chronological Folders

It will be great if the tweets are output in nested chronological folders The folders can be in the format YYYY and MM, with a DD.md containing all the tweets on the day.
Eg
2022
|__01
|__01.md
|__02.md
...
|__31.md
...
2021
|__01
|__01.md
|__02.md
...
|__31.md

This will make it easier to publish the entire archive on a website.

Bug: tweets with multiple images output with only a single image

Hi again!

It looks like tweets that have more than one image include only the first image. Is it possible to include them all?

If not no worries! Thank you! 🤩

"Warning: missing local file"

Hello,

Gettting lots of warnings:

C:\Temp\Twitter.archive>python parser.py
Parsing .\data\tweets.js...
Warning: missing local file: .\data\tweets_media\1473653109218070531-BetnAQOurIrZad0X.jpg. Using original link instead: https://t.
co/CIul6pLg9n (expands to http://pbs.twimg.com/ext_tw_video_thumb/1473653023587155970/pu/img/BetnAQOurIrZad0X.jpg)
Warning: missing local file: .\data\tweets_media\1473424645546299393-TIgR4R9XfpY-X8cj.jpg. Using original link instead: https://t.
co/5rc56JT2VF (expands to http://pbs.twimg.com/ext_tw_video_thumb/1473424394638839811/pu/img/TIgR4R9XfpY-X8cj.jpg)
Warning: missing local file: .\data\tweets_media\1472915850550206469-Bth9mkPR2YATJGUV.jpg. Using original link instead: https://t.
co/BalRXveha8 (expands to http://pbs.twimg.com/ext_tw_video_thumb/1472915762293657606/pu/img/Bth9mkPR2YATJGUV.jpg)
Warning: missing local file: .\data\tweets_media\1472627142693474309-K5HYi0f5DG7CpXbG.jpg. Using original link instead: https://t.
co/wfyBxZ1CVB (expands to http://pbs.twimg.com/ext_tw_video_thumb/1472627087865491474/pu/img/K5HYi0f5DG7CpXbG.jpg)
Warning: missing local file: .\data\tweets_media\1472574958605848585-GPC6rpDjtvu0yErk.jpg. Using original link instead: https://t.
co/ONnGaZkppJ (expands to http://pbs.twimg.com/ext_tw_video_thumb/1472574772114608132/pu/img/GPC6rpDjtvu0yErk.jpg)
etc.

Cheers,

don't know how to use

Needed: Map missing user_id -> user_handle with remote services

This will improve followers.txt, following.txt, DMs.md where currently many handles are missing.

Suggestions and initial work for how to retrieve these handles has happened in several places recently. Thank you!

Ping: @flauschzelle, @lenaschimmel, @press-rouch, @n1ckfg (but of course anyone can contribute)

[Edit: the PRs that were mentioned are now merged]

name 'media_sources' is not defined

The archive doesn't contain the original-size images. We can attempt to download them from twimg.com.
Please be aware that this script may download a lot of data, which will cost you money if you are
paying for bandwidth. Please be aware that the servers might block these requests if they are too
frequent. This script may not work if your account is protected. You may want to set it to public
before starting the download.

OK to start downloading? [y/n]y
Traceback (most recent call last):
  File "~/Downloads/twitter-2022-11-01-b470d4ee7c7ba96e09a7782d4451d9c6852c3d7322cdcf3a91dbe40943586323/parser.py", line 580, in <module>
    main()
  File "~/Downloads/twitter-2022-11-01-b470d4ee7c7ba96e09a7782d4451d9c6852c3d7322cdcf3a91dbe40943586323/parser.py", line 575, in main
    download_larger_media(media_sources, log_path)
NameError: name 'media_sources' is not defined

Better filename for sorting (and Jekyll ;)

I changed line 135

filename = f'tweets_{dt.year}-{dt.month:02}.md'

filename = f'{dt.year}-{dt.month:02}-01-Tweet-Archive-{dt.year}-{dt.month:02}.md'

To get nicer filenames that I can just throw in a Jekyll _posts directory to start creating nice looking pages.

Not sure if this is worthy of a PR, maybe just a FYI or an idea :)

Feature request: Export unshortened URLs to CSV (e.g. for archiving them in the Internet Archive)

Michele Weigle has a nice thread on archiving t.co links into the Internet Archive running an Internet Archive service that archives URLs in a Google Sheets spreadsheet: https://twitter.com/weiglemc/status/1593698822257102851

Her script prepares the list of t.co URLs using awk: https://gist.github.com/weiglemc/312a11356420b3bc4c8e196e8454002a

The idea from that script might be a thing you want to include in your Python script.

Expand t.co URLs for all links

bug: crashes on 404 error ("Failed to get user handle: <Response [404]>")

Parsing ./data/direct-messages.js...
1 users are unknown.
Download user data from Twitter (approx 2KB)? [y/n]y
Traceback (most recent call last):
  File "/Users/sanja/Downloads/twitter-2022-11-20-52795906fc4533f7d93a763b1a6193fc624fdc507a4c5ca6f932f329488c105f/parser.py", line 580, in <module>
    main()
  File "/Users/sanja/Downloads/twitter-2022-11-20-52795906fc4533f7d93a763b1a6193fc624fdc507a4c5ca6f932f329488c105f/parser.py", line 565, in main
    parse_direct_messages(data_folder, users, user_id_URL_template, output_dms_filename)
  File "/Users/sanja/Downloads/twitter-2022-11-20-52795906fc4533f7d93a763b1a6193fc624fdc507a4c5ca6f932f329488c105f/parser.py", line 476, in parse_direct_messages
    lookup_users(list(dm_user_ids), users)
  File "/Users/sanja/Downloads/twitter-2022-11-20-52795906fc4533f7d93a763b1a6193fc624fdc507a4c5ca6f932f329488c105f/parser.py", line 107, in lookup_users
    retrieved_users = get_twitter_users(session, bearer_token, guest_token, filtered_user_ids)
  File "/Users/sanja/Downloads/twitter-2022-11-20-52795906fc4533f7d93a763b1a6193fc624fdc507a4c5ca6f932f329488c105f/parser.py", line 83, in get_twitter_users
    raise Exception(f'Failed to get user handle: {response}')
Exception: Failed to get user handle: <Response [404]>

Feature request: Download full-size images from Twitter

@masukomi points out in #13 that the images in the Twitter archive are not the original size.

Example: an image in my archive is 600x809. If I download from Twitter appending ?format=jpg&name=large then it gives me an image that is 780x1052.

Conceivably we could automate the downloading of full size images.

tweets.js is split at the 100Mb mark

As someone who was prolific on twitter my total download was about 9Gb. For tweets.js it's split into 3 files:

-rw------- 1 peter peter 104874538 Nov  5 20:55 tweets-part1.js
-rw------- 1 peter peter  77737185 Nov  5 20:55 tweets-part2.js
-rw------- 1 peter peter 104874243 Nov  5 20:55 tweets.js

So when tweets.js hits 101M it gets split into -part1 for the next 101M then -part2 and so on.

Feature request: Retrieve full text of retweets

Only the first 140 characters of each retweet are preserved in the official Twitter archive. I'm afraid it is a convoluted process to retrieve the full text via the API, probably for historical (pre-280 character) reasons. It should be done though (if at all possible, I guess it's a rather inconvenient number of API calls).

Bug: some videos use different URLs

This affects the download_better_images.py script, causing them to say e.g.

176/350: Fail. Media couldn't be retrieved: https://video.twimg.com/tweet_video/Q1P1WLseqOz6yoDf.mp4 Filename: media\1332621192075890689-Q1P1WLseqOz6yoDf.mp4

Twitter seems to handle videos differently depending on their size. For different bitrates the correct URLs are available in the JSON.

Proposal:

parser.py pulls out the highest-quality video URL (and image) and puts it in media/sources.txt, where each row contains <filename> <URL>
If the user runs download_better_images.py then it reads that file and tries to upgrade each file.

Feature Request: Followers & Followings

Any chance of parsing the Followers & Followings lists ? They are currently just unfriendly url's:

https://twitter.com/intent/user?user_id=XXXXXXXXXXXXX

but it would be good to resolve those URL's into at least usernames, and any other public info available on that user that would help us to reconstruct our lists if those twitter URL's stop working.

THanks!

Should it be "tweets.js"

The code currently looks for a file called tweet.js - but that doesn't appear in my archive.

Instead, I have tweets.js and tweets-part1.js - possibly because I post too much!

Are other archives different?

Feature request: Support for reply-to-self threads

It would be nice to be able to extract or format tweet threads in some fashion that makes them easy to identify - right now, all the tweets appear in the output, but not with any explicit indication of connection between the related replies-to-self.

At least one complication is that threads are not necessarily a single linear sequence.

Enhancement: download-better-images does not always work on private accounts

I had some problems with downloading the full size images on one of my accounts and suspect that it was because my account was set to private. This is the log I received on my first run:

1/10: Fail. Media couldn't be retrieved: https://pbs.twimg.com/media/AzUHnUvCQAAQTGa.jpg:orig Filename: media/231099328037535744-AzUHnUvCQAAQTGa.jpg
2/10: Skipped. Available version is same size or smaller than media/371782408221118464-BSivgd1CMAAERHA.jpg
3/10: Fail. Media couldn't be retrieved: https://pbs.twimg.com/media/AwvZUcGCEAIXkWY.png:orig Filename: media/219507951960985600-AwvZUcGCEAIXkWY.png
4/10: Fail. Media couldn't be retrieved: https://pbs.twimg.com/media/AzUHocNCIAAn2fl.jpg:orig Filename: media/231099347218079744-AzUHocNCIAAn2fl.jpg
5/10: Fail. Media couldn't be retrieved: https://pbs.twimg.com/media/AtFxeJ1CIAAb7lo.jpg:orig Filename: media/203068221052559360-AtFxeJ1CIAAb7lo.jpg
6/10: Fail. Media couldn't be retrieved: https://pbs.twimg.com/media/Axb1MdECIAAE4AZ.png:orig Filename: media/222634825906003970-Axb1MdECIAAE4AZ.png
7/10: Skipped. Available version is same size or smaller than media/124894202516615168-AbugnsJCQAA0IZX.jpg
8/10: Fail. Media couldn't be retrieved: https://pbs.twimg.com/media/BnYLPFrCMAAAedz.jpg:orig Filename: media/465571969623392256-BnYLPFrCMAAAedz.jpg
9/10: Fail. Media couldn't be retrieved: https://pbs.twimg.com/media/Axbz2SBCMAECfiR.png:orig Filename: media/222633345471885312-Axbz2SBCMAECfiR.png
10/10: Fail. Media couldn't be retrieved: https://pbs.twimg.com/media/A14nDfYCcAAmUmu.jpg:orig Filename: media/242674370735140864-A14nDfYCcAAmUmu.jpg

Replaced 0 of 10 media files with larger versions.
Total downloaded: 0.0MB = 0.00GB
Time taken: 10s

And this is the result after I switched the account's visibility to "non-protected".

1/10: Success. Overwrote media/231099328037535744-AzUHnUvCQAAQTGa.jpg with downloaded version that is 244% larger, 0.1MB downloaded.
2/10: Skipped. Available version is same size or smaller than media/371782408221118464-BSivgd1CMAAERHA.jpg
3/10: Skipped. Available version is same size or smaller than media/219507951960985600-AwvZUcGCEAIXkWY.png
4/10: Skipped. Available version is same size or smaller than media/231099347218079744-AzUHocNCIAAn2fl.jpg
5/10: Success. Overwrote media/203068221052559360-AtFxeJ1CIAAb7lo.jpg with downloaded version that is 60% larger, 0.1MB downloaded.
6/10: Skipped. Available version is same size or smaller than media/222634825906003970-Axb1MdECIAAE4AZ.png
7/10: Skipped. Available version is same size or smaller than media/124894202516615168-AbugnsJCQAA0IZX.jpg
8/10: Success. Overwrote media/465571969623392256-BnYLPFrCMAAAedz.jpg with downloaded version that is 143% larger, 0.0MB downloaded.
9/10: Skipped. Available version is same size or smaller than media/222633345471885312-Axbz2SBCMAECfiR.png
10/10: Success. Overwrote media/242674370735140864-A14nDfYCcAAmUmu.jpg with downloaded version that is 3% larger, 0.0MB downloaded.

Replaced 4 of 10 media files with larger versions.
Total downloaded: 0.2MB = 0.00GB
Time taken: 11s

I am not entirely sure whether the problem stems from the protected state but I had no problems on any of my other accounts, none of them being protected.

My suggestion: add a line to the scripts intro text explaining that it may be a good idea to go public for the download's time frame.

If this sounds good to you let me know and I will happily create a pull request.

Fix videos

Currently we don't embed videos. This is also the cause of some of the warnings reported in #18

Bug: markdown inside tweets is not escaped

Consider https://twitter.com/twoscomplement/status/1580380783767756801 - note the text __VA_ARGS__

The tweet body text is copied verbatim by parser.py, resulting in VA_ARGS when the markdown is rendered

Split output into separate files

Motivations:

Enormous markdown files cause problems for rendering.
Import into Jekyll etc needs separate files.

Proposal:

Construct a filename using the timestamp
Output each tweet as a separate markdown file

Feature Request: gather list members

lists-created.js is just a list of links to your lists. Would have liked it if it saved who were the list members, so this request is #70 but applies to your personal lists.

Thanks!

download_better_images.py should always download :orig versions of PNG, currently keeps lower-quality reencoded versions from archive zip

The current check used by download_better_images.py to detect whether the file downloaded is "better quality" fails for lossless codecs like PNG. In a lossless compression format, a lower filesize doesn't imply a loss of quality, it's re-encoding the bit patterns of the raw original image into a more compact encoding scheme that represents the same output.

I recommend editing this script to unconditionally download :orig versions of PNG, as it makes a big difference. In particular, there's anti-aliasing introduced around screens/pixel art after Twitter re-encodes the original PNG present in the archive versions of media. As a side-effect the archive version degrade the quality a lot compared to the original upload for PNGs. By fixing the download check for PNGs, the quality issue is resolved.

Thanks for making this tool!

(For a quick workaround, I did this locally -- In all cases where file sizes smaller, the quality of the PNG is noticeably better in the :orig than the archive zip version.

copy this script,
a line like media_filenames = [filename for filename in media_filenames if os.path.splitext(filename)[1] == '.png'] between the glob + number_of_files lines to only fetch the PNGs
replace if size_after > size_before: with if True: to unconditionally take the downloaded :orig over the existing png.

of course, a real fix will be more robust than that! Just sharing if those script modifications help anyone else in the meantime who encounters a similar PNG quality thing)

Feature request: fetch other tweets in threads

While discussing support for threads in #23 - it would be nice to have an option to fetch other people's tweets that you replied to, so that entire threads are preserved, not just your own replies.

Expand is.gd

Twitter used this until about 2011. It seems their engineers forgot about it, because they're not expanded, but they still work! Example.