The tidy_tweet from qut-digital-observatory

tidy_tweet's Issues

Json parsing checklist

This checklist is for the first pass at parsing all the objects in the json. It's okay if it misses some cases, but it should get it to the point where most of the data goes in the database and we can start doing testing to find things we've missed.

Twarc + twitter json:

Proposal: split tweet table into mutable and immutable components

I think the tweet table can be split into two groups of columns:

Immutable components that are inherent to the tweet, such as the text of the tweet (and content deriving from the text like mentions and urls)
Mutable components, such as the engagement metrics, and the whether there are any restrictions on who can reply

This allows us to track the mutable components with (tweet_id, collected_at) as the primary key, and the immutable component with whatever version is seen first (subject to the limitations you've noticed with tweets that are only in the includes).

In the context of streaming or longitudinal data collections this allows very neatly to track engagement with a tweet over time, as retweets of a popular tweet will give updates to the engagement metrics of the original tweet.

Proposal: keep user profile properties like URLs and hashtags distinct from tweet hashtags and URLs

Apart from the structural differences between tweets and user profiles (a single text blog + attachments for tweets, multiple fields with different meanings/purposes for user profiles), URLs, hashtags and mentions in profiles have different interpretations and meanings than the same thing in tweets. The current schema conflates these, making every query that needs to touch these elements more complicated to avoid mistakes.

Additionally, this schema means that we can't use foreign key constraints to maintain referential integrity.

As an example of this for hashtags, I propose we change the current schema:

create table hashtag (
    source_id text, -- the id of the object (user or tweet) this hashtag is included in
    source_type text, -- "user" or "tweet"
    field text not null, -- e.g. "description", "text" - which field of the source
                         -- object the hashtag is in
    tag text not null
)

To the following two tables:

create table if not exists tweet_hashtag (
    id integer foreign key references tweet(id),
    hashtag text,
    -- Normalisation, to allow indexing on common transformations of hashtags to match Twitter platform affordances
    -- Twitter makes no distinction between #SuperBowl, #SuperbOwl and #SUPERBOWL
    hashtag_lower text,
    primary key (id, hashtag)
);

create table if not exists user_profile_hashtag (
    id integer foreign key references tweet(id),
    field text,
    hashtag text,
    hashtag_lower text,
    primary key (id, field, hashtag)
)

CLI

Going to need a nice CLI, like what Gab Tidy Data has.

Support SQLite full text search

Proposal: schema evolution

This gist maps out a possible schema that addresses #8 #9 #10 #11 #12.

https://gist.github.com/SamHames/795fb3fed71818d871c2873329396ee6

Add indexes to support common analytical queries

Add primary keys to relevent tables

Needed to support:

point lookups by ID
avoid duplicate entries in other tables

Versioning for database schema

Should version the database schema and place both the database schema version and the library version in a metadata table in the database

Proposal: handle directly_collected as a separate table.

Current directly_collected is a column on the tweet table. This has a couple of limitations:

Because this table is updated as insert or ignore, the first version of a tweet seen 'wins' - this column is only correct if the tweets are inserted in chronological order, which isn't guaranteed (especially if there is more than one file to insert).
Additionally this means that filtering on directly_collected in any other table requires joining against and processing the largest table in the collection.

I think we can avoid the order-of-operation problems and get more functionality out of structures like the following. This would also let us consider more complex tagging of collection properties later, and possibly allow us to do some nicer things to make #5 more legible for analysis.

create table if not exists tweet_source (
     id integer primary key references tweet(id),
     directly_collected integer not null,
     primary key(directly_collected, id) 
);

create table if not exists tweet_source_label (
     label text,
     directly_collected integer,
     id integer primary key references tweet(id),
     primary key(label, directly_collected, id) 
);

-- Or just a table that only has the tweet_ids of the directly collected tweets.
create table if not exists tweet_directly_collected (
     id integer primary key
);

All of these allow any of the other tables to filter against a much more compact (and correct) view of the directly_collected attribute.

Opportunity: handling likes nicely

Having worked with the new Twitter like endpoints the data model is a real pain, which I think is a real opportunity for tidy_tweet to do something nice. The Twitter endpoints give the following information:

liking-users gives a reverse chronological list of user profiles liking the tweet
liked-tweets gives a reverse chronological list of liked tweets made by a user

In both cases, there is no indicator in the Twitter JSON of which tweet (or user) was liked by which user, however twarc injects the requested URL with that information into the __twarc field, allowing us to recover the proper relation from the data by itself.

A suggested approach is to insert the associated tweet/user profiles into the relevant tables, and create a liked_tweet table as below (with an index on (user_id, tweet_id) too):

create table if not exists liked_tweet (
    tweet_id integer references tweet(tweet_id),
    user_id integer references user(user_id),
    primary key (tweet_id, user_id)
);

Restructure into a nice package

Time to organise the things! Time to open source the things!!!

Support data collected from sample and filter endpoints.

Annoyingly, the have a slightly differet format.

Instead of {"data": [array of tweets]} you get {"data": <single tweet>}.

qut-digital-observatory / tidy_tweet Goto Github PK

tidy_tweet's People

Contributors

Stargazers

Watchers

tidy_tweet's Issues

Recommend Projects

Recommend Topics

Recommend Org