Giter Club home page Giter Club logo

tidy_tweet's People

Contributors

betsybookwyrm avatar samhames avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tidy_tweet's Issues

Json parsing checklist

This checklist is for the first pass at parsing all the objects in the json. It's okay if it misses some cases, but it should get it to the point where most of the data goes in the database and we can start doing testing to find things we've missed.

Twarc + twitter json:

  • _twarc
  • data
    • tweet fields
    • mentions
    • annotations
    • context annotations
    • references
  • includes
    • media
    • places
    • tweets (as for data above)
    • users
      • user fields
      • urls
      • hashtags
      • mentions
      • other entites???
  • meta
  • errors

Proposal: split tweet table into mutable and immutable components

I think the tweet table can be split into two groups of columns:

  • Immutable components that are inherent to the tweet, such as the text of the tweet (and content deriving from the text like mentions and urls)
  • Mutable components, such as the engagement metrics, and the whether there are any restrictions on who can reply

This allows us to track the mutable components with (tweet_id, collected_at) as the primary key, and the immutable component with whatever version is seen first (subject to the limitations you've noticed with tweets that are only in the includes).

In the context of streaming or longitudinal data collections this allows very neatly to track engagement with a tweet over time, as retweets of a popular tweet will give updates to the engagement metrics of the original tweet.

Proposal: keep user profile properties like URLs and hashtags distinct from tweet hashtags and URLs

Apart from the structural differences between tweets and user profiles (a single text blog + attachments for tweets, multiple fields with different meanings/purposes for user profiles), URLs, hashtags and mentions in profiles have different interpretations and meanings than the same thing in tweets. The current schema conflates these, making every query that needs to touch these elements more complicated to avoid mistakes.

Additionally, this schema means that we can't use foreign key constraints to maintain referential integrity.

As an example of this for hashtags, I propose we change the current schema:

create table hashtag (
    source_id text, -- the id of the object (user or tweet) this hashtag is included in
    source_type text, -- "user" or "tweet"
    field text not null, -- e.g. "description", "text" - which field of the source
                         -- object the hashtag is in
    tag text not null
)

To the following two tables:

create table if not exists tweet_hashtag (
    id integer foreign key references tweet(id),
    hashtag text,
    -- Normalisation, to allow indexing on common transformations of hashtags to match Twitter platform affordances
    -- Twitter makes no distinction between #SuperBowl, #SuperbOwl and #SUPERBOWL
    hashtag_lower text,
    primary key (id, hashtag)
);

create table if not exists user_profile_hashtag (
    id integer foreign key references tweet(id),
    field text,
    hashtag text,
    hashtag_lower text,
    primary key (id, field, hashtag)
)

Versioning for database schema

Should version the database schema and place both the database schema version and the library version in a metadata table in the database

Proposal: handle directly_collected as a separate table.

Current directly_collected is a column on the tweet table. This has a couple of limitations:

  1. Because this table is updated as insert or ignore, the first version of a tweet seen 'wins' - this column is only correct if the tweets are inserted in chronological order, which isn't guaranteed (especially if there is more than one file to insert).
  2. Additionally this means that filtering on directly_collected in any other table requires joining against and processing the largest table in the collection.

I think we can avoid the order-of-operation problems and get more functionality out of structures like the following. This would also let us consider more complex tagging of collection properties later, and possibly allow us to do some nicer things to make #5 more legible for analysis.

create table if not exists tweet_source (
     id integer primary key references tweet(id),
     directly_collected integer not null,
     primary key(directly_collected, id) 
);

create table if not exists tweet_source_label (
     label text,
     directly_collected integer,
     id integer primary key references tweet(id),
     primary key(label, directly_collected, id) 
);

-- Or just a table that only has the tweet_ids of the directly collected tweets.
create table if not exists tweet_directly_collected (
     id integer primary key
);

All of these allow any of the other tables to filter against a much more compact (and correct) view of the directly_collected attribute.

Opportunity: handling likes nicely

Having worked with the new Twitter like endpoints the data model is a real pain, which I think is a real opportunity for tidy_tweet to do something nice. The Twitter endpoints give the following information:

  • liking-users gives a reverse chronological list of user profiles liking the tweet
  • liked-tweets gives a reverse chronological list of liked tweets made by a user

In both cases, there is no indicator in the Twitter JSON of which tweet (or user) was liked by which user, however twarc injects the requested URL with that information into the __twarc field, allowing us to recover the proper relation from the data by itself.

A suggested approach is to insert the associated tweet/user profiles into the relevant tables, and create a liked_tweet table as below (with an index on (user_id, tweet_id) too):

create table if not exists liked_tweet (
    tweet_id integer references tweet(tweet_id),
    user_id integer references user(user_id),
    primary key (tweet_id, user_id)
);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.