qut-digital-observatory / tidy_tweet Goto Github PK
View Code? Open in Web Editor NEWTidies Twitter json collected with Twarc into relational tables.
Home Page: https://pypi.org/project/tidy-tweet/
License: MIT License
Tidies Twitter json collected with Twarc into relational tables.
Home Page: https://pypi.org/project/tidy-tweet/
License: MIT License
This checklist is for the first pass at parsing all the objects in the json. It's okay if it misses some cases, but it should get it to the point where most of the data goes in the database and we can start doing testing to find things we've missed.
Twarc + twitter json:
I think the tweet table can be split into two groups of columns:
This allows us to track the mutable components with (tweet_id, collected_at) as the primary key, and the immutable component with whatever version is seen first (subject to the limitations you've noticed with tweets that are only in the includes).
In the context of streaming or longitudinal data collections this allows very neatly to track engagement with a tweet over time, as retweets of a popular tweet will give updates to the engagement metrics of the original tweet.
Apart from the structural differences between tweets and user profiles (a single text blog + attachments for tweets, multiple fields with different meanings/purposes for user profiles), URLs, hashtags and mentions in profiles have different interpretations and meanings than the same thing in tweets. The current schema conflates these, making every query that needs to touch these elements more complicated to avoid mistakes.
Additionally, this schema means that we can't use foreign key constraints to maintain referential integrity.
As an example of this for hashtags, I propose we change the current schema:
create table hashtag (
source_id text, -- the id of the object (user or tweet) this hashtag is included in
source_type text, -- "user" or "tweet"
field text not null, -- e.g. "description", "text" - which field of the source
-- object the hashtag is in
tag text not null
)
To the following two tables:
create table if not exists tweet_hashtag (
id integer foreign key references tweet(id),
hashtag text,
-- Normalisation, to allow indexing on common transformations of hashtags to match Twitter platform affordances
-- Twitter makes no distinction between #SuperBowl, #SuperbOwl and #SUPERBOWL
hashtag_lower text,
primary key (id, hashtag)
);
create table if not exists user_profile_hashtag (
id integer foreign key references tweet(id),
field text,
hashtag text,
hashtag_lower text,
primary key (id, field, hashtag)
)
Going to need a nice CLI, like what Gab Tidy Data has.
Needed to support:
Should version the database schema and place both the database schema version and the library version in a metadata table in the database
Current directly_collected is a column on the tweet table. This has a couple of limitations:
I think we can avoid the order-of-operation problems and get more functionality out of structures like the following. This would also let us consider more complex tagging of collection properties later, and possibly allow us to do some nicer things to make #5 more legible for analysis.
create table if not exists tweet_source (
id integer primary key references tweet(id),
directly_collected integer not null,
primary key(directly_collected, id)
);
create table if not exists tweet_source_label (
label text,
directly_collected integer,
id integer primary key references tweet(id),
primary key(label, directly_collected, id)
);
-- Or just a table that only has the tweet_ids of the directly collected tweets.
create table if not exists tweet_directly_collected (
id integer primary key
);
All of these allow any of the other tables to filter against a much more compact (and correct) view of the directly_collected attribute.
Having worked with the new Twitter like endpoints the data model is a real pain, which I think is a real opportunity for tidy_tweet to do something nice. The Twitter endpoints give the following information:
In both cases, there is no indicator in the Twitter JSON of which tweet (or user) was liked by which user, however twarc injects the requested URL with that information into the __twarc
field, allowing us to recover the proper relation from the data by itself.
A suggested approach is to insert the associated tweet/user profiles into the relevant tables, and create a liked_tweet
table as below (with an index on (user_id, tweet_id)
too):
create table if not exists liked_tweet (
tweet_id integer references tweet(tweet_id),
user_id integer references user(user_id),
primary key (tweet_id, user_id)
);
Time to organise the things! Time to open source the things!!!
Annoyingly, the have a slightly differet format.
Instead of {"data": [array of tweets]}
you get {"data": <single tweet>}
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.