thiago-paim / twitter-scraping Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 177 KB

Python 99.21% HTML 0.57% Dockerfile 0.22%

twitter-scraping's People

Contributors

Stargazers

Watchers

twitter-scraping's Issues

Celery logs always go to the same file

Celery worker is always logging to the same file, making it too big after a while.
Need to find a way for making it a rotating file.

Refactor scraping for new limitations

The scraping task must be refactored because TwitterSearchScraper doesn't work anymore. Should also takje the opportunity to create a better separation of concerns:

record_tweet: Salva um tweet no banco, junto com tweets relacionados (replies_to, conversation, retweets, quoted)
scrape_user_tweets (atual scrape_tweets_from_user): Raspa todos os tweets de um usuário
scrape_tweet_replies: Raspa todas as respostas de um tweet em específico
scrape_user_tweet_replies: Itera sobre os tweets salvos de um usuário (por uma scraping_request?) e chama o scrape_tweet_answers para eles

Reorganize README to make it easier for newcomers

Topics:

Add explanantion about the project
Split Docker setup and manual setup
Add instructions for scrapping from the admin

Add task for creating and running next scrapping requests

Exports folder must be manually created before trying to export tweets

Add Tweet field for unaccented content

Having a field for unaccented content will allow make it easiear for searching key words.

Steps:

Add Tweet.unaccented_content field
Add TweetManager.contains_unaccented_hate_words

Consider adding django-auditlog

https://django-auditlog.readthedocs.io/en/latest/usage.html

Refactor create_next_scrapping_request to split ScrappingRequest creation and start

As is, the task create_next_scrapping_request both creates new ScrappingRequests (based on lists in values.py and also starts the next one.
A better architecture would be to decouple it, having one periodic task that just checks and starts the next request, and another non-periodic task to create ScrappingRequests (and have them waiting on the line)

Scrapping doesn't work without since and until parameters

Fix scraping spelling

It was written with two p's in most cases (including the DB)

Scrapped tweets don't automatically link in_reply_to_tweet and conversation_tweet

Update Django version

https://github.com/thiago-paim/twitter-scraping/security/dependabot/1

Increase visibility over ScrappingRequest results

Currently we only keep record of the latest ScrappingRequest that affected created/updated a Tweet, so if you run similar scrappings you might lose track of their results. Also, this is not visible by the admin (apart from a simple count of tweets).

Ideally it should be easy to:

Check which tweets were created by a ScrappingRequest
Check which tweets were updated by a ScrappingRequest
Filter tweets created/updated by a ScrappingRequest
Check all the errors that happened during a scrapping

Function in ScrappingRequest admin list page
Function in Tweet admin list page

thiago-paim / twitter-scraping Goto Github PK

twitter-scraping's People

Contributors

Stargazers

Watchers

twitter-scraping's Issues

Recommend Projects

Recommend Topics

Recommend Org