twitter-scraping's People
twitter-scraping's Issues
Celery logs always go to the same file
Celery worker is always logging to the same file, making it too big after a while.
Need to find a way for making it a rotating file.
Switch DB to Postgres
Refactor scraping for new limitations
The scraping task must be refactored because TwitterSearchScraper doesn't work anymore. Should also takje the opportunity to create a better separation of concerns:
record_tweet
: Salva um tweet no banco, junto com tweets relacionados (replies_to, conversation, retweets, quoted)scrape_user_tweets
(atualscrape_tweets_from_user
): Raspa todos os tweets de um usuárioscrape_tweet_replies
: Raspa todas as respostas de um tweet em específicoscrape_user_tweet_replies
: Itera sobre os tweets salvos de um usuário (por uma scraping_request?) e chama oscrape_tweet_answers
para eles
Reorganize README to make it easier for newcomers
Topics:
- Add explanantion about the project
- Split Docker setup and manual setup
- Add instructions for scrapping from the admin
Add task for creating and running next scrapping requests
Add Sentry to track errors
Exports folder must be manually created before trying to export tweets
Add Tweet field for unaccented content
Having a field for unaccented content will allow make it easiear for searching key words.
Steps:
- Add
Tweet.unaccented_content
field - Add
TweetManager.contains_unaccented_hate_words
Consider adding django-auditlog
Refactor create_next_scrapping_request to split ScrappingRequest creation and start
As is, the task create_next_scrapping_request
both creates new ScrappingRequests (based on lists in values.py
and also starts the next one.
A better architecture would be to decouple it, having one periodic task that just checks and starts the next request, and another non-periodic task to create ScrappingRequests (and have them waiting on the line)
Scrapping doesn't work without since and until parameters
Fix scraping spelling
It was written with two p's in most cases (including the DB)
Scrapped tweets don't automatically link in_reply_to_tweet and conversation_tweet
Update Django version
Increase visibility over ScrappingRequest results
Currently we only keep record of the latest ScrappingRequest that affected created/updated a Tweet, so if you run similar scrappings you might lose track of their results. Also, this is not visible by the admin (apart from a simple count of tweets).
Ideally it should be easy to:
- Check which tweets were created by a ScrappingRequest
- Check which tweets were updated by a ScrappingRequest
- Filter tweets created/updated by a ScrappingRequest
- Check all the errors that happened during a scrapping
Add the rest of Twitter fields to the models
There's many other fields returned by the Twitter scrapper that aren't being saved. They might be useful later (like media
, viewCounts
, retweetedTweet
, quotedTweet
)
Create task scrapping only the user's tweets
Admin doesnt work with DEBUG disabled
The admin doesnt load static files and thus is rendered incorrectly.
Error: Refused to apply style from '<URL>' because its MIME type ('text/html') is not a suppor
Create an easier way of exporting tweets csv's
Currently you need to open django shell for exporting a CSV, which is not optimal for less technical users. Best options would probably be functions to be called from django-admin.
Some options
- Function in ScrappingRequest admin list page
- Function in Tweet admin list page
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.