whatsapp-scraper's Introduction
whatsapp-scraper's People
whatsapp-scraper's Issues
Enable Deployment Workflows
- Setup Dockerfile for strapi
- Enable Github action workflow to deploy to dev and prod server
- Setup Cron job to run the scraper task daily
- Test CI/CD pipeline
getZipFileNames returns undefined
in the following code, getZipFileNames returns undefined even though the preceding line logs a value:
function getZipFileNames(drive) {
drive.files.list(
{
pageSize: 10,
fields: "nextPageToken, files(id, name)",
},
(err, res) => {
if (err) return console.log("The API returned an error: " + err);
const files = res.data.files;
const pageToken = null,
folderId = null;
if (files.length) {
console.log("Files:");
let zipFileNames = [];
files.map((file) => {
if (file.name.includes(".zip")) {
console.log(file.name);
zipFileNames = zipFileNames.concat(file.name);
}
});
console.log(">>>", zipFileNames);
return zipFileNames;
}
}
);
}
function listFiles(auth) {
const drive = google.drive({ version: "v3", auth });
const fileId = "1_tCKjPYcjIfnloGiF318bbwUl7yI7u6U";
var dest = fs.createWriteStream("/tmp/whatsapp_dump.zip");
let zipFileNames = getZipFileNames(drive);
drive.files
.get({
fileId: fileId,
alt: "media",
})
.then((res) => {
console.log(">", zipFileNames);
});
}
to reproduce, clone my fork of the repo, npm i
in src/scraper
and then node .
Add Instructions to run the project locally
in the root README.md create a section called 'Developing Locally' that lists steps to run the project on your local machine. If any of the steps for a particular component is complex or multi-step, add that to the README of that component. For instance src/api/README.md and link in the root README.md
The goal is to enable anyone to spin this up locally and start developing new features.
Store Scraped Data in a database
Using the server described here - #11, Write a script that periodically stores the messages stored in json file for each whatssapp group into the database via the API.
This scripts primary aim is to ensure debuggint the scraper is easy. If anything goes wrong in our usual operation, we should be able to run the script and it should sync any data thats not in the db yet.
Ingest whatsapp data from google drive
Whatsapp lets you export chat from a group via an 'export chat' feature.
This feature lets you backup 40,000 messages in total (10,000 if you choose to include media - images, videos)
One of the options it provides you is export this data dump to your google drive.
The scope of this task is to create a scraper that can fetch this data from a google drive and parse it into messages. @tarunima built a proof of concept of this in python, which you can find here https://github.com/tattle-made/whatsapp-scraper/blob/master/examples/googleDrive_load.py
Ensure that whatsapp messages from a group are stored in the database without duplication
Description
We intend to join whatsapp groups and take their backup periodically and upload it to Google Drive. Anytime we run the scraper we would want only unique messages from a group to be stored in mongo.
Assumptions
We can assume that a group's name is unique. This can be ensured by contributors who submit their whatsapp group backup and tattle team members. So you could use the name of the group as the identifier when storing messages in the database.
Proposed Solution - Timestamp based approach
For the whatsapp group that you intend to add scraped messages to, fetch the last stored message's timestamp (say X) and only store those newly scraped messages whose timestamp is greater in time than the X.
Build UI to moderate stored Whatsapp Messages
- Create a Gatsby site at src/ui-web-community
- Create a login page
- Show a list of all whatsapp groups
- Incorporate the UI you created during your trial tasks to moderate messages inside a WhatsApp group
- Support Deletion, adding tags and linking messages
- These changes need to be persisted in the backend via strapi backend
Discover Public Whatsapp Group
Discover public whatsapp groups on Twitter
Tattle is only Collecting Content from Public WhatsApp Groups that have been shared on a public platform such as Twitter or open Facebook Groups.
- Scrape Twitter to find tweets that contain chat.whatsapp.com.
- Archive the Tweet using one of archive services such as Webcitation.org, archive.org or archive.is.
- Extract the WhatsApp group public link
- Save the original tweet link, the archived tweet link and the WhatsApp group link on a database
- Push the WhatsApp group link
UI Bug: CSS Z-Index mismatch
anonymize numbers of chat conversation
Content extracted from a chat conversation must be anonymized before being added to Tattle's database.
This would involved replacing the phone number with a randomly generated ID. For any exported file, a number should be mapped to an ID. This will help in identifying connected messages on a WhatsApp UI. The ID does not need to be unique across exported files.
Some anonymization techniques are here: https://piwik.pro/blog/the-ultimate-guide-to-data-anonymization-in-analytics/
This step should be carried out as soon as the file is exported into a database which allows for field substitution and prior to any other processing.
Make API Server for whatsapp-scraper
Task Overview :
- Create a strapi app at the src/api directory
- Create Content types for WhatsApp Group and WhatsApp Message
- Make api call from the app you wrote at src/scraper to the strapi server to store groups and messages scraped
The server needs to support the following APIs
- returns a list of all whatsapp groups
- returns a list of all whatsapp messages within a group
- add a group
- add a message to a group
- delete a message to a group
- add tags to a message (we might also need a Tag content type)
- an API to link two messages within a same group
GDrive API ghost files in files.list call
Google drive API is weird. For some reason, when you delete files, the contents of the drive's folder will show the files in the object returned by functions such as files.list
etc.
If you're getting inaccurate list of files when querying the API, make sure your GDrive recycle bin is empty.
UI Bug : CSS breaks on >5 tags in a message bubble
Scraper Task Definition
User Story:
Different Tattle contributors periodically upload their chat backups to a designated folder on a Google Drive owned by Tattle. Tattle Admins should then be able to run a script to download the content of this drive and transform into a desired structure (to be explained later)
Background:
A WhatsApp Group Chat’s content can be backed up on your google drive. This backup is stored in a folder that has the same name as the WhatsApp Group (enforced by a tattle team member). This folder contains :
- A .txt file containing timestamped stream of WhatsApp messages AND/OR
- image and video files that were part of this group chat
Objective:
Obtain data for every WhatsApp Group in a structured form (JSON preferred) so that it can be stored in a MongoDB. This structured file should contain
- the timestamp of the message,
- the content of the message
- If the message is a text message, this should be a string containing that text
- If the message is a image or video, it should contain the path to the file on your local machine
- an Anonymized sender id (to obfuscate sender’s phone number)
Current Progress:
I encourage you to read about the various authentication methods that Google offers to programmatically access their services (Drive in our case). In my research, I tried out a few and moved ahead with something that they call Service Accounts.
Check out the functions getFilesInThisFolder(), getFoldersInThisFolder(), getFolderFromDriveByName() here
They contain some examples of how to GET directory and file information from google drive. Hopefully they parameter sent to the drive.files.list() function in my code will serve as documentation of google drive API and save you some time.
You will also authentication related code in that file that might be helpful. In my understanding the challenge with google drive has been figuring out what is the right authentication mechanism for your task. once thats done the process of actually fetching data from google drive is always the same.
you'll also see a reference to a file named '/whatsapp-scraper-668a815fc26f.json'. This was generated for the service account for Tattle's Gmail account. We can send this to you in case you just want to try it out.
Obfuscation phone number related code is here
Project Structure Discussion
I am proposing the following structure for the current sprint.
Three main components are planned as of now
- Express App - this server is responsible for scraping whatsapp data dump from google drive (defined here) and from a zip file upload
- Strapi App
- Web UI for the tattle team to moderate the scraped content
Do this important task
fs.readdirSync does not always return all the contents of a folder
So I have a function that is supposed to recursively return all the files in a folder, here it is:
async function getFiles(dir) {
const subdirs = await fs.readdirSync(dir);
const files = await Promise.all(
subdirs.map(async (subdir) => {
const res = resolve(dir, subdir);
return (await stat(res)).isDirectory() && !subdir.startsWith("__")
? getFiles(res)
: res;
})
);
return files.reduce((a, f) => a.concat(f), files);
}
The trouble is, it only returns all the contents recursively some of the times. So, let's say if the given directory has 5 subdirectories, it will only return the contents of 4. This happens infrequently and if there is some underlying pattern, I am not able to detect it. Please help!
The expected output is it should always return all the files in the folder, recursively, all the time. Not just some of the time as it is currently doing.
To reproduce, please pull my latest commit and do node .
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.