tattle-made / whatsapp-scraper Goto Github PK

View Code? Open in Web Editor NEW

2.0 5.0 2.0 12.43 MB

JavaScript 65.49% CSS 7.44% Shell 0.09% Python 26.98%

whatsapp-scraper's Introduction

whatsapp-scraper's People

Contributors

Stargazers

Watchers

Forkers

scottrogowski ertprs

whatsapp-scraper's Issues

Enable Deployment Workflows

Setup Dockerfile for strapi
Enable Github action workflow to deploy to dev and prod server
Setup Cron job to run the scraper task daily
Test CI/CD pipeline

getZipFileNames returns undefined

in the following code, getZipFileNames returns undefined even though the preceding line logs a value:


function getZipFileNames(drive) {
  drive.files.list(
    {
      pageSize: 10,
      fields: "nextPageToken, files(id, name)",
    },
    (err, res) => {
      if (err) return console.log("The API returned an error: " + err);

      const files = res.data.files;
      const pageToken = null,
        folderId = null;

      if (files.length) {
        console.log("Files:");
        let zipFileNames = [];
        files.map((file) => {
          if (file.name.includes(".zip")) {
            console.log(file.name);
            zipFileNames = zipFileNames.concat(file.name);
          }
        });

        console.log(">>>", zipFileNames);
        return zipFileNames;
      }
    }
  );
}

function listFiles(auth) {
  const drive = google.drive({ version: "v3", auth });
  const fileId = "1_tCKjPYcjIfnloGiF318bbwUl7yI7u6U";
  var dest = fs.createWriteStream("/tmp/whatsapp_dump.zip");

  let zipFileNames = getZipFileNames(drive);

  drive.files
    .get({
      fileId: fileId,
      alt: "media",
    })
    .then((res) => {
      console.log(">", zipFileNames);
    });
}

to reproduce, clone my fork of the repo, npm i in src/scraper and then node .

Add Instructions to run the project locally

in the root README.md create a section called 'Developing Locally' that lists steps to run the project on your local machine. If any of the steps for a particular component is complex or multi-step, add that to the README of that component. For instance src/api/README.md and link in the root README.md

The goal is to enable anyone to spin this up locally and start developing new features.

Store Scraped Data in a database

Using the server described here - #11, Write a script that periodically stores the messages stored in json file for each whatssapp group into the database via the API.

This scripts primary aim is to ensure debuggint the scraper is easy. If anything goes wrong in our usual operation, we should be able to run the script and it should sync any data thats not in the db yet.

Ingest whatsapp data from google drive

Whatsapp lets you export chat from a group via an 'export chat' feature.

This feature lets you backup 40,000 messages in total (10,000 if you choose to include media - images, videos)
One of the options it provides you is export this data dump to your google drive.

The scope of this task is to create a scraper that can fetch this data from a google drive and parse it into messages. @tarunima built a proof of concept of this in python, which you can find here https://github.com/tattle-made/whatsapp-scraper/blob/master/examples/googleDrive_load.py

Ensure that whatsapp messages from a group are stored in the database without duplication

Description

We intend to join whatsapp groups and take their backup periodically and upload it to Google Drive. Anytime we run the scraper we would want only unique messages from a group to be stored in mongo.

Assumptions

We can assume that a group's name is unique. This can be ensured by contributors who submit their whatsapp group backup and tattle team members. So you could use the name of the group as the identifier when storing messages in the database.

Proposed Solution - Timestamp based approach

For the whatsapp group that you intend to add scraped messages to, fetch the last stored message's timestamp (say X) and only store those newly scraped messages whose timestamp is greater in time than the X.

Build UI to moderate stored Whatsapp Messages

Create a Gatsby site at src/ui-web-community
Create a login page
Show a list of all whatsapp groups
Incorporate the UI you created during your trial tasks to moderate messages inside a WhatsApp group
1. Support Deletion, adding tags and linking messages
2. These changes need to be persisted in the backend via strapi backend

Discover Public Whatsapp Group

Discover public whatsapp groups on Twitter

Tattle is only Collecting Content from Public WhatsApp Groups that have been shared on a public platform such as Twitter or open Facebook Groups.

Scrape Twitter to find tweets that contain chat.whatsapp.com.
Archive the Tweet using one of archive services such as Webcitation.org, archive.org or archive.is.
Extract the WhatsApp group public link
Save the original tweet link, the archived tweet link and the WhatsApp group link on a database
Push the WhatsApp group link

UI Bug: CSS Z-Index mismatch

Message bubble date over Tagging window

anonymize numbers of chat conversation

Content extracted from a chat conversation must be anonymized before being added to Tattle's database.

This would involved replacing the phone number with a randomly generated ID. For any exported file, a number should be mapped to an ID. This will help in identifying connected messages on a WhatsApp UI. The ID does not need to be unique across exported files.

Some anonymization techniques are here: https://piwik.pro/blog/the-ultimate-guide-to-data-anonymization-in-analytics/

This step should be carried out as soon as the file is exported into a database which allows for field substitution and prior to any other processing.

Make API Server for whatsapp-scraper

Task Overview :

Create a strapi app at the src/api directory
Create Content types for WhatsApp Group and WhatsApp Message
Make api call from the app you wrote at src/scraper to the strapi server to store groups and messages scraped

The server needs to support the following APIs

returns a list of all whatsapp groups
returns a list of all whatsapp messages within a group
add a group
add a message to a group
delete a message to a group
add tags to a message (we might also need a Tag content type)
an API to link two messages within a same group

GDrive API ghost files in files.list call

Google drive API is weird. For some reason, when you delete files, the contents of the drive's folder will show the files in the object returned by functions such as files.list etc.

If you're getting inaccurate list of files when querying the API, make sure your GDrive recycle bin is empty.

UI Bug : CSS breaks on >5 tags in a message bubble

Scraper Task Definition

User Story:

Different Tattle contributors periodically upload their chat backups to a designated folder on a Google Drive owned by Tattle. Tattle Admins should then be able to run a script to download the content of this drive and transform into a desired structure (to be explained later)

Background:

A WhatsApp Group Chat’s content can be backed up on your google drive. This backup is stored in a folder that has the same name as the WhatsApp Group (enforced by a tattle team member). This folder contains :

A .txt file containing timestamped stream of WhatsApp messages AND/OR
image and video files that were part of this group chat

Objective:

Obtain data for every WhatsApp Group in a structured form (JSON preferred) so that it can be stored in a MongoDB. This structured file should contain

the timestamp of the message,
the content of the message
1. If the message is a text message, this should be a string containing that text
2. If the message is a image or video, it should contain the path to the file on your local machine
an Anonymized sender id (to obfuscate sender’s phone number)

Current Progress:

I encourage you to read about the various authentication methods that Google offers to programmatically access their services (Drive in our case). In my research, I tried out a few and moved ahead with something that they call Service Accounts.

Check out the functions getFilesInThisFolder(), getFoldersInThisFolder(), getFolderFromDriveByName() here
They contain some examples of how to GET directory and file information from google drive. Hopefully they parameter sent to the drive.files.list() function in my code will serve as documentation of google drive API and save you some time.
You will also authentication related code in that file that might be helpful. In my understanding the challenge with google drive has been figuring out what is the right authentication mechanism for your task. once thats done the process of actually fetching data from google drive is always the same.

you'll also see a reference to a file named '/whatsapp-scraper-668a815fc26f.json'. This was generated for the service account for Tattle's Gmail account. We can send this to you in case you just want to try it out.

Obfuscation phone number related code is here

Project Structure Discussion

I am proposing the following structure for the current sprint.

Three main components are planned as of now

Express App - this server is responsible for scraping whatsapp data dump from google drive (defined here) and from a zip file upload
Strapi App
Web UI for the tattle team to moderate the scraped content

Do this important task

fs.readdirSync does not always return all the contents of a folder

So I have a function that is supposed to recursively return all the files in a folder, here it is:

async function getFiles(dir) {
  const subdirs = await fs.readdirSync(dir);
  const files = await Promise.all(
    subdirs.map(async (subdir) => {
      const res = resolve(dir, subdir);
      return (await stat(res)).isDirectory() && !subdir.startsWith("__")
        ? getFiles(res)
        : res;
    })
  );
  return files.reduce((a, f) => a.concat(f), files);
}

The trouble is, it only returns all the contents recursively some of the times. So, let's say if the given directory has 5 subdirectories, it will only return the contents of 4. This happens infrequently and if there is some underlying pattern, I am not able to detect it. Please help!

The expected output is it should always return all the files in the folder, recursively, all the time. Not just some of the time as it is currently doing.

To reproduce, please pull my latest commit and do node .