Giter Club home page Giter Club logo

whatsapp-scraper's Introduction

project logo

whatsapp-scraper's People

Contributors

surajsharma avatar dennyabrain avatar scottrogowski avatar tarunima avatar

Stargazers

Artem Shashkin avatar Kiran Jonnalagadda avatar

Watchers

swairshah avatar James Cloos avatar  avatar  avatar Patricio Silva avatar

whatsapp-scraper's Issues

Enable Deployment Workflows

  1. Setup Dockerfile for strapi
  2. Enable Github action workflow to deploy to dev and prod server
  3. Setup Cron job to run the scraper task daily
  4. Test CI/CD pipeline

getZipFileNames returns undefined

in the following code, getZipFileNames returns undefined even though the preceding line logs a value:


function getZipFileNames(drive) {
  drive.files.list(
    {
      pageSize: 10,
      fields: "nextPageToken, files(id, name)",
    },
    (err, res) => {
      if (err) return console.log("The API returned an error: " + err);

      const files = res.data.files;
      const pageToken = null,
        folderId = null;

      if (files.length) {
        console.log("Files:");
        let zipFileNames = [];
        files.map((file) => {
          if (file.name.includes(".zip")) {
            console.log(file.name);
            zipFileNames = zipFileNames.concat(file.name);
          }
        });

        console.log(">>>", zipFileNames);
        return zipFileNames;
      }
    }
  );
}

function listFiles(auth) {
  const drive = google.drive({ version: "v3", auth });
  const fileId = "1_tCKjPYcjIfnloGiF318bbwUl7yI7u6U";
  var dest = fs.createWriteStream("/tmp/whatsapp_dump.zip");

  let zipFileNames = getZipFileNames(drive);

  drive.files
    .get({
      fileId: fileId,
      alt: "media",
    })
    .then((res) => {
      console.log(">", zipFileNames);
    });
}

to reproduce, clone my fork of the repo, npm i in src/scraper and then node .

Add Instructions to run the project locally

in the root README.md create a section called 'Developing Locally' that lists steps to run the project on your local machine. If any of the steps for a particular component is complex or multi-step, add that to the README of that component. For instance src/api/README.md and link in the root README.md

The goal is to enable anyone to spin this up locally and start developing new features.

Store Scraped Data in a database

Using the server described here - #11, Write a script that periodically stores the messages stored in json file for each whatssapp group into the database via the API.

This scripts primary aim is to ensure debuggint the scraper is easy. If anything goes wrong in our usual operation, we should be able to run the script and it should sync any data thats not in the db yet.

Ingest whatsapp data from google drive

Whatsapp lets you export chat from a group via an 'export chat' feature.
Screen Shot 2020-06-18 at 11 49 35 AM

This feature lets you backup 40,000 messages in total (10,000 if you choose to include media - images, videos)
One of the options it provides you is export this data dump to your google drive.

The scope of this task is to create a scraper that can fetch this data from a google drive and parse it into messages. @tarunima built a proof of concept of this in python, which you can find here https://github.com/tattle-made/whatsapp-scraper/blob/master/examples/googleDrive_load.py

Ensure that whatsapp messages from a group are stored in the database without duplication

Description

We intend to join whatsapp groups and take their backup periodically and upload it to Google Drive. Anytime we run the scraper we would want only unique messages from a group to be stored in mongo.

Assumptions

We can assume that a group's name is unique. This can be ensured by contributors who submit their whatsapp group backup and tattle team members. So you could use the name of the group as the identifier when storing messages in the database.

Proposed Solution - Timestamp based approach

For the whatsapp group that you intend to add scraped messages to, fetch the last stored message's timestamp (say X) and only store those newly scraped messages whose timestamp is greater in time than the X.

Build UI to moderate stored Whatsapp Messages

  1. Create a Gatsby site at src/ui-web-community
  2. Create a login page
  3. Show a list of all whatsapp groups
  4. Incorporate the UI you created during your trial tasks to moderate messages inside a WhatsApp group
    1. Support Deletion, adding tags and linking messages
    2. These changes need to be persisted in the backend via strapi backend

Discover public whatsapp groups on Twitter

Tattle is only Collecting Content from Public WhatsApp Groups that have been shared on a public platform such as Twitter or open Facebook Groups.

  • Scrape Twitter to find tweets that contain chat.whatsapp.com.
  • Archive the Tweet using one of archive services such as Webcitation.org, archive.org or archive.is.
  • Extract the WhatsApp group public link
  • Save the original tweet link, the archived tweet link and the WhatsApp group link on a database
  • Push the WhatsApp group link

anonymize numbers of chat conversation

Content extracted from a chat conversation must be anonymized before being added to Tattle's database.

This would involved replacing the phone number with a randomly generated ID. For any exported file, a number should be mapped to an ID. This will help in identifying connected messages on a WhatsApp UI. The ID does not need to be unique across exported files.

Some anonymization techniques are here: https://piwik.pro/blog/the-ultimate-guide-to-data-anonymization-in-analytics/

This step should be carried out as soon as the file is exported into a database which allows for field substitution and prior to any other processing.

Make API Server for whatsapp-scraper

Task Overview :

  1. Create a strapi app at the src/api directory
  2. Create Content types for WhatsApp Group and WhatsApp Message
  3. Make api call from the app you wrote at src/scraper to the strapi server to store groups and messages scraped

The server needs to support the following APIs

  • returns a list of all whatsapp groups
  • returns a list of all whatsapp messages within a group
  • add a group
  • add a message to a group
  • delete a message to a group
  • add tags to a message (we might also need a Tag content type)
  • an API to link two messages within a same group

GDrive API ghost files in files.list call

Google drive API is weird. For some reason, when you delete files, the contents of the drive's folder will show the files in the object returned by functions such as files.list etc.

If you're getting inaccurate list of files when querying the API, make sure your GDrive recycle bin is empty.

Scraper Task Definition

User Story:

Different Tattle contributors periodically upload their chat backups to a designated folder on a Google Drive owned by Tattle. Tattle Admins should then be able to run a script to download the content of this drive and transform into a desired structure (to be explained later)

Background:

A WhatsApp Group Chat’s content can be backed up on your google drive. This backup is stored in a folder that has the same name as the WhatsApp Group (enforced by a tattle team member). This folder contains :

  1. A .txt file containing timestamped stream of WhatsApp messages AND/OR
  2. image and video files that were part of this group chat

Objective:

Obtain data for every WhatsApp Group in a structured form (JSON preferred) so that it can be stored in a MongoDB. This structured file should contain

  1. the timestamp of the message,
  2. the content of the message
    1. If the message is a text message, this should be a string containing that text
    2. If the message is a image or video, it should contain the path to the file on your local machine
  3. an Anonymized sender id (to obfuscate sender’s phone number)

Current Progress:

I encourage you to read about the various authentication methods that Google offers to programmatically access their services (Drive in our case). In my research, I tried out a few and moved ahead with something that they call Service Accounts.

Check out the functions getFilesInThisFolder(), getFoldersInThisFolder(), getFolderFromDriveByName() here
They contain some examples of how to GET directory and file information from google drive. Hopefully they parameter sent to the drive.files.list() function in my code will serve as documentation of google drive API and save you some time.
You will also authentication related code in that file that might be helpful. In my understanding the challenge with google drive has been figuring out what is the right authentication mechanism for your task. once thats done the process of actually fetching data from google drive is always the same.

you'll also see a reference to a file named '/whatsapp-scraper-668a815fc26f.json'. This was generated for the service account for Tattle's Gmail account. We can send this to you in case you just want to try it out.

Obfuscation phone number related code is here

Project Structure Discussion

I am proposing the following structure for the current sprint.
Screen Shot 2020-06-17 at 6 17 10 PM

Three main components are planned as of now

  1. Express App - this server is responsible for scraping whatsapp data dump from google drive (defined here) and from a zip file upload
  2. Strapi App
  3. Web UI for the tattle team to moderate the scraped content

fs.readdirSync does not always return all the contents of a folder

So I have a function that is supposed to recursively return all the files in a folder, here it is:

async function getFiles(dir) {
  const subdirs = await fs.readdirSync(dir);
  const files = await Promise.all(
    subdirs.map(async (subdir) => {
      const res = resolve(dir, subdir);
      return (await stat(res)).isDirectory() && !subdir.startsWith("__")
        ? getFiles(res)
        : res;
    })
  );
  return files.reduce((a, f) => a.concat(f), files);
}

The trouble is, it only returns all the contents recursively some of the times. So, let's say if the given directory has 5 subdirectories, it will only return the contents of 4. This happens infrequently and if there is some underlying pattern, I am not able to detect it. Please help!

The expected output is it should always return all the files in the folder, recursively, all the time. Not just some of the time as it is currently doing.

To reproduce, please pull my latest commit and do node .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.