Giter Club home page Giter Club logo

whatstk's Introduction

whatstk: analyze WhatsApp chats with python

Package version

Build Status codecov Documentation Status Tutorial Python 3 Number of downloads GitHub license Join the chat at https://gitter.im/sociepy/whatstk


Try the live demo parser to convert your chats to CSV


whatstk is a python package providing tools to parse, analyze and visualise WhatsApp chats developed under the sociepy project. Easily convert your chats to csv or simply visualise some stats using the provided command-line tools or python. The package uses pandas to process the data and plotly to visualise it.

It is distributed under the GPL-3.0 license.

⭐ Please star our project if you found it interesting to give us some dopamine 😄!

Content

Installation

pip install whatstk

Install develop version (not stable):

pip install git+https://github.com/lucasrodes/whatstk.git@develop

More details here

Getting Started

For a rapid introduction, check this tutorial on Medium.

Export your chat using your phone:

See instructions.

Load chat as a DataFrame

from whatstk import df_from_txt_whatsapp
df = df_from_txt_whatsapp("path/to/chat.txt")

Convert chat to csv

$ whatstk-to-csv [input_filename] [output_filename]

More examples

See more in sections getting started and examples.

Documentation

See official documentation.

Contribute

See contribute section.

License

GPL-3.0

Citation

Lucas Rodés-Guirao. "whatstk, WhatsApp analysis and parsing toolkit", https://github.com/lucasrodes/whatstk

Covered in

whatstk's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

whatstk's Issues

Add basic text stats

Provide some basic tools to explore what users say. For instance, top words used per user etc.

Package dependencies

I can't use plotly, numpy and pandas due to a dependencie from this library, could you try to fix it?

Thanks, regards

Generate random chat.

Generate random chat in multiple formats, so the library can be tested against multiple examples.

Use Lorem ipsum for instance

Ease the installation process

Prepare the module for easy installation, for instance using pypi repository so it can be easily installed by simply using pip.

Header automatic extraction failed: '[%m/%d/%y, %H:%M:%S] %name:' IOS

Exported from WhatsApp IOS App (business and personal), American living internationally (not sure if that matters).

Manually matched pattern here: '[%m/%d/%y, %H:%M:%S] %name:'

[12/27/20, 18:51:14] My Group 2023: ‎Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them.
[7/18/21, 18:34:07] ~ Nev: That's awesome. Well said.
[7/18/21, 20:36:20] Sandrine - French: What is this?
[7/20/21, 11:21:27] ‪+1 (777) 777‑7777‬: ‎‪+1 (777) 777‑7777‬ joined using this group's invite link

Not parsing correctly when time is in am/pm

I've extracted chats from my phone which is set to use 12H time, and hence the whatsapp extractions do as well. E.g.
09/10/2023, 11:11 am - +44 7123 456789:
09/10/2023, 12:57 pm - +44 7123 456789:

df_from_txt_whatsapp was not importing the message correctly, often including the following line of the text as the message of the previous line.

I have traced this back to the following problem -

import whatstk as wa
from whatstk.whatsapp.auto_header import extract_header_from_text

text = wa.whatsapp.parser._str_from_txt(filepath)
hformat = extract_header_from_text(text)

This gives hformat as '%d/%m/%y, %H:%M am - %name'
i.e. the 'am' is hard coded, and it does not recognise the timestamps of messages sent in the afternoon as timestamps!

This can be solved by instead using
hformat = '%d/%m/%y, %I:%M %p - %name:'

But I have not yet worked out how to fully integrate this into the program

pip3 missing

The program 'pip3' is currently not installed.

run command: sudo apt install python3-pip

chat with 12h clock not working properly

Given a chat with the following format:

2015-08-22, 12:15 PM - X: bla bla
2015-08-22, 12:15 PM - Y: bla
2015-08-22, 12:16 PM - Y: blabla
2015-08-22, 12:17 PM - X: bla

Both with auto_header option

>>> from whatstk import WhatsAppChat
>>> chat = WhatsAppChat.from_txt('example.txt')

and manually setting hformat

>>> from whatstk import WhatsAppChat
>>> chat = WhatsAppChat.from_txt('example.txt', hformat='%y-%m-%d, %P:%M - %name:')

a KeyError exception is raised:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'date'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-4-5e2de28f6c26> in <module>
----> 1 chat = WhatsAppChat.from_txt(filename)

~/whatstk/whatstk/objects.py in from_txt(cls, filename, auto_header, hformat, encoding)
     44         hformat.replace('[', '\[').replace(']', '\]')
     45         # Prepare DataFrame
---> 46         df = cls._prepare_df(text, hformat)
     47 
     48         return cls(df)

~/whatstk/whatstk/objects.py in _prepare_df(text, hformat)
     79 
     80         # Parse chat to DataFrame
---> 81         df = parse_chat(text, r)
     82 
     83         # get rid of wp warning messages

~/whatstk/whatstk/utils/parser.py in parse_chat(text, regex)
     51         line_dict = _parse_line(text, headers, i)
     52         result.append(line_dict)
---> 53     df_chat = pd.DataFrame.from_records(result, index='date')
     54     return df_chat[['username', 'message']]
     55 

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
   1527             if (isinstance(index, compat.string_types) or
   1528                     not hasattr(index, "__iter__")):
-> 1529                 i = columns.get_loc(index)
   1530                 exclude.add(index)
   1531                 if len(arrays) > 0:

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2657                 return self._engine.get_loc(key)
   2658             except KeyError:
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2661         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'date'

Reduce library dependencies so that whatstk becomes lighter

First idea:

  • Library scipy, lorem-python, ... are only used if user wants to generate chats. So it could be removed from dependencies and add it as extra, something like pip install whatstk[generation]. Also add error message when calling generation methods alerting that scipy should be installed

Parser

I receive this AttributeError when using the sample chats, both dash and dot:

~/coding/whatstk/whatstk/parser.py in _get_date_component(header, pattern, offset)
    155     py = re.compile(pattern)
    156     match_0 = py.match(header[offset:])
--> 157     component = int(match_0.group()[:-1])
    158     component_end = match_0.end() + offset
    159     return component, component_end

AttributeError: 'NoneType' object has no attribute 'group'

Also I'm not able to open the whatstkenv kernel in the jupyter notebook. are those two related?
I have tried to modify the get_date_component function but couldn't get rid of the Error.

Migrate CI/CD: Travis to GitHub Actions

Travis CI/CD is not working properly. We should migrate our CI/CD to GitHub actions.

Things to consider:

  • Build and test on various OS and python versions should be triggered for each PR.
  • Build and deploy of the package to PyPi should be done for each release (i.e. tag in main branch).
  • Documentation shipped to read the docs.

IndexError: list index out of range

I can't seem to parse my whatsapp txt file
Format is [MM/DD/YYYY hh:mm:ss PM] - username: message

Do you have solution for this?
Your help is much appreciated
screen shot 2018-01-29 at 18 53 18

Error when using `pip install`

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for numpy
Successfully built whatstk
Failed to build numpy
ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects

Provide with sample chats

Provide with some sample chats so that the user can test the project on a fake chat.

In addition, this chat should be used as the standard file when comparing results between users or simply for debugging purposes.

Rethink README.md

Currently the README.md file is a bit crowded. We might want to simplify and better structure it.

1. Titles

2. Getting Started section

  • Reduce amount of verbose and simplify examples. Move more specific examples to other file (e.g. docs/examples.md)
  • Add chat exporting mini-tutorial details. Example in http://github.com/lucasrodes/whatstk-gui/.

3. Other sections

  • Add "Documentation" link in TOC
  • Remove "Known Issues"
  • List contributors

Improve readme file

Add sections

  • Contribution (explain how users can contribute to the project)
  • Help and Support (explain where users can get support)'

Please comment if you find there should be other sections :)

Chat Parser Not Working

I am trying to use WhatsAppChat module of whatstk but it is no longer working. It used to work before June but apparently an update happened to whatsapp and it is no longer working.

I use it like this: "chat = WhatsAppChat.from_source(filepath="/content/_chat.txt")"

I get this error: Header automatic extraction failed. Please specify the format manually by setting input argument hformat. Report this issue so that automatic header detection support for your header format is added: https://github.com/lucasrodes/whatstk/issues.

Merge two chats

Sometimes we may have several chats from the same group, exported at different moments. Some of them might have messages that others do not have (maybe because of phone change). Therefore, it could be interesting to be able to merge different chats.

Example usage could be:

Option 1

from whatstk import WhatsAppChat
chat_1 = WhatsAppChat.from_txt('chat-part-1.txt')
chat_2 = WhatsAppChat.from_txt('chat-part-2txt')

chat = WhatsAppChat.merge([chat_1, chat_2])

Option 2

from whatstk import WhatsAppChat
chat = WhatsAppChat.from_multiple_txt(['chat-part-1.txt', 'chat-part-2.txt'])

Option 3

from whatstk import WhatsAppChat
chat_1 = WhatsAppChat.from_txt('chat-part-1.txt')
chat_2 = WhatsAppChat.from_txt('chat-part-2txt')

chat = chat_1.merge(chat_2)

Emojis and sentiment analisys

For a far far away release, I think could be a good idea to give a special treatment for emojis. It could enrich the sentiment analysis, no? =)

bug in cleaning data

For some chat formats, the script fails to clean the data.

Traceback (most recent call last):
  File "main.py", line 20, in <module>
    data = cf.clean_data(lines)
  File "whatsapp-stats/chatformat.py", line 65, in clean_data
    data[-1][2] = data[-1][2] + "\n" + line
IndexError: list index out of range

Fix links

Some links in README are not properly set.

Automate chat text files to CSV conversion

Hi,
I'm looking to import txt whatsapp chats from a folder into Python without specifying what the txt file names will be.
In essence, I want to be able to run it in a way that is irrespective of the files that are in the folder, it just picks up whatever txt files are in that folder (for example they might be changing in number and changing daily). I was wondering if there is a way to do this and still leverage the whatstk library to clean the data from all files and then export it as one to csv?
Thanks,
Matt

Insights on message length per user

Could be in form of distributions/box plots. different boxplot/distrib plots could be generated per date element.
That is, say, it is not the same distribution from user X on year 2018 than in year 2020, maybe they've become more quite

Installation error with Python 3.8 and Visual Studio

Hi, does know anybody what is not working here? Cheers

`PS C:\Users\Notebook\Desktop\test_python3> pip install whatstk

...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2117: character maps to
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Parse messages which contain \n characters

Description

Currently, chat message like this one

IMG_2EADED1430EA-1

get processed into this

2023-03-07 15:38:29,Person1,hey hey
2023-03-07 15:38:36,Person1,good productive day or bad one?
2023-03-07 16:11:41,Person2,"Hey dear
Returned home after being out
Productive 
Certainly 
How about yours?"

which isn't ideal because it breaks the format.

They should be converted to

2023-03-07 15:38:29,Person1,hey hey
2023-03-07 15:38:36,Person1,good productive day or bad one?
2023-03-07 16:11:41,Person2,"Hey dear\nReturned home after being out\nProductive\nCertainly \nHow about yours?"

I have written this function to handle this case, might be a good starting point.

def is_complete_message(s):
    """
    Given a WhatsApp chat exported string, returns if string is a complete message

    :param s: string
    :return: bool

    >>> is_complete_message("[2021-05-05, 11:44:26] A: message content")
    True

    >>> is_complete_message("Message content")
    False
    """
    pattern = r"^\[\d{4}-\d{2}-\d{2},\W\d{2}:\d{2}:\d{2}]"
    results = re.search(pattern, s)
    return True if results else False

and this

all_messages = []
with blob.open(mode="r") as file:
    count = 0
    for i, line in enumerate(file):
        if whatsapp.is_complete_message(line):
            data = whatsapp.parse_message(line)
            all_messages.append(data)
            count += 1
        else:
            last_line = all_messages[count - 1]
            new_content = last_line["content"] + "\n" + line
            all_messages[count - 1]["content"] = new_content

They might be a good starting point :)

Consider system messages

I think we shouldn't discard system messages. It has useful information as "who added who" whose could be used in a network analysis. It also has information when users enter and leaves, it could answer the question: how long users stay in the group? Or calculate a turnaround rate.

Whatsapp is used as a work tool, truly communities of practices exist in the whatsapp's ecosystem. Understanding the human behavior in those groups could help to understand the human behavior related to the 'domain' of these communities.

The literature says a community of practice consists of a joint enterprise, mutual engagement, and a shared repertoire (WENGER, 1998).

WENGER, E. Communities of practice: Learning, meaning, and identity. Cambridge University Press, 1998. (Learning in Doing: Social, Cognitive and Computational Perspectives). ISBN 9780521430173. DOI: 10.1017/CBO9780511803932.

Add option to filter chat by date

Concept idea

When loading chat

from whatstk import WhatsAppChat
chat = WhatsAppChat.from_txt('file.txt', date_min='2020-01-01', date_max='2020-01-21')

When analysing chat

from whatstk import WhatsAppChat
chat = WhatsAppChat.from_txt('file.txt')
chat = chat.filter_dates(date_min='2020-01-01', date_max='2020-01-21')

Text Sentiment Analysis support

Implement first version of some sort of sentiment analysis using the text from the chat (e.g. at user level, who is the happiest member based on what s/he writes?)

IndexError: list index out of range

List of detected possible header formats (those ticked are currently supported):

  • DD.MM.YYYY, hh:mm - username:
  • DD.MM.YYYY, hh:mm: username:
  • DD.MM.YYYY, hh:mm PM - username:
  • DD.MM.YYYY, hh:mm PM: username:
  • DD.MM.YYYY, hh:mm:ss - username:
  • DD.MM.YYYY, hh:mm:ss: username:
  • DD.MM.YYYY, hh:mm:ss PM - username:
  • DD.MM.YYYY, hh:mm:ss PM: username:
  • DD/MM/YYYY, hh:mm - username:
  • DD/MM/YYYY, hh:mm: username:
  • DD/MM/YYYY, hh:mm PM - username:
  • DD/MM/YYYY, hh:mm PM: username:
  • DD/MM/YYYY, hh:mm:ss - username:
  • DD/MM/YYYY, hh:mm:ss: username:
  • DD/MM/YYYY, hh:mm:ss PM - username:
  • DD/MM/YYYY, hh:mm:ss PM: username:
  • DD-MM-YYYY, hh:mm - username:
  • DD-MM-YYYY, hh:mm: username:
  • DD-MM-YYYY, hh:mm PM - username:
  • DD-MM-YYYY, hh:mm PM: username:
  • DD-MM-YYYY, hh:mm:ss - username:
  • DD-MM-YYYY, hh:mm:ss: username:
  • DD-MM-YYYY, hh:mm:ss PM - username:
  • DD-MM-YYYY, hh:mm:ss PM: username:
  • MM.DD.YYYY, hh:mm - username:
  • MM.DD.YYYY, hh:mm: username:
  • MM.DD.YYYY, hh:mm PM - username:
  • MM.DD.YYYY, hh:mm PM: username:
  • MM.DD.YYYY, hh:mm:ss - username:
  • MM.DD.YYYY, hh:mm:ss: username:
  • MM.DD.YYYY, hh:mm:ss PM - username:
  • MM.DD.YYYY, hh:mm:ss PM: username:
  • MM/DD/YYYY, hh:mm - username:
  • MM/DD/YYYY, hh:mm: username:
  • MM/DD/YYYY, hh:mm PM - username:
  • MM/DD/YYYY, hh:mm PM: username:
  • MM/DD/YYYY, hh:mm:ss - username:
  • MM/DD/YYYY, hh:mm:ss: username:
  • MM/DD/YYYY, hh:mm:ss PM - username:
  • MM/DD/YYYY, hh:mm:ss PM: username:
  • MM-DD-YYYY, hh:mm - username:
  • MM-DD-YYYY, hh:mm: username:
  • MM-DD-YYYY, hh:mm:ss - username:
  • MM-DD-YYYY, hh:mm:ss: username:
  • MM-DD-YYYY, hh:mm:ss PM - username:
  • MM-DD-YYYY, hh:mm:ss PM: username:
  • MM-DD-YYYY, hh:mm:ss PM - username:
  • MM-DD-YYYY, hh:mm:ss PM: username:
  • [MM-DD-YYYY, hh:mm:ss PM] username:

If you are having problems, please comment with your chat 'header'

PD: Possibly implement an algorithm that can detect the header pattern by itself in an unsupervised manner.

Project documentation

Prepare documentation of the project. This involves:

  1. Organise the formatting of the code documentation. There are also several options for the docstring formatting. You may find here some examples.
  2. Set the website (possible options are github pages or readthedocs)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.