lucasrodes / whatstk Goto Github PK

WhatsApp chats as dataframes. Python toolkit to analyse and parse WhatsApp chats.

License: GNU General Public License v3.0

Python 98.94% Shell 1.06%

whatsapp whatsapp-statistics whatsapp-group whatsapp-analysis whatsapp-parser pandas parser pandas-dataframe machinelearning

whatstk's Introduction

whatstk: analyze WhatsApp chats with python

Try the live demo parser to convert your chats to CSV

whatstk is a python package providing tools to parse, analyze and visualise WhatsApp chats developed under the sociepy project. Easily convert your chats to csv or simply visualise some stats using the provided command-line tools or python. The package uses pandas to process the data and plotly to visualise it.

It is distributed under the GPL-3.0 license.

⭐ Please star our project if you found it interesting to give us some dopamine 😄!

Content

Installation
Getting Started
Documentation
Contribute
Covered in
Citation

Installation

pip install whatstk

Install develop version (not stable):

pip install git+https://github.com/lucasrodes/whatstk.git@develop

More details here

Getting Started

For a rapid introduction, check this tutorial on Medium.

Export your chat using your phone:

See instructions.

Load chat as a DataFrame

from whatstk import df_from_txt_whatsapp
df = df_from_txt_whatsapp("path/to/chat.txt")

Convert chat to csv

$ whatstk-to-csv [input_filename] [output_filename]

More examples

See more in sections getting started and examples.

Documentation

See official documentation.

Contribute

See contribute section.

License

GPL-3.0

Citation

Lucas Rodés-Guirao. "whatstk, WhatsApp analysis and parsing toolkit", https://github.com/lucasrodes/whatstk

Covered in

whatstk's People

Stargazers

Watchers

whatstk's Issues

Re-adapt library so it might incorporate other sources in the future (e.g. facebook, instagram...).

Add basic text stats

Provide some basic tools to explore what users say. For instance, top words used per user etc.

Read chats from URLs

Package dependencies

I can't use plotly, numpy and pandas due to a dependencie from this library, could you try to fix it?

Thanks, regards

change index in chat dataframe: Use ID instead of timestamp (since timestamp might be repeated)

Generate random chat.

Generate random chat in multiple formats, so the library can be tested against multiple examples.

Use Lorem ipsum for instance

Ease the installation process

Prepare the module for easy installation, for instance using pypi repository so it can be easily installed by simply using pip.

Prefer working directly with pd.DataFrame object thant WhatsAppChat. Re-design APIs from whatstk to work with df instead of inhouse class

Header automatic extraction failed: '[%m/%d/%y, %H:%M:%S] %name:' IOS

Exported from WhatsApp IOS App (business and personal), American living internationally (not sure if that matters).

Manually matched pattern here: '[%m/%d/%y, %H:%M:%S] %name:'

[12/27/20, 18:51:14] My Group 2023: ‎Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them.
[7/18/21, 18:34:07] ~ Nev: That's awesome. Well said.
[7/18/21, 20:36:20] Sandrine - French: What is this?
[7/20/21, 11:21:27] ‪+1 (777) 777‑7777‬: ‎‪+1 (777) 777‑7777‬ joined using this group's invite link

Correct badge link for Travis

Badge for Travis CI points to https://travis-ci.org/github/lucasrodes/whatstk instead of https://travis-ci.org/github/lucasrodes/whatstk (note the .org)

virtual environment package missing

The current install_dependencies does not check if virtual environment is installed.

command to use: apt-get install python3.venv

Option to count the number of interventions jointly (all users combined).

Not parsing correctly when time is in am/pm

I've extracted chats from my phone which is set to use 12H time, and hence the whatsapp extractions do as well. E.g.
09/10/2023, 11:11 am - +44 7123 456789:
09/10/2023, 12:57 pm - +44 7123 456789:

df_from_txt_whatsapp was not importing the message correctly, often including the following line of the text as the message of the previous line.

I have traced this back to the following problem -

import whatstk as wa
from whatstk.whatsapp.auto_header import extract_header_from_text

text = wa.whatsapp.parser._str_from_txt(filepath)
hformat = extract_header_from_text(text)

This gives hformat as '%d/%m/%y, %H:%M am - %name'
i.e. the 'am' is hard coded, and it does not recognise the timestamps of messages sent in the afternoon as timestamps!

This can be solved by instead using
hformat = '%d/%m/%y, %I:%M %p - %name:'

But I have not yet worked out how to fully integrate this into the program

pip3 missing

The program 'pip3' is currently not installed.

run command: sudo apt install python3-pip

Methods to rename users, merge users etc.

chat with 12h clock not working properly

Given a chat with the following format:

2015-08-22, 12:15 PM - X: bla bla
2015-08-22, 12:15 PM - Y: bla
2015-08-22, 12:16 PM - Y: blabla
2015-08-22, 12:17 PM - X: bla

Both with auto_header option

>>> from whatstk import WhatsAppChat
>>> chat = WhatsAppChat.from_txt('example.txt')

and manually setting hformat

>>> from whatstk import WhatsAppChat
>>> chat = WhatsAppChat.from_txt('example.txt', hformat='%y-%m-%d, %P:%M - %name:')

a KeyError exception is raised:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'date'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-4-5e2de28f6c26> in <module>
----> 1 chat = WhatsAppChat.from_txt(filename)

~/whatstk/whatstk/objects.py in from_txt(cls, filename, auto_header, hformat, encoding)
     44         hformat.replace('[', '\[').replace(']', '\]')
     45         # Prepare DataFrame
---> 46         df = cls._prepare_df(text, hformat)
     47 
     48         return cls(df)

~/whatstk/whatstk/objects.py in _prepare_df(text, hformat)
     79 
     80         # Parse chat to DataFrame
---> 81         df = parse_chat(text, r)
     82 
     83         # get rid of wp warning messages

~/whatstk/whatstk/utils/parser.py in parse_chat(text, regex)
     51         line_dict = _parse_line(text, headers, i)
     52         result.append(line_dict)
---> 53     df_chat = pd.DataFrame.from_records(result, index='date')
     54     return df_chat[['username', 'message']]
     55 

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
   1527             if (isinstance(index, compat.string_types) or
   1528                     not hasattr(index, "__iter__")):
-> 1529                 i = columns.get_loc(index)
   1530                 exclude.add(index)
   1531                 if len(arrays) > 0:

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2657                 return self._engine.get_loc(key)
   2658             except KeyError:
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2661         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'date'

Reduce library dependencies so that whatstk becomes lighter

First idea:

Library scipy, lorem-python, ... are only used if user wants to generate chats. So it could be removed from dependencies and add it as extra, something like pip install whatstk[generation]. Also add error message when calling generation methods alerting that scipy should be installed

Parser

I receive this AttributeError when using the sample chats, both dash and dot:

~/coding/whatstk/whatstk/parser.py in _get_date_component(header, pattern, offset)
    155     py = re.compile(pattern)
    156     match_0 = py.match(header[offset:])
--> 157     component = int(match_0.group()[:-1])
    158     component_end = match_0.end() + offset
    159     return component, component_end

AttributeError: 'NoneType' object has no attribute 'group'

Also I'm not able to open the whatstkenv kernel in the jupyter notebook. are those two related?
I have tried to modify the get_date_component function but couldn't get rid of the Error.

Migrate CI/CD: Travis to GitHub Actions

Travis CI/CD is not working properly. We should migrate our CI/CD to GitHub actions.

Things to consider:

Build and test on various OS and python versions should be triggered for each PR.
Build and deploy of the package to PyPi should be done for each release (i.e. tag in main branch).
Documentation shipped to read the docs.

IndexError: list index out of range

I can't seem to parse my whatsapp txt file
Format is [MM/DD/YYYY hh:mm:ss PM] - username: message

Do you have solution for this?
Your help is much appreciated

Chat df schema (strings)

Create correct schema, with adequate dtypes.

ENH: Add support for Google Drive

Reading and loading chat files from google drive folders.

May want to integrate a third-party library like PyDrive for this.

Create list mapping usernames to colours, so same color is used for a user in all plots

Error when using `pip install`

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for numpy
Successfully built whatstk
Failed to build numpy
ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects

Work on whatstk logo, center-align text in README.md

We want to have the title and badges centered, like it has been done in https://github.com/semantic-release/semantic-release/blob/master/README.md

Provide with sample chats

Provide with some sample chats so that the user can test the project on a fake chat.

In addition, this chat should be used as the standard file when comparing results between users or simply for debugging purposes.

Rethink README.md

Currently the README.md file is a bit crowded. We might want to simplify and better structure it.

1. Titles

Add title, subtitle.
Center titles and badges (example https://github.com/semantic-release/semantic-release/blob/master/README.md)
Add logo

2. Getting Started section

Reduce amount of verbose and simplify examples. Move more specific examples to other file (e.g. docs/examples.md)
Add chat exporting mini-tutorial details. Example in http://github.com/lucasrodes/whatstk-gui/.

3. Other sections

Add "Documentation" link in TOC
Remove "Known Issues"
List contributors

Python 3.9 compatibility

Python 3.9 is on its way. whatstk should be tested so as to make it compatible with it.

References:

argument 'cummulative' is mispelled, should be 'cumulative'.

Mark argument 'cummulative' as deprecated, 'cumulative should be used instead.

review usage in library
review documentation and generated HTML plots
review command line tools

Improve readme file

Add sections

Contribution (explain how users can contribute to the project)
Help and Support (explain where users can get support)'

Please comment if you find there should be other sections :)

Chat Parser Not Working

I am trying to use WhatsAppChat module of whatstk but it is no longer working. It used to work before June but apparently an update happened to whatsapp and it is no longer working.

I use it like this: "chat = WhatsAppChat.from_source(filepath="/content/_chat.txt")"

I get this error: Header automatic extraction failed. Please specify the format manually by setting input argument hformat. Report this issue so that automatic header detection support for your header format is added: https://github.com/lucasrodes/whatstk/issues.

Merge two chats

Sometimes we may have several chats from the same group, exported at different moments. Some of them might have messages that others do not have (maybe because of phone change). Therefore, it could be interesting to be able to merge different chats.

Example usage could be:

Option 1

from whatstk import WhatsAppChat
chat_1 = WhatsAppChat.from_txt('chat-part-1.txt')
chat_2 = WhatsAppChat.from_txt('chat-part-2txt')

chat = WhatsAppChat.merge([chat_1, chat_2])

Option 2

from whatstk import WhatsAppChat
chat = WhatsAppChat.from_multiple_txt(['chat-part-1.txt', 'chat-part-2.txt'])

Option 3

from whatstk import WhatsAppChat
chat_1 = WhatsAppChat.from_txt('chat-part-1.txt')
chat_2 = WhatsAppChat.from_txt('chat-part-2txt')

chat = chat_1.merge(chat_2)

response_matrix

Why some functions like response_matrix are gone?

Emojis and sentiment analisys

For a far far away release, I think could be a good idea to give a special treatment for emojis. It could enrich the sentiment analysis, no? =)

bug in cleaning data

For some chat formats, the script fails to clean the data.

Traceback (most recent call last):
  File "main.py", line 20, in <module>
    data = cf.clean_data(lines)
  File "whatsapp-stats/chatformat.py", line 65, in clean_data
    data[-1][2] = data[-1][2] + "\n" + line
IndexError: list index out of range

Fix links

Some links in README are not properly set.

Generate documentation of project using sphinx.

Basic emoji support

Implement some basic functions to deal with emoji codes and, for instance, be able to list the top used emojis within the group (or at user level).

May want to check emoji library: https://pypi.org/project/emoji/

Automate chat text files to CSV conversion

Hi,
I'm looking to import txt whatsapp chats from a folder into Python without specifying what the txt file names will be.
In essence, I want to be able to run it in a way that is irrespective of the files that are in the folder, it just picks up whatever txt files are in that folder (for example they might be changing in number and changing daily). I was wondering if there is a way to do this and still leverage the whatstk library to clean the data from all files and then export it as one to csv?
Thanks,
Matt

Add script to parse chat from txt to csv

Command line script to parse chat to csv

Insights on message length per user

Could be in form of distributions/box plots. different boxplot/distrib plots could be generated per date element.
That is, say, it is not the same distribution from user X on year 2018 than in year 2020, maybe they've become more quite

Create method to load chat directly as DataFrame, e.g. df_from_txt(filename, encoding)

Installation error with Python 3.8 and Visual Studio

Hi, does know anybody what is not working here? Cheers

`PS C:\Users\Notebook\Desktop\test_python3> pip install whatstk

...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2117: character maps to
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Parse messages which contain \n characters

Description

Currently, chat message like this one

get processed into this

2023-03-07 15:38:29,Person1,hey hey
2023-03-07 15:38:36,Person1,good productive day or bad one?
2023-03-07 16:11:41,Person2,"Hey dear
Returned home after being out
Productive 
Certainly 
How about yours?"

which isn't ideal because it breaks the format.

They should be converted to

2023-03-07 15:38:29,Person1,hey hey
2023-03-07 15:38:36,Person1,good productive day or bad one?
2023-03-07 16:11:41,Person2,"Hey dear\nReturned home after being out\nProductive\nCertainly \nHow about yours?"

I have written this function to handle this case, might be a good starting point.

def is_complete_message(s):
    """
    Given a WhatsApp chat exported string, returns if string is a complete message

    :param s: string
    :return: bool

    >>> is_complete_message("[2021-05-05, 11:44:26] A: message content")
    True

    >>> is_complete_message("Message content")
    False
    """
    pattern = r"^\[\d{4}-\d{2}-\d{2},\W\d{2}:\d{2}:\d{2}]"
    results = re.search(pattern, s)
    return True if results else False

and this

all_messages = []
with blob.open(mode="r") as file:
    count = 0
    for i, line in enumerate(file):
        if whatsapp.is_complete_message(line):
            data = whatsapp.parse_message(line)
            all_messages.append(data)
            count += 1
        else:
            last_line = all_messages[count - 1]
            new_content = last_line["content"] + "\n" + line
            all_messages[count - 1]["content"] = new_content

They might be a good starting point :)

Consider system messages

I think we shouldn't discard system messages. It has useful information as "who added who" whose could be used in a network analysis. It also has information when users enter and leaves, it could answer the question: how long users stay in the group? Or calculate a turnaround rate.

Whatsapp is used as a work tool, truly communities of practices exist in the whatsapp's ecosystem. Understanding the human behavior in those groups could help to understand the human behavior related to the 'domain' of these communities.

The literature says a community of practice consists of a joint enterprise, mutual engagement, and a shared repertoire (WENGER, 1998).

WENGER, E. Communities of practice: Learning, meaning, and identity. Cambridge University Press, 1998. (Learning in Doing: Social, Cognitive and Computational Perspectives). ISBN 9780521430173. DOI: 10.1017/CBO9780511803932.

Add option to filter chat by date

Concept idea

When loading chat

from whatstk import WhatsAppChat
chat = WhatsAppChat.from_txt('file.txt', date_min='2020-01-01', date_max='2020-01-21')

When analysing chat

from whatstk import WhatsAppChat
chat = WhatsAppChat.from_txt('file.txt')
chat = chat.filter_dates(date_min='2020-01-01', date_max='2020-01-21')

PD: Possibly implement an algorithm that can detect the header pattern by itself in an unsupervised manner.

Project documentation

Prepare documentation of the project. This involves:

Organise the formatting of the code documentation. There are also several options for the docstring formatting. You may find here some examples.
Set the website (possible options are github pages or readthedocs)