Giter Club home page Giter Club logo

liwc-python's Introduction

liwc

PyPI version Travis CI Build Status

This repository is a Python package implementing two basic functions:

  1. Loading (parsing) a Linguistic Inquiry and Word Count (LIWC) dictionary from the .dic file format.
  2. Using that dictionary to count category matches on provided texts.

This is not an official LIWC product nor is it in any way affiliated with the LIWC development team or Receptiviti.

Obtaining LIWC

The LIWC lexicon is proprietary, so it is not included in this repository.

The lexicon data can be acquired (purchased) from liwc.net.

  • If you are a researcher at an academic institution, please contact Dr. James W. Pennebaker directly.
  • For commercial use, contact Receptiviti, which is the company that holds exclusive commercial license.

Finally, please do not open an issue in this repository with the intent of subverting encryption implemented by the LIWC developers. If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable *.dic file, please contact the distributor directly.

Setup

Install from PyPI:

pip install liwc

Example

This example reads the LIWC dictionary from a file named LIWC2007_English100131.dic, which looks like this:

%
1   funct
2   pronoun
[...]
%
a   1   10
abdomen*    146 147
about   1   16  17
[...]

Loading the lexicon

import liwc
parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic')
  • parse is a function from a token of text (a string) to a list of matching LIWC categories (a list of strings)
  • category_names is all LIWC categories in the lexicon (a list of strings)

Analyzing text

import re

def tokenize(text):
    # you may want to use a smarter tokenizer
    for match in re.finditer(r'\w+', text, re.UNICODE):
        yield match.group(0)

gettysburg = '''Four score and seven years ago our fathers brought forth on
  this continent a new nation, conceived in liberty, and dedicated to the
  proposition that all men are created equal. Now we are engaged in a great
  civil war, testing whether that nation, or any nation so conceived and so
  dedicated, can long endure. We are met on a great battlefield of that war.
  We have come to dedicate a portion of that field, as a final resting place
  for those who here gave their lives that that nation might live. It is
  altogether fitting and proper that we should do this.'''.lower()
gettysburg_tokens = tokenize(gettysburg)

Now, count all the categories in all of the tokens, and print the results:

from collections import Counter
gettysburg_counts = Counter(category for token in gettysburg_tokens for category in parse(token))
print(gettysburg_counts)
#=> Counter({'funct': 58, 'pronoun': 18, 'cogmech': 17, ...})

N.B.:

  • The LIWC lexicon only matches lowercase strings, so you will most likely want to lowercase your input text before passing it to parse(...). In the example above, I call .lower() on the entire string, but you could alternatively incorporate that into your tokenization process (e.g., by using spaCy's token.lower_).

License

Copyright (c) 2012-2020 Christopher Brown. MIT Licensed.

liwc-python's People

Contributors

chbrown avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

liwc-python's Issues

KeyError: 8 German 2001 .dic

Hi,
has anybody used the German 2001 .dic with this library? I am getting "KeyError: 8" when I try to load the lexicon.

I've managed to solve this issue myself by accessing to the liwc/liwc/dic.py file and changing the read_dic(filepath) method to read my file using the utf-8 encoding.

Thank you for sharing your own solution. I would appreciate if you provided the details about it.
I just added encoding='utf-8' to liwc/liwc/dic.py, but apparently that is not right, because I keep getting "KeyError: '8'"
with open(filepath, encoding='utf-8') as lines:

By the way, I am trying to use this library not with a Spanish but with a German (from 2001) version. That is why I think your suggested solution might make this library work with the German 2001 .dic

Originally posted by @vicru in #15 (comment)

.dic file can't be found on ubuntu server when running liwc example

Hi all,

I used the sample code:

def tokenize(text): # you may want to use a smarter tokenizer for match in re.finditer(r'\w+', text, re.UNICODE): yield match.group(0) parse, category_names = liwc.load_token_parser('LIWC2015_English_flat_decoded.dic') tokens=tokenize(text.lower()) counts = Counter(category for token in tokens for category in parse(token))

to get the counts. I have a valid .dic file saved under the same directory of the script. It ran fine on my windows machine. However, when i move the file and script to ubuntu and also installed the package using pip3 install -U liwc, it shows me below error:

File "predict.py", line 71, in liwc_convert parse, category_names = liwc.load_token_parser('LIWC2015_English_flat_decoded.dic') File "/home/ubuntu/.local/lib/python3.6/site-packages/liwc/__init__.py", line 21, in load_token_parser lexicon, category_names = read_dic(filepath) File "/home/ubuntu/.local/lib/python3.6/site-packages/liwc/dic.py", line 36, in read_dic with open(filepath) as lines: FileNotFoundError: [Errno 2] No such file or directory: 'LIWC2015_English_flat_decoded.dic'

But the truth is that i put the file right in the same directory of the script (here is predict.py). I don't understand why this happens. Can anybody help? I'm hitting a deadline for a competition. So if anyone who can help in a timely fashion, i highly appreciate it! Thank you!

---UPDATE----
This is not an issue any more! I'm an idiot! It turns out i have two versions of the files and i copied the other one to the server which is LIWC2015_English_Flat_decoded.dic but in my script, i referenced as LIWC2015_English_flat_decoded.dic. I'm closing this issue. Sorry for the confusion!

Dealing with 'utf-8' encoding

Hello there!
I'm dealing with a .dic file containing special characters such as "ñ" or "í", the problem lies whenever I do have to read such dictionary.

As an example: I do have the word "abadía" in my corpus and whenever I read it with liwc it appears as : "abadÃ\xada".
How could I deal with this?
Thank you very much in advance.

key error

It returns the keyerror when read the line like (02 134)125/464 (02 134)126 253 466

KeyError: '(02 134)125/464'

KeyError for LIWC2007_English150202_STRESS.dic

I have a LIWC2007_English150202_STRESS.dic, which looks like the following:

%
1	stress
%
abandon*	1
abuse*		1
ache*		1
aching		1
afraid		1
...
worst		1
worthless* 	1
wrong*		1

When I do

parse, category_names = liwc.load_token_parser('./LIWC2007_English150202_STRESS.dic')

I get

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
c:\Users\liqia\Desktop\random-code.py in 
      8         yield match.group(0)
      9 
---> 10 parse, category_names = liwc.load_token_parser('./LIWC2007_English150202_STRESS.dic')

~\Anaconda3\lib\site-packages\liwc\__init__.py in load_token_parser(filepath)
     19       the lexicon
     20     """
---> 21     lexicon, category_names = read_dic(filepath)
     22     trie = build_trie(lexicon)
     23 

~\Anaconda3\lib\site-packages\liwc\dic.py in read_dic(filepath)
     42         category_mapping = dict(_parse_categories(lines))
     43         # read lexicon (a mapping from matching string to a list of category names)
---> 44         lexicon = dict(_parse_lexicon(lines, category_mapping))
     45     return lexicon, list(category_mapping.values())

~\Anaconda3\lib\site-packages\liwc\dic.py in _parse_lexicon(lines, category_mapping)
     24         line = line.strip()
     25         parts = line.split("\t")
---> 26         yield parts[0], [category_mapping[category_id] for category_id in parts[1:]]
     27 
     28 

~\Anaconda3\lib\site-packages\liwc\dic.py in (.0)
     24         line = line.strip()
     25         parts = line.split("\t")
---> 26         yield parts[0], [category_mapping[category_id] for category_id in parts[1:]]
     27 
     28 

KeyError: ''

how to calculate the summary variables

Hi. Thanks for your amazing work.
I want to ask if there are some ways to get the summary variables results? They are:

(1) analytical thinking; (2) clout; (3) authenticity; and (4) emotional tone.

AttributeError: 'module' object has no attribute 'load_token_parser'

Error as above,

Do you need to make any changes to the liwc.py file?

Traceback:

runfile('C:/Users/Fionn Delahunty/Documents/Insight/LIWC2/test.py', wdir='C:/Users/Fionn Delahunty/Documents/Insight/LIWC2')
Reloaded modules: liwc
C:\Users\Fionn Delahunty\Documents\Insight\LIWC2
Traceback (most recent call last):

  File "<ipython-input-25-b6aa52208070>", line 1, in <module>
    runfile('C:/Users/Fionn Delahunty/Documents/Insight/LIWC2/test.py', wdir='C:/Users/Fionn Delahunty/Documents/Insight/LIWC2')

  File "C:\Users\Fionn Delahunty\Anaconda3\envs\python2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\Users\Fionn Delahunty\Anaconda3\envs\python2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Users/Fionn Delahunty/Documents/Insight/LIWC2/test.py", line 29, in <module>
    parse, category_names = liwc.load_token_parser('LIWC2007_English080730_Edit.dic')

convert the text to lower case

hi,
thank you for your awesome work.

this is not a bug, just a usage precautions.
When you using this package, you need to convert the text to lower case by manual.

for example:
the result of parse('Our') is blank, but parse('our') works.

thx
Mengying

How to add some changes?

Thanks for sharing. It works as it should, but I want to add some new words to the dictionary with space and parenthesis like "Barak Obama" or "(will) like". I tried some changes but It doesn't response well.

AttributeError: module 'liwc' has no attribute 'load_token_parser'

Hi Brown,
Thanks so much for your contribution. I got the 2015 dic from developer. When I try to use the liwc, it shows the attribute error. I am a newbie in Python. Just don't know hw to solve the issue cause I can't similar solutions online to this error. The code I used is as follows:

import liwc
liwcPath = ('D:/Dropbox/01_CEO activism/Data/01_raw/LIWC2015Dictionary.dic')
parse, category_names = liwc.load_token_parser('liwcPath')

and the error is like:
C:\Users\miaoyun2\AppData\Local\Programs\Python\Python310\python.exe "D:/CEO activism_Data/02_dofiles/textanalysis.py"
Traceback (most recent call last):
File "D:\CEO activism_Data\02_dofiles\textanalysis.py", line 11, in
parse, category_names = liwc.load_token_parser('liwcPath')
AttributeError: module 'liwc' has no attribute 'load_token_parser'

Thank you so much for your time!

Where you do locate the LIWC English dictionary once you purchase the LIWC license?

I see that you need to purchase an LIWC license to get access to the dictionary. So, I have purchased the software but I can’t figure out how to access the English Dictionary file which your Python package needs for it to work. Any pointers would be really helpful as to how to connect your python package to the LIWC dictionary easily e.g. where do I access/download the LIWC English file?

Key Error (revisted!)

Hello,
Thanks for providing these scripts. I know you've closed this issue before but I am also getting a key error. I get why it is erroring, in that it doesn't like lines with unusual structure and symbols I guess. But does it normally deal with them fine? Here is the traceback:

at9362$ python3 example.py 
Traceback (most recent call last):
  File "example.py", line 10, in <module>
    parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic')
  File "python/liwc/__init__.py", line 76, in load_token_parser
    lexicon, category_names = read_dic(filepath)
  File "python/liwc/__init__.py", line 27, in read_dic
    lexicon[parts[0]] = [category_mapping[category_id] for category_id in parts[1:]]
  File "python/liwc/__init__.py", line 27, in <listcomp>
    lexicon[parts[0]] = [category_mapping[category_id] for category_id in parts[1:]]
KeyError: '<of>131/125'

Grateful for any help :)

use Traditional_Chinese_LIWC2015 UnicodeDecodeError

import liwc
parse, category_names = liwc.load_token_parser('/Downloads/Traditional_Chinese_LIWC2015_Dictionary.dic')

Traceback (most recent call last):
File "C:/Users/Desktop/test.py", line 4, in
parse, category_names = liwc.load_token_parser('/Downloads/Traditional_Chinese_LIWC2015_Dictionary.dic')
File "C:\Users\AppData\Local\Programs\Python\Python38\lib\site-packages\liwc_init_.py", line 21, in load_token_parser
lexicon, category_names = read_dic(filepath)
File "C:\Users\AppData\Local\Programs\Python\Python38\lib\site-packages\liwc\dic.py", line 38, in read_dic
for line in lines:

UnicodeDecodeError: 'cp950' codec can't decode byte 0xbf in position 2: illegal multibyte sequence

What can I do to solve this problem?

No update code in PyPI

Hi

Thank you for your continued commitment to this project! But it seems to be no 'really' update in PyPI.

I downloaded the source code from this URL: https://files.pythonhosted.org/packages/4a/b3/d46aec19508d29e8f2c71c8d87d3878a2249abf01cb6d727e442e67b2d74/liwc-0.4.2.tar.gz

And found that the read_dic function does not consist of the Github repo code.

The code in the downloaded file is the following:

def read_dic(filepath):
    """
    Reads a LIWC lexicon from a file in the .dic format, returning a tuple of
    (lexicon, category_names), where:
    * `lexicon` is a dict mapping string patterns to lists of category names
    * `categories` is a list of category names (as strings)

    """
    # category_mapping is a mapping from integer string to category name
    category_mapping = {}
    # category_names is equivalent to category_mapping.values() but retains original ordering
    category_names = []
    lexicon = {}
    # the mode is incremented by each '%' line in the file
    mode = 0
    for line in open(filepath):
        tsv = line.strip()
        if tsv:
            parts = tsv.split()
            if parts[0] == "%":
                mode += 1
            elif mode == 1:
                # definining categories
                category_names.append(parts[1])
                category_mapping[parts[0]] = parts[1]
            elif mode == 2:
                lexicon[parts[0]] = [
                    category_mapping[category_id] for category_id in parts[1:]
                ]
    return lexicon, category_names

thank you! Best Regards!

mengying

How to show the specific category only?

c_counts = Counter(category for token in Corpus['text'][1] for category in parse((token)))

I got :

Counter({'social (Social)': 74,
         'verb (Verbs)': 97,
         'drives (Drives)': 49,
         'reward (Reward)': 9,
         'focuspresent (Present Focus)': 66,
         'function (Function Words)': 334,
         '

 (Conjunctions)': 43,
         'adj (Adjectives)': 41,
         'affect (Affect)': 47,
         'posemo (Positive Emotions)': 42,
         'work (Work)': 44,
         'article (Articles)': 70,
         'pronoun (Pronouns)': 45,
         'ipron (Impersonal Pronouns)': 28,
         'relativ (Relativity)': 81,
         'motion (Motion)': 4,
         'time (Time)': 26,
         'prep (Prepositions)': 123,
         'percept (Perceptual Processes)': 28,
         'hear (Hear)': 22,
         'auxverb (Auxiliary Verbs)': 43,
         'compare (Comparisons)': 31,
         'quant (Quantifiers)': 69,
         'cogproc (Cognitive Processes)': 63,
         'tentat (Tentative)': 21,
         'adverb (Adverbs)': 23,
         'space (Space)': 54,
         'power (Power)': 30,
         'interrog (Interrogatives)': 14,
         'certain (Certainty)': 5,
         'ppron (Personal Pronouns)': 17,
         'they (They)': 13,
         'focuspast (Past Focus)': 24,
         'number (Numbers)': 6,
         'money (Money)': 16,
         'achieve (Achievement)': 17,
         'differ (Differentiation)': 21,
         'see (See)': 4,
         'insight (Insight)': 11,
         'discrep (Discrepancies)': 9,
         'focusfuture (Future Focus)': 3,
         'bio (Biological Processes)': 4,
         'health (Health)': 4,
         'i (I)': 1,
         'affiliation (Affiliation)': 3,
         'leisure (Leisure)': 2,
         'cause (Causal)': 3,
         'home (Home)': 3,
         'negemo (Negative Emotions)': 5,
         'sad (Sad)': 2,
         'risk (Risk)': 3,
         'shehe (SheHe)': 3,
         'male (Male)': 2,
         'negate (Negations)': 6,
         'female (Female)': 1,
         'anx (Anx)': 1})

What if I just want to show, say,

'focuspast (Past Focus)': 24, 
'focusfuture (Future Focus)': 3,
'cogproc (Cognitive Processes)': 63,

I have a stupid method using

name_list=[
    'focuspast (Past Focus)', 'focusfuture (Future Focus)', 'cogproc (Cognitive Processes)',
]
for k, v in c_counts.items():
    while k in name_list:
        print(k, v)
        break

May I have any more smarter approach?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.