tanloong / neosca Goto Github PK

L2SCA & LCA fork: cross-platform, GUI, without Java dependency

License: GNU General Public License v3.0

Makefile 0.23% Python 99.01% Shell 0.66% Batchfile 0.10%

syntactic-complexity linguistics corpus-analysis l2sca neosca corpus constituency-parsing nlp python tregex

neosca's Introduction

NeoSCA

NeoSCA is a fork of Xiaofei Lu's L2 Syntactic Complexity Analyzer (L2SCA) and Lexical Complexity Analyzer (LCA). Starting from version 0.1.0, NeoSCA has a graphical interface and no longer requires Java installation, it has translated a portion of the Tregex code into Python and favors Stanza over Stanford Parser.

Basic Usage

GUI

Download and run the packaged application

Release	Remarks
Latest Release for Windows	1. Extract all files 2. Double-click NeoSCA/NeoSCA.exe to run
Latest Release for macOS	1. Extract all files 2. Search and open Terminal in Launchpad, then type `xattr -rc` (note the trailing whitespace), drag the whole NeoSCA directory to the Terminal, and press `Enter` 3. Double-click NeoSCA.app to run
Latest Release for Arch Linux	1. Extract all files 2. Double-click NeoSCA/NeoSCA to run
Past Releases	Not recommended
Baidu Netdisk	For users with unstable connections to GitHub

Command Line

Install NeoSCA from the source code

pip install git+https://github.com/tanloong/neosca

Run SCA or LCA by

python -m neosca sca filepath.txt
python -m neosca sca --text 'This is a test.'
python -m neosca lca filepath.txt
python -m neosca lca --text 'This is a test.'

To see other command line options please use

python -m neosca --help
python -m neosca sca --help
python -m neosca lca --help

Citing

If you use NeoSCA in your research, please cite as follows.

BibTeX

@misc{long2024neosca,
title        = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.1.4},
author       = {Long Tan},
howpublished = {\url{https://github.com/tanloong/neosca}},
year         = {2024}
}

APA (7th edition)

Tan, L. (2024). NeoSCA (version 0.1.4) [Computer software]. GitHub. https://github.com/tanloong/neosca

MLA (9th edition)

Tan, Long. NeoSCA. version 0.1.4, GitHub, 2024, https://github.com/tanloong/neosca.

If you use the Syntactic Complexity Analyzer module of NeoSCA, please cite Xiaofei's article describing L2SCA as well.

BibTeX

@article{xiaofei2010automatic,
title     = {Automatic analysis of syntactic complexity in second language writing},
author    = {Xiaofei Lu},
journal   = {International journal of corpus linguistics},
volume    = {15},
number    = {4},
pages     = {474--496},
year      = {2010},
publisher = {John Benjamins Publishing Company},
doi       = {10.1075/ijcl.15.4.02lu},
}

APA (7th edition)

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474-496. https://doi.org/10.1075/ijcl.15.4.02lu

MLA (9th edition)

Lu, Xiaofei. "Automatic Analysis of Syntactic Complexity in Second Language Writing." International Journal of Corpus Linguistics, vol. 15, no. 4, John Benjamins Publishing Company, 2010, pp. 474-96, https://doi.org/10.1075/ijcl.15.4.02lu

If you use the Lexical Complexity Analyzer module of NeoSCA, please also cite Xiaofei's article about LCA.

BibTeX

@article{xiaofei2012relationship,
author  = {Xiaofei Lu},
title   = {The Relationship of Lexical Richness to the Quality of ESL Learners' Oral Narratives},
journal = {The Modern Language Journal},
volume  = {96},
number  = {2},
pages   = {190-208},
doi     = {https://doi.org/10.1111/j.1540-4781.2011.01232\_1.x},
year    = {2012}
}

APA (7th edition)

Lu, X. (2012). The relationship of lexical richness to the quality of ESL learners' oral narratives. The Modern Language Journal, 96(2), 190-208.

MLA (9th edition)

Lu, Xiaofei. "The Relationship of Lexical Richness to the Quality of ESL Learners' Oral Narratives." The Modern Language Journal, vol. 96, no. 2, Wiley-Blackwell, 2012, pp. 190-208.

Contact

You can send bug reports, feature requests, or any questions via:

neosca's People

Contributors

Stargazers

Watchers

Forkers

bright2013 hugozha waldoweng verbolll richiesh f-crystal sanchrtv

neosca's Issues

Subsequent files cannot be processed if parsed results for preceding files are present

Description

If preceding files come with their parsed results, the following files that come without cannot be processed.

java.io.IOException: Unable to open "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz" as class path, filename or URL
Traceback (most recent call last):
  File "/home/tan/.local/bin/nsca", line 33, in <module>
    sys.exit(load_entry_point('neosca==0.0.38', 'console_scripts', 'nsca')())
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/main.py", line 587, in main
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/main.py", line 552, in run
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/main.py", line 518, in wrapper
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/main.py", line 536, in run_on_ifiles
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/neosca.py", line 146, in run_on_ifiles
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/neosca.py", line 127, in parse_ifile_and_query
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/neosca.py", line 118, in parse_ifile
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/neosca.py", line 82, in parse_text
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/parser.py", line 44, in __init__
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/parser.py", line 63, in init_parser
AttributeError: 'NoneType' object has no attribute 'setOptionFlags'

How to reproduce

echo 'This is a test.' | tee 1.txt 2.txt
nsca 1.txt -p

Now the current directory contains 3 files: 1.txt, 1.parsed, and 2.txt

nsca 1.txt 2.txt # fail
nsca 2.txt 1.txt # succeed

TypeError: Class edu.stanford.nlp.parser.lexparser.LexicalizedParser is not found

The program raises the error If the $STANFORD_PARSER_HOME contains Chinese chars.

PS C:\Users\Administrator> $env:STANFORD_PARSER_HOME
...
C:\Users\Administrator\AppData\Roaming\中文目录\stanford-parser-full-2020-11-17

PS C:\Users\Administrator> nsca --text "This is a test." --stdout
...
Command-line text: This is a test.
Java has already been installed. ok
Stanford Parser has already been installed. ok
Stanford Tregex has already been installed. ok
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Scripts\nsca.exe\__main__.py", line 7, in <module>
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\main.py", line 597, in main
    success, err_msg = ui.run()
                       ^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\main.py", line 560, in run
    return self.run_on_text()  # type: ignore
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\main.py", line 527, in wrapper
    func(self, *args, **kwargs)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\main.py", line 538, in run_on_text
    analyzer.run_on_text(self.options.text)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\neosca.py", line 99, in run_on_text
    trees = self.parse_text(text)
            ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\neosca.py", line 82, in parse_text
    self.parser = StanfordParser(
                  ^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\parser.py", line 44, in __init__
    self.init_parser()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\parser.py", line 58, in init_parser
    LexicalizedParser = JClass(self.PARSER_GRAMMAR)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\jpype\_jclass.py", line 99, in __new__
    return _jpype._getClass(jc)
           ^^^^^^^^^^^^^^^^^^^^
TypeError: Class edu.stanford.nlp.parser.lexparser.LexicalizedParser is not found

You need to replace java --version with java -version

Zero frequency for some measures with `--select`

With --select:

Without --select:

How to reproduce:

pip install neosca==0.0.37
download the sample file
nsca sample1.txt --select CP VP_T CN_C -o with-select.csv
nsca sample1.txt -o without-select.csv

IndexError when running only with subfiles

How to reproduce

pip install neosca==0.0.38
download sample1.txt and sample2.txt
nsca -c sample1.txt sample2.txt

Error message

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/tan/projects/neosca/neosca/__main__.py", line 4, in <module>
    main()
  File "/home/tan/projects/neosca/neosca/main.py", line 586, in main
    success, err_msg = ui.run()
  File "/home/tan/projects/neosca/neosca/main.py", line 551, in run
    return self.run_on_ifiles()  # type: ignore
  File "/home/tan/projects/neosca/neosca/main.py", line 514, in wrapper
    func(self, *args, **kwargs)  # type: ignore
  File "/home/tan/projects/neosca/neosca/main.py", line 532, in run_on_ifiles
    analyzer.run_on_ifiles(self.verified_ifile_list)
  File "/home/tan/projects/neosca/neosca/neosca.py", line 150, in run_on_ifiles
    self.write_freq_output()
  File "/home/tan/projects/neosca/neosca/neosca.py", line 174, in write_freq_output
    freq_output = self.get_freq_output(self.oformat_freq)
  File "/home/tan/projects/neosca/neosca/neosca.py", line 159, in get_freq_output
    freq_output = self.counter_lists[0].fields
IndexError: list index out of range

However, nsca -c sample1.txt sample2.txt -- another-file.txt works.

Thanks a lot for updating this!

Thanks a lot for updating this to use Stanza.

Would it be possible to update the README on how to use the command line tool?
Also, how can I replicate the following script using neosca?

This is taken from the original L2SCA code

import sys
import os
import subprocess
import re
import tempfile


def division(x, y):
    if float(x) == 0 or float(y) == 0:
        return 0
    return float(x) / float(y)

# List of tregex patterns for various structures
patternlist = [
    "'ROOT !> __'",
    "'VP > S|SINV|SQ'",
    "'S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])]'",
    "'S|SBARQ|SINV|SQ > ROOT | [$-- S|SBARQ|SINV|SQ !>> SBAR|VP]'",
    "'SBAR < (S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])])'",
    "'S|SBARQ|SINV|SQ [> ROOT | [$-- S|SBARQ|SINV|SQ !>> SBAR|VP]] << (SBAR < (S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])]))'",
    "'ADJP|ADVP|NP|VP < CC'",
    "'NP !> NP [<< JJ|POS|PP|S|VBG | << (NP $++ NP !$+ CC)]'",
    "'SBAR [<# WHNP | <# (IN < That|that|For|for) | <, S] & [$+ VP | > VP]'",
    "'S < (VP <# VBG|TO) $+ VP'",
    "'FRAG > ROOT !<< (S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])])'",
    "'FRAG > ROOT !<< (S|SBARQ|SINV|SQ > ROOT | [$-- S|SBARQ|SINV|SQ !>> SBAR|VP])'",
    "'MD|VBZ|VBP|VBD > (SQ !< VP)'"
]

# Path to the Stanford parser

def analyze_text(raw_text, long_names=False):
    current_dir = os.path.dirname(os.path.abspath(__file__))
    inputFile = os.path.join(current_dir, "input_file_temp.txt")

    with open(inputFile, "w") as temp_file:
        temp_file.write(raw_text)

    # Name a temporary file to hold the parse trees of the input file
    parsedFile = inputFile + ".parsed"

    parserPath = os.path.join(current_dir, "stanford-parser-full-2020-11-17/lexparser.sh")

    # Parse the input file
    command = f"{parserPath} {inputFile} > {parsedFile}"
    subprocess.getoutput(command)

    # List of counts of the patterns
    patterncount = []

    tregex_path = os.path.join(current_dir, "tregex.sh")

    # Query the parse trees using the tregex patterns
    for pattern in patternlist:
        command = f"{tregex_path} {pattern} {parsedFile} -C -o"
        count = subprocess.getoutput(command).split('\n')[-1]
        patterncount.append(int(count))

    # Update frequencies of complex nominals, clauses, and T-units
    patterncount[7] = patterncount[-4] + patterncount[-5] + patterncount[-6]
    patterncount[2] = patterncount[2] + patterncount[-3]
    patterncount[3] = patterncount[3] + patterncount[-2]
    patterncount[1] = patterncount[1] + patterncount[-1]

    # Word count
    with open(parsedFile, "r") as infile:
        content = infile.read()
    w = len(re.findall("\([A-Z]+\$? [^\)\(-]+\)", content))


    # #list of frequencies of structures other than words
    [s,vp,c,t,dc,ct,cp,cn]=patterncount[:8]

    # #compute the 14 syntactic complexity indices
    mls=division(w,s)
    mlt=division(w,t)
    mlc=division(w,c)
    c_s=division(c,s)
    vp_t=division(vp,t)
    c_t=division(c,t)
    dc_c=division(dc,c)
    dc_t=division(dc,t)
    t_s=division(t,s)
    ct_t=division(ct,t)
    cp_t=division(cp,t)
    cp_c=division(cp,c)
    cn_t=division(cn,t)
    cn_c=division(cn,c)


    if long_names:
        measures = {
            "W": w,
            "S": s,
            "VP": vp,
            "C": c,
            "T": t,
            "DC": dc,
            "CT": ct,
            "CP": cp,
            "CN": cn,
            "MLS": mls,
            "MLT": mlt,
            "MLC": mlc,
            "C/S": c_s,
            "VP/T": vp_t,
            "C/T": c_t,
            "DC/C": dc_c,
            "DC/T": dc_t,
            "T/S": t_s,
            "CT/T": ct_t,
            "CP/T": cp_t,
            "CP/C": cp_c,
            "CN/T": cn_t,
            "CN/C": cn_c
        }
    else:
        measures = {
            "words": w,
            "sentences": s,
            "verb phrases": vp,
            "clauses": c,
            "T-units": t,
            "dependent clauses": dc,
            "complex T-units": ct,
            "coordinate phrases": cp,
            "complex nominals": cn,
            "mean length of sentence (MLS)": mls,
            "mean length of T-unit (MLT)": mlt,
            "mean length of clause (MLC)": mlc,
            "clauses per sentence (C/S)": c_s,
            "verb phrases per T-unit (VP/T)": vp_t,
            "clauses per T-unit (C/T)": c_t,
            "dependent clauses per clause (DC/C)": dc_c,
            "dependent clauses per T-unit (DC/T)": dc_t,
            "T-units per sentence (T/S)": t_s,
            "complex T-unit ratio (CT/T)": ct_t,
            "coordinate phrases per T-unit (CP/T)": cp_t,
            "coordinate phrases per clause (CP/C)": cp_c,
            "complex nominals per T-unit (CN/T)": cn_t,
            "complex nominals per clause (CN/C)": cn_c
        }
    for key in measures:
        measures[key] = round(measures[key], 4)

    # Delete the temporary file holding the parse trees
    os.remove(parsedFile)

    return measures

so that I can do this:

i.e. how do i get the same measurements dictionary using neosca?

【problem】mac版本一次性上传大批量txt文件不响应

开发者您好，非常感谢您开发并持续更新neosca这一工具，便于英文词法和句法复杂度分析。但在使用这一工具的过程中我遇到了一些小问题：我下载了mac版本的neosca app，但是当我一次性上传多个txt文件时app会长期不响应，我怀疑是因为内存问题导致，因此想问一下能否提供之前的python包版本，便于我自己写代码处理？再次感谢您的辛苦工作！

tanloong / neosca Goto Github PK

neosca's Introduction

NeoSCA

Basic Usage

GUI

Command Line

Citing

Contact

neosca's People

Contributors

Stargazers

Watchers

Forkers

neosca's Issues

How to reproduce

Error message

Recommend Projects

Recommend Topics

Recommend Org