Giter Club home page Giter Club logo

neosca's Introduction

NeoSCA

build lint codecov codacy commit platform downloads license

NeoSCA is a fork of Xiaofei Lu's L2 Syntactic Complexity Analyzer (L2SCA) and Lexical Complexity Analyzer (LCA). Starting from version 0.1.0, NeoSCA has a graphical interface and no longer requires Java installation, it has translated a portion of the Tregex code into Python and favors Stanza over Stanford Parser.

Basic Usage

GUI

Download and run the packaged application

Release Remarks
Latest Release for Windows 1. Extract all files
2. Double-click NeoSCA/NeoSCA.exe to run
Latest Release for macOS 1. Extract all files
2. Search and open Terminal in Launchpad, then type xattr -rc (note the trailing whitespace), drag the whole NeoSCA directory to the Terminal, and press Enter
3. Double-click NeoSCA.app to run
Latest Release for Arch Linux 1. Extract all files
2. Double-click NeoSCA/NeoSCA to run
Past Releases Not recommended
Baidu Netdisk For users with unstable connections to GitHub

Command Line

Install NeoSCA from the source code

pip install git+https://github.com/tanloong/neosca

Run SCA or LCA by

python -m neosca sca filepath.txt
python -m neosca sca --text 'This is a test.'
python -m neosca lca filepath.txt
python -m neosca lca --text 'This is a test.'

To see other command line options please use

python -m neosca --help
python -m neosca sca --help
python -m neosca lca --help

Citing

If you use NeoSCA in your research, please cite as follows.

BibTeX
@misc{long2024neosca,
title        = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.1.4},
author       = {Long Tan},
howpublished = {\url{https://github.com/tanloong/neosca}},
year         = {2024}
}
APA (7th edition)
Tan, L. (2024). NeoSCA (version 0.1.4) [Computer software]. GitHub. https://github.com/tanloong/neosca
MLA (9th edition)
Tan, Long. NeoSCA. version 0.1.4, GitHub, 2024, https://github.com/tanloong/neosca.

If you use the Syntactic Complexity Analyzer module of NeoSCA, please cite Xiaofei's article describing L2SCA as well.

BibTeX
@article{xiaofei2010automatic,
title     = {Automatic analysis of syntactic complexity in second language writing},
author    = {Xiaofei Lu},
journal   = {International journal of corpus linguistics},
volume    = {15},
number    = {4},
pages     = {474--496},
year      = {2010},
publisher = {John Benjamins Publishing Company},
doi       = {10.1075/ijcl.15.4.02lu},
}
APA (7th edition)
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474-496. https://doi.org/10.1075/ijcl.15.4.02lu
MLA (9th edition)
Lu, Xiaofei. "Automatic Analysis of Syntactic Complexity in Second Language Writing." International Journal of Corpus Linguistics, vol. 15, no. 4, John Benjamins Publishing Company, 2010, pp. 474-96, https://doi.org/10.1075/ijcl.15.4.02lu

If you use the Lexical Complexity Analyzer module of NeoSCA, please also cite Xiaofei's article about LCA.

BibTeX
@article{xiaofei2012relationship,
author  = {Xiaofei Lu},
title   = {The Relationship of Lexical Richness to the Quality of ESL Learners' Oral Narratives},
journal = {The Modern Language Journal},
volume  = {96},
number  = {2},
pages   = {190-208},
doi     = {https://doi.org/10.1111/j.1540-4781.2011.01232\_1.x},
year    = {2012}
}
APA (7th edition)
Lu, X. (2012). The relationship of lexical richness to the quality of ESL learners' oral narratives. The Modern Language Journal, 96(2), 190-208.
MLA (9th edition)
Lu, Xiaofei. "The Relationship of Lexical Richness to the Quality of ESL Learners' Oral Narratives." The Modern Language Journal, vol. 96, no. 2, Wiley-Blackwell, 2012, pp. 190-208.

Contact

You can send bug reports, feature requests, or any questions via:

neosca's People

Contributors

tanloong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

neosca's Issues

Subsequent files cannot be processed if parsed results for preceding files are present

  1. Description
If preceding files come with their parsed results, the following files that come without cannot be processed.
java.io.IOException: Unable to open "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz" as class path, filename or URL
Traceback (most recent call last):
  File "/home/tan/.local/bin/nsca", line 33, in <module>
    sys.exit(load_entry_point('neosca==0.0.38', 'console_scripts', 'nsca')())
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/main.py", line 587, in main
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/main.py", line 552, in run
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/main.py", line 518, in wrapper
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/main.py", line 536, in run_on_ifiles
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/neosca.py", line 146, in run_on_ifiles
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/neosca.py", line 127, in parse_ifile_and_query
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/neosca.py", line 118, in parse_ifile
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/neosca.py", line 82, in parse_text
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/parser.py", line 44, in __init__
  File "/home/tan/.local/lib/python3.10/site-packages/neosca-0.0.38-py3.10.egg/neosca/parser.py", line 63, in init_parser
AttributeError: 'NoneType' object has no attribute 'setOptionFlags'
  1. How to reproduce
echo 'This is a test.' | tee 1.txt 2.txt
nsca 1.txt -p

Now the current directory contains 3 files: 1.txt, 1.parsed, and 2.txt

nsca 1.txt 2.txt # fail
nsca 2.txt 1.txt # succeed

TypeError: Class edu.stanford.nlp.parser.lexparser.LexicalizedParser is not found

The program raises the error If the $STANFORD_PARSER_HOME contains Chinese chars.

PS C:\Users\Administrator> $env:STANFORD_PARSER_HOME
...
C:\Users\Administrator\AppData\Roaming\中文目录\stanford-parser-full-2020-11-17
PS C:\Users\Administrator> nsca --text "This is a test." --stdout
...
Command-line text: This is a test.
Java has already been installed. ok
Stanford Parser has already been installed. ok
Stanford Tregex has already been installed. ok
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Scripts\nsca.exe\__main__.py", line 7, in <module>
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\main.py", line 597, in main
    success, err_msg = ui.run()
                       ^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\main.py", line 560, in run
    return self.run_on_text()  # type: ignore
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\main.py", line 527, in wrapper
    func(self, *args, **kwargs)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\main.py", line 538, in run_on_text
    analyzer.run_on_text(self.options.text)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\neosca.py", line 99, in run_on_text
    trees = self.parse_text(text)
            ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\neosca.py", line 82, in parse_text
    self.parser = StanfordParser(
                  ^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\parser.py", line 44, in __init__
    self.init_parser()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\neosca\parser.py", line 58, in init_parser
    LexicalizedParser = JClass(self.PARSER_GRAMMAR)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\jpype\_jclass.py", line 99, in __new__
    return _jpype._getClass(jc)
           ^^^^^^^^^^^^^^^^^^^^
TypeError: Class edu.stanford.nlp.parser.lexparser.LexicalizedParser is not found

IndexError when running only with subfiles

How to reproduce

  1. pip install neosca==0.0.38
  2. download sample1.txt and sample2.txt
  3. nsca -c sample1.txt sample2.txt

Error message

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/tan/projects/neosca/neosca/__main__.py", line 4, in <module>
    main()
  File "/home/tan/projects/neosca/neosca/main.py", line 586, in main
    success, err_msg = ui.run()
  File "/home/tan/projects/neosca/neosca/main.py", line 551, in run
    return self.run_on_ifiles()  # type: ignore
  File "/home/tan/projects/neosca/neosca/main.py", line 514, in wrapper
    func(self, *args, **kwargs)  # type: ignore
  File "/home/tan/projects/neosca/neosca/main.py", line 532, in run_on_ifiles
    analyzer.run_on_ifiles(self.verified_ifile_list)
  File "/home/tan/projects/neosca/neosca/neosca.py", line 150, in run_on_ifiles
    self.write_freq_output()
  File "/home/tan/projects/neosca/neosca/neosca.py", line 174, in write_freq_output
    freq_output = self.get_freq_output(self.oformat_freq)
  File "/home/tan/projects/neosca/neosca/neosca.py", line 159, in get_freq_output
    freq_output = self.counter_lists[0].fields
IndexError: list index out of range

However, nsca -c sample1.txt sample2.txt -- another-file.txt works.

Thanks a lot for updating this!

Thanks a lot for updating this to use Stanza.

Would it be possible to update the README on how to use the command line tool?
Also, how can I replicate the following script using neosca?

This is taken from the original L2SCA code

import sys
import os
import subprocess
import re
import tempfile


def division(x, y):
    if float(x) == 0 or float(y) == 0:
        return 0
    return float(x) / float(y)

# List of tregex patterns for various structures
patternlist = [
    "'ROOT !> __'",
    "'VP > S|SINV|SQ'",
    "'S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])]'",
    "'S|SBARQ|SINV|SQ > ROOT | [$-- S|SBARQ|SINV|SQ !>> SBAR|VP]'",
    "'SBAR < (S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])])'",
    "'S|SBARQ|SINV|SQ [> ROOT | [$-- S|SBARQ|SINV|SQ !>> SBAR|VP]] << (SBAR < (S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])]))'",
    "'ADJP|ADVP|NP|VP < CC'",
    "'NP !> NP [<< JJ|POS|PP|S|VBG | << (NP $++ NP !$+ CC)]'",
    "'SBAR [<# WHNP | <# (IN < That|that|For|for) | <, S] & [$+ VP | > VP]'",
    "'S < (VP <# VBG|TO) $+ VP'",
    "'FRAG > ROOT !<< (S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])])'",
    "'FRAG > ROOT !<< (S|SBARQ|SINV|SQ > ROOT | [$-- S|SBARQ|SINV|SQ !>> SBAR|VP])'",
    "'MD|VBZ|VBP|VBD > (SQ !< VP)'"
]

# Path to the Stanford parser

def analyze_text(raw_text, long_names=False):
    current_dir = os.path.dirname(os.path.abspath(__file__))
    inputFile = os.path.join(current_dir, "input_file_temp.txt")

    with open(inputFile, "w") as temp_file:
        temp_file.write(raw_text)

    # Name a temporary file to hold the parse trees of the input file
    parsedFile = inputFile + ".parsed"

    parserPath = os.path.join(current_dir, "stanford-parser-full-2020-11-17/lexparser.sh")

    # Parse the input file
    command = f"{parserPath} {inputFile} > {parsedFile}"
    subprocess.getoutput(command)

    # List of counts of the patterns
    patterncount = []

    tregex_path = os.path.join(current_dir, "tregex.sh")

    # Query the parse trees using the tregex patterns
    for pattern in patternlist:
        command = f"{tregex_path} {pattern} {parsedFile} -C -o"
        count = subprocess.getoutput(command).split('\n')[-1]
        patterncount.append(int(count))

    # Update frequencies of complex nominals, clauses, and T-units
    patterncount[7] = patterncount[-4] + patterncount[-5] + patterncount[-6]
    patterncount[2] = patterncount[2] + patterncount[-3]
    patterncount[3] = patterncount[3] + patterncount[-2]
    patterncount[1] = patterncount[1] + patterncount[-1]

    # Word count
    with open(parsedFile, "r") as infile:
        content = infile.read()
    w = len(re.findall("\([A-Z]+\$? [^\)\(-]+\)", content))


    # #list of frequencies of structures other than words
    [s,vp,c,t,dc,ct,cp,cn]=patterncount[:8]

    # #compute the 14 syntactic complexity indices
    mls=division(w,s)
    mlt=division(w,t)
    mlc=division(w,c)
    c_s=division(c,s)
    vp_t=division(vp,t)
    c_t=division(c,t)
    dc_c=division(dc,c)
    dc_t=division(dc,t)
    t_s=division(t,s)
    ct_t=division(ct,t)
    cp_t=division(cp,t)
    cp_c=division(cp,c)
    cn_t=division(cn,t)
    cn_c=division(cn,c)


    if long_names:
        measures = {
            "W": w,
            "S": s,
            "VP": vp,
            "C": c,
            "T": t,
            "DC": dc,
            "CT": ct,
            "CP": cp,
            "CN": cn,
            "MLS": mls,
            "MLT": mlt,
            "MLC": mlc,
            "C/S": c_s,
            "VP/T": vp_t,
            "C/T": c_t,
            "DC/C": dc_c,
            "DC/T": dc_t,
            "T/S": t_s,
            "CT/T": ct_t,
            "CP/T": cp_t,
            "CP/C": cp_c,
            "CN/T": cn_t,
            "CN/C": cn_c
        }
    else:
        measures = {
            "words": w,
            "sentences": s,
            "verb phrases": vp,
            "clauses": c,
            "T-units": t,
            "dependent clauses": dc,
            "complex T-units": ct,
            "coordinate phrases": cp,
            "complex nominals": cn,
            "mean length of sentence (MLS)": mls,
            "mean length of T-unit (MLT)": mlt,
            "mean length of clause (MLC)": mlc,
            "clauses per sentence (C/S)": c_s,
            "verb phrases per T-unit (VP/T)": vp_t,
            "clauses per T-unit (C/T)": c_t,
            "dependent clauses per clause (DC/C)": dc_c,
            "dependent clauses per T-unit (DC/T)": dc_t,
            "T-units per sentence (T/S)": t_s,
            "complex T-unit ratio (CT/T)": ct_t,
            "coordinate phrases per T-unit (CP/T)": cp_t,
            "coordinate phrases per clause (CP/C)": cp_c,
            "complex nominals per T-unit (CN/T)": cn_t,
            "complex nominals per clause (CN/C)": cn_c
        }
    for key in measures:
        measures[key] = round(measures[key], 4)

    # Delete the temporary file holding the parse trees
    os.remove(parsedFile)

    return measures

so that I can do this:

CleanShot 2024-07-29 at 10 38 27@2x

i.e. how do i get the same measurements dictionary using neosca?

【problem】mac版本一次性上传大批量txt文件不响应

开发者您好,非常感谢您开发并持续更新neosca这一工具,便于英文词法和句法复杂度分析。但在使用这一工具的过程中我遇到了一些小问题:我下载了mac版本的neosca app,但是当我一次性上传多个txt文件时app会长期不响应,我怀疑是因为内存问题导致,因此想问一下能否提供之前的python包版本,便于我自己写代码处理?再次感谢您的辛苦工作!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.