masashi-y / depccg Goto Github PK

View Code? Open in Web Editor NEW

91.0 11.0 27.0 14.89 MB

A* CCG Parser with a Supertag and Dependency Factored Model

License: MIT License

Python 1.96% C 44.28% Makefile 0.01% C++ 0.09% Shell 0.01% Jsonnet 53.62% Cython 0.06%

ccg neural-network categorial-grammar parser

depccg's Introduction

depccg v2

Codebase for A* CCG Parsing with a Supertag and Dependency Factored Model

2021/07/12 Updates (v2)

Increased stability and efficiency
- (Replaced OpenMP with multiprocessing)
More integration with AllenNLP
- The parser is now callable from within a predictor (see here)
More friendly way to define your own grammar (wrt. languages or treebanks)
- See depccg/grammar/{en,ja}.py for example grammars.

Requirements

Python >= 3.6.0
A C++ compiler supporting C++11 standard (in case of gcc, must be >= 4.8)

Installation

Using pip:

➜ pip install cython numpy depccg

Usage

Using a pretrained English parser

Currently following models are available for English:

Name	Description	unlabeled/labeled F1 on CCGbank	Download
basic	model trained on the combination of CCGbank and tri-training dataset (Yoshikawa et al., 2017)	94.0%/88.8%	link (189M)
`elmo`	basic model with its embeddings replaced with ELMo (Peters et al., 2018)	94.98%/90.51%	link (649M)
`rebank`	basic model trained on Rebanked CCGbank (Honnibal et al., 2010)	-	link (337M)
`elmo_rebank`	ELMo model trained on Rebanked CCGbank	-	link (1G)

The basic model is available by:

➜ depccg_en download

To use:

➜ echo "this is a test sentence ." | depccg_en
ID=1, Prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP XX XX this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP XX XX is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N XX XX a NP[nb]/N>) (<T N 0 2> (<L N/N XX XX test N/N>) (<L N XX XX sentence N>) ) ) ) ) (<L . XX XX . .>) )

You can download other models by specifying their names:

➜ depccg_en download elmo

To use, make sure to install allennlp:

➜ echo "this is a test sentence ." | depccg_en --model elmo

You can also specify in the --model option the path of a model file (in tar.gz) that is available from links above.

Using a GPU (by --gpu option) is recommended if possible.

There are several output formats (see below).

➜ echo "this is a test sentence ." | depccg_en --format deriv
ID=1, Prob=-0.0006299018859863281
 this        is           a      test  sentence  .
  NP   (S[dcl]\NP)/NP  NP[nb]/N  N/N      N      .
                                ---------------->
                                       N
                      -------------------------->
                                  NP
      ------------------------------------------>
                      S[dcl]\NP
------------------------------------------------<
                     S[dcl]
---------------------------------------------------<rp>
                      S[dcl]

By default, the input is expected to be pre-tokenized. If you want to process untokenized sentences, you can pass --tokenize option.

The POS and NER tags in the output are filled with XX by default. You can replace them with ones predicted using SpaCy:

➜ echo "this is a test sentence ." | depccg_en --annotator spacy
ID=1, Prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP DT DT this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP VBZ VBZ is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N DT DT a NP[nb]/N>) (<T N 0 2> (<L N/N NN NN test N/N>) (<L N NN NN sentence N>) ) ) ) ) (<L . . . . .>) )

The parser uses a SpaCy's en_core_web_sm model.

Orelse, you can use POS/NER taggers implemented in C&C, which may be useful in some sorts of parsing experiments:

➜ export CANDC=/path/to/candc
➜ echo "this is a test sentence ." | depccg_en --annotator candc
ID=1, log prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP DT DT this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP VBZ VBZ is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N DT DT a NP[nb]/N>) (<T N 0 2> (<L N/N NN NN test N/N>) (<L N NN NN sentence N>) ) ) ) ) (<L . . . . .>) )

By default, depccg expects the POS and NER models are placed in $CANDC/models/pos and $CANDC/models/ner, but you can explicitly specify them by setting CANDC_MODEL_POS and CANDC_MODEL_NER environmental variables.

It is also possible to obtain logical formulas using ccg2lambda's semantic parsing algorithm.

➜ echo "This is a test sentence ." | depccg_en --format ccg2lambda --annotator spacy
ID=0 log probability=-0.0006299018859863281
exists x.(_this(x) & exists z1.(_sentence(z1) & _test(z1) & (x = z1)))

Using a pretrained Japanese parser

The best performing model is available by:

➜ depccg_ja download

It can be downloaded directly here (56M).

The parser provides the almost same interface as with the English one, with slight differences including the default output format, which is now one compatible with the Japanese CCGbank:

➜ echo "これはテストの文です。" | depccg_ja
ID=1, Prob=-53.98793411254883
{< S[mod=nm,form=base,fin=t] {< S[mod=nm,form=base,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] これ/これ/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] は/は/**}} {< S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] テスト/テスト/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] の/の/**}} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] 文/文/**}} {(S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f])\NP[case=nc,mod=nm,fin=f] です/です/**}}} {S[mod=nm,form=base,fin=t]\S[mod=nm,form=base,fin=f] 。/。/**}}

You can pass pre-tokenized sentences as well:

➜ echo "これ は テスト の 文 です 。" | depccg_ja --pre-tokenized
ID=1, Prob=-53.98793411254883
{< S[mod=nm,form=base,fin=t] {< S[mod=nm,form=base,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] これ/これ/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] は/は/**}} {< S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] テスト/テスト/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] の/の/**}} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] 文/文/**}} {(S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f])\NP[case=nc,mod=nm,fin=f] です/です/**}}} {S[mod=nm,form=base,fin=t]\S[mod=nm,form=base,fin=f] 。/。/**}}

Available output formats

auto - the most standard format following AUTO format in the English CCGbank
auto_extended - extension of auto format with combinator info and POS/NER tags
deriv - visualized derivations in ASCII art
xml - XML format compatible with C&C's XML format (only for English parsing)
conll - CoNLL format
html - visualized trees in MathML
prolog - Prolog-like format
jigg_xml - XML format compatible with Jigg
ptb - Penn Treebank-style format
ccg2lambda - logical formula converted from a derivation using ccg2lambda
jigg_xml_ccg2lambda - jigg_xml format with ccg2lambda logical formula inserted
json - JSON format
ja - a format adopted in Japanese CCGbank (only for Japanese)

Programmatic Usage

Please look into depccg/__main__.py.

Train your own parsing model

You can use my allennlp-based supertagger and extend it.

To train a supertagger, prepare the English CCGbank and download vocab:

➜ cat ccgbank/data/AUTO/{0[2-9],1[0-9],20,21}/* > wsj_02-21.auto
➜ cat ccgbank/data/AUTO/00/* > wsj_00.auto

➜ wget http://cl.naist.jp/~masashi-y/resources/depccg/vocabulary.tar.gz
➜ tar xvf vocabulary.tar.gz

then,

➜ vocab=vocabulary train_data=wsj_02-21.auto test_data=wsj_00.auto gpu=0 \
  encoder_type=lstm token_embedding_type=char \
  allennlp train --include-package depccg --serialization-dir results depccg/allennlp/configs/supertagger.jsonnet

The training configs are passed either through environmental variables or directly writing to jsonnet config files, which are available in supertagger.jsonnet or supertagger_tritrain.jsonnet. The latter is a config file for using tri-training silver data (309M) constructed in (Yoshikawa et al., 2017), on top of the English CCGbank.

To use the trained supertagger,

➜ echo '{"sentence": "this is a test sentence ."}' > input.jsonl
➜ allennlp predict results/model.tar.gz --include-package depccg --output-file weights.json input.jsonl

or alternatively, you can perform CCG parsing:

➜ allennlp predict --include-package depccg --predictor parser-predictor --predictor-args '{"grammar_json_path": "depccg/models/config_en.jsonnet"}' model.tar.gz input.jsonl

Evaluation in terms of predicate-argument dependencies

The standard CCG parsing evaluation can be performed with the following script:

➜ cat ccgbank/data/PARG/00/* > wsj_00.parg
➜ export CANDC=/path/to/candc
➜ python -m depccg.tools.evaluate wsj_00.parg wsj_00.predicted.auto

The script is dependent on C&C's generate program, which is only available by compiling the C&C program from the source.

(Currently, the above page is down. You can find the C&C parser here or here)

Miscellaneous

Diff tool

In error analysis, you must want to see diffs between trees in an intuitive way. depccg.tools.diff does exactly this:

➜ python -m depccg.tools.diff file1.auto file2.auto > diff.html

which outputs:

where trees in the same lines of the files are compared and the diffs are marked in color.

Citation

If you make use of this software, please cite the following:

    @inproceedings{yoshikawa:2017acl,
      author={Yoshikawa, Masashi and Noji, Hiroshi and Matsumoto, Yuji},
      title={A* CCG Parsing with a Supertag and Dependency Factored Model},
      booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
      publisher={Association for Computational Linguistics},
      year={2017},
      pages={277--287},
      location={Vancouver, Canada},
      doi={10.18653/v1/P17-1026},
      url={http://aclweb.org/anthology/P17-1026}
    }

Licence

MIT Licence

Contact

For questions and usage issues, please contact [email protected].

Acknowledgement

In creating the parser, I owe very much to:

EasyCCG: from which I learned everything
NLTK: for nice pretty printing for parse derivation

depccg's People

Contributors

Stargazers

Watchers

depccg's Issues

tarfile.ReadError: file could not be opened successfully

I could not get depccg to work. I first tried installing via pip, then I cloned the repo and tried:

pip install -U cython
python setup.py clean
python setup.py build_ext --inplace
python setup.py install

However, in both cases, running the depccg_en download then gives:

  warnings.warn('''\
2022-06-01 10:34:05,848 - INFO - root - start downloading from 1mxl1HU99iEQcUYhWhvkowbE4WOH0UKxv
Downloading 1mxl1HU99iEQcUYhWhvkowbE4WOH0UKxv into /Users/ruben/github/depccg/depccg/models/tri_headfirst.tar.gz... Done.
2022-06-01 10:34:06,273 - INFO - root - extracting files
Traceback (most recent call last):
  File "/Users/ruben/.pyenv/versions/3.8.13/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/ruben/.pyenv/versions/3.8.13/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/ruben/github/depccg/depccg/__main__.py", line 166, in <module>
    parse_args(main)
  File "/Users/ruben/github/depccg/depccg/argparse.py", line 203, in parse_args
    args.func(args)
  File "/Users/ruben/github/depccg/depccg/argparse.py", line 112, in <lambda>
    func=lambda args: download(args.lang, args.VARIANT)
  File "/Users/ruben/github/depccg/depccg/instance_models.py", line 108, in download
    tf = tarfile.open(filename)
  File "/Users/ruben/.pyenv/versions/3.8.13/lib/python3.8/tarfile.py", line 1608, in open
    raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully

MBP Pro 2021 (M1 Pro), macOS 12.4
python3.8.13
numpy==1.22.4
Cython==0.29.30
Clang:
Apple clang version 13.1.6 (clang-1316.0.21.2.5)
Target: arm64-apple-darwin21.5.0

Customising tokenization

Hello author,

Greetings. I found there is a config_en.jsonnet, which contains several en.jsonnet files specifying lots of tokens and ccg rules.
May I know that,

If I want to customise the tokenizer, after modifying these files, do I need to retrain the model?
Does the number of tokens in tokens.en.jsonnet have any relationship with the number of targets in the targets.en.json?

Thanks and Best Regards,
Chriss IT. Leong

N-best output format slightly different to that of EasyCCG

In EasyCCG, when I request N-best parsing for one sentence, I get something like this:

ID=1
(<T ... 1-best parse of the first sentence ... )
ID=1
(<T ... 2-best parse of the first sentence ... )

However, in depccg we obtain something like this:

ID=0
score= ...
(<T ... 1-best parse of the first sentence ... )
(<T ... 2-best parse of the first sentence ... )

It would be great to have the same format as EasyCCG (ID starting from 1, repeat ID for every new parse, and produce no scores). Do you have plans to adjust this format to be compatible with pipelines that already use EasyCCG N-best parsing?

vocab URL broken

Hello,

Thank you so much for sharing your work.

I notice that the URL for vocab seems to be broken: http://cl.naist.jp/~masashi-y/resources/depccg/vocabulary.tar.gz

Could you provide an alternative link?

Thank you again for your help.

tri_headfirst.tar.gz downloads as an HTML file, not a .tar.gz file

Following the install instructions, I executed depccg_en download. This is what happened:

username@localhost:~$ depccg_en download
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.16.0-unknown is an invalid version and will not be supported in a future release
  warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
  warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
  warnings.warn(
2022-11-02 10:39:13,164 - INFO - root - start downloading from 1mxl1HU99iEQcUYhWhvkowbE4WOH0UKxv
Downloading 1mxl1HU99iEQcUYhWhvkowbE4WOH0UKxv into /home/username/.local/lib/python3.10/site-packages/depccg/models/tri_headfirst.tar.gz... Done.
2022-11-02 10:39:13,535 - INFO - root - extracting files
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/username/.local/lib/python3.10/site-packages/depccg/__main__.py", line 166, in <module>
    parse_args(main)
  File "/home/username/.local/lib/python3.10/site-packages/depccg/argparse.py", line 203, in parse_args
    args.func(args)
  File "/home/username/.local/lib/python3.10/site-packages/depccg/argparse.py", line 112, in <lambda>
    func=lambda args: download(args.lang, args.VARIANT)
  File "/home/username/.local/lib/python3.10/site-packages/depccg/instance_models.py", line 108, in download
    tf = tarfile.open(filename)
  File "/usr/lib/python3.10/tarfile.py", line 1639, in open
    raise ReadError(f"file could not be opened successfully:\n{error_msgs_summary}")
tarfile.ReadError: file could not be opened successfully:
- method gz: ReadError('not a gzip file')
- method bz2: ReadError('not a bzip2 file')
- method xz: ReadError('not an lzma file')
- method tar: ReadError('invalid header')

The error seems to be telling me that tri_headfirst.tar.gz is not the expected file type. Checking it with file:

username@localhost:~$ file /home/username/.local/lib/python3.10/site-packages/depccg/models/tri_headfirst.tar.gz
/home/username/.local/lib/python3.10/site-packages/depccg/models/tri_headfirst.tar.gz: HTML document, ASCII text, with very long lines (2035)

The contents of the file, if it helps:

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="C7JrQ_bxZ2npIv_xsTlBUg">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1mxl1HU99iEQcUYhWhvkowbE4WOH0UKxv">tri_headfirst.tar.gz</a> (189M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://docs.google.com/uc?export=download&amp;id=1mxl1HU99iEQcUYhWhvkowbE4WOH0UKxv&amp;confirm=t&amp;uuid=02449c1a-bdbb-40bb-b114-d77c9c61b6ea" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>

Net result is that the installation procedures don't work in my environment as written, and it would seem they wouldn't work for anyone else either?

If I'm reading the HTML contents right it appears this code is trying to download tri_headfirst.tar.gz from Google Drive, which is not great if true.

Inconsistent output format for 1-best vs. n-best parsing

I am trying to obtain N-best parses of a sentence, but I realized that the output of depccg when obtaining more than 1 parse differs in format to that of using only 1 parse:

Using one parse (1-best):

echo "this|this|DT|O is|be|VBZ|O a|a|DT|O test|test|NN|O sentence|sentence|NN|O .|.|.|O" | python ../depccg/src/run.py ${depccg_dir}/models/tri_headfirst en --input-format POSandNERtagged --nbest 1 2>/dev/null
ID=0
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP POS POS this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP POS POS is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N POS POS a NP[nb]/N>) (<T N 0 2> (<L N/N POS POS test N/N>) (<L N POS POS sentence N>) ) ) ) ) (<L . POS POS . .>) )

When requesting 2-best:

echo "this|this|DT|O is|be|VBZ|O a|a|DT|O test|test|NN|O sentence|sentence|NN|O .|.|.|O" | python ../depccg/src/run.py ${depccg_dir}/models/tri_headfirst en --input-format POSandNERtagged --nbest 2 2>/dev/null                                                                                                                                   
ID=0
score= -0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP POS POS this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP POS POS is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N POS POS a NP[nb]/N>) (<T N 0 2> (<L N/N POS POS test N/N>) (<L N POS POS sentence N>) ) ) ) ) (<L . POS POS . .>) )
score= -17.412315368652344
(<T S/S 0 2> (<T S/S 0 2> (<L NP POS POS this NP>) (<T (S/S)\NP 0 2> (<L ((S/S)\NP)/NP POS POS is ((S/S)\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N POS POS a NP[nb]/N>) (<T N 0 2> (<L N/N POS POS test N/N>) (<L N POS POS sentence N>) ) ) ) ) (<L . POS POS . .>) )

In this latter case, depccg is printing to standard output a "score", which was not present when producing 1-best and it is a bit unexpected.

Parenthesized CG categories being confused with Penn Treebank node brackets

Issue description

DepCCG appears to bracket complex CG categories with parentheses (e.g. (S[smc]\PP[s])\PP[o1]) in all the provided output formats.
These parentheses, however, conflict with those dedicated to other roles such as those in the Penn Treebank Format, which indicate node boundaries, confusing other programs.

Steps to reproduce the issue

echo "太郎は花子に怒られた" | depccg_ja --format ptb

What's the actual result?

(ROOT (S[mod=nm,form=base,fin=f] (NP[case=ga,mod=nm,fin=f] (NP[case=nc,mod=nm,fin=f] 太郎) (NP[case=ga,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] が)) (S[mod=nm,form=base,fin=f]\NP[case=ga,mod=nm,fin=f] (NP[case=ni,mod=nm,fin=f] (NP[case=nc,mod=nm,fin=f] 花子) (NP[case=ni,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] に)) ((S[mod=nm,form=base,fin=f]\NP[case=ga,mod=nm,fin=f])\NP[case=ni,mod=nm,fin=f] ((S[mod=nm,form=cont,fin=f]\NP[case=ga,mod=nm,fin=f])\NP[case=ni,mod=nm,fin=f] ((S[mod=nm,form=r,fin=f]\NP[case=ga,mod=nm,fin=f])\NP[case=ni,mod=nm,fin=f] 怒ら) (S[mod=nm,form=cont,fin=f]\S[mod=nm,form=r,fin=f] れ)) (S[mod=nm,form=base,fin=f]\S[mod=nm,form=cont,fin=f] た)))))

What's the expected result?

The label of the following tree fragment should use any brackets other than ( and ).

((S[mod=nm,form=r,fin=f]\NP[case=ga,mod=nm,fin=f])\NP[case=ni,mod=nm,fin=f] 怒ら)

For example, it should be anything like

(<S[mod=nm,form=r,fin=f]\NP[case=ga,mod=nm,fin=f]>\NP[case=ni,mod=nm,fin=f] 怒ら)

error while installing

Is there a particular version of Cython we should use? I get the following error (interestingly on both an osx and a linux machine)- wait , won't this work on a conda environment?

Collecting depccg
  Downloading depccg-2.0.3.2.tar.gz (3.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.5/3.5 MB 54.0 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [2 lines of output]
      Could not import Cython, which is required to build depccg extension modules.
      Please install cython and numpy prior to installing depccg.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
(qnlp) which cython
~/miniconda3/envs/qnlp/bin/cython
(qnlp) cython --version
Cython version 0.29.35

”pip install depccg” fails with an error

Hi.
I get the following error message when I try to install depccg. How should I deal with this?
"Releases" is empty, does this have anything to do with it?

pip install depccg
↓
Collecting depccg
Using cached depccg-2.0.2.tar.gz (3.5 MB)
Preparing metadata (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\XXXXX\anaconda3\envs\qiskit\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\XXXXXXXXX\AppData\Local\Temp\pip-install-l1bjgvy2\depccg_1e7b5a90f22840b7aaf4447c64be1c3e\setup.py'"'"'; file='"'"'C:\Users\XXXXXX\AppData\Local\Temp\pip-install-l1bjgvy2\depccg_1e7b5a90f22840b7aaf4447c64be1c3e\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\XXXXXXXX\AppData\Local\Temp\pip-pip-egg-info-2c1_az10'
cwd: C:\Users\XXXXXXX\AppData\Local\Temp\pip-install-l1bjgvy2\depccg_1e7b5a90f22840b7aaf4447c64be1c3e
Complete output (13 lines):
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\XXXXXXXX\AppData\Local\Temp\pip-install-l1bjgvy2\depccg_1e7b5a90f22840b7aaf4447c64be1c3e\setup.py", line 105, in
generate_cpp([])
File "C:\Users\XXXXXXXXX\AppData\Local\Temp\pip-install-l1bjgvy2\depccg_1e7b5a90f22840b7aaf4447c64be1c3e\setup.py", line 62, in generate_cpp
p = subprocess.call(["make", options], env=os.environ)
File "c:\users\XXXXXXXXXX\anaconda3\envs\qiskit\lib\subprocess.py", line 340, in call
with Popen(*popenargs, kwargs) as p:
File "c:\users\XXXXXXXXXX\anaconda3\envs\qiskit\lib\subprocess.py", line 858, in init**
self._execute_child(args, executable, preexec_fn, close_fds,
File "c:\users\XXXXXXXXXX\anaconda3\envs\qiskit\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] 指定されたファイルが見つかりません。

WARNING: Discarding https://files.pythonhosted.org/packages/a9/92/8f3b372662f63e0c4af3cbad41daffeeb6d0df496dc64e7f843a66d55016/depccg-2.0.2.tar.gz#sha256=de1a7ce6a9d707a2a8dc5c0730ed2d435694d4c297e747a8c9a6c329cc4438a4 (from https://pypi.org/simple/depccg/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Throw IndexError when using --input-format partial

I installed the latest version of depccg.
The prolog format prints the derivation (fyi, I am using build directory rather than installing depccg):

$ echo "この Ｔ シャツ" | ext/depccg_portable/build/scripts-3.6/depccg_ja --silent -f prolog --pre-tokenized

:- op(601, xfx, (/)).
:- op(601, xfx, (\)).
:- multifile ccg/2, id/2.
:- discontiguous ccg/2, id/2.

ccg(1,
 fa(np:nc,
  t((np:X1/np:X1), 'XX', 'XX', 'XX/XX/XX/XX', 'XX', 'XX'),
  fa(np:nc,
   t((np:X1/np:X1), 'XX', 'XX', 'XX/XX/XX/XX', 'XX', 'XX'),
   t(np:nc, 'XX', 'XX', 'XX/XX/XX/XX', 'XX', 'XX')))).

But --input-format partial throws the error:

echo "この| Ｔ| シャツ|" | ext/depccg_portable/build/scripts-3.6/depccg_ja  --silent -f prolog --input-format partial --pre-tokenizedTraceback (most recent call last):
  File "/net/gsb/lib/python3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/net/gsb/lib/python3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/net/gsb/depccg_portable/build/lib.linux-x86_64-3.6/depccg/__main__.py", line 262, in <module>
    args.func(args)
  File "/net/gsb/depccg_portable/build/lib.linux-x86_64-3.6/depccg/__main__.py", line 115, in main
    constraints=constraints)
  File "parser.pyx", line 253, in depccg.parser.EnglishCCGParser.parse_doc
  File "/net/gsb/depccg_portable/build/lib.linux-x86_64-3.6/depccg/ja_lstm_parser_bi.py", line 114, in predict_doc
    res.extend(self._predict(doc[i:i + batchsize]))
  File "/net/gsb/depccg_portable/build/lib.linux-x86_64-3.6/depccg/ja_lstm_parser_bi.py", line 99, in _predict
    xs = [self.extractor.process(x, self.xp) for x in xs]
  File "/net/gsb/depccg_portable/build/lib.linux-x86_64-3.6/depccg/ja_lstm_parser_bi.py", line 99, in <listcomp>
    xs = [self.extractor.process(x, self.xp) for x in xs]
  File "/net/gsb/depccg_portable/build/lib.linux-x86_64-3.6/depccg/ja_lstm_parser_bi.py", line 39, in process
    c[0, 0] = self.start_char
IndexError: index 0 is out of bounds for axis 1 with size 0

Query of notation meanings

Dear [Author],

I am trying to use the tool as a component in my research. May I know the meanings of the notations in the result?
Precisely, I have noticed notations such as Subj(), Acc(), AccI etc. May I have the explanations of their meanings?

Thanks and Best Regards,
Sincerely,
Chriss IT. Leong

日本語CCGでエラー

「世界最高のテノール歌手になったイタリア人がいた。」
というtextに対する日本語CCG解析をdepccgで行います。
↓
ccg2lambdaのコマンド「semparse.py (CCG結果xml) ja/semantic_templates_ja_emnlp2016.yaml (output)」
を実行すると、次のエラーが起こります

ERROR:root:An error occurred: 'x' is an illegal predicate name. Individual variables may not be used as predicates.
\F.exists x.(exists z1.(_XX(\y.(_XX(y) & _XX(y)),z1) & x(z1)) & F(x))

この原因は何でしょうか。回避する方法はありますか？

Never-ending installation

Last night I tried to install the package using pipenv with the command pipenv install depccg --verbose, but it's taking forever to install, I tried the --verbose argument to see some logs, but the only log I got was:

Installing package: depccg

Writing supplied requirement line to temporary file: 'depccg'

It's been more than 15 hours and the console keep logging the message: ⠧ Installing depccg...

My hardware is:

MacBook Pro, Intel i5, macOS Monterey

Incompatible with AllenNLP >= 1.0

The current my_allennlp implementation tries to access variables that have been removed after version 0.9, e.g. DEFAULT_PREDICTORS.
The current version of AllenNLP is 2.1.0.
I have tried installing 0.9.0 instead, but this seems to fail on my system with Python 3.8.
I can install AllenNLP 1.5.0 and 2.1.0 just fine.

undefined symbol: _ZSt24__throw_out_of_range_fmtPKcz

Hello!

I have just installed depccg but I am getting an Import error:

src]$ pwd
/data/pascual/software/depccg/src
src]$ python run.py -h
Traceback (most recent call last):
  File "run.py", line 10, in <module>
    from depccg import PyAStarParser, PyJaAStarParser
ImportError: /data/pascual/software/depccg/src/depccg.so: undefined symbol: _ZSt24__throw_out_of_range_fmtPKcz

I am working on REL 7 (Red Hat) and I installed g++ locally since it was not available in my distribution and I don't have root permissions.

src]$ which g++
/data/pascual/local/bin/g++
src]$ g++ --version
g++ (GCC) 4.9.2

Do you have any advice on how to get past this error?

Thank you very much for making this software available!
Best,
Pascual

model url broken

Hi!
(and thanks for the awesome framework)

There seems to be a problem with the base model download url,
http://cl.naist.jp/~masashi-y/resources/depccg/en_hf_tri.tar.gz

I can't access it from any location i can reach (europe, us, russia).
Is there an alternative link i can use?

depccg_ja downloadでエラー

google colabにてdepccgを使用させていただいております。
昨年(2021年12月ごろ)までは問題なく使用できていたのですが、先日(2022年2月)、下記コマンドでインストールした際、
pip install cython numpy depccg
インストールは行えたのですが、depccg_ja downloadでエラーが発生してしまいました。
最初はimport errorのNo module named 'overrides'であったためoverridesをインストールし、
次にimport errorのno module named 'google.cloud.storage.retry'であったためgoogle-cloud-storageをアップグレードしたところ、

TypeError: JaSupertaggingDatasetReader._read: return type None is not a typing.Iterable[allennlp.data.instance.Instance].

のエラーが発生しました。本エラーを回避する方法があればお教えいただきたく存じます。

Prolog printer

For me the prolog printer is not working while other printers are fine.
When I run the parser programmatically I find that Tree.prolog() inserts numbers instead of tokens in the f-string.

ccg({0},
 rp(s:dcl, 
  ba(s:dcl, 
   t(np, 'This', '{32766.lemma}', '{32766.pos}', '{32766.chunk}', '{32766.entity}'),
   fa((s:dcl\np), 
    t(((s:dcl\np)/(s:adj\np)), 'is', '{32767.lemma}', '{32767.pos}', '{32767.chunk}', '{32767.entity}'),
    t((s:adj\np), 'second', '{32768.lemma}', '{32768.pos}', '{32768.chunk}', '{32768.entity}'))),
  t(period, '.', '{32769.lemma}', '{32769.pos}', '{32769.chunk}', '{32769.entity}'))).

The following shows that other formats work while prolog doesn't.

$ echo "Prolog printer inserts numbers" | python3 -m depccg en "$@" -f deriv --silent 

ID=1, log probability=-0.32995718717575073
  N/N       N     (S[dcl]\NP)/NP     N
 Prolog  printer     inserts      numbers
----------------->
        N
-----------------<un>
       NP
                                 ---------<un>
                                    NP
                 ------------------------->
                         S[dcl]\NP
------------------------------------------<
                  S[dcl]

$ echo "Prolog printer inserts numbers" | python3 -m depccg en "$@" -f prolog --silent 

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/depccg-1.0.7-py3.6-linux-x86_64.egg/depccg/__main__.py", line 259, in <module>
    args.func(args)
  File "/usr/local/lib/python3.6/dist-packages/depccg-1.0.7-py3.6-linux-x86_64.egg/depccg/__main__.py", line 125, in main
    semantic_templates=semantic_templates)
  File "/usr/local/lib/python3.6/dist-packages/depccg-1.0.7-py3.6-linux-x86_64.egg/depccg/printer.py", line 299, in print_
    print(to_prolog_en(nbest_trees, tagged_doc), end='', file=file)
  File "/usr/local/lib/python3.6/dist-packages/depccg-1.0.7-py3.6-linux-x86_64.egg/depccg/printer.py", line 46, in to_prolog_en
    print(t.prolog().format(i, *tokens), file=output)
AttributeError: 'int' object has no attribute 'lemma'

AssertionError

Hi,
I am accessing depccg from Python. When I call the function res = parser.parse("this is a test sentence") It gives an error

File "", line 1, in
File "depccg.pyx", line 602, in depccg.PyAStarParser.parse
assert isinstance(sent, unicode)
AssertionError

What may be causing this error? (Just for an experimentation I tried change the command to res = parser.parse(unicode("this is a test sentence")) but it is giving error in another place)

How to specify GPU usage in A* parsing

Hi,

I'm using a project that uses EnglishCCGParser (v1) (depccg 1.0.8):

def __init__(self):
        kwargs = dict(
            # A list of binary rules 
            # By default: depccg.combinator.en_default_binary_rules
            binary_rules=None,
            # Penalize an application of a unary rule by adding this value (negative log probability)
            unary_penalty=0.1,
            # Prune supertags with low probabilities using this value
            beta=0.00001,
            # Set False if not prune
            use_beta=True,
            # Use category dictionary
            use_category_dict=True,
            # Use seen rules
            use_seen_rules=True,
            # This also used to prune supertags
            pruning_size=50,
            # Nbest outputs
            nbest=1,
            # Limit categories that can appear at the root of a CCG tree
            # By default: S[dcl], S[wq], S[q], S[qem], NP.
            possible_root_cats=None,
            # Give up parsing long sentences
            max_length=250,
            # Give up parsing if it runs too many steps
            max_steps=100000,
            # You can specify a GPU
            gpu=0
        )
        self.model = EnglishCCGParser.from_dir(DEPCCG_MODEL_FILENAME, load_tagger=True, **kwargs)

And the console output is:

2021-09-15 16:16:22,209 - INFO - depccg.parser - start tagging sentences
2021-09-15 17:00:07,350 - INFO - depccg.parser - done tagging sentences
2021-09-15 17:00:07,375 - INFO - depccg.parser - unary penalty = 0.1
2021-09-15 17:00:07,375 - INFO - depccg.parser - beta value = 1e-05 (use beta = True)
2021-09-15 17:00:07,375 - INFO - depccg.parser - pruning size = 50
2021-09-15 17:00:07,376 - INFO - depccg.parser - N best = 1
2021-09-15 17:00:07,376 - INFO - depccg.parser - use category dictionary = True
2021-09-15 17:00:07,376 - INFO - depccg.parser - use seen rules = True
2021-09-15 17:00:07,376 - INFO - depccg.parser - allow at the root of a tree only categories in [S[dcl], S[wq], S[q], S[qem], NP]
2021-09-15 17:00:07,376 - INFO - depccg.parser - give up sentences that contain > 250 words
2021-09-15 17:00:07,376 - INFO - depccg.parser - combinators: [>, <, >B1, <B1, >B2, <B2, <Φ>, <Φ>, <rp>, <rp>, <rp>, <*>, <*>]
2021-09-15 17:00:09,437 - INFO - depccg.parser - start A* parsing

I'm trying to parse 600K sentences, looking at the system monitor it's using the CPU instead of the GPU and the code has been running for 4 days. I can't see the progress or anything.

I run the code via Intellij Idea and in the environment variables I have specified:

PYTHONUNBUFFERED=1;CUDA_VISIBLE_DEVICES=1.

When tagging I can see that uses the GPU but when it starts with the A* parsing it uses CPU instead.

How long does it take to parse that many sentences or how can I force GPU usage?

Regards.

The link of elmo_rebank is broken