grammaticalframework / gf-ud Goto Github PK

Functions to analyse and manipulate dependency trees, as well as conversions between GF and dependency trees. The main use case is UD (Universal Dependencies), but the code is designed to be completely generic as for annotation scheme. This repository replaces the old gf-contrib/ud2gf code. It is also meant to be used in the 'vd' command of GF and replace the supporting code in gf-core in the future.

License: Other

Haskell 5.82% Grammatical Framework 93.97% Makefile 0.01% Shell 0.20%

gf-ud's People

Contributors

Stargazers

Watchers

Forkers

daherb britneybob jofunch datatjej guscarrian harisont caesarhawi polloniuss anka-213 1regina turtilla seanpm2001 devix71 peachfriday

gf-ud's Issues

Feature request: command line option to opt for string literals for OOV words

(previously the other half of #22 , split into its own issue)

String literals for OOV words

If the sentence contains words that are not in the lexicon, I would like to create those words as string literals. So "mimsy were the borogroves" would result in an otherwise normal GF tree, but with the subtrees StrA "mimsy" and StrN "borogrove".

This feature should be optional: either command line arg, or check if the grammar contains StrA : String -> A.

Future work: modify the PGF grammar?

The new majestic runtime will allow modifying PGFs on the fly. So when that is possible, I'd prefer to create proper lexicon entries mimsy_A and borogrove_N, using real GF smart paradigms, and insert them into the PGF.
(Similarly, the Backup* funs from #22 would also be possible to insert into the PGF.)

So once the new runtime is in place, I think that command line argument would be a better option. And if these features are added into gf-ud already before majestic runtime, it makes sense to just use command line arguments from the beginning.

Feature request: different backup options

Two things: Backup cat + funs, and string literal funs.

Backup* funs

In the current master branch, Backup* funs are added by default, if some parts of the sentence can't be included in the GF tree. However, this requires that the grammar contains such funs, which not all grammars do. So I would like to have this feature optional.

It can either be a command line argument, or the Haskell code can check whether the PGF contains a cat called Backup and funs called BackupNP etc., and use them only if they are found in the PGF.

I don't care about which way to use, just to have the feature optional in some way.

String literals for OOV words

Again, this feature should be optional: either command line arg, or check if the grammar contains StrA : String -> A.

Future work: modify the PGF grammar?

The new majestic runtime will allow modifying PGFs on the fly. So when that is possible, I'd prefer to create proper lexicon entries mimsy_A and borogrove_N, using real GF smart paradigms, and insert them into the PGF. Similarly, the Backup* funs would also be possible to insert into the PGF.

CoNNL-U Plus

Do you think it would make sense to make gf-ud support CoNNL-U Plus?

Things to consired:

I guess it would require several changes, e.g. in UDConcepts?
You can always convert CoNLL-U Plus to plain CoNNL-U

Visualization too small for long sentences

I'm parsing some pretty long sentences, and I'd like to see the visualizations for them. However, the sentences get cut off, as shown in the picture. Is this just an issue with pdflatex, or can something be done about it in gf2ud code?

Infinite applications of ProgrVP by ud2gf

I'm running ud2gf with ShallowParse, using "the cat sleeps" as my sentence. Here's the original sentence, produced with parsing "the cat sleeps" in UDpipe, and using this code to output the CoNLLU format.

$ cat /tmp/cat.conllu
1       the     the     DET     _       _       2       det     _       _
2       cat     cat     NOUN    _       _       3       nsubj   _       _
3       sleeps  sleep   VERB    _       _       0       root    _       _

I run ud2gf as follows.

$ cat /tmp/cat.conllu | stack run gf-ud ud2gf grammars/ShallowParse Eng Text at

Infinite loop

First, ud2gf ran for 30 minutes until I stopped it.

Uncomment "beam size" of 123 trees

Next, I uncommented this line, to put back the limitation of max 123 candidate trees. This works, in the sense that ud2gf doesn't get stuck in an infinite loop anymore, but the best tree still contains multiple applications of ProgrVP—despite the original sentence having none. Here's the output:

# bt0, the best (most complete) tree, without backups:
[3] sleeps 3 (2) VERB root (ImpVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (UseV sleep_V))))))))))))))))))))) : Imp[3]) 1
    *[1,2] cat 2 (1) NOUN nsubj (UseN cat_N : CN[2]) 1
        *[1] the 1 (2) DET det (the_Det : Det[1]) 1

# at, final GF tree, macros expanded:
AddBackupImp (ConsBackup (CNBackup (AddBackupCN (ConsBackup (DetBackup the_Det) BaseBackup) (UseN cat_N))) BaseBackup) (ImpVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (UseV sleep_V))))))))))))))))))))))

Adding annotations to the conllu file

I have noticed before that I get weird trees if the file is missing morphological annotations. So I added them manually to the CoNLLU file:

$ cat /tmp/cat-annotated.conllu
1	the	the	DET	Det	FORM=0	2	det	_	_
2	cat	cat	NOUN	N	Number=Sing	3	nsubj	_	_
3	sleeps	sleep	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_

With this file, we now get a correct tree with MiniLang:

# MiniLang with cat.conllu (which is missing annotations)
AddBackupImp (ConsBackup (CNBackup (AddBackupCN (ConsBackup (TheBackup the_The) BaseBackup) (UseN cat_N))) BaseBackup) (ImpVP (UseV sleep_V))

# MiniLang with cat-annotated.conllu
PredVP (DetCN the_Det (UseN cat_N)) (UseV sleep_V)

But with ShallowParse, the tree is as wrong as ever, with multiple ProgrVPs.

# ShallowParse with cat-annotated.conllu
AddBackupImp (ConsBackup (CNBackup (AddBackupCN (ConsBackup (DetBackup thePl_Det) BaseBackup) (UseN cat_N))) BaseBackup) (ImpVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (ProgrVP (UseV sleep_V))))))))))))))))))))))

So it seems unlikely that the ProgrVP loop is due to user error/insufficiently annotated CoNLLU files.

Workaround

ProgrVP is the only function in ShallowParse of type a -> a, so I can just comment it out in the GF grammar. But of course, sometimes such functions are actually needed, so this is not a real solution.

conll2latex not working? or am I misunderstading something?

Students in the ongoing Computational Syntax course at GU are expected (or at least strongly encouraged) to use conll2pdf and parse2pdf to visualize the various trees they are working on.

However, installing gf-ud has proven problematic, especially for Windows users, and most people are working with a version of gf-ud installed on one of the university's servers, which also has LaTeX but not the command used to show the PDFs. I suggested that they use conll2latex and parse2latex, but only the former works. The latter produces an (almost) empty .tex file:

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\begin{document}
\includegraphics[width=0.6\textwidth]{_1parsetree.tex.eps}\end{document}

The command I run to generate this file is

echo "the black cat sees us" | gf-ud dbnf English.dbnf Utt | gf-ud parse2latex parsetree

I looking into this right now to see if it's a bug, but please tell me if I am misunderstanding something.

Feature request: Modular labels files

Many GF grammars are structured as follows:

abstract A = { fun a1, a2 : … } ;

abstract B = A ** { fun b1, b2 : … } ;

If we want to use those grammars with gf-ud, we need the following labels files:

-- A.labels
#fun a1 …
#fun a2 …

and

-- B.labels
#fun a1 …
#fun a2 …
#fun b1 …
#fun b2 …

In order to avoid duplicating labels, it would be better to allow the labels for a given grammar to be in multiple files. In this example, the file B.labels should only contain labels for funs b1 and b2.

When giving the labels files as arguments to gf-ud, we could give a complete list of labels files, e.g.

 gf-ud ud2gf B <startcat> <lang> A.labels B.labels

Or we could keep the existing arguments to ud-gf, and make the module structure a part of the labels files:

-- B.labels
#include A.labels
#fun b1 …
#fun b2 …

FORM and LEMMA should accept comma

I want to do this

#auxfun CommaNP_ np comma : NP -> Comma -> NPComma = np ; head punct[FORM=,]

But it doesn't accept comma as a valid wordform (nor LEMMA, both behave the same). I've also tried variants FORM="," and FORM=\,, neither work.

-- FORM=,
Starting debug for CommaNP_:
CommaNP_ : NP -> Comma -> NPComma ; head punct[FORM=]
Attempting to build: CommaNP_ modification ,

--FORM=","
Starting debug for CommaNP_:
CommaNP_ : NP -> Comma -> NPComma ; head punct[FORM=","]
Attempting to build: CommaNP_ modification ,

--FORM=\,
Starting debug for CommaNP_:
CommaNP_ : NP -> Comma -> NPComma ; head punct[FORM=\]
Attempting to build: CommaNP_ modification ,

Feature request: match lexicon in auxfuns

We have this standard way of distinguishing between singular and plural the:

#auxcat The DET
#auxfun DetCN_theSg det cn : The -> CN -> NP = DetCN the_Det cn ; det head[Number=Sing]
#auxfun DetCN_thePl det cn : The -> CN -> NP = DetCN thePl_Det cn ; det head[Number=Plur]
#disable the_Det thePl_Det

Now I would like to do the same for other determiners that are ambiguous for number, like some and any. My file is this:

1	any	any	DET	DT	_	2	det	_	_
2	word	word	NOUN	NN	Number=Sing	0	root	_	_

1	any	any	DET	DT	_	2	det	_	_
2	words	word	NOUN	NN	Number=Plur	0	root	_	_

1	some	some	DET	DT	_	2	det	_	_
2	word	word	NOUN	NN	Number=Sing	0	root	_	_

1	some	some	DET	DT	_	2	det	_	_
2	words	word	NOUN	NN	Number=Plur	0	root	_	_

The naive way would be to do something like this. (The is still the auxcat for DET from the previous example.)

#auxfun DetCN_anySg det cn : The -> CN -> NP = DetCN anySg_Det cn ; det head[Number=Sing]
#auxfun DetCN_anyPl det cn : The -> CN -> NP = DetCN anyPl_Det cn ; det head[Number=Plur]
#disable anyPl_Det anySg_Det

#auxfun DetCN_someSg det cn : The -> CN -> NP = DetCN someSg_Det cn ; det head[Number=Sing]
#auxfun DetCN_somePl det cn : The -> CN -> NP = DetCN somePl_Det cn ; det head[Number=Plur]
#disable somePl_Det someSg_Det

With this, I get the following results—the auxfuns didn't seem to do anything. Looking at dt and bt, I see no auxfuns being used.

DetCN anyPl_Det (UseN word_N)
LIN: any words

DetCN anyPl_Det (UseN word_N)
LIN: any words

DetCN somePl_Det (UseN word_N)
LIN: some words

DetCN somePl_Det (UseN word_N)
LIN: some words

So I try to change the auxfuns into this: taking as an argument an actual RGL cat Det, not auxcat The (which corresponds to a DET).

#auxfun DetCN_anySg det cn : Det -> CN -> NP = DetCN anySg_Det cn ; det head[Number=Sing]
#auxfun DetCN_anyPl det cn : Det -> CN -> NP = DetCN anyPl_Det cn ; det head[Number=Plur]
#disable anyPl_Det anySg_Det

#auxfun DetCN_someSg det cn : Det -> CN -> NP = DetCN someSg_Det cn ; det head[Number=Sing]
#auxfun DetCN_somePl det cn : Det -> CN -> NP = DetCN somePl_Det cn ; det head[Number=Plur]
#disable somePl_Det someSg_Det

Now I see, from looking at bt0, that the auxfuns take action:

bt0: DetCN_anySg anyPl_Det (UseN word_N) 
at: DetCN anySg_Det (UseN word_N)
LIN: any word

bt0: DetCN_anyPl anyPl_Det (UseN word_N)
at: DetCN anyPl_Det (UseN word_N)
LIN: any words

But unfortunately, there is no matching with strings, so the auxfun DetCN_any* auxfuns take action even when the actual Det is some*_Det.

bt0: DetCN_anySg somePl_Det (UseN word_N)
at: DetCN anySg_Det (UseN word_N)
LIN: any word

bt0: DetCN_anyPl somePl_Det (UseN word_N)
at: DetCN anyPl_Det (UseN word_N)
LIN: any words

So I would like to enhance the macro DSL such that we can add a wordform or lemma constraint among the tag constraints. For example (feel free to suggest a better syntax)

#auxfun DetCN_anySg det cn : Det -> CN -> NP = DetCN anySg_Det cn ; det head[Number=Sing|wf="any"]
#auxfun DetCN_anyPl det cn : Det -> CN -> NP = DetCN anyPl_Det cn ; det head[Number=Plur|wf="any"]
#disable anyPl_Det anySg_Det

#auxfun DetCN_someSg det cn : Det -> CN -> NP = DetCN someSg_Det cn ; det head[Number=Sing|wf="some"]
#auxfun DetCN_somePl det cn : Det -> CN -> NP = DetCN somePl_Det cn ; det head[Number=Plur|wf="some"]
#disable somePl_Det someSg_Det

Feature request: #auxfun macros (and other #funs too if feasible?) to distinguish word order

Current behaviour, it treats phrases like "Section 10" (apposition) and "10 sections" identically.

Feature request: output something that is like bt0 but macros expanded on subtrees

This is a common occurrence: I convert some large tree with ud2gf, and get output like this

[2,3] critical 3 (2) ADJ root (root_cop (rootA_ (PositA critical_A)) be_cop : UDS[2,3]) 1
    *[1] it 1 (2) PRON expl (root_only (rootN_ (UsePron it_Pron)) : UDS[1]) 1
    [2] is 2 (1) AUX cop (be_cop : cop[2]) 1
    *[4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21] do 5 (4) VERB csubj (root_advcl (rootV_ (OblVP_ (ComplV do_V (DetCN_a_ (DetQuant IndefArt NumSg) (AdjCN (PositA preliminary_A) (UseN assessment_N)))) (PrepNP upon_Prep (MassNP_sg (AdvCN (UseN discovery_N) (PrepNP of_Prep (DetCN_a_ (DetQuant IndefArt NumSg) (UseN (CompoundN data_N breach_N))))))))) (advclMarkUDS_ to_mark (root_advcl (rootV_ (UseV see_V)) (advclMarkUDS_ (mark_ if_Subj) (root_nsubj (rootV_ (ComplV warrant_V (DetCN_a_ (DetQuant IndefArt NumSg) (UseN notification_N)))) (nsubj_ (UsePron it_Pron)))))) : UDS[5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]) 1
        *[4] to 4 (3) PART mark (to_mark : mark[4]) 1
        [6,7,8] assessment 8 (7) NOUN obj (DetCN_a_ (DetQuant IndefArt NumSg) (AdjCN (PositA preliminary_A) (UseN assessment_N)) : NP[6,7,8]) 1
            [6] a 6 (5) DET det (DetQuant IndefArt NumSg : Det[6]) 1
            [7] preliminary 7 (6) ADJ amod (PositA preliminary_A : AP[7]) 1
        [9,10,11,12,13,14] discovery 10 (9) NOUN obl (PrepNP upon_Prep (MassNP_sg (AdvCN (UseN discovery_N) (PrepNP of_Prep (DetCN_a_ (DetQuant IndefArt NumSg) (UseN (CompoundN data_N breach_N)))))) : Adv[9,10,11,12,13,14]) 1
            [9] upon 9 (8) ADP case (upon_Prep : Prep[9]) 1
            [11,12,13,14] breach 14 (13) NOUN nmod (PrepNP of_Prep (DetCN_a_ (DetQuant IndefArt NumSg) (UseN (CompoundN data_N breach_N))) : Adv[11,12,13,14]) 1
                [11] of 11 (10) ADP case (of_Prep : Prep[11]) 1
                [12] a 12 (11) DET det (DetQuant IndefArt NumSg : Det[12]) 1
                [13] data 13 (12) NOUN compound (data_N : N[13]) 1
        [15,16,17,18,19,20,21] see 16 (15) VERB advcl (advclMarkUDS_ to_mark (root_advcl (rootV_ (UseV see_V)) (advclMarkUDS_ (mark_ if_Subj) (root_nsubj (rootV_ (ComplV warrant_V (DetCN_a_ (DetQuant IndefArt NumSg) (UseN notification_N)))) (nsubj_ (UsePron it_Pron))))) : advcl[15,16,17,18,19,20,21]) 1
            [15] to 15 (14) PART mark (to_mark : mark[15]) 1
            [17,18,19,20,21] warrants 19 (18) VERB advcl (advclMarkUDS_ (mark_ if_Subj) (root_nsubj (rootV_ (ComplV warrant_V (DetCN_a_ (DetQuant IndefArt NumSg) (UseN notification_N)))) (nsubj_ (UsePron it_Pron))) : advcl[17,18,19,20,21]) 1
                [17] if 17 (16) SCONJ mark (mark_ if_Subj : mark[17]) 1
                [18] it 18 (17) PRON nsubj (nsubj_ (UsePron it_Pron) : nsubj[18]) 1
                [20,21] notification 21 (20) NOUN obj (DetCN_a_ (DetQuant IndefArt NumSg) (UseN notification_N) : NP[20,21]) 1
                    [20] a 20 (19) DET det (DetQuant IndefArt NumSg : Det[20]) 1

I would like to linearise the subtree [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21], but it is full of auxfuns, and I need to manually replace lots of auxfuns before I get it to linearise. Is there any way to output partial subtrees that have auxfuns expanded, to make debugging easier?

Feature request: handle compounds in lemma

An example input:

1       Sapiteed        sapi_tee        PROPN   S       Case=Par|Number=Sing    0       root    _       _
2       tavalaiusega    tavalaius       NOUN    S       Case=Com|Number=Sing    1       nmod    _       _

I would like ud2gf to try to parse sapi_tee in the following order:

a. Merge the lemma into sapitee and try to parse it. If it is found in the lexicon, return sapitee_N.
b. If sapitee is not in the lexicon, then try parsing both sapi and tee. If they are both nouns, return CompoundN sapi_N tee_N.
c. If only tee is found in the lexicon, return StrCompoundN "sapi" tee_N.
d. If none of sapi or tee is in the lexicon, then proceed to morpho_analyze the wordform, i.e. "sapiteed". That's because the lemma may have been wrongly analysed.
f. If ma "sapiteed"didn't return anything either, as a last resort we return StrN <something>. That something can be

lemma without the underscore, so StrN "sapitee"
wordform as is, so StrN "sapiteed".

The same applies for compound adjectives, verbs etc. This assumes that the grammar has the backup functions StrC and StrCompoundC (which may become a command line option, see #24. But for now, when it's not command line option, we can just introduce those functions in ud2gf, and leave it to the grammarian to add them to grammar.)

Interaction with morpho_analyse

As of April 2022, ud2gf first tries to parse the lemma, and only secondarily does ma on the word form. If the default behaviour changes, this proposed algorithm should be reconsidered too.

grammaticalframework / gf-ud Goto Github PK

gf-ud's People

Contributors

Stargazers

Watchers

Forkers

gf-ud's Issues

String literals for OOV words

Future work: modify the PGF grammar?

Backup* funs

String literals for OOV words

Future work: modify the PGF grammar?

Infinite loop

Uncomment "beam size" of 123 trees

Adding annotations to the conllu file

Workaround

Interaction with morpho_analyse

Recommend Projects

Recommend Topics

Recommend Org