TASI

Data files naming conventions

Name-text.txt = raw text file
Name-GS.csv = hand annotated that is the Gold Standard to compare our model to, i.e. the "correct answers" to Anglicism identification
Name-TASI.csv = machine annotated file that is the output of our model

tasi's People

Contributors

Watchers

tasi's Issues

Add IOB tag to anglicisms to keep multiword units together

Recreate goldstandard

tokenization must be consistent with our model
remove problematic sentences
-ensure all Anglicism Labels are True or False

def custom_tokenizer(nlp):
infix_re = re.compile(r'''[,?;\‘\’`\“\”"'~]''')
modified_prefixes = tuple(x for x in nlp.Defaults.prefixes if x != '#')
prefix_re = compile_prefix_regex(modified_prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
is_candidate_filter = lambda token: token.pos_ in ["VERB", "NOUN", "ADJ"] and (token.is_stop == False) and (any(
{"@", "#"} & set(token.text)) == False)
Token.set_extension("is_candidate", getter=is_candidate_filter, force=True)
Token.set_extension("is_anglicism", default=False, force=True)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)

def custom_tokenizer_modified(nlp):
# spacy defaults: when the standard behaviour is required, they
# need to be included when subclassing the tokenizer
infix_re = re.compile(r'''[.,?!:;...\‘\’`\“\”"'~]''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

# extending the default url regex with regex for hashtags
hashtag_pattern = r'''|^(#[\w_-]+)$'''
url_and_hashtag = URL_PATTERN + hashtag_pattern
url_and_hashtag_re = re.compile(url_and_hashtag)

# set a custom extension to match if token is a hashtag
hashtag_getter = lambda token: token.text.startswith('#')
Token.set_extension('is_hashtag', getter=hashtag_getter)

return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                 suffix_search=suffix_re.search,
                 infix_finditer=infix_re.finditer,
                 token_match=url_and_hashtag_re.match
                 )

Recommend Projects

jserigos / tasi Goto Github PK

tasi's Introduction

TASI

tasi's People

Contributors

Watchers

tasi's Issues

Add IOB tag to anglicisms to keep multiword units together

Recreate goldstandard

Tokenization

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent