giganticode / codeprep Goto Github PK
View Code? Open in Web Editor NEWA toolkit for pre-processing large source code corpora
A toolkit for pre-processing large source code corpora
SplitContainer
to IdentifierSingleWordIdentifier
, TwoWordIdentifier
, ThreeWordIdentifier
, FourOrMoreWordIdentifier
PreprocessingMetadata
-> PreppedTokenMetadata
word_boundaries
field as a list of the number of subtoken in each token, e.gnon-processible
tokens filed. Return non-processible tokens as a separate object>>> metadata.for_last_tokens(n: int)
The tasks for the new PreppedTokenSequence
class are to encapsulate getting full tokens from subtokens (which is currently done by FullTokenIterator
class) and at the same time provide transparent access to the subtokens)
Motivation:
FullTokenIterator
. This functionality can be provided by PreppedTokenSequence
directlyProvisional API:
>>> prepped_token_sequence = api.bpe("getName(", "5k")
>>> prepped_tokens
['get', 'Name', '</t>', (]
>>> prepped_tokens.metadata.token_types
[SplitContainer, OpeningBracket]
>>> prepped_tokens.metadata.n_subtokens_per_token
[3, 1]
>>> prepped_tokens.full_tokens()
[['get', 'Name', '</t>'], ['(']]
>>> prepped_tokens.full_tokens(formatter=lambda s: ''.join(s))
['getName</t>', '(']
Hello!
Thank you for your work it is amazing!
My name is Maksim Zubkov, I am an inter at JetBrains Research. I tried to use your tool for BPE tokenization of C++ code and got the following error:
OSError: Calculation of vocabulary is not supported on OSX.
Could you please explain why it is not currently supported, and what can I do to use your lib on MacOS?
Thank you!
codeprep/codeprep/pipeline/to_repr.py
Line 60 in f5a35b6
eh, I am working with this repository. on windows
I find when I use unicode like chinese in path like "./文档/", to_repr.py is likely to encode this string to bytes, this cause Exception.
unicode bytes like b'\xe6\x96\x87\xe6\xa1\xa3.py' which means ”文档.py“ , in Windows, it means a recursive folder. And python built-in function os.path.basename will not recognize this. When writing MetaData to file, this will raise a FileOrDirNotExist Exception
actually, I change the path to str to avoid this exception, but I dont know if there are any other side effects
It seems that it dosenot work when delt with javascript language. And is there any solution to remove end of a token'\t' in the token sequence.
Currently:
>>> api.basic("getName")
['<w>', 'get', 'Name', '</w>']
To be done:
>>> api.basic("getName")
['get', 'Name', '</t>']
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.