giganticode / codeprep Goto Github PK

View Code? Open in Web Editor NEW

45.0 5.0 12.0 1.59 MB

A toolkit for pre-processing large source code corpora

Python 91.45% Java 8.55%

mining-software-repositories source-code-analysis language-modeling word-segmentation natural-language-processing

codeprep's People

Contributors

Stargazers

Watchers

Forkers

pombredanne mir-am xueqiyang maximzubkov doytsujin thdusdl1219 sturmianseq dennis9707 jiekeshi joshua0128 nashid budsus

codeprep's Issues

Enhance `ParsedToken` hierarchy

rename SplitContainer to Identifier
make Identifier abstract and extend it with SingleWordIdentifier, TwoWordIdentifier, ThreeWordIdentifier, FourOrMoreWordIdentifier
make other classes that have sub-classes abstract

PreprocessingMetadata enhancement

Rename PreprocessingMetadata -> PreppedTokenMetadata
Represent word_boundaries field as a list of the number of subtoken in each token, e.g
[1, 3, 1, 2] instead of [0, 1, 4, 5, 7]
Remove non-processible tokens filed. Return non-processible tokens as a separate object
Provide a method for returning the metadata for the last tokens:

>>> metadata.for_last_tokens(n: int)

Create PreppedTokenSequence class to incapsulate getting full tokens from subtokens

The tasks for the new PreppedTokenSequence class are to encapsulate getting full tokens from subtokens (which is currently done by FullTokenIterator class) and at the same time provide transparent access to the subtokens)

Motivation:

To get the full tokens, the user won't have to know about FullTokenIterator. This functionality can be provided by PreppedTokenSequence directly
ModelContext class is not really needed anymore

Provisional API:

>>> prepped_token_sequence = api.bpe("getName(", "5k")
>>> prepped_tokens
['get', 'Name', '</t>', (]
>>> prepped_tokens.metadata.token_types
[SplitContainer, OpeningBracket]
>>> prepped_tokens.metadata.n_subtokens_per_token
[3, 1]
>>> prepped_tokens.full_tokens()
[['get', 'Name', '</t>'], ['(']]
>>> prepped_tokens.full_tokens(formatter=lambda s: ''.join(s))
['getName</t>', '(']

OSX support

Hello!

Thank you for your work it is amazing!
My name is Maksim Zubkov, I am an inter at JetBrains Research. I tried to use your tool for BPE tokenization of C++ code and got the following error:

OSError: Calculation of vocabulary is not supported on OSX.

Could you please explain why it is not currently supported, and what can I do to use your lib on MacOS?

Thank you!

why use byte not str while in path （Windows）

codeprep/codeprep/pipeline/to_repr.py

Line 60 in f5a35b6

 def preprocess_and_write(params: Tuple[bytes, bytes, PrepConfig, str], bpe_data: Optional[BpeData] = None): 

eh, I am working with this repository. on windows

I find when I use unicode like chinese in path like "./文档/", to_repr.py is likely to encode this string to bytes, this cause Exception.

unicode bytes like b'\xe6\x96\x87\xe6\xa1\xa3.py' which means ”文档.py“ , in Windows, it means a recursive folder. And python built-in function os.path.basename will not recognize this. When writing MetaData to file, this will raise a FileOrDirNotExist Exception

actually, I change the path to str to avoid this exception, but I dont know if there are any other side effects

Does codeprep works on JavaScript source code preprocessing?

It seems that it dosenot work when delt with javascript language. And is there any solution to remove end of a token'\t' in the token sequence.

By default use end-of-full-token character (</t>) instead of token boundaries (<w>, </w>) for all kinds of pre-processing for consistency

Currently:

>>> api.basic("getName")
['<w>', 'get', 'Name', '</w>']

To be done:

>>> api.basic("getName")
['get', 'Name', '</t>']

giganticode / codeprep Goto Github PK

codeprep's People

Contributors

Stargazers

Watchers

Forkers

codeprep's Issues

Enhance `ParsedToken` hierarchy

PreprocessingMetadata enhancement

Create PreppedTokenSequence class to incapsulate getting full tokens from subtokens

OSX support

why use byte not str while in path （Windows）

Does codeprep works on JavaScript source code preprocessing?

By default use end-of-full-token character (</t>) instead of token boundaries (<w>, </w>) for all kinds of pre-processing for consistency

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent