dspinellis / tokenizer Goto Github PK

View Code? Open in Web Editor NEW

63.0 3.0 19.0 421 KB

Convert source code into numerical tokens

License: Other

C++ 94.88% Makefile 0.66% Perl 1.73% Roff 2.74%

tokenizer's Introduction

tokenizer

Tokenize source code into integer vectors, symbols, or discrete tokens.

The following languages are currently supported.

C
C#
C++
Go
Java
JavaScript
PHP
Python
Rust
TypeScript

Build

cd src
make

Test

Ensure CppUnit is installed. Depending on your environment, you may also need to pass its installation directory prefixes to make through the command line arguments. For example, under macOS pass ADDCXXFLAGS='-I /opt/homebrew/include' ADDLDFLAGS='-L /opt/homebrew/lib' as arguments to make.

cd src
make test

Install

cd src
sudo make install

Run

tokenizer file.c
tokenizer -l Java -o statement <file.java

Examples of tokenizing "hello world" programs in diverse languages

C into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C
35      320     60      2000    46      2001    62      322     2002    40     41       123     2003    40      625     41      59      327     1500    59     125

C into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C -s
# include < ID:2000 . ID:2001 > int ID:2002 ( ) { ID:2003 ( STRING_LITERAL
) ; return 0 ; }

C# into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#"
312     2000    123     360     376     2001    40      41      123     2002   46       2003    46      2004    40      627     41      59      125     125

C# into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -s
class ID:2000 { static void ID:2001 ( ) { ID:2002 . ID:2003 . ID:2004
( STRING_LITERAL ) ; } }

C# method-only into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -o method
123     2002    46      2003    46      2004    40      627     41      59     125

C++ into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -s
# include < ID:2000 > LINE_COMMENT using namespace ID:2001 ; int ID:2002
( ) LINE_COMMENT { ID:2003 LSHIFT STRING_LITERAL LSHIFT ID:2004 ;
LINE_COMMENT return 0 ; LINE_COMMENT }

Java into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/j/Java.java | tokenizer -l Java -s
public class ID:2000 { public static void ID:2001 ( ID:2002 [ ] ID:2003 )
{ ID:2004 . ID:2005 . ID:2006 ( STRING_LITERAL ) ; } }

C++ into code tokens

curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -c
#
include
<
iostream
>
// ...
using
namespace
std
;
int
main
(
)
// ...
{
cout
<<
"..."
<<
endl
;
// ...
return
0
;
// ...
}

Examples of tokenizer code preprocessing

Token-by-token difference

Produce a token-by-token difference between the current version of the file tokenizer.cpp and the one in version v1.1.

diff <(git show v1.1:./tokenizer.cpp | tokenizer -l C++ -b) \
  <(tokenizer -l C++ -b tokenizer.cpp)

Clone detection

List Type 2 (near) clones in the tokenizer source code.

tokenizer -l C++ -c -f -o line *.cpp *.h | mpcd

Reference manual

You can read the command's Unix manual page through this link.

In 2023 version 2.0 of the tokenizer was released, with a simpler and more orthogonal command-line interface. To convert old code, you can read the Unix manual page of the original v1.1 version through this link.

Contributing

To support a new language proceed as follows.

Open an issue with the language name and a pointer to its lexical structure defintion.
Add a comment indicating that you're working on it.
List the language's keywords in a file name language-keyword.txt. Keep alphabetic order. If the language supports a C-like preprocessor add those keywords as well.
Copy the source code files of an existing language that most resembles the new language to create the new language files: languageTokenizer.cpp, languageTokenizer.h, languageTokenizerTest.h.
In the copied files rename all instances (uppercase, lowercase, CamelCase) of the existing language name to the new language name.
Create a list of the new language's operators and punctuators, and methodically go through the languageTokenizer.cpp switch statements to ensure that these are correctly handled. When code is missing or different, base the new code on an existing pattern. Keep token names used for the same semantic purpose same between languages. If you need a new token name just write Token:MY_NAME and it will be defined automatigcally.
Add code to handle the language's comments.
Adjust, if needed, the handling of constants and literals. Note that for the sake of simplicity and efficiency, the tokenizer can assume that its input is correct.
To implement features that aren't handled in the language whose tokenizer implementation you copied, look at the implementation of other language tokenizers that have these features.
If you need to reuse a method from another language, move it to TokenizerBase.
Add the object file languageTokenizer.o to the OBJ list of file names in the Makefile.
Add unit tests for any new or modified features you implemented.
Update the fileUnitTests.cpp to include the unit test header file, and call addTest with the unit test suite.
Update the method process_file in tokenizer.cpp to call the tokenizer you implemented and the language's name to the list of supported languages.
Ensure the language is correctly tokenized, both by running the tokenizer and by running the unit tests with make test.
Update the manual page tokenizer.1 and this README.md file.
Bump up the semantic version middle number of the version string in tokenizer.cpp

tokenizer's People

Contributors

Stargazers

Watchers

Forkers

tushartushar chubbymaggie forestryks majdsoud propellingbits bashcache alexanderyoungblood moabson namrataaundhkar sambsp yaxirhuxxain wuyff taian paul12345-cyber lancherba wannch zio2002 liying-sf alignment-lab-ai

tokenizer's Issues

Support Ruby

The lexical structure is here. However, the rules are complex, and the solutions require taking into account considerable context.

Support Python

Lexical structure is here.

NULL output when tokenizing PHP

When tokenizing a certain PHP file a nasty NULL character is output. Interestingly, this causes fgrep(1) to stop matching lines.

C#: Negative tokens

There are cases when tokenizer gives negative tokens. It could be related to non-ASCII range characters. Here is a sample of method definition and corresponding tokenizer output.

negative-tokens-cs.txt

code de-tokenization

I was wondering if there are any tool here that can de-tokenize the tokenization done by this tool.

Example: I perform tokenization by the following call,

curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -t c

Is there any tool here that can de-tokenize the output of the above call to the original tokens?

Tokens for semantically same keywords

Programming languages have similar keywords. In order to apply trans-learning, it would be necessary to encode same keywords from different programming languages into the same tokens.

Mistake in EOF detection

I'm running tokenizer for this javascript function and get escapeHTML.js(9): EOF encountered while processing a string literal

function escapeHTML(s) {
	  var n = s;
	  n = n.replace(/&/g, '&amp;');
	  n = n.replace(/</g, '&lt;');
	  n = n.replace(/>/g, '&gt;');
	  n = n.replace(/"/g, '&quot;');

	  return n;
	}

I think there's an error in detecting close-quote, this error happens every time the tokenizer encounters a single quote or double quote. Can you please check this. Thanks!

floating-point constant too large

Thank you for the project.

I cloned it and tried to make but I got this error. I am on mac m1.

[master][~/tokenizer]$ cd src                                                                              rbenv:2.7.6
[master][~/tokenizer/src]$ make                                                                            rbenv:2.7.6
./mkkeyword.pl C-keyword.txt CSharp-keyword.txt Cpp-keyword.txt Go-keyword.txt Java-keyword.txt JavaScript-keyword.txt PHP-keyword.txt Python-keyword.txt Rust-keyword.txt TypeScript-keyword.txt
./mktoken.pl CTokenizer.cpp CSharpTokenizer.cpp CppTokenizer.cpp GoTokenizer.cpp JavaTokenizer.cpp JavaScriptTokenizer.cpp PHPTokenizer.cpp PythonTokenizer.cpp RustTokenizer.cpp TypeScriptTokenizer.cpp TokenizerBase.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o CTokenizer.o CTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o CppTokenizer.o CppTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o JavaTokenizer.o JavaTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o CSharpTokenizer.o CSharpTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o PythonTokenizer.o PythonTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o TokenizerBase.o TokenizerBase.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o SymbolTable.o SymbolTable.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o NestedClassState.o NestedClassState.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o PHPTokenizer.o PHPTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o JavaScriptTokenizer.o JavaScriptTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o GoTokenizer.o GoTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o RustTokenizer.o RustTokenizer.cpp
c++ -Wall -Werror -MD -std=c++17  -O2   -c -o tokenizer.o tokenizer.cpp
tokenizer.cpp:154:30: error: magnitude of floating-point constant too large for type 'long double'; maximum is 1.7976931348623157E+308 [-Werror,-Wliteral-range]
        for (long double d = 1; d < 1e309L; d *= 10)
                                    ^
1 error generated.

Compilation failing on Windows machine

Compilation of the project failing on Windows. Is it because of specific version of g++? I have MinGW 6.3.0 installed.

The errors are coming from the auto-generated files (such as CToken.h). I tried to modify the perl scripts but no success yet. It seems that the compiler is complaining the following syntax:

{"enum
", ENUM

The compiler wants the double quote in the same line

{"enum", ENUM

Here is one of the error messages:

CKeyword.h:104:5: error: missing terminating " character [-Werror]

Attached: compile log
compile.log

Support JavaScript

The lexical structure is defined here. Note the following.

$ is allowed in identifiers
Regular expression literals
Template substitution literals
Full Unicode support in identifiers is probably an overkill

Java closing braces prepended to the next statement when the statement option is used

If one tokenizes the following code snippet with -o statement

if (foo != bar) {
foo = bar;
}
foo = 2*bar;

the closing brace in the third line becomes the first numeric of the vector for the last line.

cleaning out whitespace removes python scope context

Hello!

Not sure if this is out of scope of the project and/or covered by "The Python tokenizer does not support processing options and identifier scoping. "

Basically the tokenizer strips whitespace and by doing that removes the scope context information, which can change the meaning of the code.

Example:

def foo(baz,bar):
    if baz:
        bar=1
    bar+=1
    return bar

foo(True,1)


def
food
(
baz
,
bar
)
:
if
baz
:
bar
=
1
bar
+=
1
return
bar
foo
(
True
,
1
)

Because whitespace is stripped, it's unclear in which scope bar += 1 happens.