Giter Club home page Giter Club logo

tokenizer's Introduction

Build Status DOI

tokenizer

Tokenize source code into integer vectors, symbols, or discrete tokens.

The following languages are currently supported.

  • C
  • C#
  • C++
  • Java
  • PHP
  • Python

Build

cd src
make

Install

cd src
sudo make install

Run

tokenizer file.c
tokenizer -l Java -o statement <file.java

Examples of tokenizing "hello world" programs in diverse languages

C into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C
35      320     60      2000    46      2001    62      322     2002    40     41       123     2003    40      625     41      59      327     1500    59     125

C into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C -t s
# include < ID:2000 . ID:2001 > int ID:2002 ( ) { ID:2003 ( STRING_LITERAL
) ; return 0 ; }

C# into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#"
312     2000    123     360     376     2001    40      41      123     2002   46       2003    46      2004    40      627     41      59      125     125

C# into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -t s
class ID:2000 { static void ID:2001 ( ) { ID:2002 . ID:2003 . ID:2004
( STRING_LITERAL ) ; } }

C# method-only into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -o method
123     2002    46      2003    46      2004    40      627     41      59     125

C++ into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -t s
# include < ID:2000 > LINE_COMMENT using namespace ID:2001 ; int ID:2002
( ) LINE_COMMENT { ID:2003 LSHIFT STRING_LITERAL LSHIFT ID:2004 ;
LINE_COMMENT return 0 ; LINE_COMMENT }

Java into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/j/Java.java | tokenizer -l Java -t s
public class ID:2000 { public static void ID:2001 ( ID:2002 [ ] ID:2003 )
{ ID:2004 . ID:2005 . ID:2006 ( STRING_LITERAL ) ; } }

C++ into code tokens

curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -t c
#
include
<
iostream
>
// ...
using
namespace
std
;
int
main
(
)
// ...
{
cout
<<
"..."
<<
endl
;
// ...
return
0
;
// ...
}

Reference manual

You can read the command's Unix manual page through this link.

Contributing

To support a new language proceed as follows.

  • Open an issue with the language name and a pointer to its lexical structure defintion.
  • Add a comment indicating that you're working on it.
  • List the language's keywords in a file name language-keyword.txt. Keep alphabetic order. If the language supports a C-like preprocessor add those keywords as well.
  • Copy the source code files of an existing language that most resembles the new language to create the new language files: languageTokenizer.cpp, languageTokenizer.h, languageTokenizerTest.h.
  • In the copied files rename all instances (uppercase, lowercase, CamelCase) of the existing language name to the new language name.
  • Create a list of the new language's operators and punctuators, and methodically go through the languageTokenizer.cpp switch statements to ensure that these are correctly handled. When code is missing or different, base the new code on an existing pattern.
  • Add code to handle the language's comments.
  • Adjust, if needed, the handling of constants and literals. Note that for the sake of simplicity and efficiency, the tokenizer can assume that its input is correct.
  • To implement features that aren't handled in the language whose tokenizer implementation you copied, look at the implementation of other language tokenizers that have these features.
  • If you need to reuse a method from another language, move it to TokenizerBase.
  • Add the object file languageTokenizer.o to the OBJ list of file names in the Makefile.
  • Add unit tests for any new or modified features you implemented.
  • Update the fileUnitTests.cpp to include the unit test header file, and call addTest with the unit test suite.
  • Update the method process_file in tokenizer.cpp to call the tokenizer you implemented and the language's name to the list of supported languages.
  • Ensure the language is correctly tokenized, both by running the tokenizer and by running the unit tests with make test.
  • Update the manual page tokenizer.1 and this README.md file.

tokenizer's People

Contributors

dspinellis avatar tushartushar avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.