Giter Club home page Giter Club logo

codeornot's Introduction

CASICS CodeOrNot

CASICS CodeOrNot is a Python 3 package implementing heuristic methods for determining whether a file or directory contains software source (or not). The analysis is oriented towards detecting code: a repository containing a mix of documents and even one source code file will be considered to contain code, and conversely, if there is no sign of source code, it will be labeled as "not" being source code.

Authors: Michael Hucka
Repository: https://github.com/casics/codeornot
License: Unless otherwise noted, this content is licensed under the GPLv3 license.

Introduction

In performing source code repository analysis for classification tasks, a basic first step is to decide whether a source code repository actually contains code. Some repositories contain documents or other files and are not actually repositories for software; those are cases that a system for analyzing source code could skip. CodeOrNot is a Python 3 package that uses heuristics to answer the question "does it contain code, or not?"

Some cases are quite easy to decide: if a collection of files contains even one .c file, it can be reasonably assumed to contain C code, and thus the answer returned by CodeOrNot will be "code". Some other cases are more difficult. For example, files may contain code but not have file name extensions, and so determining whether they contain code or not requires examining the content. Other examples are gray zones: should a repository containining LaTeX files and a single Makefile be considered to contain code? After all, a Makefile can contain code—does that count? (The position taken by CodeOrNot is no, a single Makefile is not enough to consider the repository to be a code repository.)

CodeOrNot also provides some simple utilities modules that may be useful in other contexts:

  • The textcheck module provides functions such as majority_language(), which takes a list of text strings and reports the most likely human language in which the text strings are written. (It does this by using a combination of ftfy, cld2, and a majority vote.)

  • The codecheck module provides functions such as code_filename() and noncode_filename, which can be used to infer whether a file is likely to be code or noncode based on its name. These work by using built-in lists of file name rules.

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

♬ Contributing — info for developers

A lot remains to be done on CASICS in many areas. We would be happy to receive your help and participation if you are interested. Please feel free to contact the developers either via GitHub or the mailing list [email protected].

Everyone is asked to read and respect the code of conduct when participating in this project.

❤️ Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


             

codeornot's People

Contributors

mhucka avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.