Giter Club home page Giter Club logo

structa's Introduction

structa

structa is a small, semi-magical utility for discerning the "overall structure" of large data files. Typically this is something like a document oriented database in JSON format, or a CSV file of a database dump, or a YAML document.

Usage

Use from the command line:

structa <filename>

The usual --help and --version switches are available for more information. The full documentation may also help understanding the myriad switches!

Examples

The People in Space API shows the number of people currently in space, and their names and craft name:

curl -s http://api.open-notify.org/astros.json | structa

Output:

{
    'message': str range="success" pattern="success",
    'number': int range=10,
    'people': [
        {
            'craft': str range="ISS".."Tiangong",
            'name': str range="Akihiko Hoshide".."Thomas Pesquet"
        }
    ]
}

The Python Package Index (PyPI) provides a JSON API for packages. You can feed the JSON of several packages to structa to get an idea of the overall structure of these records (when structa is given multiple inputs on the same invocation, it assumes all have a common source):

for pkg in numpy scipy pandas matplotlib structa; do
    curl -s https://pypi.org/pypi/$pkg/json > $pkg.json
done
structa numpy.json scipy.json pandas.json matplotlib.json structa.json

Output:

{
    'info': { str: value },
    'last_serial': int range=11.9M..13.1M,
    'releases': {
        str range="0.1".."3.5.1": [
            {
                'comment_text': str,
                'digests': {
                    'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
                    'sha256': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
                },
                'downloads': int range=-1,
                'filename': str,
                'has_sig': bool,
                'md5_digest': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
                'packagetype': str range="bdist_wheel".."sdist",
                'python_version': str range="2.4".."source",
                'requires_python': value,
                'size': int range=39.3K..118.4M,
                'upload_time': str of timestamp range=2006-01-09 14:02:01..2022-03-10 16:45:20 pattern="%Y-%m-%dT%H:%M:%S",
                'upload_time_iso_8601': str of timestamp range=2009-04-06 06:19:25..2022-03-10 16:45:20 pattern="%Y-%m-%dT%H:%M:%S.%f%z",
                'url': URL,
                'yanked': bool,
                'yanked_reason': value
            }
        ]
    },
    'urls': [
        {
            'comment_text': str range="",
            'digests': {
                'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
                'sha256': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
            },
            'downloads': int range=-1,
            'filename': str,
            'has_sig': bool,
            'md5_digest': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
            'packagetype': str range="bdist_wheel".."sdist",
            'python_version': str range="cp310".."source",
            'requires_python': value,
            'size': int range=47.2K..55.6M,
            'upload_time': str of timestamp range=2021-10-27 23:57:01..2022-03-10 16:45:20 pattern="%Y-%m-%dT%H:%M:%S",
            'upload_time_iso_8601': str of timestamp range=2021-10-27 23:57:01..2022-03-10 16:45:20 pattern="%Y-%m-%dT%H:%M:%S.%f%z",
            'url': URL,
            'yanked': bool,
            'yanked_reason': value
        }
    ],
    'vulnerabilities': [ empty ]
}

The Ubuntu Security Notices database contains the list of all security issues in releases of Ubuntu (warning, this one takes some time to analyze and eats about a gigabyte of RAM while doing so):

curl -s https://usn.ubuntu.com/usn-db/database.json | structa

Output:

{
    str range="1430-1".."4630-1" pattern="dddd-d": {
        'action'?: str,
        'cves': [ str ],
        'description': str,
        'id': str range="1430-1".."4630-1" pattern="dddd-d",
        'isummary'?: str,
        'releases': {
            str range="artful".."zesty": {
                'allbinaries'?: {
                    str: { 'version': str }
                },
                'archs'?: {
                    str range="all".."source": {
                        'urls': {
                            URL: {
                                'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
                                'size': int range=20..1.2G
                            }
                        }
                    }
                },
                'binaries': {
                    str: { 'version': str }
                },
                'sources': {
                    str: {
                        'description': str,
                        'version': str
                    }
                }
            }
        },
        'summary': str,
        'timestamp': float of timestamp range=2012-04-27 12:57:41..2020-11-11 18:01:48,
        'title': str
    }
}

structa's People

Contributors

bennuttall avatar waveform80 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

bennuttall

structa's Issues

Progress output

Add some indication that structa is doing something; takes bloody ages to analyze a 100+Mb of JSON on a pi

Common sub-tree elimination

Some complex cases (USN example being one) result in common sub-structures which could be eliminated with some slightly more intelligent comparison methods

Add URL support

Might be handy to be able to stuff a URL straight on the CLI; minor enhancement

Merge type-instances correctly

Currently, when a Dict with Field keys is merged with a Dict with, say, Scalar keys, the result duplicates the typed entries and the subsequent sort (intended for fields) fails (because Scalar entries are unsortable):

Traceback (most recent call last):                                                                                                                                           
  File "/home/dave/envs/structa/bin/structa", line 33, in <module>
    sys.exit(load_entry_point('structa', 'console_scripts', 'structa')())
  File "/home/dave/projects/structa/structa/ui/cli.py", line 30, in main
    structure = get_structure(config)
  File "/home/dave/projects/structa/structa/ui/cli.py", line 230, in get_structure
    result = analyzer.merge(struct)
  File "/home/dave/projects/structa/structa/analyzer.py", line 272, in merge
    return self._merge(struct)
  File "/home/dave/projects/structa/structa/analyzer.py", line 315, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 316, in <listcomp>
    self._merge(item)
  File "/home/dave/projects/structa/structa/analyzer.py", line 310, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 311, in <listcomp>
    DictField(self._merge(field.key), self._merge(field.value))
  File "/home/dave/projects/structa/structa/analyzer.py", line 310, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 311, in <listcomp>
    DictField(self._merge(field.key), self._merge(field.value))
  File "/home/dave/projects/structa/structa/analyzer.py", line 305, in _merge
    DictField(self._merge(keys), self._merge(sum(
  File "/home/dave/projects/structa/structa/types.py", line 301, in __add__
    result = super().__add__(other)
  File "/home/dave/projects/structa/structa/types.py", line 248, in __add__
    result.content = [a + b for a, b in self._zip(other)]
  File "/home/dave/projects/structa/structa/types.py", line 248, in <listcomp>
    result.content = [a + b for a, b in self._zip(other)]
  File "/home/dave/projects/structa/structa/types.py", line 337, in __add__
    self.value + other.value)
  File "/home/dave/projects/structa/structa/types.py", line 305, in __add__
    result.content = sorted(result.content, key=attrgetter('key'))
TypeError: '<' not supported between instances of 'Str' and 'Str'

Improve merging of str patterns

When the merge phase combines Str patterns, if either is undefined or unequal lengths are found, we just ditch the entire pattern. This could be more refined, e.g.zip_longest and mark the suffix regex-optional, although that begs the question what to do about combining optional suffixes, or if such a thing is even valid. Still, there's almost certainly room for some improvement here.

Table structure

Tried going down the namedtuple route, which turned out to be the wrong idea. We need a specific list-of-tuples type to represent table structures, with options to specify header (& footer?) rows, and its own output representation.

Improve StrRepr for numeric patterns

The pattern handling in StrRepr is a bit rubbish, dealing with trivial bases in int, and with no clue about floating-point representations. We should have the ability to derive a %-format from a cohort of strings (just as we do for DateTime) for the Int and Float classes.

The Int detection you already know how to do; Float can be figured out. The tricky bit will be handling __eq__ and __str__ for those patterns. The current base merging is easy enough, but handling 0-prefix and length padding may turn out to be non-trivial, e.g. it needs to know that '%08x' == '%10x' and that '%08x' + '%10x' == '%08x', that '%#x' != '%x', etc.

HTML output

ANSI codes for the tty at least; maybe HTML?

Multi-file input

Handle multiple input files as if an array of their content (stripping the outer array from the output); mostly already implemented but needs output refinement on the ui-work branch.

CSV handling

Add parsing of CSV files including detection of quoting and field separators (and column headers? should be easy enough). Depends on #4

XLS support

After a little scouting around it appears it's still a sadly common format for several data sources (ONS!). Add it as an optional dependency like YAML.

Sub-tree re-work

Add a method to permit forcing the type of a given node (and thus re-analysis of its sub-nodes). To be used in an eventual interactive implementation permitting users to force, e.g. a str of int to be treated as a straight str.

YAML handling

It's a horrid format, but popular enough, and easy enough to add that it's probably warranted adding it.

XML handling

Add the ability to parse and analyze XML elements; something along the lines of a compound structure with a dict for attrs, a list for child elems, and a couple of str attributes for text and tail (a la etree)

Better handling of null/None

At the moment, the --empty-threshold handling applies to blank strings only. This should be extended to include None (or null in Javascript et al) so that things like a bunch of strings with a few None entries doesn't get treated as "value" but as "optional str".

Path re-factor

Analysis path is currently a tuple of structures, which is okay but would be preferable to have something pathlib-esque to make the code a bit more obvious.

Tuple analysis

Handle tuples separately from lists. Lists are expected to be homogeneous structures, whereas tuples should be treated as heterogeneous, almost like an integer keyed dict (or str for namedtuples?). Open questions surrounding a heterogeneous list of differently structured tuples (reject? simplify?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.