waveform80 / structa Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 1.0 507 KB

A small utility for analyzing data structures (e.g. JSON files)

Home Page: https://structa.readthedocs.io/

License: Other

Makefile 1.25% Python 94.95% XSLT 3.80%

data-analysis json csv yaml data-visualization datawrangling datajournalism

structa's Introduction

structa

structa is a small, semi-magical utility for discerning the "overall structure" of large data files. Typically this is something like a document oriented database in JSON format, or a CSV file of a database dump, or a YAML document.

Usage

Use from the command line:

structa <filename>

The usual --help and --version switches are available for more information. The full documentation may also help understanding the myriad switches!

Examples

The People in Space API shows the number of people currently in space, and their names and craft name:

curl -s http://api.open-notify.org/astros.json | structa

Output:

{
    'message': str range="success" pattern="success",
    'number': int range=10,
    'people': [
        {
            'craft': str range="ISS".."Tiangong",
            'name': str range="Akihiko Hoshide".."Thomas Pesquet"
        }
    ]
}

The Python Package Index (PyPI) provides a JSON API for packages. You can feed the JSON of several packages to structa to get an idea of the overall structure of these records (when structa is given multiple inputs on the same invocation, it assumes all have a common source):

for pkg in numpy scipy pandas matplotlib structa; do
    curl -s https://pypi.org/pypi/$pkg/json > $pkg.json
done
structa numpy.json scipy.json pandas.json matplotlib.json structa.json

Output:

{
    'info': { str: value },
    'last_serial': int range=11.9M..13.1M,
    'releases': {
        str range="0.1".."3.5.1": [
            {
                'comment_text': str,
                'digests': {
                    'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
                    'sha256': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
                },
                'downloads': int range=-1,
                'filename': str,
                'has_sig': bool,
                'md5_digest': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
                'packagetype': str range="bdist_wheel".."sdist",
                'python_version': str range="2.4".."source",
                'requires_python': value,
                'size': int range=39.3K..118.4M,
                'upload_time': str of timestamp range=2006-01-09 14:02:01..2022-03-10 16:45:20 pattern="%Y-%m-%dT%H:%M:%S",
                'upload_time_iso_8601': str of timestamp range=2009-04-06 06:19:25..2022-03-10 16:45:20 pattern="%Y-%m-%dT%H:%M:%S.%f%z",
                'url': URL,
                'yanked': bool,
                'yanked_reason': value
            }
        ]
    },
    'urls': [
        {
            'comment_text': str range="",
            'digests': {
                'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
                'sha256': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
            },
            'downloads': int range=-1,
            'filename': str,
            'has_sig': bool,
            'md5_digest': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
            'packagetype': str range="bdist_wheel".."sdist",
            'python_version': str range="cp310".."source",
            'requires_python': value,
            'size': int range=47.2K..55.6M,
            'upload_time': str of timestamp range=2021-10-27 23:57:01..2022-03-10 16:45:20 pattern="%Y-%m-%dT%H:%M:%S",
            'upload_time_iso_8601': str of timestamp range=2021-10-27 23:57:01..2022-03-10 16:45:20 pattern="%Y-%m-%dT%H:%M:%S.%f%z",
            'url': URL,
            'yanked': bool,
            'yanked_reason': value
        }
    ],
    'vulnerabilities': [ empty ]
}

The Ubuntu Security Notices database contains the list of all security issues in releases of Ubuntu (warning, this one takes some time to analyze and eats about a gigabyte of RAM while doing so):

curl -s https://usn.ubuntu.com/usn-db/database.json | structa

Output:

{
    str range="1430-1".."4630-1" pattern="dddd-d": {
        'action'?: str,
        'cves': [ str ],
        'description': str,
        'id': str range="1430-1".."4630-1" pattern="dddd-d",
        'isummary'?: str,
        'releases': {
            str range="artful".."zesty": {
                'allbinaries'?: {
                    str: { 'version': str }
                },
                'archs'?: {
                    str range="all".."source": {
                        'urls': {
                            URL: {
                                'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
                                'size': int range=20..1.2G
                            }
                        }
                    }
                },
                'binaries': {
                    str: { 'version': str }
                },
                'sources': {
                    str: {
                        'description': str,
                        'version': str
                    }
                }
            }
        },
        'summary': str,
        'timestamp': float of timestamp range=2012-04-27 12:57:41..2020-11-11 18:01:48,
        'title': str
    }
}

structa's People

Contributors

Stargazers

Watchers

Forkers

bennuttall

structa's Issues

Progress output

Add some indication that structa is doing something; takes bloody ages to analyze a 100+Mb of JSON on a pi

Common sub-tree elimination

Some complex cases (USN example being one) result in common sub-structures which could be eliminated with some slightly more intelligent comparison methods

Add URL support

Might be handy to be able to stuff a URL straight on the CLI; minor enhancement

Merge type-instances correctly

Currently, when a Dict with Field keys is merged with a Dict with, say, Scalar keys, the result duplicates the typed entries and the subsequent sort (intended for fields) fails (because Scalar entries are unsortable):

Traceback (most recent call last):                                                                                                                                           
  File "/home/dave/envs/structa/bin/structa", line 33, in <module>
    sys.exit(load_entry_point('structa', 'console_scripts', 'structa')())
  File "/home/dave/projects/structa/structa/ui/cli.py", line 30, in main
    structure = get_structure(config)
  File "/home/dave/projects/structa/structa/ui/cli.py", line 230, in get_structure
    result = analyzer.merge(struct)
  File "/home/dave/projects/structa/structa/analyzer.py", line 272, in merge
    return self._merge(struct)
  File "/home/dave/projects/structa/structa/analyzer.py", line 315, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 316, in <listcomp>
    self._merge(item)
  File "/home/dave/projects/structa/structa/analyzer.py", line 310, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 311, in <listcomp>
    DictField(self._merge(field.key), self._merge(field.value))
  File "/home/dave/projects/structa/structa/analyzer.py", line 310, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 311, in <listcomp>
    DictField(self._merge(field.key), self._merge(field.value))
  File "/home/dave/projects/structa/structa/analyzer.py", line 305, in _merge
    DictField(self._merge(keys), self._merge(sum(
  File "/home/dave/projects/structa/structa/types.py", line 301, in __add__
    result = super().__add__(other)
  File "/home/dave/projects/structa/structa/types.py", line 248, in __add__
    result.content = [a + b for a, b in self._zip(other)]
  File "/home/dave/projects/structa/structa/types.py", line 248, in <listcomp>
    result.content = [a + b for a, b in self._zip(other)]
  File "/home/dave/projects/structa/structa/types.py", line 337, in __add__
    self.value + other.value)
  File "/home/dave/projects/structa/structa/types.py", line 305, in __add__
    result.content = sorted(result.content, key=attrgetter('key'))
TypeError: '<' not supported between instances of 'Str' and 'Str'

Improve merging of str patterns

When the merge phase combines Str patterns, if either is undefined or unequal lengths are found, we just ditch the entire pattern. This could be more refined, e.g.zip_longest and mark the suffix regex-optional, although that begs the question what to do about combining optional suffixes, or if such a thing is even valid. Still, there's almost certainly room for some improvement here.

Table structure

Tried going down the namedtuple route, which turned out to be the wrong idea. We need a specific list-of-tuples type to represent table structures, with options to specify header (& footer?) rows, and its own output representation.

Improve StrRepr for numeric patterns

The pattern handling in StrRepr is a bit rubbish, dealing with trivial bases in int, and with no clue about floating-point representations. We should have the ability to derive a %-format from a cohort of strings (just as we do for DateTime) for the Int and Float classes.

The Int detection you already know how to do; Float can be figured out. The tricky bit will be handling __eq__ and __str__ for those patterns. The current base merging is easy enough, but handling 0-prefix and length padding may turn out to be non-trivial, e.g. it needs to know that '%08x' == '%10x' and that '%08x' + '%10x' == '%08x', that '%#x' != '%x', etc.

HTML output

ANSI codes for the tty at least; maybe HTML?

Variable date/time epoch/scale

For analysis of Excel-exported datetimes (1900-01-01 epoch, 1/86400 scale)

Multi-file input

Handle multiple input files as if an array of their content (stripping the outer array from the output); mostly already implemented but needs output refinement on the ui-work branch.

CSV handling

Add parsing of CSV files including detection of quoting and field separators (and column headers? should be easy enough). Depends on #4

Documentation

'nuff said

XLS support

After a little scouting around it appears it's still a sadly common format for several data sources (ONS!). Add it as an optional dependency like YAML.

Sub-tree re-work

Add a method to permit forcing the type of a given node (and thus re-analysis of its sub-nodes). To be used in an eventual interactive implementation permitting users to force, e.g. a str of int to be treated as a straight str.

YAML handling

It's a horrid format, but popular enough, and easy enough to add that it's probably warranted adding it.

XML handling

Add the ability to parse and analyze XML elements; something along the lines of a compound structure with a dict for attrs, a list for child elems, and a couple of str attributes for text and tail (a la etree)

Better handling of null/None

At the moment, the --empty-threshold handling applies to blank strings only. This should be extended to include None (or null in Javascript et al) so that things like a bunch of strings with a few None entries doesn't get treated as "value" but as "optional str".

Path re-factor

Analysis path is currently a tuple of structures, which is okay but would be preferable to have something pathlib-esque to make the code a bit more obvious.

Tuple analysis

Handle tuples separately from lists. Lists are expected to be homogeneous structures, whereas tuples should be treated as heterogeneous, almost like an integer keyed dict (or str for namedtuples?). Open questions surrounding a heterogeneous list of differently structured tuples (reject? simplify?)

waveform80 / structa Goto Github PK

structa's Introduction

structa

Usage

Examples

structa's People

Contributors

Stargazers

Watchers

Forkers

structa's Issues

Recommend Projects

Recommend Topics

Recommend Org