waveform80 / structa Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 1.0 507 KB

A small utility for analyzing data structures (e.g. JSON files)

Home Page: https://structa.readthedocs.io/

License: Other

Makefile 1.25% Python 94.95% XSLT 3.80%

data-analysis json csv yaml data-visualization datawrangling datajournalism

structa's Issues

XLS support

After a little scouting around it appears it's still a sadly common format for several data sources (ONS!). Add it as an optional dependency like YAML.

Multi-file input

Handle multiple input files as if an array of their content (stripping the outer array from the output); mostly already implemented but needs output refinement on the ui-work branch.

Add the ability to parse and analyze XML elements; something along the lines of a compound structure with a dict for attrs, a list for child elems, and a couple of str attributes for text and tail (a la etree)

Table structure

Tried going down the namedtuple route, which turned out to be the wrong idea. We need a specific list-of-tuples type to represent table structures, with options to specify header (& footer?) rows, and its own output representation.

Variable date/time epoch/scale

For analysis of Excel-exported datetimes (1900-01-01 epoch, 1/86400 scale)

Common sub-tree elimination

Some complex cases (USN example being one) result in common sub-structures which could be eliminated with some slightly more intelligent comparison methods

Improve StrRepr for numeric patterns

The pattern handling in StrRepr is a bit rubbish, dealing with trivial bases in int, and with no clue about floating-point representations. We should have the ability to derive a %-format from a cohort of strings (just as we do for DateTime) for the Int and Float classes.

The Int detection you already know how to do; Float can be figured out. The tricky bit will be handling __eq__ and __str__ for those patterns. The current base merging is easy enough, but handling 0-prefix and length padding may turn out to be non-trivial, e.g. it needs to know that '%08x' == '%10x' and that '%08x' + '%10x' == '%08x', that '%#x' != '%x', etc.

Merge type-instances correctly

Currently, when a Dict with Field keys is merged with a Dict with, say, Scalar keys, the result duplicates the typed entries and the subsequent sort (intended for fields) fails (because Scalar entries are unsortable):

Traceback (most recent call last):                                                                                                                                           
  File "/home/dave/envs/structa/bin/structa", line 33, in <module>
    sys.exit(load_entry_point('structa', 'console_scripts', 'structa')())
  File "/home/dave/projects/structa/structa/ui/cli.py", line 30, in main
    structure = get_structure(config)
  File "/home/dave/projects/structa/structa/ui/cli.py", line 230, in get_structure
    result = analyzer.merge(struct)
  File "/home/dave/projects/structa/structa/analyzer.py", line 272, in merge
    return self._merge(struct)
  File "/home/dave/projects/structa/structa/analyzer.py", line 315, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 316, in <listcomp>
    self._merge(item)
  File "/home/dave/projects/structa/structa/analyzer.py", line 310, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 311, in <listcomp>
    DictField(self._merge(field.key), self._merge(field.value))
  File "/home/dave/projects/structa/structa/analyzer.py", line 310, in _merge
    return path.with_content([
  File "/home/dave/projects/structa/structa/analyzer.py", line 311, in <listcomp>
    DictField(self._merge(field.key), self._merge(field.value))
  File "/home/dave/projects/structa/structa/analyzer.py", line 305, in _merge
    DictField(self._merge(keys), self._merge(sum(
  File "/home/dave/projects/structa/structa/types.py", line 301, in __add__
    result = super().__add__(other)
  File "/home/dave/projects/structa/structa/types.py", line 248, in __add__
    result.content = [a + b for a, b in self._zip(other)]
  File "/home/dave/projects/structa/structa/types.py", line 248, in <listcomp>
    result.content = [a + b for a, b in self._zip(other)]
  File "/home/dave/projects/structa/structa/types.py", line 337, in __add__
    self.value + other.value)
  File "/home/dave/projects/structa/structa/types.py", line 305, in __add__
    result.content = sorted(result.content, key=attrgetter('key'))
TypeError: '<' not supported between instances of 'Str' and 'Str'

Tuple analysis

Handle tuples separately from lists. Lists are expected to be homogeneous structures, whereas tuples should be treated as heterogeneous, almost like an integer keyed dict (or str for namedtuples?). Open questions surrounding a heterogeneous list of differently structured tuples (reject? simplify?)

Sub-tree re-work

Add a method to permit forcing the type of a given node (and thus re-analysis of its sub-nodes). To be used in an eventual interactive implementation permitting users to force, e.g. a str of int to be treated as a straight str.

Documentation

'nuff said

YAML handling

It's a horrid format, but popular enough, and easy enough to add that it's probably warranted adding it.

Improve merging of str patterns

When the merge phase combines Str patterns, if either is undefined or unequal lengths are found, we just ditch the entire pattern. This could be more refined, e.g.zip_longest and mark the suffix regex-optional, although that begs the question what to do about combining optional suffixes, or if such a thing is even valid. Still, there's almost certainly room for some improvement here.

Better handling of null/None

At the moment, the --empty-threshold handling applies to blank strings only. This should be extended to include None (or null in Javascript et al) so that things like a bunch of strings with a few None entries doesn't get treated as "value" but as "optional str".

waveform80 / structa Goto Github PK

structa's Issues

Recommend Projects

Recommend Topics

Recommend Org