waveform80 / structa Goto Github PK
View Code? Open in Web Editor NEWA small utility for analyzing data structures (e.g. JSON files)
Home Page: https://structa.readthedocs.io/
License: Other
A small utility for analyzing data structures (e.g. JSON files)
Home Page: https://structa.readthedocs.io/
License: Other
After a little scouting around it appears it's still a sadly common format for several data sources (ONS!). Add it as an optional dependency like YAML.
Handle multiple input files as if an array of their content (stripping the outer array from the output); mostly already implemented but needs output refinement on the ui-work branch.
Add the ability to parse and analyze XML elements; something along the lines of a compound structure with a dict for attrs, a list for child elems, and a couple of str attributes for text and tail (a la etree)
Tried going down the namedtuple route, which turned out to be the wrong idea. We need a specific list-of-tuples type to represent table structures, with options to specify header (& footer?) rows, and its own output representation.
For analysis of Excel-exported datetimes (1900-01-01 epoch, 1/86400 scale)
Some complex cases (USN example being one) result in common sub-structures which could be eliminated with some slightly more intelligent comparison methods
The pattern handling in StrRepr is a bit rubbish, dealing with trivial bases in int, and with no clue about floating-point representations. We should have the ability to derive a %-format from a cohort of strings (just as we do for DateTime) for the Int and Float classes.
The Int detection you already know how to do; Float can be figured out. The tricky bit will be handling __eq__
and __str__
for those patterns. The current base merging is easy enough, but handling 0-prefix and length padding may turn out to be non-trivial, e.g. it needs to know that '%08x' == '%10x'
and that '%08x' + '%10x' == '%08x'
, that '%#x' != '%x'
, etc.
Currently, when a Dict with Field keys is merged with a Dict with, say, Scalar keys, the result duplicates the typed entries and the subsequent sort (intended for fields) fails (because Scalar entries are unsortable):
Traceback (most recent call last):
File "/home/dave/envs/structa/bin/structa", line 33, in <module>
sys.exit(load_entry_point('structa', 'console_scripts', 'structa')())
File "/home/dave/projects/structa/structa/ui/cli.py", line 30, in main
structure = get_structure(config)
File "/home/dave/projects/structa/structa/ui/cli.py", line 230, in get_structure
result = analyzer.merge(struct)
File "/home/dave/projects/structa/structa/analyzer.py", line 272, in merge
return self._merge(struct)
File "/home/dave/projects/structa/structa/analyzer.py", line 315, in _merge
return path.with_content([
File "/home/dave/projects/structa/structa/analyzer.py", line 316, in <listcomp>
self._merge(item)
File "/home/dave/projects/structa/structa/analyzer.py", line 310, in _merge
return path.with_content([
File "/home/dave/projects/structa/structa/analyzer.py", line 311, in <listcomp>
DictField(self._merge(field.key), self._merge(field.value))
File "/home/dave/projects/structa/structa/analyzer.py", line 310, in _merge
return path.with_content([
File "/home/dave/projects/structa/structa/analyzer.py", line 311, in <listcomp>
DictField(self._merge(field.key), self._merge(field.value))
File "/home/dave/projects/structa/structa/analyzer.py", line 305, in _merge
DictField(self._merge(keys), self._merge(sum(
File "/home/dave/projects/structa/structa/types.py", line 301, in __add__
result = super().__add__(other)
File "/home/dave/projects/structa/structa/types.py", line 248, in __add__
result.content = [a + b for a, b in self._zip(other)]
File "/home/dave/projects/structa/structa/types.py", line 248, in <listcomp>
result.content = [a + b for a, b in self._zip(other)]
File "/home/dave/projects/structa/structa/types.py", line 337, in __add__
self.value + other.value)
File "/home/dave/projects/structa/structa/types.py", line 305, in __add__
result.content = sorted(result.content, key=attrgetter('key'))
TypeError: '<' not supported between instances of 'Str' and 'Str'
Handle tuples separately from lists. Lists are expected to be homogeneous structures, whereas tuples should be treated as heterogeneous, almost like an integer keyed dict (or str for namedtuples?). Open questions surrounding a heterogeneous list of differently structured tuples (reject? simplify?)
Add a method to permit forcing the type of a given node (and thus re-analysis of its sub-nodes). To be used in an eventual interactive implementation permitting users to force, e.g. a str of int to be treated as a straight str.
'nuff said
It's a horrid format, but popular enough, and easy enough to add that it's probably warranted adding it.
When the merge phase combines Str patterns, if either is undefined or unequal lengths are found, we just ditch the entire pattern. This could be more refined, e.g.zip_longest
and mark the suffix regex-optional, although that begs the question what to do about combining optional suffixes, or if such a thing is even valid. Still, there's almost certainly room for some improvement here.
At the moment, the --empty-threshold
handling applies to blank strings only. This should be extended to include None
(or null
in Javascript et al) so that things like a bunch of strings with a few None entries doesn't get treated as "value" but as "optional str".
ANSI codes for the tty at least; maybe HTML?
Add some indication that structa is doing something; takes bloody ages to analyze a 100+Mb of JSON on a pi
Analysis path is currently a tuple of structures, which is okay but would be preferable to have something pathlib-esque to make the code a bit more obvious.
Might be handy to be able to stuff a URL straight on the CLI; minor enhancement
Add parsing of CSV files including detection of quoting and field separators (and column headers? should be easy enough). Depends on #4
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.