clamsproject / mmif Goto Github PK

View Code? Open in Web Editor NEW

4.0 6.0 1.0 22.81 MB

MultiMedia Interchange Format

License: Apache License 2.0

Python 83.41% CSS 16.59%

mmif's Introduction

MMIF

Repository for MMIF specifications, MMIF schema and the CLAMS vocabulary.

To create a new version use the build.py script:

$ python build.py

This creates a new version in docs/VERSION where the version is taken from the VERSION file.

Local build and preview

HTML files generated from build.py will be deployed to a github.io page. The base webpage where all the versioned specifications reside is deployed via the jekyll engine. That is, to test and preview a local build, one needs to install jekyll for local serving, which in turn, requires ruby. Install ruby following this documentation. jekyll wants ruby>=2.5, but ruby is shipped with bundle/bundler (THE dependency management utility for ruby) only since 2.6, hence installing 2.6 or newer is preferred. For 2.5, one needs to manually install bundler after installing ruby.

Once ruby and bundler are ready,

bundle env | grep Bin # this is where jekyll is installed
cd docs
rm Gemfile.lock
bundle install
jekyll serve # if the bundle Bin dir is not in your $PATH, use absolute path to jekyll binary

should give you a running instance of the MMIF specification website at localhost:4000.

Note that the starting jekyll will download website theme and L&F from the internet (Specifically from https://github.com/clamsproject/website-theme ). So you need an internet connection to get the full preview locally. To start jekyll without a connection (and lose styles) you need to comment remote_theme config line from docs/_config.yaml file before running the jekyll command.

Checklist for deployment

List of things to do when creating a new version:

Update the VERSION file.
Run the build.py script. This will automatically do the following:
- Collect all changes (schema, vocabulary and specifications)
- Update specifications/index.md to replace version numbers.
- Update all the sample files so they all have the right version number.
- Update VERSIONS list in docs/_config.yml.
- Update docs/index.md (date at the bottom).
Test all examples to see whether they match the schema.
Check all pages
Final updates to CHANGELOG.md.
Submit all changes (including ones that were made automatically, like the changes to the config file in the documentation directory) and merge to the main branch.

mmif's People

Contributors

Stargazers

Watchers

Forkers

keighrim

mmif's Issues

Context indignities

This is a gathering space for all issues related to the context. See this gist for an early draft of the context.

Issues:

The context expands id to http://mmif.clams.ai/vocab/Annotation#id, but that means that the identifier for media and views gets the same treatment, maybe expand to something like http://mmif.clams.ai/vocab/mmif#id.

python-sdk bulider

as a subtask of #19 , I'd like to add a builder script to the sdk.
Here's what the parent issue outlined for sdk bulider

sdk builder (currently mmif-python)
- make publish - when there's an update in sdk codebase, version it, package it (setup.py), publish to pypi
- make dev - package sdk and install to local development environment (maybe upload to the lab pypi too?)

re-implement views, media, annotations as dict keyed by ID

For faster access to individual objects, implement those as dicts in python, while keeping the actual serialization still to be lists as described in the specification.

update serialize package based on spec 0.1.x

This thread to track changes to python code to be made to incorporate changes in spec 0.1.x, including #8, and others.

Annotation vocabulary shortnames and ID prefixes

Define a formal scheme for deriving unique shortnames and ID prefixes for MMIF vocabulary that is extensible to future LAPPS integration and user-defined ontologies

update python test suite for py-0.2.0

Python code for handling MMIF objects are updated in cabec75, but the test suite we have was written for 0.1.0, and is failing with the updated code. Update test code accordingly.

find a way to share pypi credentials in the team

I have created clamsproject account on official pypi and test pypi. I also set up a nexus based hosted private pypi for uploading development snapshots if needed. I want to have a way to share secrets for pypi and configurations for up-/downloading to/from the private pypi.

Use of "contains" at the top level

This was proposed to contain a list (or probably a set) of annotation types in the MMIF views. You do not know what view those types are in, but it could be used for quick sniffing.

Some disadvantages:

it makes MMIF a little bit more complex
it introduces redundancies

And how much quicker is the sniffing? Without the "contains" on the top level all you would need to do something is loop over the views and collect the keys in the "contains" dictionary.

If we keep it at the top level I would like it better if it were placed inside the metadata dictionary.

spec-0.3.0

Issue to store discussions on a new version and to track changes that we want to make for the next minor version of the specifications. Patch versions will only have changes to the documentation and the informal specifications.

On the list:

Adding boxType to BoundingBox. The property is used in some of the examples. May want to put it on Polygon, but polygonType seems to be a weird name.
Think through the anchoring approach.
Decide on whether to split annotations into anchorings and labelings with the first in a separate segments list with each document. See closed issue #86 for discussion.
The value of Polygon>timePoint is an ID, but we are using it as an integer that reflects a frame number or time measure.
Units are potentially relevant for determining position on a line (time line or character stream) or coordinates in an image. There was some worry that the unit meta data property is overloaded because you cannot use it for more that one of those dimensions. This appears to be not a problem (issue #113), but some notes in the documentation would be good.
The pixels in coordinates are measured from the top left and start at (0,0). Add this to the documentation.
What to do about compression? MMIF files are huge. Tesseract may introduce 3600 identical text documents for a static piece of text on the screen, all associated with an alignment to a bounding box. If the text documents are all the same and the bounding boxes have the same coordinates (but different time points) then we could do with one text document, aligned to one bounding box which instead of one time point has a list of time points. This is the beginning of a move towards defining vide objects. But we do not yet know well enough how to do that. We will experiment with doing parts of this in visualization and PBCore export modules, and worry about updating MMIF later.

Annotation medium metadata property

Maybe remove this now that the medium property has been moved out of contains, together with other properties that are not there anymore, like tool.

`AnnotationProperties.properties` dict doesn't get created when using `View.new_annotation`

When creating a new AnnotationProperties object, the id field gets set, but the "id" key in properties does not get set and in fact the properties dict is never initialized, leading to issues with serialization.

Fixed by adding @property getters and setters to AnnotationProperties and initializing the properties dictionary.

git tagging with version

As now we are just about to release first alpha version (0.1.0), I wonder which tagging scheme we should use for tagging the release in git. Note that we have two different versions (spec version and python sdk version) that are only partially synchronized. So, we can use base, unspecified x.y.z scheme for spec version (as if the spec is the main dish) and more specified x.y.ẑ-python scheme for python sdk. Or we can treat them equally and use specified schemes for both (x.y.z-spec, x.y.ẑ-python) (Of course we can use shorter suffixes as in x.y.zs or x.y.ẑp). Any other idea is also welcomed.

experiment with pydantic integration

pydantic is a popular library that supports mappings between python objects and json schema fields, as well as validation and auto generation of schema from python classes. I'd like to experiment with the library to see if it can be utilized for testing and CI pipelines.

add type hints to various MMIF subclasses

To make the code easier to read and debug, I'd like to have all MMIF subclasses (defined in serialize package) to be statically typed using python type hints.

Context file links to non-existing vocabulary

Context files all use URIs like http://mmif.clams.ai/0.1.0/vocab/.

Change this into http://mmif.clams.ai/0.1.0/vocabulary/.

MMIF sample draft

A sample is at https://github.com/brandeis-llc/mmif-prototype/blob/master/draft/sample.md.

Version 0.2.0

Will have the changes in MMIF as discussed the last few weeks.

Using text documents in views
Using alignments for relating text documents to images or speech segments
Getting rid of the contexts
More example files

I am not sure what major changes in there are in the SDK, but it would be nice to have a checklist here.

Updated to work with new specifications
Added ability to freeze read-only parts of the MMIF file

We decided to use versions like spec-0.2.0 and py-0.2.0, but we never added any tags to the repository for the previous version.

I started a branch spec-0.2.0 for the changes on the specifications end, this will include changes to the publish script.

More unit tests

Issue for ongoing work on unit testing of the SDK.

abstract `annotation` type has duplicate properties

The type defined in the vocabular 0.2.0 https://mmif.clams.ai/0.2.0/vocabulary/Annotation/ has two document properties, one in metadata and one in properties. I think annotation object has no metadata field, but only its equivalent properties field, so I guess metadat::document should be gone?

Version 0.1.0

I would like to publish and release a first version as soon as possible so that we actually have a full specification out there with vocabulary, schema and context files. Part of the motivation here is so that GBH people can actually see the documents that are part of the specifications.

This requires us to merge into the master branch.

Any reasons not to do this?

An alternative would be some process for pushing files to some development website, but that sounds like extra work.

copying resource files by checking out versioned files using git

As we discussed yesterday, we'd like to have resource files (json schema, vocab definitions, ...) copied to python package not by using hard-coded file paths, but by using git ops to checkout those files that have matching versions.

usage of files in `specifications/samples`

I don't see any reference to any of those files from specification document, or any other component. But I see some of those files are used as test resources for in the mmif-python. I'd like to suggest, instead of having duplicates in spec directory and mmif-python/tests directory, we makeMakefile to copy these files at test time so that we reduce maintenance cost as well as keep the MMIF showcase up to date with the python code.

MMIF design discussion

An epic to relate all discussions about design of MMIF

view metadata uses `tool` instead of `app`

Though we decided to consistently use app to indicate individual CL application in the CLAMS platform, there are some places in the current spec where the term tool are used. This issue to track those down and fix them.

Annotation subclassing prototypes

I've made two prototypes for Annotation subclassing: one that pregenerates the subclasses in setup.py, and one that generates them on-the-fly at runtime.

So far the prototypes do not have support for custom annotation types or LIF types. That is next on the agenda for this issue.

`discriminator` in MMIF

In lapps, tools can handle different type of data formats (LIF, gate, UIMA, ...), so we needed to have a discriminator at the very beginning of the I/O to indicate in which format is current input.
I wonder if that's still the case for the MMIF and CLAMS.
If the CLAMS needs to be 100% compatible with all the lapps tools (including those handle non-LIF inputs, for instance UIMA-XML), we can still encode a LIF (with XML encoded) as a whole inside the MMIF - of course it would be ugly though.
Any idea is welcome.

binary distribution on pypi archive is missing non-python files

The binary tar archive generated from setup.py bdist lacks non-python files including schema and vocab file and thus code does not run properly. Need a fix (see https://packaging.python.org/guides/using-manifest-in/#using-manifest-in).

(NB: this is different issue from #48)

meanings of digits in semantic version

As a subtask of clamsproject/mmif-python#12, I'd like to specify what three digits mean in the version space of MMIF.

Versioning principles

It is preferred that versions of sub-components are tied to each other. In an ideal world, all versions are exactly synchronized, but as we discuss here in the below, that is not very practically likely.

Type of changes and using digits

version number format x.y.z: version numbers consist of three digits, each digit is positive integer
major: major changes are breaking changes that are not backward compatible - any major change will increase x digit in the version number by 1.
minor: minor changes add features - any minor change will increase y digit
patch: patches are not adding any feature - any patch will increase z digit

MMIF components

As described in clamsproject/mmif-python#12, there are largely 5 sub-components;

MMIF specification (spec hereinafter)
- either as a set of abstract API, or simply as documentation
MMIF JSON schema (schema hereinafter)
MMIF JSON-LD context (context hereinafter)
MMIF reserved vocabulary (vocab hereinafter)
MMIF serialization API implementation (sdk hereinafter)

td; dr

updates in schema are most likely major
updates in vocab are most likely minor (additions)
updates in sdk are always patches, but not reflected to main spec version
no standalone updates in context
all five sub-components share x and y digtis
spec, schema, context, and vocab share z as well
sdk never make major or minor changes

Areas of change

Changes in `spec`

Any changes in a sub-component would result in change in specification as well, and thus will increase the version number. Changes in specification that are not accompanied by changes in component(s) are not very imaginable, unless changes to fix typographical errors. Fixing typos should be a patch level version bump.

Changes in `schema`

Possible changes include (not limited to)

addition of fields (e.g. introducing new metadata)
re-organization of existing fields (e.g. changes in value types)
deletion of fields (deprecation)

In terms of automatic JSON validation, an addition of a field/term can be non-breaking change unless the new field is required. All other changes are not backward compatible.

Changes in `vocab`

As there is not a forcible way of validating semantics of MMIF string using vocab as validation source, backward compatibility must mean something different here. However, I'd like to suggest all additions (items, props of an item) to the ontology must be non-breaking and minor level version bump-ups.

Changes in `context`

Any addition of items in the vocab or schema should update context as well. But changes originating from context is not so likely.

Changes in `sdk`

In contrast to other components, sdk is actual code, and is very prone to have software bugs. However, as sdk is a mere implementation of what's defined and documented in the spec, unless it's originated from updates of other components, all standalone changes in sdk would be only fixing bugs, optimizing code, or adding supplementary (possibly undocumented) APIs. So those types of updates won't add (or change) any feature to the MMIF spec. But, by its virtue, sdk will receive a lot more patches compared to other components, thus I propose we untie patch number (z) of sdk from MMIF version.

Version binding

In conclusion, we want to stick to the principle of shared version numbers between all relevant components, however, because sdk is more likely to receive frequent patches, patch number of sdk needs to be independent from MMIF version.

Handling MMIF versions in consumers

We don't want to pose any restriction on consumer (e.g. MMIF visualizer) implementation. By publishing sdk to version-controlled package repositories (e.g. pypi), we can expect a specific implementation of a consumer is tied to a specific version of MMIF (at least on x and y level). If a developer of such a consumer wants to support multiple versions of MMIF using our sdk, it might not be trivial as the sdk does not inherently support backward compatibility (by design), so the recommended way of support many versions would be maintaining different versions of the consumer software - which is not ideal (in fact a really bad practice, I believe). This leads us to a new question that might need its own issue card; namely, if it turned out that sdk of older version (say 1.2.18) contains some critical bug after the main MMIF version went up with some major changes (say latest MMIF version is 2.1.0), do we need to fix the old sdk and publish a new version (say 1.2.19)?

factor jekyll theme out as a remote theme

Currently jekyll site configs and layouts are manually copied over from the main project website (https://github.com/clamsproject/clamsproject.github.io). We want to refactor common thematic components to a remote jekyll theme to ease maintenance and addition of other sub-project websites in the future.

implement named and unnamed attributes for MMIF objects

When an object is specified to have `additionalProperties* by the json schema (which is true by default), the user should be able to throw in any key and value pairs into the json object. Among MMIF objects, for example, all metadata objects are allowing additional properties. #61 already touched on the issue with allowing any string for the key name, but current SDK doesn't have a common abstract implementation for those classes that should reserve some key names and named attributes and allow free string names as unnamed attributes. As mentioned in comments in #61, ba0d546 attempts an experimental implementation of the behaviors, and we want 1. to see if that approach is valid for other MMIF objects, and 2. implement such behaviors one way or another.

Interpretation of contains in view metadata

According to the specification, the contains dictionary's keys consist of all the annotation types in that view, with the values being the metadata associated with those annotation types, such as units. By this definition, it would seem that all of the annotations of the same type within a view are assumed to have the same metadata--for instance, you couldn't have two Intervals, one with units in seconds and another with units in frames, in the same view.

This seems like a valid restriction, since it seems like semantically one view would encompass at most one way of looking at intervals in an audio clip or what have you, but I wanted to put it in writing here to get confirmation that this is something we want to enforce.

mmif-python: Documentation

For later: start writing documentation for all exposed objects and methods in the MMIF SDK

Restrictions on keys in medium metadata and annotation properties

#60 (at 5a53b53) implements Annotation.add_property and Medium.add_metadata using setattr (the previous code, which appeared to hint at __setitem__, was not yet implemented). This necessarily requires that all properties in the annotationProperties and mediumMetadata JSON objects be valid Python identifiers, and not just valid strings as before.

If this is a requirement that we are okay with, we can proceed with the change I've implemented in that commit. If we'd prefer not to have that restriction, we can implement it differently (i.e. with a field in the AnnotationProperties and MediumMetadata classes containing a dictionary and some magic methods to wrap around that).

MMIF builder

As a developer of MMIF specifications and SDK, I'd like to have a "builder" script (preferably a single Makefile) to perform several actions upon certain changes in this repository.

add alignment model in the specification

What we released as 0.1.0 does not include the core implementation of alignment between different annotation types that are done on different media types. This thread is for discussing implementations.

@context in a view?

This example file fails validation against the current schema due to the presence of an @context attribute in a view. Should our schema account for @context in a view?

constraining number of media based on its type

In the prototype, we had Audio, Image, Text, and Video types. So the media field in the MMIF can hold multiple medium objects of different types. However, it doesn't sound very right that we have two or more Video or Audio inputs to a single MMIF file. However, as we discussed, if a tool can generate a new medium, that it's easy to imagine that a single MMIF file have multiple Text sources generated from different X-to-text tools (ASR, OCR) in the media list. Code-wise this must not be a big issue to pose some constrains based on the type, but we might want to add some justification and/or theory to the specification on this issue.

Can only deserialize MMIF from str, not dict

Because of the current logic regarding Mmif.validate() and MmifObject.deserialize(), validate() expects @context (matching the JSON-LD schema) but deserialize() expects @context for strings and _context for dicts. This means that when you try to initialize a Mmif object from a dict, if it has @-tags you get a KeyError in Mmif._deserialize() when extracting self._context, but if it has _-tags you get a ValidationError in validate().

One fix for this is to have MmifObject.deserialize() replace @ with _ for both dicts and strings, which requires a minor change to MmifObject._load_str() (which should probably be renamed at that point).

@keighrim do you see any problems with making such a change?

specification versions

In essence, MMIF specification has three versioned components,

JSON schema; that defines syntactic elements in MMIF
LinkedData context; that defines shortcuts for URIs
Type hierarchy (vocabulary); that defines concepts and their ontological relations

Currently the MMIF draft (written as 1.0) describes overall relations between these components as well as giving some details on syntactic structure of the MMIF. In doing so, the document refers not-yet-existing vocabulary and schema as 1.0, but before we concretely define them, I think we need to first decide how to version these different elements, as well as version the MMIF as an overarching entity. In lappsgrid and LIF, we used semantic versioning, although there has been only a small number of versioned changes lif specification. The synchronization of versions of sub-components were never vigorously confined or discussed.

I'm proposing we use sementic versioning, starting from 0.1.0 (as draft) synchronized over all three MMIF elements plus clams-python-sdk as that serialization module uses the same version number as underlying data model.

Changing the structure of annotation types

In LAPPS we had "attype", "id", "start" and "end" (and earlier even "label" and "type") defined directly on the annotation, and then a bunch of properties in the "features" dictionary.

{
  "@type": "Token",
  "id": "t0",
  "start": 0,
  "end": 5,
  "features": { "word": "}
}

This proved confusing to people and it sure was not very consistent. I for one found there was little rhyme or reason to why things were where they were and did not like something being called 'properties' in the vocabulary and 'features' in LIF.

How about we only have "attype" and "properties" and put all properties defined in the vocabulary as well as others in the map.

{
  "@type": "Token",
  "properties": {
    "id": "t0",
    "start": 0,
    "end": 5,
    "word": "Dingo"
}

We may want to make an exception for "id" and put it on the top level since it is required for every annotation.

Possible disadvantage is lower compatibility with LIF.

vocabulary builder

The goal of developing vocab builder is to automate publication of vocabulary as a website. The raw definition of the vocabulary itself (annotation types, object features, ...) must be written in commonly used data format (such as json, yaml, etc) so that anyone can send pull requests for updating the vocab for addition of new types of annotations (and we can use comments in the PR for discussion). The bulider should take a single definition file or a single directory of those files to build HTML files and copy them to docs directory so that github will automatically publish them via http://clamsproject.github.io/mmif. Current prototype (available in vocabulary branch) is using yaml as input.

Assorted fixes in mmif-python

Pathing to VERSION file in publish.py is broken or inconsistent
Circular import issue in mmif.__init__
clams.vocabulary.yaml should be copied into package resources
serialize.View needs reworking

python sdk does not install via pip-install

Due to the way setup.py depending on other files that are not in the python sdk source tree, when pip-installing, build process fails while trying to copy those external files (because they are not in the pypi archive).

Annotation types

This is a thread to discuss annotation types.

The goal of specifying annotation types is to form a vocabulary to use in MMIF annotation. Annotation types can be *hierarchically structured*, as we did in [lapps vocabulary](http://vocab.lappsgrid.org/).

On the top-most, annotations will be anchored either one of:

time offsets (bars & tone, scene segmentation)
character offsets (transcript, tokens, NE)
bounding boxes (face recognition, OCR)

Let's call them time-based annotations (T-based), language-based (L-based), and image-based (I-based), respectively. And then there're annotations that link other annotations (or anchors)

relations, such as coreference: between L-based annotations
forced alignment: L-based to T-based
box alignment? (need a good name): timestamps of the bounding boxes

MMIF sample draft

@keighrim commented on Sat Oct 06 2018

A sample is at https://github.com/brandeis-llc/mmif-prototype/blob/master/draft/sample.md.

@keighrim commented on Sat Oct 06 2018

From Marc's e-mail

Just two remarks on first glance:

one thing I miss is that views or annotations should make explicit what input they are viewing/annotating.
The view with id “v2” has meta data for bar-detection and annotations of @type vanilla-forced-align

Redesigning the vocabulary and MMIF

This was raised in #59, but the issue goes beyond alignment so creating a new issue.

View metadata

For LIF, the "contains" field had a dictionary of types and for each type some metadata like "producer" and other properties defined in the metadata in the vocabulary ("tagSet" etcetera). All metadata were inside of "contains".

We need metadata on the view (outside "contains") like "timestamp" (or "creation-time") and "dependsOn". And once we do that we need a theory on what things are in the "contains" dictionary and what things are not. I would like it if things like "producer" and "tool-version" and "tool-wrapper-version" are defined on the view and not the annotation types in "contains" and reserve the latter for properties defined in the vocabulary metadata. That will only work if a view is created by only one component, which is something we never did for LAPPS, but which is a restriction that I like because (1) views are now read-only after they are created and (2) we lose the redundancy of having "producer" and "tool-version" be repeated for each annotation type.

So I am thinking something like this:

{
  "contains": {
    "http://mmif.clams.ai/vocabulary/1.0/Segment": {
      "unit": "seconds"
    }
  },
  "medium": "m3",
  "timestamp": "2020-05-27T12:23:45",
  "producer": "bars-and-tones",
  "tool-version": "1.0.2",
  "tool-wrapper-version": "1.0.5"
}

At the moment, "producer" and "medium" are in the metadata of "Annotation", but that does not fit nicely with the MMIF above. I am also thinking that we should allow a URI as the value of "producer".

Get `make test` passing on current develop HEAD

Tasks:

Find all roadblocks to passing
Append them here
Fix the following from pytype, which is being discussed in #61 (Note: if we merge #60, this is moot):

unsupported operand type(s) for item assignment: 'AnnotationProperties' [unsupported-operands]
  No attribute '__setitem__' on AnnotationProperties

Fix:

File "/Users/stygg/clamsproject/mmif/mmif-python/mmif/serialize/__init__.py", line 6, in <module>: Name 'annotation' is not defined [name-error]
File "/Users/stygg/clamsproject/mmif/mmif-python/mmif/serialize/__init__.py", line 6, in <module>: No attribute '__all__' on module [attribute-error]
File "/Users/stygg/clamsproject/mmif/mmif-python/mmif/serialize/__init__.py", line 6, in <module>: Name 'view' is not defined [name-error]
File "/Users/stygg/clamsproject/mmif/mmif-python/mmif/serialize/__init__.py", line 6, in <module>: Name 'medium' is not defined [name-error]

Fix: won't-fix, setup.py must be out of pytype scope

File "/Users/stygg/clamsproject/mmif/mmif-python/setup.py", line 88, in <module>: No attribute 'sdist' on module 'setuptools.command' [module-attr]
File "/Users/stygg/clamsproject/mmif/mmif-python/setup.py", line 108, in <module>: No attribute 'setup' on module 'setuptools' [module-attr]
File "/Users/stygg/clamsproject/mmif/mmif-python/setup.py", line 117, in <module>: No attribute 'find_packages' on module 'setuptools' [module-attr]

Fix problems and PR into develop

Future:

Test and fix additional problems for code in #60 before un-drafting

mmif-python `make test` fails right after `make clean`

External resource files (schema and vocab) are copied by make test, and thus many tests fail.

Publishing vocabulary in additional formats

On Wednesday we briefly discussed adding a feature to allow users to use RDF or OWL formats with MMIF. I don't remember exactly how we wanted to support those formats (defining annotation vocabulary? converting to/from JSON-LD? both?), so I wanted to open an issue to discuss that.