ietf-tools / bibxml-service Goto Github PK

View Code? Open in Web Editor NEW

15.0 11.0 18.0 3.49 MB

Django-based Web service implementing IETF BibXML APIs

Home Page: https://bib.ietf.org

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.98% Python 73.06% HTML 16.24% Shell 0.09% CSS 1.24% JavaScript 7.89% Ruby 0.16% Smarty 0.34%

ietf internet-draft rfc xml2rfc bibxml

bibxml-service's People

Contributors

Stargazers

Watchers

Forkers

stefanomunarini familyfirst4 kesara ietf-ribose strogonoff william3johnson inferno-inc qpc-github watchtower001110 mtokenn rheyoga36 mihaller fitriadi19 isabella232

bibxml-service's Issues

Unable to find "C57-12-20.2011" unless given "IEEE C57-12-20.2011"

If I select the "IEEE" dataset, then enter "IEEE C57-12-20.2011", it works:

But if I enter "C57-12-20.2011", nothing is found:

Having to require entering the dataset prefix ("IEEE") and then the document identifier with also the prefix "IEEE" seems redundant.

This seems to be an issue with the search engine matching strings.

Bring `bibxml2` dataset up to date (in the future)

W3C, ISO, ITU, ANSI, FIPS, CCITT [previous name of ITU], IEEE, OASIS, PKCS

W3C, ISO, ITU, FIPS are all supported by Relaton (the latter two especially are provided by the authoritative parties).

OASIS has indicated they are willing to provide bibdata (in a separate project, but the data can also be useful for IETF).

We will need to figure out CCITT documents (legacy docs from ITU).

PKCS has no authority now (they were published by RSA) so we can just move those content into a static dataset.

ANSI we will need to figure out.

Originally from ietf-ribose#10 (comment)

Adjusted `xml2rfc` compatibility path implementation

Requirements (to reiterate):

A valid XML response must be returned for any requested xml2rfc path
If possible (plan A), returned XML response must reflect a citation indexed from one of the canonical data sources, not its original xml2rfc contents
As a fallback (plan B), returned XML response must be the exact file under requested xml2rfc path

Adjusted legacy path implementation is as follows:

When handling an xml2rfc path, BibXML service will transform the path to citation document type & ID (the docid pair in Relaton data)
For example, a path like /public/rfc/bibxml9/reference.BCP.0004.xml should be converted to docid like { "type": "IETF", "id": "IETF BCP 4" }
Manual docid mapping will be used first, if defined for this path
In absence of manual mapping, automatic transformation will be attempted (fallible)
If either of the above fails, BibXML service will return pre-crawled XML file obtained from xml2rfc server

Implement legacy path pattern: IETF Internet-Drafts (`bibxml3`, `bibxml-id`)

IETF Internet-Drafts (bibxml3, bibxml-id)

(previous location) http://xml2rfc.tools.ietf.org/public/rfc/bibxml3/
Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.example-name.xml
Pattern 2: http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.draft-example-name-99.xml

Legacy pattern(s) to implement:

Pattern 1: https://{hostname}/public/rfc/bibxml-ids/reference.I-D.{example-name}.xml
Pattern 2: https://{hostname}/public/rfc/bibxml-ids/reference.I-D.draft-{example-name}-{draft-number}.xml

We need to parse the pattern to return the appropriate BibXML content.

Add development, maintenance and operations documentation using Sphinx

Originally planned to be done as part of bibxml-project, but now that we have only one codebase repository it may as well go here.

Missing document title makes citation inaccessible via GUI

Currently the BibXML service does not provide an individual "show" page like the other datasets.

RFC subseries:

RFC

RFC individual page:

This is due to the fact where an "RFC subseries document" contains multiple RFCs. Each "RFC subseries document" contains individual metadata and also one or more RFCs, which form part of the "RFC subseries document".

This relationship is represented as a "document relation" in the Relaton data.

We need to handle this new data structure for display.

GraphQL API

This is a moderately far out idea for now, at least as far as I’m concerned, but GraphQL API is an option that might be very feasible given current architecture.

It has its downsides (e.g., consumers may start depending on citation attributes even if as data structure may change as citation sourcing evolves; inconsistencies between data sources will become more obvious and irritating—some sources contain more data compared to the others, so a finer-grained query may unintentionally exclude citations; higher complexity; etc.), but also some upsides which may outweigh now or in near future (although I don’t think GQL should be made the primary supported API).

Show a message when there are no search results

When search result from web UI has zero matches, show a message indicating that.
Example: https://demo.bibxml.org/search/?query=%2Bieee+foobar

Right now web UI gives an impression that it's still searching even though there are no matches.

Development UI screenshots

Courtesy of @strogonoff .

Search:

Single record view:

Collection view:

Internal:

Implement legacy paths

We need to support the legacy path patterns for the following datasets.

Datasets

IETF datasets

(bibxml) IETF RFCs: http://xml2rfc.tools.ietf.org/public/rfc/bibxml/
- Pattern: http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.{NNNN}.xml
(bibxml3) IETF Internet-Drafts: http://xml2rfc.tools.ietf.org/public/rfc/bibxml3/
- Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.example-name.xml
- Pattern 2: http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.draft-example-name-99.xml
- bibxml-id or bibxml3
(bibxml9) IETF BCP, FYI, STD: http://xml2rfc.tools.ietf.org/public/rfc/bibxml9/
- Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.BCP.0099.xml
- Pattern 2: http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.FYI.0099.xml
- Pattern 13 http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.STD.0099.xml
- bibxml-rfcsubseries or bibxml9

Non-IETF datasets

(bibxml2) Misc collection (W3C, ISO, ITU, ANSI, FIPS, CCITT [previous name of ITU], IEEE, OASIS, PKCS): http://xml2rfc.tools.ietf.org/public/rfc/bibxml2/
(bibxml4) W3C: http://xml2rfc.tools.ietf.org/public/rfc/bibxml4/
- Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml4/reference.W3C.REC-example-name-date.xml
(bibxml5) 3GPP: http://xml2rfc.tools.ietf.org/public/rfc/bibxml5/
(bibxml6) IEEE: http://xml2rfc.tools.ietf.org/public/rfc/bibxml6/
- Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml6/reference.IEEE.12345_date.xml
- Pattern 2: http://xml2rfc.ietf.org/public/rfc/bibxml-ieee/reference.IEEE.12345_date.xml
(bibxml8) IANA references (In the old implementation this is a dynamic fetch service, in our implementation we treat this as a dataset)
- Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml8/reference.IANA.example-table-name.xml
  - e.g. https://xml2rfc.tools.ietf.org/public/rfc/bibxml8/reference.IANA.sip-parameters.xml
- Pattern 2: http://xml2rfc.ietf.org/public/rfc/bibxml8-iana/reference.IANA.example-table-name.xml
  - e.g. https://xml2rfc.tools.ietf.org/public/rfc/bibxml-iana/reference.IANA.sip-parameters.xml

Services

(bibxml7) DOI references.
- Pattern 1: http://xml2rfc.tools.ietf.org/public/rfc/bibxml7/reference.DOI.{DOI ID}.xml or .kramdown
  - e.g. http://xml2rfc.tools.ietf.org/public/rfc/bibxml7/reference.DOI.10.6028/NIST.SP.800-53r5.xml
- Pattern 2: http://xml2rfc.tools.ietf.org/public/rfc/bibxml-doi/reference.DOI.{DOI ID}.xml or .kramdown
  - e.g. http://xml2rfc.tools.ietf.org/public/rfc/bibxml-doi/reference.DOI.10.6028/NIST.SP.800-53r5.xml

Originally posted by @ronaldtse in ietf-ribose#7 (comment)

Port documentation from `bibxml-indexer`

Specification for standard retrieval API

Here’s our API specification for BibXML service: openapi.yaml. The API is evolving and can change in the coming week or so, but is it overall correct / on the right track?

The API describes two endpoints:

/ref/ for retrieving a single standard’s metadata given dataset ID and standard reference.
- It returns JSON-encoded representation of citation, as it was indexed.
- For DOI, it performs an additional trip to authoritative dataset.
- For other datasets, it uses the index (much faster).
/search/ for querying standards.
- It accepts a JSON object as query.
- The most important part of it is fields, which for now simply matches provided values with whatever is in the index (e.g., { "fields": { "id": 1234, "doctype": "standard" } }).
- The object also can contain a dataset field, without which search is performed across all datasets.
- The object can also contain parameters limit and offset for windowing returned data.

Implement legacy path pattern: IANA references (`bibxml8`)

IANA references (bibxml8)

(In the old implementation this is a dynamic fetch service, in our implementation we treat this as a dataset)
Pattern 1: https://xml2rfc.tools.ietf.org/public/rfc/bibxml8/reference.IANA.sip-parameters.xml
Pattern 2: https://xml2rfc.tools.ietf.org/public/rfc/bibxml-iana/reference.IANA.sip-parameters.xml

Legacy pattern(s) to implement:

BibXML data preparation

IEEE

W3C

3GPP

IANA

NIST

RFC Subseries

ietf-ribose/bibxml-data-rfcsubseries#1

Reference identifier patterns across multiple datasets

http://xml2rfc.tools.ietf.org

IETF RFC

http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.NNNN.xml

IETF Internet-Draft

http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.example-name.xml
http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.draft-example-name-99.xml
(Note: Leave draft- off when referencing a generic draft name, and add draft- when referencing a specific version of an internet-draft.)

IETF RFC Subseries

http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.BCP.0099.xml
http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.FYI.0099.xml
http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.STD.0099.xml

W3C

http://xml2rfc.ietf.org/public/rfc/bibxml4/reference.W3C.REC-example-name-date.xml

3GPP

http://xml2rfc.ietf.org/public/rfc/bibxml5/reference.SDO-3GPP.1234.xml
http://xml2rfc.ietf.org/public/rfc/bibxml-3gpp/reference.SDO-3GPP.1234.xml

IEEE

http://xml2rfc.ietf.org/public/rfc/bibxml6/reference.IEEE.12345_date.xml
http://xml2rfc.ietf.org/public/rfc/bibxml-ieee/reference.IEEE.12345_date.xml

Relaton YAML format consistency

From @strogonoff :

@yablokov says the data in Relaton YAML is not consistent across standard data repos

Make management API invocation buttons progressively enhanced & compatible with stricter CSP

“Queue reindex”, “revoke task” buttons don’t work without JS, and the code is contained in <script> within HTML.

Implement legacy path pattern: IETF RFCs (`bibxml`)

IETF RFCs (bibxml)

(previous location) http://xml2rfc.tools.ietf.org/public/rfc/bibxml/
Pattern: http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.{NNNN}.xml

Legacy pattern(s) to implement:

https://{hostname}/public/rfc/bibxml/reference.RFC.{NNNN}.xml

We need to parse NNNN and return the appropriate RFC NNNN in BibXML format.

Availability of the demo instance

The (in testing) BibXML service is now deployed at:

https://demo.bibxml.org

The data sets of w3c, ieee, iana and nist are available.

The following data sets are still in progress:

rfcs, ids, rfcsubseries, misc and 3gpp

The managing interface of the indexer is accessible at: https://demo.bibxml.org:8000. We will supply the method to login shortly.

cc: @rjsparks @kesara

Implement legacy path pattern: IEEE (`bibxml6`)

IEEE (bibxml6)

(previous location) http://xml2rfc.tools.ietf.org/public/rfc/bibxml6/
Pattern 1: http://xml2rfc.tools.ietf.org/public/rfc/bibxml6/reference.IEEE.802.3.1_2011.xml
Pattern 2: http://xml2rfc.tools.ietf.org/public/rfc/bibxml6/reference.IEEE.P802-1A.1989.xml

Legacy pattern(s) to implement:

We will not implement other types of legacy patterns because there are not generic. If we need to maintain compatibility we will have to store those patterns in the database. Perhaps an analysis of existing bib usage in RFCs/IDs is necessary.

Implement legacy path pattern: Misc collection (`bibxml2`)

Misc collection (W3C, ISO, ITU, ANSI, FIPS, CCITT [previous name of ITU], IEEE, OASIS, PKCS) (bibxml2)

(previous location) http://xml2rfc.tools.ietf.org/public/rfc/bibxml2/

Legacy pattern(s) to implement:

We probably have to do this with a static map. This collection is not going to change moving forward.

We need to parse the pattern to return the appropriate BibXML content.

Serve /openapi.yaml from live site root

Manager’s GUI incorrectly shows buttons to “reindex” external datasets

https://demo.bibxml.org/management/doi/

Hmm...

Use relaton-py library to generate BibXML

We currently need two repositories per dataset. The relaton-data one for search indexing (as it provides more data) and the bibxml-data one for serving BibXML.

The latter dataset will no longer be needed once the service integrates the relaton-py library which converts Relaton data to BibXML on the fly. This is also necessary for serving other bibliographic formats like bibtex.

Once this is done we can also remove the unneeded dataset repos.

Originally posted by @ronaldtse in ietf-ribose#41 (comment)

Switch Celery result backend to Django ORM (potentially)

Currently, task status/result persists in Redis, but if we want dataset indexing task history to be more reliable/exist for longer we should persist it in PostgreSQL (using django-celery-results). This would also make it available using Django ORM, making it more convenient to query task status.

Style error pages

Update BibXML service API and OpenAPI definition to support legacy paths

From @kesara:

The service must maintain the following backward compatibility with the existing service:
a. URL structure and file naming of the current web service. For example /public/rfc/bibxml/reference.RFC.7991.xml. This will allow existing tools to quickly shift to using the new service.
b. For certain datasets (detailed below) the service must support a ‘live’ file name, which always serves the latest version of an XML citation at the time of retrieval, while also supporting the serving of specific versions. For example: reference.I-D.ietf-stir-passport-rcd.xml will return the XML citation for the current version of draft-ietf-stir-passport-rcd at the time of the request, while draft-ietf-stir-passport-rcd-09.xml
will always return the XML citation for version -09 of the Internet-Draft.

Originally posted by @kesara in ietf-ribose#6 (comment)

Implement legacy path pattern: NIST references (bibxml-nist)

IANA references (bibxml-nist)

Legacy pattern(s) to implement:

Pattern 1: http://xml2rfc.tools.ietf.org/public/rfc/bibxml-nist/reference.NIST.{old-docid}.xml

We just have to do a {old-docid} mapping to the new IDs.

Implement legacy path pattern: IETF BCP, FYI, STD (`bibxml9`, `bibxml-rfcsubseries`)

IETF BCP, FYI, STD (bibxml9, bibxml-rfcsubseries)

(previous location) http://xml2rfc.tools.ietf.org/public/rfc/bibxml9/
Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.BCP.0099.xml
Pattern 2: http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.FYI.0099.xml
Pattern 3: http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.STD.0099.xml

Legacy pattern(s) to implement:

Pattern 1: http://{hostname}/public/rfc/bibxml-rfcsubseries/reference.BCP.{NNNN}.xml
Pattern 2: http://{hostname}/public/rfc/bibxml-rfcsubseries/reference.FYI.{NNNN}.xml
Pattern 3: http://{hostname}/public/rfc/bibxml-rfcsubseries/reference.STD.{NNNN}.xml

We need to parse the pattern to return the appropriate BibXML content.

Update URL config with legacy paths

Support data variables in legacy path patterns

required for #13

Support non-normalised publication identifiers

From @yablokov:

In the API URL pattern, what identifiers do we support?

Right now, the bibxml patterns go like "rfcNNNN.xml".

Should we support:

"RFCNNNN.xml"
"IETF-RFC-NNNN.xml"
or?

would be nice to get more examples of a non-normalised identifiers.

Spaces
Hyphens '-'
Underscores
?

Or may be better to clean these identifiers from non-standard characters and then try to normalize it?

Originally posted by @yablokov in ietf-ribose#3 (comment)

Properly report non-404 errors from DOI

Currently, even if DOI returns 503, BibXML returns the “not found” response.

Discovered by cURLing a reference with 503 result, and trying it using BibXML service (running from the same IP) getting “not found” response.

Provide instructions on how to test service

From @yablokov :

Now available (bibxml-indexer):

http://127.0.0.1:8001/api/v1/indexer/<dataset_name>/run
http://127.0.0.1:8001/api/v1/indexer/<dataset_name>/stop
http://127.0.0.1:8001/api/v1/indexer/<dataset_name>/reset
http://127.0.0.1:8001/api/v1/indexer/<dataset_name>/status

(as it described: https://github.com/ietf-ribose/bibxml-indexer/blob/master/openapi.yaml )

At indexer settings.py I have configured datasets:

ecma

nist

ietf

itu-r

calconnect

cie

iso

bipm

iho

(get it from https://github.com/relaton?q=relaton-data-&type=&language=&sort= )

You can start indexation at bibxml-indexer instance:
http://127.0.0.1:8001/api/v1/indexer/ecma/run

And read result from bibxml-service after indexation finish:
http://127.0.0.1:8000/api/v1/ref/ecma/ECMA-154

You can start indexation at bibxml-indexer instance:
http://127.0.0.1:8001/api/v1/indexer/nist/run

And read result from bibxml-service after indexation finish:
http://127.0.0.1:8000/api/v1/ref/nist/LCIRC288

Repo/datasets at bibxml-indexer reads from configuration: https://github.com/ietf-ribose/bibxml-indexer/blob/master/indexer/settings.py

Implement API to retrieve a document by reference and support DOI first

Generalize external dataset configuration

Currently, we hard-code “doi” in places, and route code to DOI retrieval function.

Instead, we should configure external datasets as a dictionary that maps external dataset ID to retrieval functions matching specific interface convention.

Implement legacy path pattern: DOI references (`bibxml7`)

DOI references (bibxml7)

Legacy pattern(s) to implement:

Pattern 1: http://xml2rfc.tools.ietf.org/public/rfc/bibxml7/reference.DOI.{DOI ID}.xml or .kramdown
Pattern 2: http://xml2rfc.tools.ietf.org/public/rfc/bibxml-doi/reference.DOI.{DOI ID}.xml or .kramdown

rfcs, ids, rfcsubseries datasets are not indexing

https://github.com/ietf-ribose/bibxml-service/blob/d64b54e511c46fb665e819d455c91866695a3bb4/bibxml/settings.py#L266-L268

Add CI on GitHub Actions

Add flexibility to legacy path patterns

Currently, they are fixed /public/rfc/{legacy_dataset_id}/reference.{ref}.xml.

BibXML service should allow for /public/rfc/{legacy_dataset_id}/{arbitrary_prefix}{ref}.xml.

By default, arbitrary_prefix="reference.".

Implement legacy path pattern: W3C (`bibxml4`)

W3C (bibxml4)

(previous location) http://xml2rfc.tools.ietf.org/public/rfc/bibxml4/
Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml4/reference.W3C.REC-example-name-date.xml

Legacy pattern(s) to implement:

https://{hostname}/public/rfc/bibxml4/reference.W3C.REC-{example-name}-{date}.xml

We need to parse the pattern to return the appropriate BibXML content.

Data mismatch when retrieving IEEE standards by xml2rfc paths

Current bibxml reference: https://xml2rfc.tools.ietf.org/public/rfc/bibxml6/reference.IEEE.802.11_2012.xml
New bibxml legacy reference: http://34.229.41.119:8000/public/rfc/bibxml6/reference.IEEE_802-11.2012.xml

Note that,
The file name is different:
reference.IEEE.802.11_2012 vs reference.IEEE_802-11.2012.

The reference anchor attribute data is different:
anchor="IEEE.802.11_2012" vs anchor="IEEE.IEEE 802-11.2012"

The organization data is different:
<organization>IEEE</organization> vs <organization abbrev="IEEE">Institute of Electrical and Electronics Engineers</organization>

These should match the existing bibxml service references.

settings.py is too confusing

The config file should be simple to use. Right now it mixes many different concepts and makes it hard to understand/use/config.

For example, I can't tell the difference between these:

DATASET_SOURCE_OVERRIDES
AUTHORITATIVE_DATASETS
EXTERNAL_DATASETS
KNOWN_DATASETS
LEGACY_DATASETS

If I just want to do #40 or #41 , what do I do? The file does not answer this.

DOI fetch broken on demo site

Go here https://demo.bibxml.org/doi/
Enter a valid DOI identifier, e.g. 10.6028/NIST.IR.7057 (this is a valid link: http://doi.org/10.6028/NIST.IR.7057)

See it being redirected to https://demo.bibxml.org/doi/10.6028%252FNIST.IR.7057/ and a failure message:

Requested reference not found: 10.6028/NIST.IR.7057

Support legacy path patterns from `xml2rfc.tools.ietf.org`

We need to implement the legacy path patterns from xml2rfc.tools.ietf.org.

Functionality needed:

Regression: resolving citations in XML format via main API

Example: https://demo.bibxml.org/api/v1/ref/ieee/IEEE_628.2020/?format=bibxml

Caused by a refactor that introduced get_indexed_ref_by_query, and caller forgetting to pass the “format” argument through to it.

Implement legacy path pattern: 3GPP (`bibxml5`)

3GPP (bibxml5)

(previous location) http://xml2rfc.tools.ietf.org/public/rfc/bibxml5/
Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml5/reference.SDO-3GPP.55.205.xml
Pattern 2: http://xml2rfc.ietf.org/public/rfc/bibxml-3gpp/reference.SDO-3GPP.55.205.xml
Pattern 3: http://xml2rfc.ietf.org/public/rfc/bibxml-3gpp/reference.3GPP.55.205.xml

Legacy pattern(s) to implement:

Pattern 1: https://{hostname}/public/rfc/bibxml5/reference.SDO-3GPP.{docid}.xml
Pattern 2: https://{hostname}/public/rfc/bibxml-3gpp/reference.SDO-3GPP.{docid}.xml
Pattern 3: https://{hostname}/public/rfc/bibxml-3gpp/reference.3GPP.{docid}.xml

3GPP documents are of the pattern like:

00.02U
55.205
02.06dcs
29.998-04-1

We need to parse the pattern to return the appropriate BibXML content.

Remove unnecessary k8n annotations

@kwkwan These annotations are not necessary? Can we remove them?

https://github.com/ietf-ribose/bibxml-service/blob/d7113ff4fe583b87d5c32712842ad34a0efa182a/k8s/ws/ws-service.yaml#L4-L7

Convert 'misc' dataset into relaton-data-ietfmisc and bibxml-data-ietfmisc

https://github.com/ietf-ribose/bibxml-service/blob/d64b54e511c46fb665e819d455c91866695a3bb4/bibxml/settings.py#L269

“Misc” dataset is an old, manually crafted xml2rfc dataset that contains citations of various doctypes and docids, for some (or all) of which newer citation metadata is contained in other sources that we have.

We shouldn’t index “misc”, but we should allow compatibility API (legacy xml2rfc paths) to resolve to doctype/docids (both dynamically by parsing filename, and via manual assignment); and then return corresponding new citation metadata indexed from those datasets, with fallback to pre-crawled data from xml2rfc webserver (for dynamic resolution yields unknown doctype/docid and manual assignment was not provided).

(Similar handling is planned to be applied to other xml2rfc data as well.)

Removal of duplicate 3GPP references from this dataset, with a legacy path mapping

In the latest bibxml4 dataset, there are two patterns of reference files:

reference.SDO-3GPP.*.xml
reference.3GPP.*.xml

We will need to build legacy file mappings for both, but I wanted to find out if the content really differs between them.

Since we already the anchors are different (because the anchors follow the filenames), we will omit the difference in anchors:

$ sed -i'.bak' 's#SDO-##g' reference.SDO-3GPP.*.xml
$ find . -name 'reference.SDO-3GPP.*.xml' -exec bash -c 'diff $0 ${0/SDO-/}' {} \;

The collection of files are different in cardinality:

$ ls -l reference.3GPP.*.xml | wc -l
    2217
$ ls -l reference.SDO-3GPP.*.xml | wc -l
    2110

So the pattern reference.3GPP.*.xml contains 117 files that reference.SDO-3GPP.*.xml does not have.

The speculation is that the reference.SDO-3GPP.*.xml pattern contains older data than reference.3GPP.*.xml.

The actual differences between the files are:

5c5
< <title>Voice Broadcast Service (VBS); Stage 2</title>
---
> <title>Voice Broadcast service (VBS); Stage 2</title>
5c5
< <title>Location Services (LCS); Serving Mobile Location Centre - Serving Mobile Location Centre (SMLC - SMLC); SMLCPP specification</title>
---
> <title>Location Services (LCS): Serving Mobile Location Centre - Serving Mobile Location Centre (SMLC - SMLC); SMLCPP specification</title>
5c5
< <title>Customised Applications for Mobile network Enhanced Logic (CAMEL) Phase X; CAMEL Application Part (CAP) specification</title>
---
> <title>Customized Applications for Mobile network Enhanced Logic (CAMEL) Phase X; CAMEL Application Part (CAP) specification</title>
5c5
< <title>3G Security; Specification of the MILENAGE algorithm set: An example algorithm set for the 3GPP authentication and key generation functions f1, f1*, f2, f3, f4, f5 and f5*; Document 3: Implementors' test data</title>
---
> <title>3G Security; Specification of the MILENAGE algorithm set: An example algorithm set for the 3GPP authentication and key generation functions f1, f1*, f2, f3, f4, f5 and f5*; Document 3: Implementors’ test data</title>
5c5
< <title>Customised Applications for Mobile network Enhanced Logic (CAMEL); Service description; Stage 1</title>
---
> <title>Customized Applications for Mobile network Enhanced Logic (CAMEL); Service description; Stage 1</title>
5c5
< <title>3G security; LawfulInterception; Stage 2</title>
---
> <title>3G security; Lawful Interception; Stage 2</title>
5c5
< <title>IP Multimedia Subsystem (IMS) Application Level Gateway (IMS-ALG) - IMS Access Gateway (IMS-AGW) interface: Procedures descriptions</title>
---
> <title>IP Multimedia Subsystem (IMS) Application Level Gateway (IMS-ALG) – IMS Access Gateway (IMS-AGW) interface: Procedures descriptions</title>
5c5
< <title>Customised Applications for Mobile network Enhanced Logic (CAMEL) Phase 4; Stage 2; IM CN Interworking</title>
---
> <title>Customized Applications for Mobile network Enhanced Logic (CAMEL) Phase 4; Stage 2; IM CN Interworking</title>
5c5
< <title>Customised Applications for Mobile network Enhanced Logic (CAMEL) Phase 4; Stage 2</title>
---
> <title>Customized Applications for Mobile network Enhanced Logic (CAMEL) Phase 4; Stage 2</title>
5c5
< <title>Mobile radio interface layer 3 specification; Radio Resource Control (RRC) protocol; Iu mode</title>
---
> <title>Mobile radio interface layer 3 specification, Radio Resource Control (RRC) protocol; Iu mode</title>
5c5
< <title>Telecommunication management; Self-configuration of network elements Integration Reference Point (IRP); Solution Set (SS) definitions</title>
---
> <title>Telecommunication management; Self-Configuration of Network Elements Integration Reference Point (IRP); Solution Set (SS) definitions</title>
5c5
< <title>TISPAN; PSTN/ISDN simulation services Terminating Identification Presentation (TIP) and Terminating Identification Restriction (TIR); Protocol specification</title>
---
> <title>PSTN/ISDN simulation services Terminating Identification Presentation (TIP) and Terminating Identification Restriction (TIR); Protocol specification</title>
5c5
< <title>Telecommunication management; Generic Integration Reference Point (IRP) management; Solution Set (SS) definitions</title>
---
> <title>Telecommunication management; Generic Integration Reference Point (IRP) management; Solution Set (SS) Definitions</title>
5c5
< <title>3G Security; Lawful Interception; Stage 2</title>
---
> <title>Lawful Interception; Stage 2</title>

Which aren't many.

We can immediately pick up those minor differences:

British English vs American English spelling ("Customised" vs "Customized")
Colon vs semi-colon ("Location Services (LCS):" vs "Location Services (LCS);")
Capitalisation ("service" vs "Service")
Unicode punctuation - vs –
Typos ("Lawful Interception" vs "LawfulInterception")

In all cases, the pattern reference.3GPP.*.xml contains content that are more correct

typos, semicolons fixed
uses British English spelling for their titles, which from their official page is the correct spelling (e.g. 3GPP 29.078

I propose that we make these assumptions:

The intention of reference.3GPP.*.xml and reference.SDO-3GPP.*.xml are identical, and hence the content of reference.3GPP.*.xml and reference.SDO-3GPP.*.xml are meant to be identical
Since there are reference.3GPP.*.xml files that do not have a corresponding reference.SDO-3GPP.*.xml file, when asked for that reference.SDO-3GPP.*.xml file, we can respond with the content of the reference.3GPP.*.xml file.

@rjsparks is that acceptable?

ietf-tools / bibxml-service Goto Github PK

bibxml-service's People

Contributors

Stargazers

Watchers

Forkers

bibxml-service's Issues

Datasets

IETF datasets

Non-IETF datasets

Services

Recommend Projects

Recommend Topics

Recommend Org