Giter Club home page Giter Club logo

bibxml-service's People

Contributors

kesara avatar kwkwan avatar mihaller avatar ronaldtse avatar stefanomunarini avatar strogonoff avatar yablokov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bibxml-service's Issues

Unable to find "C57-12-20.2011" unless given "IEEE C57-12-20.2011"

If I select the "IEEE" dataset, then enter "IEEE C57-12-20.2011", it works:
Screenshot 2021-12-25 at 5 10 11 PM

But if I enter "C57-12-20.2011", nothing is found:
image

Having to require entering the dataset prefix ("IEEE") and then the document identifier with also the prefix "IEEE" seems redundant.

This seems to be an issue with the search engine matching strings.

Bring `bibxml2` dataset up to date (in the future)

W3C, ISO, ITU, ANSI, FIPS, CCITT [previous name of ITU], IEEE, OASIS, PKCS

W3C, ISO, ITU, FIPS are all supported by Relaton (the latter two especially are provided by the authoritative parties).

OASIS has indicated they are willing to provide bibdata (in a separate project, but the data can also be useful for IETF).

We will need to figure out CCITT documents (legacy docs from ITU).

PKCS has no authority now (they were published by RSA) so we can just move those content into a static dataset.

ANSI we will need to figure out.

Originally from ietf-ribose#10 (comment)

Adjusted `xml2rfc` compatibility path implementation

Requirements (to reiterate):

  • A valid XML response must be returned for any requested xml2rfc path
  • If possible (plan A), returned XML response must reflect a citation indexed from one of the canonical data sources, not its original xml2rfc contents
  • As a fallback (plan B), returned XML response must be the exact file under requested xml2rfc path

Adjusted legacy path implementation is as follows:

  • When handling an xml2rfc path, BibXML service will transform the path to citation document type & ID (the docid pair in Relaton data)
  • For example, a path like /public/rfc/bibxml9/reference.BCP.0004.xml should be converted to docid like { "type": "IETF", "id": "IETF BCP 4" }
  • Manual docid mapping will be used first, if defined for this path
  • In absence of manual mapping, automatic transformation will be attempted (fallible)
  • If either of the above fails, BibXML service will return pre-crawled XML file obtained from xml2rfc server

Implement legacy path pattern: IETF Internet-Drafts (`bibxml3`, `bibxml-id`)

IETF Internet-Drafts (bibxml3, bibxml-id)

Legacy pattern(s) to implement:

  • Pattern 1: https://{hostname}/public/rfc/bibxml-ids/reference.I-D.{example-name}.xml
  • Pattern 2: https://{hostname}/public/rfc/bibxml-ids/reference.I-D.draft-{example-name}-{draft-number}.xml

We need to parse the pattern to return the appropriate BibXML content.

Missing document title makes citation inaccessible via GUI

Currently the BibXML service does not provide an individual "show" page like the other datasets.

RFC subseries:
Screenshot 2021-12-25 at 4 45 31 PM

RFC
Screenshot 2021-12-25 at 4 47 19 PM

RFC individual page:
Screenshot 2021-12-25 at 4 47 32 PM

This is due to the fact where an "RFC subseries document" contains multiple RFCs. Each "RFC subseries document" contains individual metadata and also one or more RFCs, which form part of the "RFC subseries document".

This relationship is represented as a "document relation" in the Relaton data.

We need to handle this new data structure for display.

GraphQL API

This is a moderately far out idea for now, at least as far as I’m concerned, but GraphQL API is an option that might be very feasible given current architecture.

It has its downsides (e.g., consumers may start depending on citation attributes even if as data structure may change as citation sourcing evolves; inconsistencies between data sources will become more obvious and irritating—some sources contain more data compared to the others, so a finer-grained query may unintentionally exclude citations; higher complexity; etc.), but also some upsides which may outweigh now or in near future (although I don’t think GQL should be made the primary supported API).

Implement legacy paths

We need to support the legacy path patterns for the following datasets.

Datasets

IETF datasets

Non-IETF datasets

Services

Originally posted by @ronaldtse in ietf-ribose#7 (comment)

Specification for standard retrieval API

Here’s our API specification for BibXML service: openapi.yaml. The API is evolving and can change in the coming week or so, but is it overall correct / on the right track?

The API describes two endpoints:

  1. /ref/ for retrieving a single standard’s metadata given dataset ID and standard reference.
    • It returns JSON-encoded representation of citation, as it was indexed.
    • For DOI, it performs an additional trip to authoritative dataset.
    • For other datasets, it uses the index (much faster).
  2. /search/ for querying standards.
    • It accepts a JSON object as query.
    • The most important part of it is fields, which for now simply matches provided values with whatever is in the index (e.g., { "fields": { "id": 1234, "doctype": "standard" } }).
    • The object also can contain a dataset field, without which search is performed across all datasets.
    • The object can also contain parameters limit and offset for windowing returned data.

Implement legacy path pattern: IANA references (`bibxml8`)

Reference identifier patterns across multiple datasets

http://xml2rfc.tools.ietf.org

IETF RFC

  • http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.NNNN.xml

IETF Internet-Draft

  • http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.example-name.xml
  • http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.draft-example-name-99.xml
  • (Note: Leave draft- off when referencing a generic draft name, and add draft- when referencing a specific version of an internet-draft.)

IETF RFC Subseries

  • http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.BCP.0099.xml
  • http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.FYI.0099.xml
  • http://xml2rfc.ietf.org/public/rfc/bibxml-rfcsubseries/reference.STD.0099.xml

W3C

  • http://xml2rfc.ietf.org/public/rfc/bibxml4/reference.W3C.REC-example-name-date.xml

3GPP

  • http://xml2rfc.ietf.org/public/rfc/bibxml5/reference.SDO-3GPP.1234.xml
  • http://xml2rfc.ietf.org/public/rfc/bibxml-3gpp/reference.SDO-3GPP.1234.xml

IEEE

  • http://xml2rfc.ietf.org/public/rfc/bibxml6/reference.IEEE.12345_date.xml
  • http://xml2rfc.ietf.org/public/rfc/bibxml-ieee/reference.IEEE.12345_date.xml

Implement legacy path pattern: IEEE (`bibxml6`)

IEEE (bibxml6)

Legacy pattern(s) to implement:

We will not implement other types of legacy patterns because there are not generic. If we need to maintain compatibility we will have to store those patterns in the database. Perhaps an analysis of existing bib usage in RFCs/IDs is necessary.

Use relaton-py library to generate BibXML

We currently need two repositories per dataset. The relaton-data one for search indexing (as it provides more data) and the bibxml-data one for serving BibXML.

The latter dataset will no longer be needed once the service integrates the relaton-py library which converts Relaton data to BibXML on the fly. This is also necessary for serving other bibliographic formats like bibtex.

Once this is done we can also remove the unneeded dataset repos.

Originally posted by @ronaldtse in ietf-ribose#41 (comment)

Switch Celery result backend to Django ORM (potentially)

Currently, task status/result persists in Redis, but if we want dataset indexing task history to be more reliable/exist for longer we should persist it in PostgreSQL (using django-celery-results). This would also make it available using Django ORM, making it more convenient to query task status.

Update BibXML service API and OpenAPI definition to support legacy paths

From @kesara:

The service must maintain the following backward compatibility with the existing service:
a. URL structure and file naming of the current web service. For example /public/rfc/bibxml/reference.RFC.7991.xml. This will allow existing tools to quickly shift to using the new service.
b. For certain datasets (detailed below) the service must support a ‘live’ file name, which always serves the latest version of an XML citation at the time of retrieval, while also supporting the serving of specific versions. For example: reference.I-D.ietf-stir-passport-rcd.xml will return the XML citation for the current version of draft-ietf-stir-passport-rcd at the time of the request, while draft-ietf-stir-passport-rcd-09.xml
will always return the XML citation for version -09 of the Internet-Draft.

Originally posted by @kesara in ietf-ribose#6 (comment)

Implement legacy path pattern: NIST references (bibxml-nist)

Implement legacy path pattern: IETF BCP, FYI, STD (`bibxml9`, `bibxml-rfcsubseries`)

IETF BCP, FYI, STD (bibxml9, bibxml-rfcsubseries)

Legacy pattern(s) to implement:

  • Pattern 1: http://{hostname}/public/rfc/bibxml-rfcsubseries/reference.BCP.{NNNN}.xml
  • Pattern 2: http://{hostname}/public/rfc/bibxml-rfcsubseries/reference.FYI.{NNNN}.xml
  • Pattern 3: http://{hostname}/public/rfc/bibxml-rfcsubseries/reference.STD.{NNNN}.xml

We need to parse the pattern to return the appropriate BibXML content.

Support non-normalised publication identifiers

From @yablokov:

In the API URL pattern, what identifiers do we support?

Right now, the bibxml patterns go like "rfcNNNN.xml".

Should we support:

  • "RFCNNNN.xml"
  • "IETF-RFC-NNNN.xml"
  • or?

would be nice to get more examples of a non-normalised identifiers.

  • Spaces
  • Hyphens '-'
  • Underscores
    ?

Or may be better to clean these identifiers from non-standard characters and then try to normalize it?

Originally posted by @yablokov in ietf-ribose#3 (comment)

Properly report non-404 errors from DOI

Currently, even if DOI returns 503, BibXML returns the “not found” response.

Discovered by cURLing a reference with 503 result, and trying it using BibXML service (running from the same IP) getting “not found” response.

Provide instructions on how to test service

From @yablokov :

Now available (bibxml-indexer):

http://127.0.0.1:8001/api/v1/indexer/<dataset_name>/run
http://127.0.0.1:8001/api/v1/indexer/<dataset_name>/stop
http://127.0.0.1:8001/api/v1/indexer/<dataset_name>/reset
http://127.0.0.1:8001/api/v1/indexer/<dataset_name>/status

(as it described: https://github.com/ietf-ribose/bibxml-indexer/blob/master/openapi.yaml )

At indexer settings.py I have configured datasets:

  • ecma
  • nist
  • ietf
  • itu-r
  • calconnect
  • cie
  • iso
  • bipm
  • iho

(get it from https://github.com/relaton?q=relaton-data-&type=&language=&sort= )

You can start indexation at bibxml-indexer instance:
http://127.0.0.1:8001/api/v1/indexer/ecma/run

And read result from bibxml-service after indexation finish:
http://127.0.0.1:8000/api/v1/ref/ecma/ECMA-154

You can start indexation at bibxml-indexer instance:
http://127.0.0.1:8001/api/v1/indexer/nist/run

And read result from bibxml-service after indexation finish:
http://127.0.0.1:8000/api/v1/ref/nist/LCIRC288

Repo/datasets at bibxml-indexer reads from configuration: https://github.com/ietf-ribose/bibxml-indexer/blob/master/indexer/settings.py

Generalize external dataset configuration

Currently, we hard-code “doi” in places, and route code to DOI retrieval function.

Instead, we should configure external datasets as a dictionary that maps external dataset ID to retrieval functions matching specific interface convention.

Implement legacy path pattern: DOI references (`bibxml7`)

Add flexibility to legacy path patterns

Currently, they are fixed /public/rfc/{legacy_dataset_id}/reference.{ref}.xml.

BibXML service should allow for /public/rfc/{legacy_dataset_id}/{arbitrary_prefix}{ref}.xml.

By default, arbitrary_prefix="reference.".

Data mismatch when retrieving IEEE standards by xml2rfc paths

Current bibxml reference: https://xml2rfc.tools.ietf.org/public/rfc/bibxml6/reference.IEEE.802.11_2012.xml
New bibxml legacy reference: http://34.229.41.119:8000/public/rfc/bibxml6/reference.IEEE_802-11.2012.xml

Note that,
The file name is different:
reference.IEEE.802.11_2012 vs reference.IEEE_802-11.2012.

The reference anchor attribute data is different:
anchor="IEEE.802.11_2012" vs anchor="IEEE.IEEE 802-11.2012"

The organization data is different:
<organization>IEEE</organization> vs <organization abbrev="IEEE">Institute of Electrical and Electronics Engineers</organization>

These should match the existing bibxml service references.

settings.py is too confusing

The config file should be simple to use. Right now it mixes many different concepts and makes it hard to understand/use/config.

For example, I can't tell the difference between these:

DATASET_SOURCE_OVERRIDES
AUTHORITATIVE_DATASETS
EXTERNAL_DATASETS
KNOWN_DATASETS
LEGACY_DATASETS

If I just want to do #40 or #41 , what do I do? The file does not answer this.

Implement legacy path pattern: 3GPP (`bibxml5`)

3GPP (bibxml5)

Legacy pattern(s) to implement:

  • Pattern 1: https://{hostname}/public/rfc/bibxml5/reference.SDO-3GPP.{docid}.xml
  • Pattern 2: https://{hostname}/public/rfc/bibxml-3gpp/reference.SDO-3GPP.{docid}.xml
  • Pattern 3: https://{hostname}/public/rfc/bibxml-3gpp/reference.3GPP.{docid}.xml

3GPP documents are of the pattern like:

  • 00.02U
  • 55.205
  • 02.06dcs
  • 29.998-04-1

We need to parse the pattern to return the appropriate BibXML content.

Convert 'misc' dataset into relaton-data-ietfmisc and bibxml-data-ietfmisc

https://github.com/ietf-ribose/bibxml-service/blob/d64b54e511c46fb665e819d455c91866695a3bb4/bibxml/settings.py#L269

“Misc” dataset is an old, manually crafted xml2rfc dataset that contains citations of various doctypes and docids, for some (or all) of which newer citation metadata is contained in other sources that we have.

We shouldn’t index “misc”, but we should allow compatibility API (legacy xml2rfc paths) to resolve to doctype/docids (both dynamically by parsing filename, and via manual assignment); and then return corresponding new citation metadata indexed from those datasets, with fallback to pre-crawled data from xml2rfc webserver (for dynamic resolution yields unknown doctype/docid and manual assignment was not provided).

(Similar handling is planned to be applied to other xml2rfc data as well.)

Removal of duplicate 3GPP references from this dataset, with a legacy path mapping

In the latest bibxml4 dataset, there are two patterns of reference files:

  • reference.SDO-3GPP.*.xml
  • reference.3GPP.*.xml

We will need to build legacy file mappings for both, but I wanted to find out if the content really differs between them.

Since we already the anchors are different (because the anchors follow the filenames), we will omit the difference in anchors:

$ sed -i'.bak' 's#SDO-##g' reference.SDO-3GPP.*.xml
$ find . -name 'reference.SDO-3GPP.*.xml' -exec bash -c 'diff $0 ${0/SDO-/}' {} \;

The collection of files are different in cardinality:

$ ls -l reference.3GPP.*.xml | wc -l
    2217
$ ls -l reference.SDO-3GPP.*.xml | wc -l
    2110

So the pattern reference.3GPP.*.xml contains 117 files that reference.SDO-3GPP.*.xml does not have.

The speculation is that the reference.SDO-3GPP.*.xml pattern contains older data than reference.3GPP.*.xml.

The actual differences between the files are:

5c5
< <title>Voice Broadcast Service (VBS); Stage 2</title>
---
> <title>Voice Broadcast service (VBS); Stage 2</title>
5c5
< <title>Location Services (LCS); Serving Mobile Location Centre - Serving Mobile Location Centre (SMLC - SMLC); SMLCPP specification</title>
---
> <title>Location Services (LCS): Serving Mobile Location Centre - Serving Mobile Location Centre (SMLC - SMLC); SMLCPP specification</title>
5c5
< <title>Customised Applications for Mobile network Enhanced Logic (CAMEL) Phase X; CAMEL Application Part (CAP) specification</title>
---
> <title>Customized Applications for Mobile network Enhanced Logic (CAMEL) Phase X; CAMEL Application Part (CAP) specification</title>
5c5
< <title>3G Security; Specification of the MILENAGE algorithm set: An example algorithm set for the 3GPP authentication and key generation functions f1, f1*, f2, f3, f4, f5 and f5*; Document 3: Implementors' test data</title>
---
> <title>3G Security; Specification of the MILENAGE algorithm set: An example algorithm set for the 3GPP authentication and key generation functions f1, f1*, f2, f3, f4, f5 and f5*; Document 3: Implementors’ test data</title>
5c5
< <title>Customised Applications for Mobile network Enhanced Logic (CAMEL); Service description; Stage 1</title>
---
> <title>Customized Applications for Mobile network Enhanced Logic (CAMEL); Service description; Stage 1</title>
5c5
< <title>3G security; LawfulInterception; Stage 2</title>
---
> <title>3G security; Lawful Interception; Stage 2</title>
5c5
< <title>IP Multimedia Subsystem (IMS) Application Level Gateway (IMS-ALG) - IMS Access Gateway (IMS-AGW) interface: Procedures descriptions</title>
---
> <title>IP Multimedia Subsystem (IMS) Application Level Gateway (IMS-ALG) – IMS Access Gateway (IMS-AGW) interface: Procedures descriptions</title>
5c5
< <title>Customised Applications for Mobile network Enhanced Logic (CAMEL) Phase 4; Stage 2; IM CN Interworking</title>
---
> <title>Customized Applications for Mobile network Enhanced Logic (CAMEL) Phase 4; Stage 2; IM CN Interworking</title>
5c5
< <title>Customised Applications for Mobile network Enhanced Logic (CAMEL) Phase 4; Stage 2</title>
---
> <title>Customized Applications for Mobile network Enhanced Logic (CAMEL) Phase 4; Stage 2</title>
5c5
< <title>Mobile radio interface layer 3 specification; Radio Resource Control (RRC) protocol; Iu mode</title>
---
> <title>Mobile radio interface layer 3 specification, Radio Resource Control (RRC) protocol; Iu mode</title>
5c5
< <title>Telecommunication management; Self-configuration of network elements Integration Reference Point (IRP); Solution Set (SS) definitions</title>
---
> <title>Telecommunication management; Self-Configuration of Network Elements Integration Reference Point (IRP); Solution Set (SS) definitions</title>
5c5
< <title>TISPAN; PSTN/ISDN simulation services Terminating Identification Presentation (TIP) and Terminating Identification Restriction (TIR); Protocol specification</title>
---
> <title>PSTN/ISDN simulation services Terminating Identification Presentation (TIP) and Terminating Identification Restriction (TIR); Protocol specification</title>
5c5
< <title>Telecommunication management; Generic Integration Reference Point (IRP) management; Solution Set (SS) definitions</title>
---
> <title>Telecommunication management; Generic Integration Reference Point (IRP) management; Solution Set (SS) Definitions</title>
5c5
< <title>3G Security; Lawful Interception; Stage 2</title>
---
> <title>Lawful Interception; Stage 2</title>

Which aren't many.

We can immediately pick up those minor differences:

  • British English vs American English spelling ("Customised" vs "Customized")
  • Colon vs semi-colon ("Location Services (LCS):" vs "Location Services (LCS);")
  • Capitalisation ("service" vs "Service")
  • Unicode punctuation - vs
  • Typos ("Lawful Interception" vs "LawfulInterception")

In all cases, the pattern reference.3GPP.*.xml contains content that are more correct

  • typos, semicolons fixed
  • uses British English spelling for their titles, which from their official page is the correct spelling (e.g. 3GPP 29.078

I propose that we make these assumptions:

  1. The intention of reference.3GPP.*.xml and reference.SDO-3GPP.*.xml are identical, and hence the content of reference.3GPP.*.xml and reference.SDO-3GPP.*.xml are meant to be identical
  2. Since there are reference.3GPP.*.xml files that do not have a corresponding reference.SDO-3GPP.*.xml file, when asked for that reference.SDO-3GPP.*.xml file, we can respond with the content of the reference.3GPP.*.xml file.

@rjsparks is that acceptable?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.