Comments (15)
We will not implement other types of legacy patterns because there are not generic. If we need to maintain compatibility we will have to store those patterns in the database. Perhaps an analysis of existing bib usage in RFCs/IDs is necessary.
Just in case, configuration should support more flexible generalised legacy path patterns soon.
- If all varying parts of a legacy ref (which is in
reference.{legacy_ref}.xml
) map to Relaton data properties, it should work. - If not, then yes, a static map (either indeed database-based, or maybe better in configuration generated at build time) would be the way to go if required.
from bibxml-service.
- Currently working pattern example:
/public/rfc/bibxml6/reference.IEC_62531.2012_REDLINE.xml
- I cannot factually reproduce legacy patterns from ticket description, because those standards don’t seem to exist in our
bibxml-data-ieee
- However, based on working pattern and other
bibxml-data-ieee
that we have, pattern 1 from ticket description (/public/rfc/bibxml6/reference.IEEE.802.3.1_2011.xml
) would have been implemented as/public/rfc/bibxml6/reference.IEEE_802-3.1.2011.xml
This is because files in bibxml-data-ieee
are named in this way, and therefore such are our canonical references.
As with other legacy paths, currently we can either:
- rename source
bibxml-data-ieee
files to match expected legacy pattern, or - specify simple substitutional reformatting in legacy path pattern (replacing punctuation with dots, etc.)
from bibxml-service.
- The filename pattern of
bibxml-data-ieee
files are not important. - relaton-data-ieee (and therefore bibxml-data-ieee) have a lot more content than the original
bibxml6
directory. The originalbibxml6
directory was manually crafted.
The only way to know if every single file from bibxml6 exists in bibxml-data-ieee, is through a search for every item.
Increasingly so I think this is the way to go. We should have a static "map" between the old dataset and the new dataset because the identifiers are too unpredictable...
from bibxml-service.
- The filename pattern of
bibxml-data-ieee
files are not important.- relaton-data-ieee (and therefore bibxml-data-ieee) have a lot more content than the original
bibxml6
directory. The originalbibxml6
directory was manually crafted.The only way to know if every single file from bibxml6 exists in bibxml-data-ieee, is through a search for every item.
Increasingly so I think this is the way to go. We should have a static "map" between the old dataset and the new dataset because the identifiers are too unpredictable...
@ronaldtse Possible miscommunication alert…
My previous comment was written under the assumption that legacy paths need to correspond to actual preexisting legacy data.
Today I realised it’s a mistaken assumption, as I remembered that per your comment before (ietf-ribose/bibxml-project#5 (comment)) you said legacy paths just need to maintain the patterns, and don’t need to correspond to actual data (because we aren’t expected to have that legacy data in our bibxml-data-
datasets).
However, the above response from you in this thread seems to indicate we would need to map to legacy data after all—i.e. not only maintain the patterns but make sure old preexisting XML files from XML2RFC tools are accessible via the new service?
I wonder if this is still an open question, requirements-wise.
from bibxml-service.
I think this is a question we need to clarify. @rjsparks mentioned that we should support legacy paths for backwards compatibility reasons.
For true backwards compatibility, the data served by a given path should be the same -- however, there are two major differences that it won't make that much sense to do that:
- We are now using the RFC XML v3 format to serve BibXML data. Old implementations that read it will likely fail.
- There is more data per bibliographic file than the previously served files. This means that old implementations could also fail, and at the least, behave differently.
I would say that the intention is to provide:
- An identical reference. If the old path and old content points to IEEE 802.3a, then the legacy path (new system, old path) should lead to the same reference IEEE 802.3a
- The contents of the reference will differ. We will continue using RFC XML v3 instead of RFC XML v2 for the legacy paths, and accept that if the extra content in the response will cause implementation issues, the problem is at the implementation side.
In order to make a consistent map from legacy paths to the new dataset with 100% confidence that the paths are pointing to the same data, we will need to maintain possibly a "static map" from the legacy path item towards the new references.
@rjsparks is this the approach you're thinking of?
from bibxml-service.
As the RFP calls out, there are deployed tools that need to continue to work with the legacy paths. We will deploy the new service such that it either backs the legacy URLs directly, or will proxy or redirect to the new URLs, but the path structure should remain the same.
We do not need to replicate providing references using the v2 grammar - the references should be in the v3 format. I think all the known tools will do the reasonable things.
from bibxml-service.
Earlier there was discussion of a demo instance that we could poke at - is such a thing already available?
(edit) : nm - I relocated the server you've previously pointed to.
from bibxml-service.
So - to make this all a bit clearer, perhaps:
See, e.g., https://www.ietf.org/archive/id/draft-ietf-stir-messaging-01.xml
Note the many Processing Instructions that look like:
<!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
Other documents might point internally to https://xml2rfc.tools.ietf.org/public/rfc/bibxml-rfcs/reference.RFC.8174.xml.
When existing code processes this, it should still work. We will ensure that references to xml.resource.org (to the extent possible), xml2rfc.ietf.org, and xml2rfc.ietf.org still resolve, and will serve them from the product you are making. We want to be able to easily configure those redirects (or proxies) to point to the new correct place.
Further, please skim the code at:
https://trac.ietf.org/trac/xml2rfc/browser/trunk/cli/xml2rfc/parser.py#L56
https://trac.ietf.org/trac/xml2rfc/browser/trunk/cli/xml2rfc/parser.py#L468
These network_locs string should be all we have to replace to begin using the new service.
The code (with all the assumptions it makes about the structure of the URL below the configured network_locs) at
https://trac.ietf.org/trac/xml2rfc/browser/trunk/cli/xml2rfc/parser.py#L240
https://trac.ietf.org/trac/xml2rfc/browser/trunk/cli/xml2rfc/parser.py#L275
https://trac.ietf.org/trac/xml2rfc/browser/trunk/cli/xml2rfc/parser.py#L305
should work without modification (until we decide to improve/optimize using the API directly instead of these legacy paths).
v3 was designed to be backwards compatible with v2 - the references constructed shouldn't be so different that v2 processors would break on what's produced with v3 as an intended target. If you think you've identified a place where what you're constructing might, please provide an example.
from bibxml-service.
From #31:
Current bibxml reference: https://xml2rfc.tools.ietf.org/public/rfc/bibxml6/reference.IEEE.802.11_2012.xml
New bibxml legacy reference: http://34.229.41.119:8000/public/rfc/bibxml6/reference.IEEE_802-11.2012.xmlNote that, The file name is different:
reference.IEEE.802.11_2012
vsreference.IEEE_802-11.2012
.
@strogonoff there are two issues here.
Support a fuzzy legacy path match
In the IEEE legacy path, we want to support resolving the legacy pattern IEEE.802.11_2012.xml
to the IEEE PubID-based entry of IEEE 802-11:2012
.
Name the "anchor" attribute identical to the legacy path file name
When returning an IEEE legacy path BibXML XML output, we want to reflect the same "anchor".
Today the legacy path provides this output:
https://xml2rfc.tools.ietf.org/public/rfc/bibxml6/reference.IEEE.802.11_2012.xml
<reference anchor="IEEE.802.11_2012" target="http://ieeexplore.ieee.org...">
Notice that the legacy path IEEE.802.11_2012.xml
shares an identical prefix to that of the anchor IEEE.802.11_2012
. Presumably, an author will use IEEE.802.11_2012
inside the document to reference this particular bibliographic item.
This means that existing documents rely this anchor being identical to the file path, and thus we have to keep the anchor identical.
Technically this is poor practice due to the mixing of serving location and data identification, but this is an established practice in IETF authoring that is out of our scope to change.
from bibxml-service.
@ronaldtse Question: why should we support a fuzzy legacy match, instead of having a static mapping? It’s clear that legacy consumers must have exact filenames to start with. We just need to map legacy paths to up-to-date citations, whether automatically or not.
(Obviously, fuzzy match cannot be guaranteed to return correct results!)
For the second part (the anchor), it looks like you are suggesting altering the old XML contents and substituting paths? Are you sure it won’t break legacy consumers? If so, I think our best bet is to provide a GitHub source by crawling xml2rfc tools (which we might have to do anyway) and doing the requisite processing on XML as part of that crawl, rather than trying to manipulate this in realtime.
from bibxml-service.
@strogonoff because a fuzzy match is easier to maintain than a static-string to static-string match.
(Obviously, fuzzy match cannot be guaranteed to return correct results!)
Indeed, you are correct. There are clearly just two ways of handling legacy paths:
- Make a "legacy filename" to new "document identifier" mapping for all legacy paths. e.g.
"reference.3GPP.XX.YY" => "3GPP XX.YY"
- Use a string matching pattern to map the legacy to new. It clearly doesn't work for all datasets (e.g. NIST dataset), but for certain ones that have legacy filenames defined consistently (e.g. 3GPP), it is possible.
For the second part (the anchor), it looks like you are suggesting altering the old XML contents and substituting paths
That's not what I'm saying.
I'm saying that:
- We will serve new content to the legacy paths. The whole point of handling legacy paths is to have the BibXML service provide old clients with up-to-date content.
- When serving the BibXML files, notice that the "filename" requested is identical to the
anchor
attribute within the BibXML file. This is the current practice of IETF authors and tooling, where they expect the "filename" to be identical to theanchor
attribute.
from bibxml-service.
If we return new data from other sources for xml2rfc paths, then XML anchors will not match old xml2rfc anchors (presumably, the anchors in new data will contain some authoritative/canonical identifier, while the anchors in xml2rfc files match filenames that were arbitrarily assigned by humans).
So in addition to map or fuzzy-match, it still looks like we have to substitute anchors in XML based on whatever filename was in the incoming request (in case of legacy request, it would not match our anchor), on the fly. Unless I am misunderstanding you.
I’d rather get this clarified in case our fundamental legacy path handling requirements are jumping from “find and return” to something more like “find, parse, construct and return”.
from bibxml-service.
I think "find, parse, construct and return" might be required.
from bibxml-service.
Here’s a report for bibxml6
xml2rfc paths when “auto” resolution was in effect (results are not great, as diffs show):
bibxml6-report-with-auto.zip
Here’s a report for bibxml6
xml2rfc paths with current logic:
bibxml6-report-manual-only.zip
Current logic means most paths fall back to xml2rfc archive for now, returning identical XML to before, except these two which successfully map to these standards resulting in new XML:
- https://github.com/ietf-ribose/bibxml-data-archive/blob/main/bibxml6/reference.IEEE.8802_5_1998.xml → https://dev.bibxml.org/get-bibliographic-item/?query=ANSI%2FIEEE+802-5.1998
- https://github.com/ietf-ribose/bibxml-data-archive/blob/main/bibxml6/reference.IEEE.P8021D.1989.xml → https://dev.bibxml.org/get-bibliographic-item/?query=IEEE+802-5.1989
XML diffs are available in HTML report in the second zip archive above.
@ronaldtse could you confirm that above mappings are right, just in case? If so, this can be closed.
from bibxml-service.
I think we need to either A) update mappings or B) wait until pubid-ieee
gives finalized identifiers we can use in relaton-data-ieee
(metanorma/pubid-ieee#72). I think we’d want to do (B), because otherwise we’ll need to switch mappings back and forth, but if it takes too long we should do (A) instead ASAP. (cc @ronaldtse)
from bibxml-service.
Related Issues (20)
- bibxml-doi (bibxml7): target URL should use https and doi.org HOT 18
- Resolve how we pin to Relaton dependencies HOT 3
- Add continuous regression testing for new I-Ds and RFCs vets at least new internet-draft's datatracker bibxml against bib.ietf.orgs bibxml and raise issues if there are differences.
- bibxml (bibxml-rfcs): missing initials attribute for author of RFC 4885 HOT 3
- bibxml-rfcs: target attribute of the reference element must be explicitly set HOT 4
- Review target for data sets HOT 3
- bib.ietf.org URLs returning HTTP 429 HOT 4
- HTTP 429 when fetching files during build HOT 3
- Providing text files of bib info
- Improve Relaton dependency pinning HOT 4
- should the BibXML URLs contain "/rfc/" for non-RFC entries? HOT 2
- sources_sourceindexationoutcome table is growing HOT 7
- Can <annotation be used by authors or is it claimed by bibxml? HOT 3
- extra 's' in target URL HOT 1
- Invalid https URLs HOT 1
- DOI bibxml error HOT 6
- Serve BibXML URLs for RFCs without leading zeros HOT 10
- Cache results from crossref.org
- Metadata Plus
- Install fonts from xmlrfc-fonts repository HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bibxml-service.