[urn] Benjamin Kaduk's Discuss on

Discussion:

John C Klensin

2018-06-08 00:52:21 UTC

--On Thursday, June 7, 2018 16:29 -0600 Peter Saint-Andre

Text along those lines seems appropriate. We might also
discourage people from defining their own canonical
transformation, but rather re-use one that's already defined
(say, a PRECIS profile such as UsernameCaseMapped).

I started to write a much longer note and explanation, but maybe
a summary is better for all concerned (i.e., if more is needed,
it is readily available).

IMO, there are two separate issues involved with this
discussion.

One is whether an Internationalization Considerations section is
needed. It seems to me that it is, but that it should mostly be
part of much clearer explanations in RFC 3986 about what rules,
especially about canonical forms and matching, apply to all URI
types, what is delegated to schemes/methods, and what can
reasonably be further delegated by particular schemes to, e.g.,
for the URN case, particular namespaces. The URNBIS WG
discussed whether to try to incorporate that type of explanation
into what became RFC 8141 but concluded that would be unwise, in
part because it would risk contradictions between the way some
of those in the URN community interpreted (or wanted to
interpret) 3986 and the way some of those in the more
traditional web community (a subset of whom are offended at the
whole idea of URNs interpret 3986. Pushing those issues and
that debate into the definition of a particular namespace seems
unwise and inappropriate.

While the other is about i18n, it is less about what I would
think of as internationalization considerations for NBNs in
general then it is about free advice to those who are defining
national sub-namespaces for NBNs. Make statements like that if
you like (and please pay attention to whatever Juha, who is
ultimately responsible for the text, has to say on the subject),
but remember that the managers of those national sub-namespaces
are mostly repository libraries who, in general, know far more
about their names, their languages, and how Unicode relates to
those languages than we can hope to. Remembering that NBNs are
already deployed by some of those libraries and that they have
hundreds of years of experience with the types of identifier,
and identifier comparison rules, that work for them, I
personally think that the IETF should be a little careful about
hubris in giving advice on those issues, but whatever works.

best,
john

Peter Saint-Andre

2018-06-08 18:01:51 UTC

Permalink

Post by John C Klensin
--On Thursday, June 7, 2018 16:29 -0600 Peter Saint-Andre

I started to write a much longer note and explanation, but maybe
a summary is better for all concerned (i.e., if more is needed,
it is readily available).

Indeed. Similarly, I had started writing a longer note yesterday,
pointing out that much of this is already covered in:

1. RFC 3986 (esp. Section 1.2.1, which doesn't say how "to use a wider
range of characters" nor does it "specify the character encoding used to
map those characters to octets prior to being percent-encoded for the URI")

2. RFC 8141 (esp. Section 2.2, which specifies a character encoding of
UTF-8 but doesn't say how to process "those characters" before UTF-8
encoding).

My reading of Adam's and Ben's messages is that they'd like a more
detailed specification of the last-mentioned item, or at least bring the
matter to the attention of anyone brave or foolhardy enough to use
"those characters" in NBNs.

+1 to the rest of what John said.

Peter

Adam Roach

2018-06-08 19:45:03 UTC

Permalink

Post by Peter Saint-Andre
...
Indeed. Similarly, I had started writing a longer note yesterday,
1. RFC 3986 (esp. Section 1.2.1, which doesn't say how "to use a wider
range of characters" nor does it "specify the character encoding used to
map those characters to octets prior to being percent-encoded for the URI")
2. RFC 8141 (esp. Section 2.2, which specifies a character encoding of
UTF-8 but doesn't say how to process "those characters" before UTF-8
encoding).
My reading of Adam's and Ben's messages is that they'd like a more
detailed specification of the last-mentioned item, or at least bring the
matter to the attention of anyone brave or foolhardy enough to use
"those characters" in NBNs.

What I have in mind is more the second thing than the first. As John
pointed out, what libraries want to do may vary from country to country.
I just want text that warns that there's a gun here and that they should
take care to aim it away from their own foot.

/a

Benjamin Kaduk

2018-06-08 20:25:59 UTC

Permalink

Post by Adam Roach

I think that's what I had in mind as well (to the extent that I had
anything at all in mind when I balloted -- I don't want to claim to
be an expert in this space).

-Benjamin

Benjamin Kaduk

2018-06-08 20:32:30 UTC

Permalink

I'm happy to see the main point of discussion progressing with input
from people who know more about the subject than me ... that said, I
can comment on some of the other points, inline.

Document shepherd here. I expect the document author (and perhaps my
co-author on RFC 8141) to provide further thoughts.

Benjamin Kaduk has entered the following ballot position for
draft-hakala-urn-nbn-rfc3188bis-01: Discuss
When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)
Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.
https://datatracker.ietf.org/doc/draft-hakala-urn-nbn-rfc3188bis/
----------------------------------------------------------------------
----------------------------------------------------------------------
I think this document may benefit from an Internationalization
Considerations sections, but am not entirely sure how needed it is.
So let's discuss it...
In particular, the URN:NBN lexical equivalence rules include several
case-insensitive comparisons, for the prefix and for the case of the
hex digits in any percent-encoded values, but do not specify any
operation on the decoded percent-encoded values/characters.

In particular, with regard to characters outside the ASCII range,
URNs that appear in protocols or that are passed between systems MUST
use only Unicode characters encoded in UTF-8 and further encoded as
required by RFC 3986. To the extent feasible and consistent with the
requirements of names defined and standardized elsewhere, as well as
the principles discussed in Section 1.2, the characters used to
represent names SHOULD be restricted to either ASCII letters and
digits or to the characters and syntax of some widely used models
such as those of Internationalizing Domain Names in Applications
(IDNA) [RFC5890], Preparation, Enforcement, and Comparison of
Internationalized Strings (PRECIS) [RFC7613], or the Unicode
Identifier and Pattern Syntax specification [UAX31].
In order to make URNs as stable and persistent as possible when
protocols evolve and the environment around them changes, URN
namespaces SHOULD NOT allow characters outside the ASCII range
[RFC20] unless the nature of the particular URN namespace makes such
characters necessary.
By my reading of draft-hakala-urn-nbn-rfc3188bis and RFC 8141, the
allowable case-sensitivity for nbn_string constructs generated by a
national library applies to the percent-encoded string because that is
where any comparison or equivalence-matching would occur for these
identifiers. Venturing into case matching of percent-decoded strings
would (IMHO) unnecessarily open up an ugly can of worms.

In many
(perhaps even most?) cases, ignoring such encoded characters for
purposes of case-insensitive comparison is the wrong thing to do,
but if I understand correctly, it actually is the correct thing to
do in this case. Namely, a NBN (or URN:NBN), once assigned, is
essentially static data and consumers of it should not attempt to
perform modification, Unicode normalization, etc. on it -- that
would potentially change what is being identified (or render the
identifier invalid).

Well, Unicode normalization would be used as part of equivalence
operations (as in IDNA or PRECIS), but in general you are right about
modification. These are identifiers or even numbers, not malleable strings.

On the other hand, a national library or
delegated institution that is assigning NBNs may wish to take into
account Unicode normalization rules and other similar considerations
while assigning NBNs (in particular, the nbn_string component), as
part of their allocation policy.

It could, but as far as I know none of the national libraries have yet
gone down that path or seen the need to. Juha can tell us if I'm wrong.

Because these can be subtle, it
may be worth explicitly pointing out the potential issues for
registration authorities.

"There be dragons and don't go there" seems like fine advice.

That, plus the directive to consumers to
not normalize, seems like it would be appropriate content for an
Internationalization Considerations section.

By "normalize" you mean perform equivalence matching of percent-decoded
strings (of which Unicode normalization might be one step), right? Here
again I think the answer is "don't do that" because it's equivalence
matching is done on the percent-encoded strings.

I did not have a terribly concrete scenario in mind when I wrote
this; I think the one Adam described is probably enough to get us
thinking about the right things.

Separately, in Section 4.2.1 where we cover 4-components, I noted
that RFC 8141 rather discourages actually using r-components until
their semantics are standardized. The text here seems to be giving
free reign for national libraries to assign their own semantics
without any coordination with a broader community.

Juha and perhaps John can clarify, but as I understand it the scope of a
URN resolver for NBNs would likely be within a particular national
library system, not even necessarily across all national libraries (this
is how things are deployed now in the absence of URN resolution, in any
case).

Do we really
want to advocate for this, as opposed to attempting to get broadly
unified semantics for r-components Internet-wide? (Perhaps we
already have and I just missed it; if so, a reference here would be
appropriate.)

The semantics of r-components are yet to be defined. I would venture
that the IETF is probably not the right place to do that work, given how
little energy remained in the URN WG at the end (and we probably didn't
have the right people in the room in the first place).

I won't argue with that. Does it make sense to say something like
"There are not currently any broadly accepted semantics for
r-components at the time of this writing which may be grounds to be
cautious with their use" in this document?

----------------------------------------------------------------------
----------------------------------------------------------------------
I'm a little confused on some of the places in the text that talk
about URN:NBNs being "generated from" NBNs (and non-reuse
thereafter) or restrictions on URN:NBN assignment (e.g.,
uniqueness). The procedure seems to be basically deterministic for
creating a URN:NBN once an NBN is assigned, and potentially
something that could be done by any party in possession of the NBN
(i.e., not necessarily the registration authority that created the
NBN). So I'm not sure why the act of generating the URN:NBN has any
significance, if anyone could do it -- the restrictions would need
to apply at NBN assignment time in order to be useful. (This kind
of gets into Ben's DISCUSS point, too, in the sense that we can only
say what prerequisites there are for national library NBN allocation
policies in order for them to be useful with URN:NBN, but they can
in principle do whatever they like and choose to not use URN:NBN.)

Yes, the process of creating a URN from an NBN is trivial (modulo
potentially interesting encoding of non-ASCII characters). I think the
point of the text is that an NBN URN is not exactly the same as an NBN.
Perhaps that could be worded more clearly.

Okay. (I don't think I have any suggestions for different text.)

Section 3.2
From the library community point of view it is important that the
f-component is not a part of the NSS and therefore f-component
attachment does not mean that the relevant component part is
identified. Moreover, the resolution process still retrieves the
entire resource even if there is an f-component. The fragment
selection is applied by the resolution client (e.g., browser) to the
media returned by the resolution process. In other words, in this
latter case the fragments are logical and physical components of the
identified resource whereas in the former cases these "fragments" are
actually complete, independently named entities.
I'm not sure I'm understanding this correctly -- is the "former
case" the thing that libraries should not do, namely, including the
f-component in the NSS?

Now that you point it out, I'm not sure what the former case is.
Formally speaking the f-component simply is not part of the NSS, see the
ABNF in RFC 8141.

I guess we should wait for Juha to clarify.

If an NBN identifies a work, descriptive metadata about the work
SHOULD be supplied. The metadata record MAY contain links to
Internet-accessible digital manifestations of the work.
This left me confused. Is it only intended to apply in the case
described in the previous paragraph, where the resource identified
by the NBN is not available in the Internet? Or does it always
apply, forcing the metadata to take precedence over delivering the
actual work? (Or maybe I'm just confused, and there's an easy way
to deliver both metadata and the actual work alongside each other
with no ambiguity.)

Juha can clarify this.

Section 4.1
National Bibliography Number (NBN) is a generic term referring to a
group of identifier systems administered by the national libraries
and institutions authorized by them.
"the national libraries" implies a specific set -- which ones? It
may be better to hedge with "some national libraries".

Or remove "the" ... "by national libraries".

That's probably better :)

Thanks,

Benjamin

Section 4.2.2
Do we need to say anything about a URN-to-URI step before talking
about URI-to-resource services?
I'm also wondering about any relationship between "component
resource" NBNs and f-components of the containing work. If there is
are NBNs assigned to both an image within a work and that containing
work, and an NBN with f-resource is used to refer to the image
within the containing work, is there any relationship between the
f-resource and the image-specific NBN?
Section 4.3
Expressing NBNs as URNs is usually straightforward, as only ASCII
characters are allowed in NBN strings. If necessary, NBNs MUST be
translated into canonical form as specified in RFC 8141.
When is it necessary?

It seems that in theory an NBN itself could contain non-ASCII
characters, whereas an NBN URN and its nbn_string construct can contain
only ASCII characters. At least that is my understanding.

Being part of the prefix, sub-namespace identifier strings are case-
insensitive. They MUST NOT contain any hyphens.
This MUST seems to just duplicate a syntactic requirement from the
ABNF; is RFC 2119 language really necessary?

/me shrugs

Section 8
John Klensin provided significant editorial and advisory support for
late versions of the draft.
Presumably that's "later versions"?

Yes.
Peter

John C Klensin

2018-06-09 20:20:39 UTC

Permalink

--On Friday, June 8, 2018 15:32 -0500 Benjamin Kaduk