comcifs / cif_core Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 9.0 2.61 MB

The IUCr CIF core dictionary

cif_core's People

Contributors

Stargazers

Watchers

Forkers

vaitkus pavoljuhas merkys kkitahara5101 jamesrhester sauliusg rowlesmr publcif nautolycus

cif_core's Issues

ddl.dic: proposal for the expansion of the 'Real' type definition

I suggest expanding the definition of Real to allow fractional numbers by allowing to append /[1-9][0-9]* to currently allowed Real numbers. Such expansion would allow to avoid rounding problems such as conversion of 1/3 to 0.333333, what is somewhat a loss of information – it is not unusual for papers to report coordinates/occupancies in fractional numbers.

Of course such change would require all reading programs to be aware of possible fractional numbers. However, specialized libraries already exist for Perl, Python and Java languages.

Incorrect description of '_audit.creation_date' in cif_core.dic

Description of _audit.creation_date in cif_core.dic presents the format of date as dd-mm-yyyy while _type.contents of _audit.creation_date is Date which mandates the ISO standard date format.

pd_meas.angle_2theta and pd_meas.2theta_scan duplicate one another?

The DDL1 dictionary has both pd_meas_angle_2theta and pd_meas_2theta_scan - how do they differ?

cif_core.dic: enumerator list of the _geom_torsion.publ_flag data item differs between DDL1 and DDLm versions

The _geom_torsion.publ_flag data item seems to have lost its short form enumerator values "y" and "n" (short for "yes" and "no" respectively) while undergoing the conversion from DDL1 to DDLm,

In addition to that, the _reflns.apply_dispersion_to_Fcalc data items is also missing the short form enumerator values even though technically this is not an error since the data item was not previously defined in the DDL1 dictionary. However, for consistency sake I would recommend adding these values since they are present in all data items containing the yes/no enumerators,

The implementation of these changes can be seen in the pull request #52.

pd_calib_2theta_offset forms a second loop in pd_calib

The powder dictionary allows pd_calib_2theta_offset_* values to be used in a number of ways:
(1) As straightforward key-value: _pd_calib_2theta_offset is a value to be added to all measured angles
(2) Looped with appropriate points:

loop_
  _pd_calib_2theta_offset
  _pd_calib_2theta_offset_point

(3) Looped with range of validity

loop_
  _pd_calib_2theta_offset_min
  _pd_calib_2theta_offset_max
  _pd_calib_2theta_offset

The pd_calib category already has loops on detector_id, so this loop means a new category must be
created

ddl.dic: _type.contents character sets

The _type.contents data item contains several text-based fields:

Type	Description
Text	case-sens strings or lines of STAR characters
Code	case-insens contig. string of STAR characters
Name	case-insens contig. string of alpha-num chars or underscore
Tag	case-insens contig. STAR string with leading underscore
Filename	case-sens string identifying an external file
Uri	case-sens string as universal resource indicator of a file

I have a few questions regarding these definitions:

What characters are included in the STAR characters set?

According to [1] and its supplementary material, the extended character set consists of the following code points:

[U+0009], [U+000A], [U+000D], [U+0020-U+D7FF], [U+E000-U+FFFD], [U+10000-U10FFF]

However, the character set used in the CIF2 definition (supplementary material of [2]) is different:

[U+0009], [U+000A], [U+000D], [U+0020-U+007E], [U+00A0-U+D7FF], [U+E000-U+FDCF],
[U+FDF0-U+FFFD], [U+10000-U+1FFFD], [U+20000-U+2FFFD], [U+30000-U+3FFFD],
[U+40000-U+4FFFD], [U+50000-U+5FFFD], [U+60000-U+6FFFD], [U+70000-U+7FFFD],
[U+80000-U+8FFFD], [U+90000-U+9FFFD], [U+A0000-U+AFFFD], [U+B0000-U+BFFFD],
[U+C0000-U+CFFFD], [U+D0000-U+DFFFD], [U+E0000-U+EFFFD], [U+F0000-U+FFFFD],
[U+100000-U+10FFFD]

Maybe there is a different definition of the STAR characters somewhere else? Or is this mismatch intentional?

How is a contig. string defined? (A string with no new lines? A string with no spaces? A string with no white space symbols at all?)
How is a STAR string defined? (A string made out of STAR characters I assume?)
The Filename and Uri do not seem to have their character sets specified. What is the implied character set?

[1] Spadaccini, N. and Hall, Sydney R., "Extensions to the star file syntax", Journal of Chemical Information and Modeling 52 (2012), no. 8, 1901–1906, doi: 10.1021/ci300074v
[2] Bernstein, Herbert J.; Bollinger, John C.; Brown, I. David; Gražulis, Saulius; Hester, James R.; McMahon, Brian; Spadaccini, Nick; Westbrook, John D.; Westrip, Simon P., "Specification of the crystallographic information file format, version 2.0", Journal of Applied Crystallography 49 (2016),
no. 1, 277–284, doi: 10.1107/S1600576715021871

Missing data name in multiple dictionaries

When compared to their DDL1 counterparts multiple DDLm dictionaries seem to be missing some of the data names. I have also surveyed the COD (svn revision 197917) for the usage of these data names -- the number of occurrences will be displayed in parenthesis.

cif_core.dic
The DDLm cif_core.dic when compared to the DDL1 cif_core.dic seems to be
missing the following data names:
_audit_block_doi (1323)
_citation_doi (213)
_citation_publisher (0)
_database_dataset_doi (0)
_publ_contact_author_id_orcid (0)
_publ_author_id_orcid (0)
_diffrn_refln_crystal_id (1)
_refln_crystal_id (0)
_space_group_id (11)
_space_group_symop_sg_id (0)
_symmetry_Int_Tables_number (38276)
_symmetry_equiv_pos_site_id (14566)

1.1. The _audit_block_doi, _citation_doi, _citation_publisher, _database_dataset_doi, _publ_contact_author_id_orcid and _publ_author_id_orcid data items are not defined in the DDLm dictionary at all.

1.2. The _diffrn_refln_crystal_id and _refln_crystal_id data items are also not defined in the DDLm dictionary even though the related data item _exptl_crystal_id is defined. Maybe these two data names could become aliases for the _exptl_crystal_id data name? The COD currently has 1 entry that contains the _diffrn_refln_crystal_id data name, no entries that contain the _refln_crystal_id and 610 entries that contain the _exptl_crystal_id.

1.3. I assume that the _space_group_id and _space_group_symop_sg_id data items were purposely removed due to space group information no longer being looped in the DDLm. The COD currently has 11 entries that contain the _space_group_id data item and 0 entries that contain the _space_group_symop_sg_id. However, in all cases the data item seems to be misused (used in a separate loop from the rest of the data or given the same value as the _space_group_IT_number
data item) and as a result the data items will most likely be removed from these files. I would lean towards removing these data items in the DDL1 legacy dictionary as well (after disallowing the looped space group information). Do you agree?

1.4. The _symmetry_Int_Tables_number and _symmetry_equiv_pos_site_id data names are not listed as aliases for the data items _space_group.IT_number and _space_group_symop.id respectively, but their doted counterparts (_symmetry.Int_Tables_number and _symmetry_equiv.pos_site_id) are. I recommend adding the aforementioned data names as aliases. I have included this small correction among others in my pull request #30.

cif_ms.dic
When compared to the DDL1 cif_ms.dic the DDLm cif_ms.dic seems to be missing the following data names:
_refln_index_m_1 (20)
_refln_index_m_2 (5)
_refln_index_m_3
_refln_index_m_4
_refln_index_m_5
_refln_index_m_6
_refln_index_m_7
_refln_index_m_8
cif_pow.dic (cif_pd.dic)
When compared to the DDL1 cif_pd.dic the DDLm cif_pow.dic seems to be missing the following data names:
_pd_spec_size_equat (523)
_pd_spec_size_thick (405)
_pd_instr_monochr_pre_spec (8)
_pd_instr_monochr_post_spec (6)

3.1. The _pd_spec_size_equat and _pd_spec_size_thick data items are not defined
at all.

3.2. The _pd_instr.monochr_pre_spec and _pd_instr.monochr_post_spec data items
are defined, but are missing the _pd_instr_monochr_pre_spec and
_pd_instr_monochr_post_spec aliases (they do have _pd_instr_monochr_pre/spec
and _pd_instr_monochr_post/spec aliases instead). I recommend adding the aforementioned data names as aliases. I have included this small correction among others in my pull request #30.

cif_rstr.dic (cif_core_restraints.dic)
When compared to the DDL1 cif_pd.dic the DDLm cif_pow.dic seems to be missing the following data name:
_restr_equal_angle_detail (0)

4.1. In this case the error seems to stem from the DDL1 dictionary (cif_core_restraints.dic) and not the DDLm one (cif_rstr.dic). The DDL1 dictionary contains a typo in the name of the data item -- it is written as '_restr_equal_angle_detail' even though the related data block is named '_restr_equal_angle_details'.

The survey of the COD revealed that neither '_restr_equal_angle_detail' nor
'_restr_equal_angle_details' data name was used at all so I think it is
safe to correct the legacy DDL1 dictionary and leave the DDLm one as is.
Do you agree?

Finally, some data items contain aliases that are identical to their _definition.id. Is it a bug or a feature?

ddl.dic: some enumeration values contain whitespace characters

Some enumeration values violate the Code data type constraints by containing a whitespace symbol:

ddl.dic: the _audit.schema data item contains the enumeration value 'Space group tables'. Even stranger, the _type.contents of the _audit.schema data item is Text even though the save frame contains enumerator values;
templ_enum: several save blocks that involve space groups (_H_M_ref, _ref_set) contain
numerous enumeration values with whitespace symbols ('P 1', 'P -1', ...).

I propose three possible approaches:

Replace the spaces with underscores. This approach seems to be the smallest amount of work and risk of breaking anything else);
Change the _enumeration_set.state data item data type from Code to Implied. However, I would argue against this approach since it would make the usage of enumeration values highly irregular (sometimes they would be case sensitive and sometimes not, etc.);
Modify the Code data type to allow the space (' ') symbol. The text would be more readable in certain cases (e.g. space group symbols), but I'm unsure if it won't break anything else.

What do you think?

ddl.dic (cif2-conversion): ATTRIBUTES save frame contains no _name.category_id

The _name.category_id data item is marked in the _dictionary_valid.application loop as being mandatory in the Category scope, however, the ATTRIBUTES save frame does not contain it. Since it is declared as the Head category of the dictionary, maybe marking it as being the container category of itself would work?

On a related note, the head CIF_CORE category in the cif_core.dic dictionary has the CIF_DIC as its container category, even though such category does not seem to be defined anywhere.

The _audit.schema save frame in the cif_core.dic is also missing the _definition.update data item.

pd_refln uses core_cif refln in a problematic way

pd_refln proposes adding a peak_id and phase_id to the core_cif refln list. The core_cif refln list assumes that only h,k,l are needed to index a row. If there are multiple phases, the phase_id will also be added, and it then becomes possible that the same h,k,l will appear in multiple rows. This could cause issues for software that expects the reflection list to consist of unique h,k,l.

cif_core.dic (cif2-conversion): _type.contents value 'Symop'

The _function.Symop save frame in the cif_core.dic and the site_symmetry save frame in the templ_attr.cif contain the _type.contents data items with the value 'Symop'. This value, however, does not seem to be defined as one of the possible enumerator values of the _type.contents data item.

cif_core.dic (cif2conversion): default enumeration value is out of enumeration range

Data items _chemical_conn_bond.distance and _chemical_conn_bond.distance_su have the default enumeration value of 0.0, even thought the enumeration range is 0.5:.

The _chemical_conn_bond.distance_su data item should probably also have the enumeration range of 0.0: since it stores the standard uncertainty.

cif_core.dic: enumerator list of the _publ_requested.category data item differs between DDL1 and DDLm versions

When compared to the DDL1 version of the dictionary the DDLm one seems to be missing the following enumerator values of the _publ_requested.category data item: GO, HO, QI,GI, QO, HM, HI, GM, QM. Is this change accidental or intentional?

_definition.class for AUDIT_CONFORM should be 'Loop' instead of 'Set'

According to cif_core v2.4.5, _audit_conform_* tags may appear in a loop (according to http://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iaudit_conform_dict_name.html), while such use is effectively forbidden in v3.0.07 (_definition.class of AUDIT_CONFORM is Set, not Loop).

cif_core.dic: enumerator list of _atom_sites.solution_hydrogens data item differs between DDL1 and DDLm versions

The enumerator list of the _atom_sites.solution_hydrogens data item differs a bit in the DDL1 and DDLm version. For some reason, the mixed value is present in the DDL1 version, but not in the DDLm one. Is this intentional?

ddl.dic: the _type.contents of the _dictionary_valid.attributes data item

The _dictionary_valid.attributes data item is declared as being of type Name. This limits the allowed character set to the alphanumeric one with the underscore symbol included ([a-zA-Z0-9_]). However, some of the _dictionary_valid.attributes do contain the '.' symbol as well (i. e. _dictionary.title, _dictionary.class, etc.). The definition of the Name type should probably not be changed since the exclusions of the '.' symbol seems to be intentional (data items of this type are referenced in dREL scripts).

I propose changing the _type.contents value from Name to Tag. This would extend the allowed character set to the Unicode one, but it should not be a problem, since data items and categories are already referenced by their _definition.id and not the <_name.category_id>.<_name.object_id> construct (i.e. ALIAS vs. ATTRIBUTES.ALIAS); the _definition.id data item does allow the Unicode set.

It seems, that the newly added _definition.replaced_by data item suffers from the same problem as well.

pd_peak.special_details is the only pd_peak item that is not looped

pd_peak.special_details is intended to capture unusual aspects of the peak determination process, so it is an overall item and should not be looped.

cif_core.dic: definition of refln.d_spacing

Is the definition

_refln.d_spacing = 2. / _refln.sin_theta_over_lambda

correct, i.e. Braggs law n\l = 2d sin\q with n=4 ?

ddl.dic (cif2-conversion): _type.container definitions

The '_type.container' enumerator value descriptions seem to be a bit incorrect. The "Array", "Matrix", "List" and "Table" values are described as having elements separated by commas, however, according to the CIF2 syntax elements in arrays and tables should be separated by white spaces.

_type.dimension definition and its usage in DDL_DIC v3.11.10 do not match

_type.dimension is defined as a single value (_type.container Single) and is described as text string within bounding square brackets in DDL_DIC v3.11.10. However, values of _type.dimension seem to be lists exclusively. Which usage is correct?

pd_block is not looped, but pd_diffractogram is

The current draft assumes that pd_block.id and pd_block_diffractogram_id are looped together, when in fact there may only be one pd_block.id per datablock.

cif_core.dic (cif2-conversion): _type.dimension notation for matrices

The DDLm definition of the _type.dimension data item provides the following example for matrices:

loop_
_description_example.case
_description_example.detail
"[3,3]"              '3x3 matrix of elements'

however in the cif_core.dic (and cif_ms.dic) contain multiple entries where the elements in brackets are separated by whites paces instead of a comma, i. e. the _cell.metric_tensor data item
contains the following value:

_type.dimension                         '[3  3]'

Which case should be treated as the correct one?

cif_linguist should preserve whitespace and comments

Program cif_linguist should copy whitespace and comments from its input to its output as much as possible.

CIF_POW: datanames do not coincide with categories

As per section 3.3.3 of International Tables Vol G, datanames in pdCIF are not named according to their category. They must, however, belong to either a Loop or Set category in DDLm. This issue train records the category assignment decisions.

ddl.dic (cif2-conversion): _type.container 'Multiple' / _type.contents values

The _type.container data item in the DDLm attribute dictionary has 'Multiple' as one of its available values. The description of this data item states that the value 'Multiple' allows:

values as List or by boolean ,|&!* or range : ops

The given definition seems a bit ambiguous and I was hoping that you could clear a few things up:

Are the three given expressions (list, boolean and range) mutually exclusive or can they be used in the same context. For example:
1.1) can a list contain boolean or range expressions?
1.2) can ranges be used in boolean expressions, etc.
Lists:
2.1) Can lists be nested? If yes, then is there a limit to the maximum nesting depth?
Boolean:
3.1) What is the set of the allowed boolean operations and what does each of the operators mean?
The '&' (and), '|' (or) and '!' (not) operators are used widely in multiple programming languages,
but the ',' and '*' might have different meanings depending on the implementation.
3.2) are parentheses allowed in the boolean expression?

A related set of questions deals with the _type.contents values in the cif_core dictionary.
It is marked as being an enumerator, however, in several cases composite values are used instead of a single enumerator:

_model_site.adp_eigen_system: List(Real,Real,Real,Real)
_atom_type_scat.versus_stol_list: List(Real,Real)
_function.Closest : Matrix(Real,Real,Real)

Most of the questions related to these observations overlap with those given about the _type.container data item, however, some do not:

Are these composite values intentional or are they a leftover from previous revisions? I assume that such values are made legit by the_type.container value 'Multiple' in the _type.contents data item definition, however, this seems to conflict with the COMCIFS policy to "not to allow changing
the interpretation of a dataname based on the values of other datanames";
The List() notation is given in one of the examples, however, the Matrix() notation is not mentioned in any of the descriptions. Is it allowed or is that a mistake?

Finally, I think that the majority of my questions are easily answered by a formal grammar definition of the field values. Even if the dictionaries are primarily intended for human readers, the grammar removes a lot of ambiguity (as well as makes life much easier for programmers). I have composed an EBNF grammar (type-contents.txt) for the _type.container data item values based on a few assumptions:

The enumerators appearing in the values are case-insensitive;
There is no list nesting limit, that is the Integer,Real,Binary, Integer,List(Integer,Real) and List(List(List(Name,Tag),Integer,Real),Binary),Octal are all valid values;
The boolean expressions are allowed inside lists, i.e.: Text|Real,Real&!Integer|Range and List(Text|Real,!Integer) are both valid values;
Only the '&' (and), '|' (or) and '!' (not) boolean operators are supported, but not ',', '*' since I am currently unsure of their interpretation;
The range syntax is not covered at all (should not matter in this context).

This is only the initial version and I could modify it based on your feedback. Maybe you could provide some example values that should and should not validate?

I you agree that such specification (in one form or the other) would be of value to the dictionary users, maybe a data item could be introduced to store this kind of specification in the dictionary as well (i.e. type.contents_grammar)? The formats of other values, such as symmetry operators, chemical formulae, etc. could also eventually be described in an unambiguous, machine-readable way.

The forward slash in some pdCIF datanames can't be used in dREL

Many pdCIF datanames use a forward slash (e.g. pd_instr.dist_detc/anal). While this is acceptable as a dataname in CIF, the forward slash in dREL is reserved for division, and so the object.id for these values will have to be different to the dataname following the period, unlike for any other pdCIF item. There are two options:
(1) Leave the datanames with the forward slash and provide a different object_id
(2) Define new aliased datanames using underscores instead of forward slashes.

Note that in either case we are defining a new dataname (because we are changing an underscore to a period).

Option (1) presents minor inconveniences for those who would write and interpret dREL methods, but throws into doubt the whole point of using periods in pdCIF datanames, if for pdCIF the period cannot be used to determine either category or object reliably (currently at least the object part can be determined).

Option (2) creates new datanames that require substituting for both / and . to get the original dataname.

cif_core.cif2.dic (cif2-conversion): _symmetry and _space_group data items should not be treated as aliases in some cases

In the DDL1 cif_core dictionary the _symmetry_* data items were deprecated and marked as replaced by the _space_group_* data items. In some instances, however, the data name was not the only thing changed, for example:

_symmetry_cell_setting and _space_group_crystal_system are both described as enumerators, but the latter does not have the rhombohedral value listed based on the fact that trigonal value encompasses it;
_symmetry_space_group_name_H-M was declared as a strictly unlooped value whereas _space_group_name_H-M_alt was declared as being either looped or unlooped.

As a result, these data names should probably not be marked as aliases. Or does the concept of 'alias' in the DDLm encompasses these differences? Maybe a mechanism of linking deprecated data items to their newer alternatives could be introduced into the DDLm as well?

I add the relevant excerpt from the cif_core DDLm below:

save__symmetry.cell_setting

_definition.id                          '_symmetry.cell_setting'
loop_
  _alias.definition_id
         '_symmetry.cell_setting'      
         '_symmetry_cell_setting'      
         '_space_group_crystal_system' 
...
save_

save__space_group.crystal_system

_definition.id                          '_space_group.crystal_system'
loop_
  _alias.definition_id
         '_space_group.crystal_system' 
...
save_

CIF_POW: pd_calib contains two loops

pd_calib may loop on detector id, but also (according to the definition) on pd_calib_std_external_block_id, and given that only _pd_calib_detector_id is required in any loop (list_reference attribute), the DDL1 situation may therefore lead to repeat values of _pd_calib_detector_id, meaning it is not a key. We can either add _pd_calib_std_external_block_id to the list of keys, but that would require it to be present in any pd_calib loop. We instead define a new category pd_calib_std which has both a detector id and block id as keys. This latter course of action seems most reasonable. It is not possible to make pd_calib_std a child category of pd_calib, as that would mean that pd_calib needs to have the _external_block_id as a key, which we want to avoid for compatibiliity with potential current use, in which only _pd_calib_detector_id needs to be present.

I suggest that we therefore create a separate pd_calib_std category for recording use of external standards, adding a pd_calib_std_detector_id to link the standard with the detector that is used. We can apply a default value to _detector_id in both categories to simplify usage.

MS: The Geom_ categories replace some of their keys

The Geom_angle category replaces the _site_symmetry_n keys with the site_ssg_symmetry_n keys, which means that, strictly speaking, the categories are different. This manifests most obviously in the dREL for geom_bond.distance no longer being correct. Essentially, we have to replace the entire category.

Versioning protocol is unclear

It is not clear when the major/minor/patch version numbers should be bumped on DDLm dictionaries. This should be clarified and documented.

This issue was first raised during discussion of #35 .

cif_core.cif2.dic (cif2-conversion): can _space_group.name_H_M_alt be looped?

In DDL1 version of the cif_core.dic the _space_group_name_H-M_alt was declared as either looped or unlooped. However, the usage of this data item (_space_group.name_H_M_alt) is not completely clear in the DDLm version of the dictionary:

The data items is assigned to the SPACE_GROUP category, which in turn is defined as being a SET. If I understand correctly, this would imply that data items belonging to the category cannot be looped;
One of the descriptions given in the definition of the _space_group.name_H_M_alt data item illustrates the usage of this data item in a loop context:

;         loop_
         _space_group.id
         _space_group.name_H_M_alt
                          '1'   'C m c m'
                          '2'   'C 2/c 2/m 21/m'
                          '3'   'A m a m'
;        'three examples for space group No. 63'

How should this be interpreted?

ddl.dic: clarification on the 'Symop' data type

The 'Symop' data type was extensively discussed in the #32. The following definition of the data type was accepted: a string composed of an integer followed by an underscore and 3 or more digits. However, the following examples are given in the save_site_symmetry save frame of the templ_attr.cif file:

loop_
_description_example.case
_description_example.detail
'4'           '4th symmetry operation applied'
'7_645'       '7th symm. posn.; +a on x; -b on y'
'.'           'no symmetry or translation to site'

The value '4' clearly does not fit the current formal type format (even if the '_555' part can be implied). Is the example wrong, or should the '_\d\d\d' part be treated as optional (in that case, the type definition should reflect that)?

Also, the example value '.' should be unquoted in order for it to be treated as a special CIF value.

cif_core.dic: _atom_site.label is linked to itself

The _atom_site.label data item definition in the cif_core.dic dictionary states that this data item is linked to itself using the _name.linked_item_id data item (this is done indirectly by importing the _atom_site_labeldata block from the templ_attr.cif file). Is this done on purpose?

MS: use of modulated structure indices changes category keys

HKL is everywhere replaced by HKL[m1 m2 ...] in the draft. This means that all such categories must be moved to an extension dictionary; possibly the whole MS dictionary should become an extension dictionary.

cif_core.dic: potentially incorrect '_atom_site.refinement_flags_occupancy' enumerator values

The enumerator values of the _atom_site.refinement_flags_occupancy data item differ greatly between the DDL1 and DDLm versions of the dictionary:

DDL1 version:

loop_
_enumeration
_enumeration_detail
.     'no constraints on site-occupancy parameters'
P     'site-occupancy constraint'

DDLm version:

loop_
_enumeration_set.state
_enumeration_set.detail
.    'no constraints on atomic displacement parameters'          
T    'special-position constraints on atomic displacement parameters'     
U    'Uiso or Uij restraint (rigid bond)'    
TU   'both constraints applied'

I have a hunch that the values in the DDLm version got mixed up with the enumeration values of the _atom_site.refinement_flags_occupancy data item.

MS: Serous violations of our _audit.schema expansion model

In many places in the DDL1 modulated structures dictionary the standard h,k,l lists are expanded with extra components. Our _audit.schema system is predicated on such expansion being transformable to a series of datablocks, where each datablock corresponds to single values of each of the new key components. For example, multiple samples can be separated into multiple datablocks with a single sample.

However, such an expansion of the modulated structure dictionary will not succeed, because, in any of the expanded datablocks, F_calc, and any other calculated values for the relevant category, will not correspond to the F_calc that standard software would obtain based on the other values in the datablock. This is because the extra h,k,l components affect the F_calc calculation.

cif_core.dic: incorrect _exptl_crystal.colour usage example

The _description_example.case data item is defined as a list of three code elements, however, the definition contains the following example:

loop_
  _description_example.case
         '[transluscent, pale, green]'

Since the example should contain a list and not a string (as the Implied type suggests), it should probably be changed to:

loop_
  _description_example.case
         [transluscent pale green]

However, the _description_example.case data item is defined as having the Single type container, so
a list is not allowed in this context. This is not the only example of this kind of issue. For example, the definition of the _atom_site_displace_special_func.sawtooth data item in the cif_ms.dic dictionary contains the following enumeration default:

_enumeration.default         [0.0  0.0  0.0]

However, the _enumeration.default data item is also defined as having the Single type container and as a result the list is not allowed in this context. Maybe an additional Implied container type could be introduced or the issue could eventually be corrected by the means discussed in #34?

On a related note, the enumerator value transluscent seems to be misspelt in this example, as well as in the definition of the _exptl_crystal_appearance.general data item.

pd_instr proposes looping some datanames for multi-detector instruments

See text in pd_instr category description.

cif_core.dic: enumerator list of the _exptl_crystal_appearance.general (aliased as _exptl_crystal.colour_lustre) data item differs between DDL1 and DDLm versions

When compared to the DDL1 version of the dictionary the DDLm one seems to be missing the following enumerator values of the _exptl_crystal.colour_lustre data item: dull, clear.

When compared to the DDLm version of the dictionary the DDL1 one seems to be missing the following enumerator values of the _exptl_crystal.colour_lustre data item: transluscent, lustrous, transparent, opaque, ..

If possible, I would propose merging the enumerator value sets from DDL1 and DDLm for compatibility purposes (it could be noted that values dull and clear are synonyms of opaque and transparent accordingly and should generally not be used). In the COD revisions 201200 the value dull is used 75 times and the values clear is used 2641 times.

In addition to that, the enumerator value transluscent seems to be misspelt.

Non-standard use of List(Code,Symop) for _type.contents

model_site.id is has _type.contents of List(Code,Symop). This is not one of the enumerated types given for _type.contents, and is not conveniently supported by DDLm constructs.

ddl.dic: proposal for the expansion of the 'Date' type definition

Currently, the Date data format is defined as "ISO date format yyyy-mm-dd". However, for some applications a more precise timestamp is needed. Would it be possible to extend this data format to the full (or at least wider) scope of formats defined in the ISO 8601 specification? I.e.:
2017-10-19
2017-10-19T16:09:58+00:00
2017-10-19T16:09:58Z

This change would leave the older values valid whilst providing a more diverse use of the data type. What do you think?

Data item '_type.source' in ddl.dic has a default enumeration value that is not among the provided enumeration values

Data item '_type.source' in ddl.dic describes the following values as the available enumeration options:
Recorded
Assigned
Related
Derived

However, the default enumeration value is declared as:
Selected

cif_pow_multiphase.dic: unrecognised data item '_name.linked_object_id'

The cif_pow_multiphase.dic dictionary contains an unrecognised data item '_name.linked_object_id'. Most likely, it is an misspelt version of the _name.linked_item_id data item.

cif_core.dic: the 'units.code' data item is never imported

The _units.code data item is referenced multiple times in the cif_core.dic, but it is not defined nor explicitly imported to the dictionary. Based on the values I assume that it is equivalent to the _units_code data item found in the templ_enum.cif file, however, it does not contain the _unit.code data name as one of its aliases.

cif_core.cif2.dic (cif2-conversion): deprecated data items

I was wondering what is the official IUCr policy towards the retention of deprecated data items while converting the cif_core.dic from DDL1 to DDLm? Currently, none of the data items that were deprecated in DDL1 are marked as such in DDLm (the _symmetry_* data items, for example). Furthermore, some of the deprecated data items seem to be described better than their alternatives, i. e. the _symmetry.cell_setting data item description contains the _method.* loop where as its newer alternative _space_group.crystal_system does not.

The _dictionary.uri value in multiple dictionaries is not a proper URI

According to RFC 3986 (http://www.rfc-base.org/txt/rfc-3986.txt) an URI must begin with the scheme name, so www.iucr.org/cif/dic/ddl.dic should be changed to one of the following http://www.iucr.org/cif/dic/ddl.dic, http://iucr.org/cif/dic/ddl.dic, ftp://iucr.org/cif/dic/ddl.dic, etc.

cif_core.dic (cif2conversion): enumeration ranges of *_su data items

Multiple data items in the cif_core.dic and templ_attr that are meant to store standard uncertainties of measurand data items have incorrect (or missing) enumeration range definitions. I assume that the SU values are meant to always be greater than 0, but some *_su data items have enumeration values lower than 0. Also, some *_su data items are marked as being of the Real type (-inf;+inf), but do not have an enumerator range specifying that the negative values are not allowed. The issues:

templ_attr:

Data items with explicit negative range values:
_Cartn_coord_su [-1000.:1000.]
_fract_coord_su [-1.:1.]

Data items that are marked as Real with no enumeration range:
save_aniso_BIJ_su
_aniso_UIJ_su

cif_core.dic:

Data items with explicit negative range values:
_geom_hbond.angle_DHA_su _enumeration.range [-180.:180.]

Data items that are marked as Real with no enumeration range:
_refine_diff.density_max_su
_refine_diff.density_min_su
_refine_diff.density_rms_su
_refine_ls.abs_structure_Flack_su
_refine_ls.abs_structure_Rogers_su
_refln.F_meas_su
_refln.F_squared_meas_su
_geom_angle.value_su
_geom_torsion.angle_su

If multiple detectors are used, does each measured point still get a unique number? I hope so

i.e. is the following incorrect? The 10.1 degree measurement is measured by two detectors, but 10.1
degrees is labelled as the same 'point_id'.

loop_
_pd_meas.point_id
_pd_meas.2theta_scan
_pd_meas.detector_id
_pd_meas.counts_total
1 10.0 1 554
2 10.1 1 652
2 10.1 2 641
3 10.2 2 712

DDLm: enumerator sets with no default values

It seems that the DDLm does not strictly impose the use of _enumeration.default data item when the ENUMERATION_SET category data items are used, however, in most cases they are given. I was wondering if the omission of the default values in the descriptions of the save_dictionary_valid.scope and save_import_details.single_index data items is intentional.

ddl.dic: the _type.contents of the _definition.id data item

The _definition.id data item is declared as being of type Tag and as a result is required to start with an underscore symbol ('_'). However, whenever it is used in the Category context, this constraint is violated (i. e. ENUMERATION_SET, LOOP, UNITS).

One of the solutions would be to change the type of the _definition.id to name. Of course, this would restrict the set of allowed characters back to the alphanumeric one, but that would probably be for the best. Dealing with the case insensitive Unicode strings invites a lot of problems and also requires knowledge of the locale that the individual data name conforms to. Consider the following example/

There are two dictionaries E and T. The E dictionary is written in English and T dictionary is written in Turkish. The dictionaries have the following data items:
E: _data.i (contains dotted i)
B: _data.ı (contains dotless i)

Both data items fit well in the same data block when they are in lower case:

data_lower_case
_data.i 'English value'
_data.ı 'Turkish value'

However, in upper case the data names become identical:

data_upper_case
_DATA.I 'English value'
_DATA.I 'Turkish value'

In addition to that, if we have the data name in upper case, without the knowledge of the locale we simply cannot lower case the data name unambiguously (in the Turkish language the 'I' character lower cases to the 'ı', however in English is lower cases to 'i' ). The German 'ß' character that upper cases to 'SS' is another such example.

The same ambiguity is true not only for the data names, but also for the enumerator values, that have been made case-insensitive in the DDLm.

Has this issue been discussed before?