Giter Club home page Giter Club logo

loc_mods's Introduction

Library of Congress MODS in Ruby

Gem Version Build Status Code Climate

Purpose

This is a class-oriented Ruby library that parses LOC’s MOD data.

This gem is developed using the MODS 3.7 XSD schema.

Usage

Ruby API

require 'loc_mods'

# Single record under `<modsCollection>`
LocMods::Collection.from_xml(File.read("spec/fixtures/record_1.xml"))

# Full NIST Tech Pubs records
# https://github.com/usnistgov/NIST-Tech-Pubs/tree/nist-pages/xml
LocMods::Collection.from_xml(File.read("reference/allrecords-MODS.xml"))

Command line interface

LocMods provides a command-line interface (CLI) for various operations.

The main executable is loc-mods.

Commands:
  loc-mods detect-duplicates PATH...  # Detect duplicate records in MODS XML files or directories
  loc-mods help [COMMAND]             # Describe available commands or one specific command

Detect duplicates

The detect-duplicates command allows you to find duplicate MODS records based on using a "primary ID" that is their DOI (Digital Object Identifier).

Note
The library assumes that every record has a DOI. If that is not the case, another way to setting the primary key needs to be defined.

Usage:

Usage:
  loc-mods detect-duplicates PATH...

Options:
  [--show-unchanged], [--no-show-unchanged] # Show unchanged attributes in the diff output
                                            # Default: false
  [--highlight-diff], [--no-highlight-diff] # Highlight only the differences
                                            # Default: false
  [--color=COLOR]                           # Use colors in the diff output (auto, on, off)
                                            # Default: auto
                                            # Possible values: auto, on, off
$ loc-mods detect-duplicates [OPTIONS] <file_or_directory_path>

Options:

--show-unchanged

(default: false) Show attributes of both objects even when they were not changed.

--highlight-diff

(default: false) Highlight values only when they differ between two records.

--color=COLOR

(default: auto) Use colors in the diff output. Values:

auto

the CLI will detect whether the terminal supports colors and display with colors if it does.

on

the CLI will always display with colors.

off

the CLI will never display with colors.

Example:

$ loc-mods detect-duplicates  /path/to/mods/files

This command will:

  1. Search for MODS XML files in the specified directory (and subdirectories if -r is used).

  2. Parse each MODS file and extract the DOI.

  3. Group records with the same DOI.

  4. For each group of duplicates:

    1. Display the shared DOI.

    2. List the filenames of the duplicate records.

    3. Show a detailed comparison of the differences between the records.

The output will highlight differences, removed elements, and missing elements between the duplicate records, helping you identify discrepancies in the metadata.

Testing

bin/update-nist-mods

License

Copyright Ribose.

loc_mods's People

Contributors

ronaldtse avatar camobap avatar

Watchers

 avatar phuong avatar Jeffrey Lau avatar  avatar Alexander Dyuzhev avatar KW Kwan avatar  avatar

Forkers

kraft001

loc_mods's Issues

Fix `loc-mods detect-duplication` command to show elements missing in one array

$ bundle exec exe/loc-mods detect-duplicates spec/fixtures/
...
Duplicate set #2 found for URL: https://doi.org/10.6028/NIST.IR.6659
  Comparison 1:
  File 1: spec/fixtures/allrecords-MODS-991000009289708106.xml
  File 2: spec/fixtures/allrecords-MODS-991000179879708106.xml
  ----
  identifier[1]:
    Record 1: "994303379"
    Record 2: (nil)

  identifier[1]:
    Record 1: "oclc"
    Record 2: (nil)

  identifier._array_size_difference:
    Record 1: 2
    Record 2: 1

  note[1]:
    Record 1: "July 1, 2010."
    Record 2: "2010."

  note[2]:
    Record 1: "Title from PDF title page (viewed June 5, 2017)."
    Record 2: "Title from PDF title page."
...

This output indicates that in File/Record 1, there are 2 "identifier" elements, but in the File/Record 2, there is only 1 "identifier" element:

  identifier._array_size_difference:
    Record 1: 2
    Record 2: 1

The missing identifier[1] (because it is second in the array) has content shown in the diff:

  identifier[1]:
    Record 1: "994303379"
    Record 2: (nil)

  identifier[1]:
    Record 1: "oclc"
    Record 2: (nil)

However, when the situation is reversed, when File/Record 2 has additional elements not in File/Record 1, those extra elements are not displayed.

Duplicate set #416 found for URL: https://doi.org/10.6028/NIST.SP.1264
  Comparison 1:
  File 1: spec/fixtures/allrecords-MODS-991000626362808106.xml
  File 2: spec/fixtures/allrecords-MODS-991000626387008106.xml
  ----
  abstract._array_size_difference:
    Record 1: 0
    Record 2: 1

    ...

  record_info[0].record_change_date[0]:
    Record 1: "20240401111510.0"
    Record 2: "20240401111509.0"

  record_info[0].record_identifier[0]:
    Record 1: "991000626362808106"
    Record 2: "991000626387008106"

  subject._array_size_difference:
    Record 1: 0
    Record 2: 2

  ----

Notice the abstract and subject elements are not displayed at all.

This task is to make clear to the user these cases and display corresponding missing/removed content:

  1. When there is an element in the File/Record 1 array removed in File/Record 2 (we consider this "removed")
  2. When there is no element in the File/Record 1 array added to File/Record 2 (we consider this "added")

You will likely have to fix in comparable_mapper.rb and cli.rb.

Fix `loc-mods detect-duplication` command to show correct path in the case of Array diffs

$ bundle exec exe/loc-mods detect-duplicates spec/fixtures/
...
Duplicate set #2 found for URL: https://doi.org/10.6028/NIST.IR.6659
  Comparison 1:
  File 1: spec/fixtures/allrecords-MODS-991000009289708106.xml
  File 2: spec/fixtures/allrecords-MODS-991000179879708106.xml
  ----
  identifier[1]:
    Record 1: "994303379"
    Record 2: (nil)

  identifier[1]:
    Record 1: "oclc"
    Record 2: (nil)
...

This content indicates that File/Record 1 has an additional <identifier> element not in File/Record 2.

The source XML is this:
spec/fixtures/allrecords-MODS-991000009289708106.xml:

      <identifier type="oclc">671253037</identifier>
      <identifier type="oclc">994303379</identifier>

The source code is:

module LocMods
  class Identifier < BaseMapper
    attribute :content, Shale::Type::String
    attribute :display_label, Shale::Type::String
    attribute :type, Shale::Type::String
    attribute :type_uri, Shale::Type::Value
    attribute :invalid, Shale::Type::Value
    attribute :alt_rep_group, Shale::Type::String

    xml do
      root "nameIdentifier" # this element name is overridden in `record.rb`
      namespace "http://www.loc.gov/mods/v3", nil

      map_content to: :content
      map_attribute "displayLabel", to: :display_label
      map_attribute "type", to: :type
      map_attribute "typeURI", to: :type_uri
      map_attribute "invalid", to: :invalid
      map_attribute "altRepGroup", to: :alt_rep_group
    end
  end
end

Notice that the diff path of identifier[1] is duplicated:

  identifier[1]:
    Record 1: "994303379"
    Record 2: (nil)

  identifier[1]:
    Record 1: "oclc"
    Record 2: (nil)

If you look at the object definition, it should actually be:

  identifier[1].content:
    Record 1: "994303379"
    Record 2: (nil)

  identifier[1].type:
    Record 1: "oclc"
    Record 2: (nil)

This task is to fix this. You will likely have to fix in comparable_mapper.rb and cli.rb.

Use MARC 21 code lists when necessary

The codes have been added to:

e.g.

require 'loc-marc'

LocMarc::Codes::Relator.lookup('wam')
=> {:code=>"wam", :description=>"Writer of accompanying material", :deprecated=>false}

LocMarc::Codes::Language.lookup('en')
=> {:code=>"en", :description=>"Europe, Northern", :deprecated=>false}

LocMarc::Codes::Country.lookup('us')
=> {:code=>"us", :description=>"United States", :deprecated=>true}

LocMarc::Codes::Country.lookup('us')
=>{:code=>"xxu", :description=>"United States", :deprecated=>false},

LocMarc::Codes::GeographicArea.lookup('a-cc-hk')
=> {:code=>"a-cc-hk", :description=>"Hong Kong (China)", :deprecated=>false}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.