prange-metadata's Introduction

Prange Metadata Harvesting and Manipulation Program

Description

A program to process and validate metadata spreadsheets, pulling additional data from MARC records using pymarc.

##Data Paths ~/Box\ Sync/PrangeMetadataStuff/CSV-data-conversion/csv/ ~/Box\ Sync/PrangeMetadataStuff/CSV-data-conversion/excel/ ~/Box\ Sync/PrangeMetadataStuff/CSV-data-conversion/marc/ ~/Box\ Sync/PrangeMetadataStuff/CSV-data-conversion/tsv/

Pseudocode

Read Spredsheet Data.
Load MARC file into array using pymarc.
Check Spreadsheet Header Rows Against One Another.
Search for matching MARC records.
Report on possible matches.
Pull data over from MARC to main array.
Output main array into single CSV file for ingest into Fedora.

Data Wrangling Algorithm

Remove brackets from author names (were used to indicate supplied names, but not needed for Digital Collections).
Separate the term for "editor" (編, 編纂, 編集, 編輯) from the name of the editor; and likewise remove the term for author (著) from the author column
Separate page count info from dimensions; create sum of page counts where multiple page counts have been listed.
Remove Japanese dates from publication date field.
Remove the Y abbreviation for Yen.

Subject Terms Matching

Compare call no. from spreadsheet against field 852h (where multiple call nos. in Aleph, check each one); if only one match is found, trust the match.
If no match found, try matching 852i or combined 852h + i.
If no match found, try matchign after removing volume info from call no.
If no match found, try matching on Author/Title.
If multiple matches found, flag record for follow up.
Output a report of all matches by each of the various methods, as well as a list of unmatched records from the spreadsheets.

Recommend Projects