Giter Club home page Giter Club logo

plotmarc's Introduction

plotMARC

A command line tool to visually characterise a bibliographic collection in terms of publication dates and available bibliographic identifier coverage.

Basic usage

From a directory containing one or more binary MARC21 format for bibliographic data files (with extension .mrc) representing a bibliographic collection, run:

./plotMARC.py

The script will process the MARC files using pymarc, and produce a single <directoryname>.png image containing a 3-way Venn diagram displaying the number of records with the following bibliographic identifiers:

and a histogram showing the publication dates in the bib records.

Example output

Example plotMARC output plot

Summary for Test Collection:
Record counts for bibliographic identifiers present in this collection:
Total:  248632	100.00%
ISBN:   175593	 70.62%
LCCN:   130780	 52.60%
OCN :    82801	 33.30%
No Id:   42263	 17.00%

Install these using pip:

 pip install -r requirements.txt

License

Copyright © 2022 Charles Horn.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

plotmarc's People

Contributors

hornc avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

plotmarc's Issues

add an option to suppress pymarc warnings

Suggestion:
add a -q option to hide warnings from pymarc Reader.

Seeing the warnings once is sometimes helpful, but once you are satisfied that the data is good enough, the constant stream of warnings is distracting....

Provide a summary of material types within a collection

Types of interest:

  • Monographs
  • Serials
  • Maps
  • Sheet Music
  • Other

(possibly AV music / video?)

To get this breakdown, it is likely that more than one field/subfield will need to be checked. For the purposes of stats calculation, it makes sense to place each record into only one of the above categories.

MVP: text only summary output. Think about how to graphically display this (if at all) later -- depending on how useful it is.

accept (an optional) filename from CLI to process one specific MARC file

default to *.mrc if no file specified

Purpose:

Large multi-file MARC collections may have different date ranges or materials located in different files. It will often be helpful to know which files have the most recent material etc

Extension: for mutliple file collections; process each file individually, and also combine the results in a way that only processes each file once.

Text based output format for bibid categories and date counts

Provide a text based output for exporting data.

Current code outputs:

Dates: [0, 1400, 1760, 1770, 1780, 1790, 1800, 1810, 1820, 1830, 1840, 1850, 1860, 1870, 1880, 1890, 1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010] [34636, 0, 1, 1, 2, 5, 18, 22, 95, 146, 188, 325, 341, 462, 596, 828, 875, 956, 1294, 1666, 2456, 5346, 12925, 30085, 45280, 76277, 33109, 697]
No dates: 34636
No IDs: 294
Venn Categories: [294, 893, 351, 3362, 55381, 61284, 17013, 110054]

Needs something more usable.

  • TSV for dates, for copy-paste into a spreadsheet?
  • Venn categories; a format which can be passed directly back into matplotlib_venn or other Venn diagramming tools?

Improvements

  • OCLCn -> OCN
  • label y-axis: 'Records'
  • chart title: Publication Dates
  • chart title: Record BibIds
  • Overall title - name of partner (directory - title cased)
  • Display No-ID portion (Moved to separate issue: #3)
  • EARLY to be clearer 1400-1700 >1400

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.