nypl / catalog_of_copyright_entries_project Goto Github PK

NYPL Project to transcribe and parse pages from the US Catalog of Copyright Entries

License: Creative Commons Zero v1.0 Universal

Python 100.00%

catalog_of_copyright_entries_project's Introduction

Catalog of Copyright Entries Project

NYPL Project to transcribe and parse pages from the US Catalog of Copyright Entries

The New York Public Library (NYPL) is embarking on a pilot project to extract the data from a publication known as the Catalog of Copyright Entries, published annually by the United States Copyright Office. The volumes have already been digitized and are freely available through the Internet Archive; our project aims to extract and parse the data contained in the records in order to create a searchable database that will aid copyright research.

For more on the project, see "Unlocking the Record of American Creativity—with Your Help"

For more on the catalog, see the following:

Although all of the data extracted to date from the CCEs by the NYPL team is available on this GitHub repo, NYPL has added records from prior CCE projects and made it available in this unofficial, experimental search interface.

Data Structure and Contents

All data files are in the ./xml directory, organized by year. The XML files conform to the project DTD, and each directory has an alto subdirectory with ALTO format files for the original OCR.

See TOC.md for details on the volumes transcribed so far.

CopyrightEntries.dtd

The main components of an XML files, within the root <copyrightEntries> element are a mandatory <header> followed by any order of <copyrightEntry>, <entryGroup>, <crossRef> and <pgNum> elements.

There are tags for identifyings authors, titles, publishers, and claimants, as well as the various dates and id numbers that an entry can contain. Many entries have attributes for recording normalized versions of dates and numbers or for identifying where corrections have been made.

See the Guide for specifics of formatting entries.

Anatomy of a Registration

The format of entries in the Catalog varies widely over time but they essentialy contain simple bibliographic information and a registration date and id number.

ADAMS, JAMES DONALD.

  Literary frontiers.  New
    York, Duell, Slone and
    Pearce.  175 p. © J. Donald 
	Adams; 6Jun51; A56505.

This is converted to XML:

<copyrightEntry 
     id="1D4D33CD-6E97-1014-8315-97D5E63C7536"
     regnum="A56505">
  <author>
    <authorName>ADAMS, JAMES DONALD</authorName>.
  </author> 
  <title>Literary frontiers.</title>
  <publisher>
    <pubPlace>New York</pubPlace>, 
    <pubName>Duell, Sloan and Pearce.</pubName> 
  </publisher>
  <desc>175 p.</desc> 
  &#xA9; <claimant>J. Donald Adams</claimant>;
  <regDate date="1951-06-06">6Jun51</regDate>; 
  <regNum>A56505</regNum>.
</copyrightEntry>

Our top priority is to correctly tag the registration numbers and dates since these are required to match registrations to renewals. Next in priority are the authors and titles although for practical purposes a full-text search is probably adequate to find an entry.

Identifiers

Every registration should have a registration number, such as A56505, but these are not unique. Numbering was restarted in "Third Series" (1947–) so there is quite a bit of overlap between this and the "New Series." For example, the example entry above shares a registration number with another book Barton Warren Stone, pathfinder of Christian union; a story of his life and times registered in 1932. Because of this a registration number and date is always required to distinguish A56505/1951-06-06 from A56505/1932-10-12

In addition every <copyrightEntry> and <crossRef> is assigned a UUID so that it can be uniquely identified, even if the registration number or date is changed (for instance, to correct a typo).

Renewals

These volumes were chosen to transcribe first because they come from the period when a book may in copyright if its first 28-year copyright term was renewed, while it is otherwise public domain. Renewal data is available from the Stanford Copyright Renewals Database and from an NYPL version of essentially the same sources. The NYPL version is better formatted for matching renewal entries with the registrations in these XML files.

By combining the two datasets we can determine how many books were registered for copyright in every year between 1923 and 1963, as well as how many were renewed:

For this period we have about 642,000 books registered for copyright. Of these about 162,000 or 25% had their copyrights renewed. So, the copyright has expired on 75% of the books published during these years, about 480,000, and they are now in the Public Domain.

User Stories for Search Interface

Although we do not have an official search interface for this data yet (the unofficial, experimental interface is here), we gathered a group of experts in Copyright Office records to discuss user needs and requirements for a search interface system. That group prioritized a list of user stories that could be used to start to develop such a search interface system.

CCE Information and Organization Sheets

The project team's goal is to complete the transcription and parsing of all of the CCEs. To keep us organized, we've created a number of spreadsheets in Google Sheets. While all of these sheets are works-in-progress, you may find the data useful. We will create a sheet for each major type of work or category within the CCE. Each sheet should include information about the particular CCE volumes relevant to that category, including direct links to each CCE volume, the page numbers that bound each category, the number of expected records based on either the counts included by the Copyright Office in the front matter for each volume or on the record count (depending on year), and other relevant data. For many years, data was published in cycles shorter than one year, creating multiple sections (in CCE lingo, "Numbers") for a single year. The data for each section is recorded in individual tabs for that year. As more sheets are built, sections will be added below.

Overview of CCE Category Changes Over Time

An overview of the organization of the CCEs from 1906 to 1977 is available here. We've also built a directory of links to the CCEs as they are presented in Internet Archive. Eventually, we will fill in the data for pre-1923 CCEs for both of these sheets.

Part 1: Books

Although the Books part designation changed over time (Part 1 Group 1 in 1923, Part 1A starting in 1947, and Part 1 starting in 1953) we've combined the data about books into this sheet.

Part 1B: Pamphlets, Leaflets, and More

From 1923-mid-1953, the CCEs grouped a number of different kinds of works into a sub-part of the Books category. Depending on year, this group included pamphlets, leaflets, contributions to newspapers or periodicals, lectures, sermons, addresses for oral delivery, dramatic compositions, maps, and motion pictures. A sheet with information about this sub-category for books is available here.

Part 2: Periodicals

Periodicals have enjoyed their own category within the CCE. A sheet tracking information about this category can be found here.

Part 5: Musical Compositions

Data about musical composition registrations and renewals can be found here.

Status of Project

As of 8/15/22, the registrations for the "Books" category from 1923-1977 has been completed and data uploaded to this repository.

As time permits, the team is recording information about the relevant CCE volumes for each category of work.

The project team continues to pursue funding opportunities to tackle the transcription and parsing of the remaining CCEs.

Press Inquiries: Please contact Greg Cram or Sean Redmond

catalog_of_copyright_entries_project's People

Contributors

Stargazers

Watchers

Forkers

nonword rossmounce raymanchester kelu124 doubtfulpalace jdlh rsk9903 echopointbooks atavacron baldhakal ericshows jdcar baberabb

catalog_of_copyright_entries_project's Issues

Handle renewals in the registrations

There are two versions of this: 1) Registrations that are their own renewals and 2) renewals that are in the registrations for some reason.

An example of the first is A1053002, being registered for the first time, just under the deadline to renew it:

KENT, FRANK R. Political behavior: the heretofore unwritten laws, customs and principles of
politics as practiced in the United States. © 23Aug28; A1053002. Frank R. Kent (A); 
28Oct55; R158303.

Of the second is on pg 1507 of 1960 No. 2

The second entry is just the renewal for A58707. Why it is in the registrations and not the renewals pages of the volume, I don't know.

The solution for both is probably to add a <renewalEntry> element that can be a child of <copyrightEntries>, <entryGroup>, or a single <copyrightEntry>

Editorially provided publication info

Anything in brackets we would usually mark up as a <note>, but what about when it's regular publication info?

In these entries, "1951", "London" etc. are presumably in brackets because they aren't printed on the title page or something like that. I think we should just encode them as what they are (<pubDate> etc) and not as notes, but preserving the brackets, e.g.:

<publisher><pubPlace>Boston</pubPlace>, <pubName>Houghton Mifflin</pubName>, <pubDate date="1951">[1951]</pubDate></publisher>
<publisher><pubPlace>[London]</pubPlace> <pubName>Faber &amp; Faber</pubName></publisher>
<publisher>[</pubPlace>Wilmington</pubPlace>, <pubName>Delaware State Society for Mental HygieneM/pubName>]</publisher>

Correct regDate in AF1238

"30Deo45" should be "30Dec45" (with corresponding date attribute

Add num attribute to <copies>

See #9

<copies> element needs a num attribute for indications that more than one copy was deposited with the Copyright Office.

Publisher + scope of claim

The publisher is the claimant, so maybe that can be combined à la #11, but there is also a statement about what the copyright applies to "© on full-length book version". This is similar to what the <newMatterClaimed> element is for, but 1) is it new? and 2) how to capture it?

If it is "new" since the publication of the previous selections and allow <newMatterClaimed> here:

<publisher claimant="yes">
    <newMatterClaimed>on full-length book version</newMatterClaimed>
    <pubName>Limes Verlag (Max Niedermayer)</pubName>
</publisher>

Or if it isn't really new, use a different element

<publisher claimant="yes">
    <claim>on full-length book version</claim>
    <pubName>Limes Verlag (Max Niedermayer)</pubName>
</publisher>

Is it important whether any claim is "new" or not? <newMatterClaimed> could just be something like <claim> with the newness left to the context.

A545/14Nov46 should be A345

See renewal R548489 (1946/cc431unse_001-724.xml)

Handle multiple volumes in one entry not as <additionalEntry>

Handle entries like 65CB2E06-6E24-1014-A696-AD35B3FDE14F (1946) where the multiple registration numbers and registration dates are sprinkled throughout the entry in way that makes using <additionalEntry> impossible

Normalize regDate

DCL proposes normalizing the registration date to YYY-MM-DD

<regDate>1951-03-06</regDate>

Corporate authors or no Authors

In the first two entries should "California" be the author? What about the second two.

Add element for cross-references?

Entries like these should perhaps be a different element than copyrightEntry.

Every actual entry should, nay must have a registration number, but the regnum attribute is optional in order, I think, to accommodate these.

adding a <crossRef> entity would be more specific about what is required.

Looks to me like the definition would be

<!ELEMENT crossRef (author, title?, see)>

Each copyrightEntry should have a unique ID

DCL proposes adding a GUID to each copyrightEntry.

Already reflected in 18416b2b8a9797af985c231b33516575e7f79cfc

Entries with multiple registration numbers

When an entry has multiple registration numbers DCL proposes to split them into several entries.

Make registration number an attribute of copyrightEntry

DCL proposes making the registration number an attribute of the copyrightEntry.

<copyrightEntry id=”GUID” regnum=”A53538”>

(already reflected in 18416b2)

Great deal of change over time. Rely on convention over validity?

Looking for the simplest format from different volumes, it's clear how much change there is over time, with the earliest volumes being much more complex than the later.

1927 https://archive.org/stream/catalogofcopyrig241libr#page/3/mode/1up

1930 https://archive.org/stream/catalogofcopyri271libr#page/2/mode/1up

1951 https://archive.org/stream/catalogofcopyri351libr#page/402/mode/1up

1962 https://archive.org/stream/catalogofcopyrig3161lib#page/1141/mode/1up

The last one is easy enough (a little different from the current DTD based on some discussions):

<copyrightEntry id="[GUID]" regnum="A578172">
    <author><authorName>ADAMS, O. R.</authorName></author>
    <title>Lameness in horses</title>. © <publisher><pubName claimant="yes">Lea & Febiger</pubName></publisher>;
    <regDate date="1962-08-10">10Aug62</regDate>; <registrationNumber>A578172</registrationNumber>.
</copyrightEntry>

But a definition that works for this and for Rabbit Diseases while accommodating a lot of CDATA punctuation?

I'm leaning towards defining <copyrightEntry> as ANY rather than trying to be really clever about it. It might be more effective to let a program check that everything does indeed have a registration number instead relying on the validity of the XML. We can be more specific in the definition of some of the parts.

Authors with places and claimant status

From DCL, how to deal with authors that have places indicated as well as volumes where an asterisk indicated that the author is also the claimant

(from https://archive.org/stream/catalogofcopyrig42libr#page/197/mode/1up)

A192154/27Nov45 should be A192134

1945/cc42libr_001-289.xml

Better definition of classCode

Current definition is "letter or letters at the beginning of the registration number" and in the example the classCode of AF0-4873 is taken to be AF.

Is that right? Or is the classCode AF0

Is classCode necessary?

if classCode is just the first letter or letters of the registration number is an element necessary? Can't it just be calculated from the regnum? If there is some processing advantage to having it "precalculated" in the XML, and if the regnum is now going to be an attribute of copyrightEntry (see #5), maybe a regclass attribute could be added as well.

<copyrightEntry id=”GUID” regnum=”A53538” regclass="A">

Multiple names and roles

How should we deal with this entry, where there are 4+ authors and 2 roles?

Adventure in radio,
edited by M. Cuthbert, with radio scripts by Edna St. Vincent Millay, Arch Obeler, Archibald MacLeish [and others]

© Sept. 17, 1945; A 189950.

A51252/2Jan51 Should be A61252

See renewal RE010376 (cc351libr_401-631.xml)

A51221/14Nov51 should be A61221

See renewal RE027389 (cc351libr_401-631.xml)

Typos in README.md#renewals, "while" and "with"

In README.md#renewals, as of commit 77886a8, I see two typographic errors:

word "whike" should read "while" in "whike it is otherwise public domain".
missing word, probably "with", between "entries the" in "for matching renewal entries the registrations in these XML files"

Correct regdate for A596716

"2Nov02" should be "2Nov62" (cc3161lib_1137-1964.xml)

Correct regdate for A591209

"28Sep02" should be "28Sep62" (cc3161lib_1137-1964.xml)

Regularize normalized dates

There a several elements that are or can have dates. At least: registration date (see #6), copies, and affDate.

The proposal is to markup the number of copies as

<copies date=”1951-05-03”>2c</copies>

and affDate like so

<affDate>1930-01-04</affDate>

So for on it might be the regdate attribute, for another the date attribute, and for yet another the contents of the affDate tag. Can there be one attribute that always represents the normalized date on whatever tag?

A89246/11Jun52 should be A69246

(1952/cc361libr_443-690.xml)

Switch order of title and author

The DTD currently requires title and author in that order when they mostly (always) occur in opposite order. We could just reverse the order of those two elements, or try to make it more flexible.

Correct foreign registrations normalized as AI

In the 1930's volumes some entries with an "A-Foreign" registration were incorrectly normalized as AI

"Complex analysis" 1953, vol. 7 part 1A, number 1, p. 3 is missing a regnum

The the regnum is missing from the volume:

But A78600 can be supplied from its renewal (RE106063)

Process motion picture entries too?

Hi. I love what you have been able to do here. I have been trying to identify and count the number of out of copyright movies for a few years now. One idea I had to be able to do so for movies created in USA is to check for renewals, but this proved hard without machine readable information about movie renewals. Could you please extend your XML catalog to also include the motion picture entries?

See http://people.skolelinux.org/pere/blog/Idea_for_finding_all_public_domain_movies_in_the_USA.html for my sketch from a few years ago on how to do this. Perhaps your approach is better?

Move alto files to a different repo

The alto files add an enormous amount of bytes the repo and make a first checkout take forever. Since they aren't necessary for most uses of the data, move them to their own repo.

Keep punctation in title?

Mark this up without the period as

<title>Social Evolution</title>

or with the period as

<title>Social Evolution.</title>

Need a canonical registration number format

What are the registration numbers like?

Every registration entry should have a registration number (some of them mysteriously don't, but that's another issue). The form changed a bit over time but is generally a class plus a serial number. The class for books is A, with a couple of variations.

In the volumes digitized so far we see the following variations

1927

"A",
"A—Foreign", foreign books
"A ad int.", interim registration for a book published abroad

The class is followed by a space and the serial number.

Other slight variations occur, but they are basically typos

1942

"A" for books
"A for." for foreign books
"A ad int." for interim registrations
"AA" (pamphlets) is also found in two entries

A space separates the class and serial number

1946

"A" for books
"A for." or "AF" for foreign books
"A ad int." or "AI" for interim registrations

Some entries have a space between the class and the serial number, some do not.

1951

"A", "AF", and "AI" occur, along with several other prefixes: "AA", "B", "DF", "DP", "JP", "K".

Sometimes there is a hyphen between the class and the serial number, otherwise there is nothing.

Sometimes the serial number has a "0-" prefix itself (that's a zero, not a letter O)

Should we regularize the numbers?

The Stanford Copyright Renewals database has regularized all the forms to the "1951" version here. That is, "A—Foreign" and "A for." have been changed to "AF". For the sake of interoperability, we should provide a regularized versions of "AF" and "AI" numbers. The canonical form could be:

[Class][Serial Prefix][Serial Number]

With no spaces or hyphens except in the "serial prefix" (which is optional). E.g:

A12345
AF12345
AF0-12345
AI12345
AI0-12345

etc.

Since we record the registration number both as an attribute of the catalogEntry and in a <regNum> element in the entry, could we record the printed version of the number (verbatim, even with typos) as the <regNum> element, and convert it to a regularized form in the regnum attribute?

Better markup for previous registrations

We have the <prev-regNum> element for AI0-370 in this entry, but no specific way to encode the explanation and date "Application states prev. pub. abroad 15Feb50"

It seems like a good idea to capture this. I would propose something like

<prevReg>
    Application states prev. pub. abroad 
    <regDate date="1950-02-15">15Feb50</regDate>,
    <regNum>AI0-370</regNum>
</prevReg>

project status?

Could you add a 'status' section to the README? It looks like the work's all done, and yet there are open issues.

Balance between text and data?

Some bits of the data are so far going to be converted to attributes, meaning they'll be taken out of the text representation of the XML though the data is preserved. Can we decide on a principal to help guide when that occurs. To take the copies element as an example:

it could (1) just be text

<copies>2c 3May51</copies>

The current proposal (2) from DCL is to regularize the date (see #8)

<copies date="1951-05-03">2c</copies>

But we could go further (3) and just parse out the number of copies, too, so that it's an empty tag

<copies date="1951-05-03" num="2"/>

Or combine the first and third (4)

<copies date="1951-05-03" num="2">2c 3May51</copies>

I think either the first or the last (and really, I think the last is the best option). They both preserve the original information. The second (currently proposed) version does some of the processing up front and makes later processing easier but leaves out an important piece. The last option will be the easiest do deal with for both human and machine.

Jack Kerouac

Yo, the http://cce-search.herokuapp.com/ site is down. I want to let you Jack Kerouac had a few titles show up as being in the public domain. Their literary executor says this isn't the case and I'd love to confirm. Can't report the serial numbers as the site is down. Is the project still being monitored? Anyway I can help? (I don't know code, but happy to finance a fix)

A58849/17Aug51 Should be A58649

(cc351libr_401-631.xml)

How to contribute

Hi! How can I contribute? For example, if I notice a typo in a book title, should I open a pull request?

EF264/2Jan45

Correct "haleine" to "baleine" (cc311libr_001-451.xml)

How to contribute / help with new data

Is there a way to support OCR + data cleaning of new volumes? (Say, renewals for other classes?)

This call for help was inspiring; what remains to be done?

Do we need to deal with Author headings

Sometimes the names under which the entries appear carry some extra information. For instance (example from DCL):

Cuthbert, Margaret,* New York.

Adventure in radio,
edited by M. Cuthbert, with radio scripts by Edna St. Vincent Millay, Arch Obeler, Archibald MacLeish [and others]

© Sept. 17, 1945; A 189950.

The asterisk means the Margaret Cuthbert is the claimant. There is also the info that she is from New York. Is this used for disambiguation?

Another example:

Curtis, Charles P., jr.,* Ipswich, Mass. & Greenslet, Ferris,* Boston.

The practical cogitator, selected and edited by C. P. Curtis, jr. and Ferris Greenslet.

© Oct. 9, 1945; A 190420.

If we do not include the author headings in the XML, then we still need to mark the authors as claimants in the entry, i.e.

<author claimaint="true"><role>edited by</role>  <authorName>M. Cuthbert</authorName></author>

However, the principles stated in #9 argue for recording the heading, and we would then need to group the entries under the heading. Something like:

<entryGroup>
    <heading>
        <author><authorName claimant="true">Cuthbert, Margaret</authorName>,* <authorPlace>New York</authorPlace>.</author>
    </heading>

    <copyrightEntry>...</copyrightEntry>
    <copyrightEntry>...</copyrightEntry>
</entryGroup>

If we do this, I think it's still a good idea to mark the claimants within the <copyrightEntries> themselves. Since the forms of the names don't match (at least in these examples) I don't know how hard that will be.

Allow publisher to be claimant?

We have both publisher and claimant tags but it seems (after an exhaustive survey of covering one whole page) that one of the simplest forms of entry has the two collapsed:

© Vantage Press, inc., New York

What should we do in a case like this? We could make publisher a valid child of claimaint

<claimant><publisher><pubName>Vantage Press, inc.</pubname><pubPlace>New York</pubPlace></publisher></claimant>

(on what to do about the punctuation, see #3 )

Or we could add some attribute on the publisher

<publisher isClaimant="true"><pubName>Vantage Press, inc.</pubname><pubPlace>New York</pubPlace></publisher>

I think the latter is easier to process.

querying VIAF for name authorities

The following VIAF search endpoint ought to return the correct name entity, based on the author name and work title:

http://www.viaf.org/viaf/search?query=local.names+all+"Cuthbert, Margaret"+and+local.title+all+"Adventure in radio"&maximumRecords=5&httpAccept=application/json

Adjust parameters to taste.

I tried this against some of the entries in basic.xml, and the query returned either 1 LC entity or 0 (if no name was present). Bunt, James isn't in the LCNAF, so he has no VIAF entry.