Giter Club home page Giter Club logo

parse-cv's Introduction

Parse-CV

Extract publication citation information from curriculum vitae using R

Overview

The goal of this project is to extract and parse citations from the publication section of a curriculum vitae (CV). We take a CV in PDF format and convert the document into XML. We then identify all section headings in the CV and extract the text in the publication section. We parse the publication text into individual citation strings, which we pass to the CrossRef API. The final output is a data frame of results with fields for citation doi, title, authors, journal, year, citation counts, and various scoring metrics.

PDF to XML

We first convert the PDF to XML using pdfminer (available here) and Python. This initial step uses code from my professor Duncan Temple Lang and is available here. The code uses a system call to interface with pdfminer, as well as performs some initial cleaning and organizing of the XML nodes.

Extract Sections

We first extract important text features from every line of the CV: text size, capitalization, left / right indentation, font, bold / italics, etc. We then group the text based on common features and try to identify a single group that contains all the sections by comparing each group with a known list of common section names. Specifically, on the first pass we'll group using the text features:

  • capitalization
  • text size
  • bold
  • italics
  • left indentation

The goal of the first grouping is to identify capitalized, bolded / italicized, or larger text size section titles which have the same left indentation. With the second pass, we remove the left indentation requirement and search for centered text or those CVs with imperfect left indentation. The known section names are located in /text_files/section_names.txt

Extract Citations

Now that we have the section names, we need to extract all the text between the publication section and the next section and then split that text appropriately into individual citation strings.

Step 1: Get citation text

We proceed down two paths to extract the citation text, either we successfully found section names in the previous step or we didn't. In the first case, we'll locate the publication section and extract all text until the next section (which we know). In the other case, we'll walk through the CV looking for the publication section, and then extract all text until we find any previously identified section which isn't publication. The known section names are located in /text_files/publications.txt

Step 2: Parse citation text

Now that we have the citation text, we'll attempt to parse citations using the following methods.

  • numbered list
  • author’s name
  • year (used as a list)
  • textbox (a grouping returned from pdfminer)
  • left indentation
  • most common starting word

For each method above, we'll check if the citation parse was successful and if not we proceed to the next most likely parsing method.

Parse Citations

The final step is to pass our parsed citation strings to the CrossRef API available as an R package.

install.packages( "rcrossref" )

CrossRef returns the doi, title, authors, journal, year, and a score for the match. We first remove results which don't contain the CV author's name. We then create a fuzzy match between the title CrossRef returns and the original citation. This gives us a pseudo percentage for the title match. We multiply the returned score by the title score to give use a better sense of which citation is correct. This isn't need when the citation is obvious, but many times none of the returned results seem likely based solely on the score. If we get a good title match, however, we can be more confident we've found the actual citation.

Preliminary Results

We attempted to extracted section names from over 45,000 CVs. We successfully found all section names in nearly 39,000 CVs, a success rate of 85.3%.

We also ran the full algorithm on 65 training CVs. We found the section names in 98.5% and extracted citations from 86.2% of those 65 CVs. In total we successfully parsed 514 citations with a median title percentage match of 90%.

parse-cv's People

Contributors

kdelrosso avatar

Stargazers

Leo Ota avatar  avatar Joe McArthur avatar Roman Hossain Shaon avatar Thallysson Klein avatar Sarah Schiavone avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

sclayton29

parse-cv's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.