Giter Club home page Giter Club logo

pinkslipper's Introduction

PinkSlipper

PinkSlipper is a webapp designed for executive recruiters. It aggregates press releases relating to the hiring / firing of executives around the country and displays them in a simple intuitive UI. A custom Chain Conditional Random Field Model was built to semantically analyze text in each press release headline enabling Pinkslipper to filter the content on the site so it is relevant for an executive recruiter.

Pinkslipper Logo

Motivation

I met several executive recruiters who described a difficult problem in their industry. They are often contracted to find a new CEO, CFO, etc to run a company purchased through a PE shop. They have to close quickly, and it becomes difficult if their networks run dry. Furthermore, sources that currently exist such as LinkedIn, BizJournalNews, and Press Release sites do not yield effective results that one could act on. Therefore, I set out to build a simple intuitive UI that accomplished this goal, eliminating the hangups that plauge various alternatives.

Data Collection

I scraped 10,000 articles from PR Newswire. An example article is below. For each article, I stored relevant information in a mongoDB including:

  1. Headline
  2. Time
  3. Date
  4. Location
  5. Images
  6. Body Text

Press Release

Feature Engineering

Goal was to predict the label for each word in a headline. Possible labels are:

  1. Person
  2. Organization
  3. Position
  4. Other

Labelling

To accurately predict the correct label for each word in a headline, I used the following features:

  1. Stanford's case insensitive POS tagger
  2. Normalized Position in headline
  3. Boolean for entire word being capitalized
  4. Boolean for entire word being lowercased
  5. POS tags of the two words prior and after

Labelling Training Data

To build a supervised model, I manually labeled 550 headlines which contained around ~5000 total words. I built a webapp to simplify this process, allowing a scalable and crowdsourced approach to be taken.

Modeling

I built a custom chain conditional random field (Chain-CRF) model to predict labels for each word. A CRF is a graphical model which stores features as nodes and connects those features via edges. The weights of the edges are learned through the algorithm such that accuracy is maximized. The inherent graphical nature of this model enables improved accuracy because the label for a given word is inherently affected by the labels of the words around it.

Model Results

I used a 5-fold cross-validated approach for validating the model. Accuracy is defined by evaluating the number of correctly predicted labels against the total number of predicted labels.

Model Comparison

Comparing my Chain-CRF (trained on 550 headlines) to Stanford's state-of-the-art NER (trained on 300,000 lines of text) demonstrates the success of my model. Stanford's model was able to better predict PERSON in a headline by 5%. My model predicted ORGANIZATION 13% better than Stanford's model.

All that said, Stanford's model was never designed to predict POSITION labels. This is the true reason a custom model is needed. My model ended up predicted POSITION with 80% accuracy. Thus for specialized applications, the efforts to build custom models are justified.

Results

The chain-CRF model predicted labels for each word with 75% accuracy. It predicted POSITION labels best at 80%, and outperformed any model currently available. The model was then saved, and all unlabelled, scraped articles were piped through it. Their predicted labels were then stored in a final mongoDB. A Flask / Bootstrap webapp was built to display articles in this final mongoDB. They would be loaded in, and simply filtered based on the words with POSITION tags.

An example video of the working app is below. Click on the image to check it out!

PinkSlipper Demo

Conclusion

A custom Chain-CRF was built to extract PERSON, ORGANIZATION, and POSITION information from press releases relating to the hiring / firing of executives. A webapp was built to display press releases for interested executive recruiters. The press release content is filtered on the POSITION tags of the articles to ensure it is only displaying content relevant to that specific recruiter.

Tools Utilized

####Stack:

  • python
  • git
  • markdown

####Debugging:

  • ipdb

####Web Scraping:

  • requests
  • beautiful soup
  • newspaper

####Databasing:

  • pymongo
  • mongodb

####Feature Engineering:

  • NLTK
  • Stanford POS Tagger
  • Stanford NER Tagger (for comparison)
  • regex

####Modeling:

  • numpy
  • pandas
  • scikit learn
  • pystruct
  • json
  • pickle

####Plotting:

  • matplotlib
  • seaborn
  • plotly

####Web App:

  • flask
  • bootstrap
  • pagination
  • html
  • css

pinkslipper's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.