Giter Club home page Giter Club logo

sgsma-topic-analysis's Introduction

Overview

A brief LDA topic analysis for SGSMA lit review.

LDA Topic Analysis.ipynb is the primary notebook and should be run first to view results and generated saved model data. The DAN classifier didnt work out and can be ignored although has been left in for review.

Setup

If using Anaconda, you are probably all set for requirements. Otherwise please use the provided requirements.txt file to load dependencies using pip.

pip install -r requirements.txt

Database

To parse the CSV files into a SQLite database, use the scripts/wrangle.py script. For more information how to use it, run the script with the --help flag. The extracted database can also be provided upon request.

The database wrangling is not perfect, and a number of errors occur that violate the normalization constraints. See a report of the wrangling errors below for the current dataset. Note that a lot of these errors can be cleaned up with some data wrangling (e.g. case normalization, white space normalization, etc.) but that was beyond the scope of the first extraction.

The schema of the database is as follows:

Database Schema

This schema is not set into stone and can be adapted and the database re-wrangled as necessary.

Currently the database contains the following tables and row counts:

Table Rows
affiliations 12458
article_keywords 248341
article_labels 9447
articles 9767
author_affiliations 31437
author_articles 36013
authors 16116
keywords 88002
labels 56
publications 2747

Wrangling Errors

Errors are identified by JSON output of the wrangling script (for easy parsing and data retrieval). The error types discovered were as follows:

Error Count
duplicate keywords 899
could not lookup article for label 296
non-unique author names for article 94
could not assign label to article 75
could not insert article 51
Total 1415

Duplicate keywords is a non harmful error, for example if author keywords includes the term "ZigBee" twice. Could not insert article is a bit more worrying, it means that a unique constraint was violated (e.g. between title and publication year). Most of the time this was for the document title "Table of Contents".

One issue that we have with this dataset is non-unique author names (e.g. Y. Liu appearing twice in each article). I assume this is two different people but we have no way of identifying which is which, particularly across documents. Therefore in the case that the same name appears twice on an author list, I've simply appended a numeric suffix to the name (e.g. Y. Liu (2)).

Label assignment has two issues: one where the article cannot be looked up by title/publication year and one where a label is assigned twice (possibly because of duplicates in the dataset). The wrangling script does nothing but ignore these errors for now.

sgsma-topic-analysis's People

Contributors

bbengfort avatar looselycoupled avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.