Giter Club home page Giter Club logo

type_prediction's Introduction

type_prediction

Predicting the type of a field depending on its display name

The objective of this project is: given a data-set with display names associated with fields and their types, try to predict the type of a field depending on its display name.
There are 4 files which result from the 4 main stages I had during this project.

DataSet.py

This file defines a new object called DataSet which enables us to extract data from a csv, clean this data according to what we want to do with it and plot some indicators which will help us driving our timeline.
I used it in order to clean the data set and ploted the 10 most common types and display names. This made me realise that the names "Folder" and "Tags" were overused and we finally decided not to take them into account for the predictions (they are automatically generated so no need to predict their types). Better still, ploting the data also showed that a lot of custom types were used but each one only a few times. These types being registered directly by the customer, we decided not to take them into account as we couldn't predict what doesn't already exist.
This is why, we finally we decided to deal with these types: ['STRING', 'TEXT', 'PERSON', 'DATE', 'INTEGER', 'BOOLEAN', 'DECIMAL', 'DATETIME', 'URL'], giving us a 24,000 rows data set after all clearings.
Here are the 10 most common types having cleared "Tags" and "Folder":

vectorization_Doc2Vec.py

After formatting the data as mentioned above, we end up with a matrix containing all our display names in its first column and all the types associated in the second one. The problem here, is that we want to classify the display names and would therefore like to use a multilayer perceptron, however the input data should be float vectors and not string ones. Consequently we need to convert our display names to vectors and this is the tricky part.
A first implementation would be to use the "bag of words" method which would convert each display name to a vector by adding the vectors associated to each word. However, this method wouldn't take into consideration the similarity between some names and for example the vectors associated with "closing date" and "sale's date" could be very different one to the other, while here we would like them to be close as both are going to end up with a "Date" type.
Therefore, we decided to use the Doc2Vec model from gensim, which takes all our inputs, builds a dictionnary based on the documents (here the display names) we gave it and associates better suited vectors to each display name.

classification.py

In this part, we implement the multilayer perceptron, which will predict the types in output after being trained. First, we format the data, between the training set which amounts in 80% of our original set and the other 20% which will form our test set and help us monitor the accuracy of our model during the optimization process. Then, the train_class function impements the architecture of our neural network and trains it on the training before returning the trained model as output.

optimization.py

To do...

Conclusion

In the end, these few lines of code (much more in reality as I used a lot of built-in functions from Keras or Gensim) result in a 0.76 accuracy on the test set previously extracted. This was achieved with a (500(0.2)-500(0.2)-50(0.2)-9) MLP, with the "Nadam" optimizer, the "lecun_uniform" initializer and Doc2Vec output vectors of size 40. This performance is considered acceptable as our data set is composed of display names from wide brand of languages (English, Russian, French, Chinese...) and companies, with display names that may not have the same meaning depending on which company we are dealing with. To enhance this performance, we could first use other inputs such as the domain, the library id... Better still, we could evolutionary algorithms such as ABC (Artificial Bee Colony) to help us optimize the learning parameters we chose, change the MPL architecture or even choose another model than Doc2Vec in order to vectorize our display names.

type_prediction's People

Contributors

hugodob avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.