Giter Club home page Giter Club logo

so_tag_predictor's Introduction

Stack Overflow: Tag Prediction

  1. Business Problem 1.1 Description Description

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.

Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML.

Problem Statemtent

Suggest the tags based on the content that was there in the question posted on Stackoverflow. Source: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/

1.2 Source / useful links Data Source : https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data Youtube : https://youtu.be/nNDqbUhtIRg Research paper : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf Research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL

1.3 Real World / Business Objectives and Constraints Predict as many tags as possible with high precision and recall. Incorrect tags could impact customer experience on StackOverflow. No strict latency constraints. 2. Machine Learning problem 2.1 Data 2.1.1 Data Overview Refer: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data All of the data is in 2 files: Train and Test.

Train.csv contains 4 columns: Id,Title,Body,Tags.

Test.csv contains the same columns but without the Tags, which you are to predict.

Size of Train.csv - 6.75GB

Size of Test.csv - 2GB

Number of rows in Train.csv = 6034195

The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Data Field Explaination

Dataset contains 6,034,195 rows. The columns in the table are:

Id - Unique identifier for each question

Title - The question's title

Body - The body of the question

Tags - The tags associated with the question in a space-seperated format (all lowercase, should not contain tabs '\t' or ampersands '&')

2.1.2 Example Data point Title: Implementing Boundary Value Analysis of Software Testing in a C++ program? Body :

    #include<
    iostream>\n
    #include<
    stdlib.h>\n\n
    using namespace std;\n\n
    int main()\n
    {\n
             int n,a[n],x,c,u[n],m[n],e[n][4];\n         
             cout<<"Enter the number of variables";\n         cin>>n;\n\n         
             cout<<"Enter the Lower, and Upper Limits of the variables";\n         
             for(int y=1; y<n+1; y++)\n         
             {\n                 
                cin>>m[y];\n                 
                cin>>u[y];\n         
             }\n         
             for(x=1; x<n+1; x++)\n         
             {\n                 
                a[x] = (m[x] + u[x])/2;\n         
             }\n         
             c=(n*4)-4;\n         
             for(int a1=1; a1<n+1; a1++)\n         
             {\n\n             
                e[a1][0] = m[a1];\n             
                e[a1][1] = m[a1]+1;\n             
                e[a1][2] = u[a1]-1;\n             
                e[a1][3] = u[a1];\n         
             }\n         
             for(int i=1; i<n+1; i++)\n         
             {\n            
                for(int l=1; l<=i; l++)\n            
                {\n                 
                    if(l!=1)\n                 
                    {\n                    
                        cout<<a[l]<<"\\t";\n                 
                    }\n            
                }\n            
                for(int j=0; j<4; j++)\n            
                {\n                
                    cout<<e[i][j];\n                
                    for(int k=0; k<n-(i+1); k++)\n                
                    {\n                    
                        cout<<a[k]<<"\\t";\n               
                    }\n                
                    cout<<"\\n";\n            
                }\n        
             }    \n\n        
             system("PAUSE");\n        
             return 0;    \n
    }\n

\n\n

The answer should come in the form of a table like

\n\n

    1            50              50\n       
    2            50              50\n       
    99           50              50\n       
    100          50              50\n       
    50           1               50\n       
    50           2               50\n       
    50           99              50\n       
    50           100             50\n       
    50           50              1\n       
    50           50              2\n       
    50           50              99\n       
    50           50              100\n

\n\n

if the no of inputs is 3 and their ranges are\n 1,100\n 1,100\n 1,100\n (could be varied too)

\n\n

The output is not coming,can anyone correct the code or tell me what's wrong?

\n' Tags : 'c++ c' 2.2 Mapping the real-world problem to a Machine Learning Problem 2.2.1 Type of Machine Learning Problem It is a multi-label classification problem Multi-label Classification: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A question on Stackoverflow might be about any of C, Pointers, FileIO and/or memory-management at the same time or none of these. Credit: http://scikit-learn.org/stable/modules/multiclass.html

2.2.2 Performance metric Micro-Averaged F1-Score (Mean F Score) : The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the weighted average of the F1 score of each class.

'Micro f1 score': Calculate metrics globally by counting the total true positives, false negatives and false positives. This is a better metric when we have class imbalance.

'Macro f1 score': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

https://www.kaggle.com/wiki/MeanFScore http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Hamming loss : The Hamming loss is the fraction of labels that are incorrectly predicted. https://www.kaggle.com/wiki/HammingLoss

so_tag_predictor's People

Contributors

krrish3398 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.