Giter Club home page Giter Club logo

angelosalatino / cso-classifier Goto Github PK

View Code? Open in Web Editor NEW
84.0 9.0 18.0 20.28 MB

Python library that classifies content from scientific papers with the topics of the Computer Science Ontology (CSO).

Home Page: https://cso.kmi.open.ac.uk

License: Apache License 2.0

Python 96.36% Jupyter Notebook 3.64%
ontology-matching semantic-web text-processing text-classification cso-classifier computer-science-ontology cso-ontology word2vec-model cso-topics

cso-classifier's Introduction

CSO-Classifier

PyPI version DOI Python 3.6+ License

Abstract

Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this repository, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of research areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.

Read more: https://skm.kmi.open.ac.uk/cso-classifier/

Table of contents

About

The CSO Classifier is a novel application that takes as input the text from the abstract, title, and keywords of a research paper and outputs a list of relevant concepts from CSO. It consists of three main components: (i) the syntactic module, (ii) the semantic module and (iii) the post-processing module. Figure 1 depicts its architecture. The syntactic module parses the input documents and identifies CSO concepts that are explicitly referred to in the document. The semantic module uses part-of-speech tagging to identify promising terms and then exploits word embeddings to infer semantically related topics. Finally, the post-processing module combines the results of these two modules, removes outliers, and enhances them by including relevant super-areas.

Framework of CSO Classifier Figure 1: Framework of CSO Classifier

Getting started

Installation using PIP

  1. Ensure you have Python 3.6, 3.7, or 3.8. Download them from here. Perhaps, you may want to use a virtual environment. Here is how to create and activate a virtual environment.
  2. Use pip to install the classifier: pip install cso-classifier
  3. Setting up the classifier. Go to Setup for finalising the installation.

Installation using Github

  1. Ensure you have Python 3.6, 3.7, or 3.8. Download them from here. Perhaps, you may want to use a virtual environment. Here is how to create and activate a virtual environment.
  2. Download this repository using: git clone https://github.com/angelosalatino/cso-classifier.git
  3. Install the package by running the following command: pip install ./cso-classifier
  4. Setting up the classifier. Go to Setup for finalising the installation.

Troubleshooting

Although, we have worked hard to fix many issues occurring at testing phase, some of them could still arise for reasons that go beyond our control. Here is the list of the common issues we have encountered.

Unable to install requirements

Most likely this issue is due to the version of pip you are currently using. Make sure to update to the latest version of pip: pip install --upgrade pip.

Unable to install python-Levenshtein

Many users found difficulties in installing the python-Levenshtein library on some Linux servers. One way to get around this issue is to install the python3-devel package. You might need sudo rights on the hosting machine.

-- Special thanks to Panagiotis Mavridis for suggesting the solution.

"python setup.py egg_info" failed

More specifically: Command "python setup.py egg_info" failed with error code 1. This error is due to the setup.py file. The occurrence of this issue is rare. If you are experiencing it, please do get in touch with us. We will work with you to fix it.

Setup

After installing the CSO Classifier, it is important to set it up with the right dependencies. To set up the classifier, please run the following code:

from cso_classifier import CSOClassifier as cc
cc.setup()
exit() # it is important to close the current console, to make those changes effective

This function downloads the English package of spaCy, which is equivalent to run python -m spacy download en_core_web_sm. Then, it downloads the latest version of Computer Science Ontology and the latest version of the word2vec model, which will be used across all modules.

Update

This functionality allows to update both ontology and word2vec model.

from cso_classifier import CSOClassifier as cc
cc.update()

#or
cc.update(force = True)

By just running update() without parameters, the system will check the version of the ontology/model that is currently using, against the lastest available version. The update will be performed if one of the two or both are outdated. Instead with update(force = True) the system will force the update by deleting the ontology/model that is currently using, and downloading their latest version.

Version

This functionality returns the version of the CSO Classifier and CSO ontology you are currently using. It will also check online if there is a newer version, for both of them, and suggest how to update.

from cso_classifier import CSOClassifier as cc
cc.version()

Instead, if you just want to know the package version use:

import cso_classifier
print(cso_classifier.__version__)

Test

This functionality allows you to test whether the classifier has been properly installed.

import cso_classifier as test
test.test_classifier_single_paper() # to test it with one paper
test.test_classifier_batch_mode() # to test it with multiple papers

To ensure that the classifier has been installed successfully, these two functions test_classifier_single_paper() and test_classifier_batch_mode() print out both paper(s) info and the result of their classification.

Usage examples

In this section, we explain how to run the CSO Classifier to classify a single or multiple (batch mode) papers.

Classifying a single paper (SP)

Sample Input (SP)

The sample input can be either a dictionary containing title, abstract and keywords as keys, or a string:

paper = {
        "title": "De-anonymizing Social Networks",
        "abstract": "Operators of online social networks are increasingly sharing potentially "
            "sensitive information about users and their relationships with advertisers, application "
            "developers, and data-mining researchers. Privacy is typically protected by anonymization, "
            "i.e., removing names, addresses, etc. We present a framework for analyzing privacy and "
            "anonymity in social networks and develop a new re-identification algorithm targeting "
            "anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, "
            "we show that a third of the users who can be verified to have accounts on both Twitter, a "
            "popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified "
            "in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is "
            "based purely on the network topology, does not require creation of a large number of dummy "
            "\"sybil\" nodes, is robust to noise and all existing defenses, and works even when the overlap "
            "between the target network and the adversary's auxiliary information is small.",
        "keywords": "data mining, data privacy, graph theory, social networking (online)"
        }

#or

paper = """De-anonymizing Social Networks
Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc. We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy "sybil" nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary's auxiliary information is small.
data mining, data privacy, graph theory, social networking (online)"""

In case the input variable is a dictionary, the classifier checks only the fields title, abstract and keywords. However, there is no need for filling all three of them. Indeed, if for instance you do not have keywords, you can just use the title and abstract.

Run (SP)

Just import the classifier and run it:

from cso_classifier import CSOClassifier
cc = CSOClassifier(modules = "both", enhancement = "first", explanation = True)
result = cc.run(paper)
print(result)

To observe the available settings please refer to the Parameters section.

If you have more than one paper to classify, you can use the following example:

from cso_classifier import CSOClassifier
cc = CSOClassifier(modules = "both", enhancement = "first", explanation = True)
results = list()
for paper in papers:
  results.append(cc.run(paper))
print(results)

Even if you are running multiple classifications, the current implementation of the CSO Classifier will load the CSO and the model only once, saving computational time.

Sample Output (SP)

As output, the classifier returns a dictionary with five components: (i) syntactic, (ii) semantic, (iii) union, (iv) enhanced, and (v) explanation. The latter field is available only if the explanation flag is set to True.

Below you can find an example. The keys syntactic and semantic respectively contain the topics returned by the syntactic and semantic module. Union contains the unique topics found by the previous two modules. In enhanced you can find the relevant super-areas. Please be aware that the results may change according to the version of Computer Science Ontology.

{
   "syntactic":[
      "network topology",
      "online social networks",
      "real-world networks",
      "anonymization",
      "privacy",
      "social networks",
      "data privacy",
      "graph theory",
      "data mining",
      "sensitive informations",
      "anonymity",
      "micro-blog",
      "twitter"
   ],
   "semantic":[
      "network topology",
      "online social networks",
      "topology",
      "data privacy",
      "social networks",
      "privacy",
      "anonymization",
      "graph theory",
      "data mining",
      "anonymity",
      "micro-blog",
      "twitter"
   ],
   "union":[
      "network topology",
      "online social networks",
      "topology",
      "real-world networks",
      "anonymization",
      "privacy",
      "social networks",
      "data privacy",
      "graph theory",
      "data mining",
      "sensitive informations",
      "anonymity",
      "micro-blog",
      "twitter"
   ],
   "enhanced":[
      "computer networks",
      "online systems",
      "complex networks",
      "privacy preserving",
      "computer security",
      "world wide web",
      "theoretical computer science",
      "computer science",
      "access control",
      "network security",
      "authentication",
      "social media"
   ],
   "explanation":{
		"social networks": ["social network", "online social networks", "microblogging service", "real-world networks", "social networks", "microblogging", "social networking", "twitter graph", "anonymous twitter", "twitter"],
		"online social networks": ["online social networks", "social network", "social networks"],
		"sensitive informations": ["sensitive information"],
		"privacy": ["sensitive information", "anonymity", "anonymous", "data privacy", "privacy"],
		"anonymization": ["anonymization"],
		"anonymity": ["anonymity", "anonymous"],
		"real-world networks": ["real-world networks"],
		"twitter": ["twitter graph", "twitter", "microblogging service", "anonymous twitter", "microblogging"],
		"micro-blog": ["twitter graph", "twitter", "microblogging service", "anonymous twitter", "microblogging"],
		"network topology": ["topology", "network topology"],
		"data mining": ["data mining", "mining"],
		"data privacy": ["data privacy", "privacy"],
		"graph theory": ["graph theory"],
		"topology": ["topology", "network topology"],
		"computer networks": ["topology", "network topology"],
		"online systems": ["online social networks", "social network", "social networks"],
		"complex networks": ["real-world networks"],
		"privacy preserving": ["anonymization"],
		"computer security": ["anonymity", "data privacy", "privacy"],
		"world wide web": ["social network", "online social networks", "microblogging service", "real-world networks", "social networks", "microblogging", "social networking", "twitter graph", "anonymous twitter", "twitter"],
		"theoretical computer science": ["graph theory"],
		"computer science": ["data mining", "mining"],
		"access control": ["sensitive information"],
		"network security": ["anonymity", "sensitive information", "anonymous"],
		"authentication": ["anonymity", "anonymous"],
		"social media": ["microblogging service", "microblogging", "twitter graph", "anonymous twitter", "twitter"]
	}
}

Classifying in batch mode (BM)

Sample Input (BM)

The sample input is a dictionary of papers. Each key is an identifier (example id1, see below) and its value is either a dictionary containing title, abstract and keywords as keys, or a string, as shown for Classifying a single paper (SP).

papers = {
    "id1": {
        "title": "De-anonymizing Social Networks",
        "abstract": "Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc. We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy \"sybil\" nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary's auxiliary information is small.",
        "keywords": "data mining, data privacy, graph theory, social networking (online)"
    },
    "id2": {
        "title": "Title of sample paper id2",
        "abstract": "Abstract of sample paper id2",
        "keywords": "keyword1, keyword2, ..., keywordN"
    }
}

Run (BM)

Import the python script and run the classifier:

from cso_classifier import CSOClassifier
cc = CSOClassifier(workers = 1, modules = "both", enhancement = "first", explanation = True)
result = cc.batch_run(papers)
print(result)

To observe the available settings please refer to the Parameters section.

Sample Output (BM)

As output the classifier returns a dictionary of dictionaries. For each classified paper (identified by their id), it returns a dictionary containing five components: (i) syntactic, (ii) semantic, (iii) union, (iv) enhanced, and (v) explanation. The latter field is available only if the explanation flag is set to True.

Below you can find an example. The keys syntactic and semantic respectively contain the topics returned by the syntactic and semantic module. Union contains the unique topics found by the previous two modules. In ehancement you can find the relevant super-areas. In explanation, you can find all chunks of text that allowed the classifier to infer a given topic. Please be aware that the results may change according to the version of Computer Science Ontology.

{
    "id1": {
	"syntactic": ["network topology", "online social networks", "real-world networks", "anonymization", "privacy", "social networks", "data privacy", "graph theory", "data mining", "sensitive informations", "anonymity", "micro-blog", "twitter"],
	"semantic": ["network topology", "online social networks", "topology", "data privacy", "social networks", "privacy", "anonymization", "graph theory", "data mining", "anonymity", "micro-blog", "twitter"],
	"union": ["network topology", "online social networks", "topology", "real-world networks", "anonymization", "privacy", "social networks", "data privacy", "graph theory", "data mining", "sensitive informations", "anonymity", "micro-blog", "twitter"],
	"enhanced": ["computer networks", "online systems", "complex networks", "privacy preserving", "computer security", "world wide web", "theoretical computer science", "computer science", "access control", "network security", "authentication", "social media"],
	"explanation": {
		"social networks": ["social network", "online social networks", "microblogging service", "real-world networks", "social networks", "microblogging", "social networking", "twitter graph", "anonymous twitter", "twitter"],
		"online social networks": ["online social networks", "social network", "social networks"],
		"sensitive informations": ["sensitive information"],
		"privacy": ["sensitive information", "anonymity", "anonymous", "data privacy", "privacy"],
		"anonymization": ["anonymization"],
		"anonymity": ["anonymity", "anonymous"],
		"real-world networks": ["real-world networks"],
		"twitter": ["twitter graph", "twitter", "microblogging service", "anonymous twitter", "microblogging"],
		"micro-blog": ["twitter graph", "twitter", "microblogging service", "anonymous twitter", "microblogging"],
		"network topology": ["topology", "network topology"],
		"data mining": ["data mining", "mining"],
		"data privacy": ["data privacy", "privacy"],
		"graph theory": ["graph theory"],
		"topology": ["topology", "network topology"],
		"computer networks": ["topology", "network topology"],
		"online systems": ["online social networks", "social network", "social networks"],
		"complex networks": ["real-world networks"],
		"privacy preserving": ["anonymization"],
		"computer security": ["anonymity", "data privacy", "privacy"],
		"world wide web": ["social network", "online social networks", "microblogging service", "real-world networks", "social networks", "microblogging", "social networking", "twitter graph", "anonymous twitter", "twitter"],
		"theoretical computer science": ["graph theory"],
		"computer science": ["data mining", "mining"],
		"access control": ["sensitive information"],
		"network security": ["anonymity", "sensitive information", "anonymous"],
		"authentication": ["anonymity", "anonymous"],
		"social media": ["microblogging service", "microblogging", "twitter graph", "anonymous twitter", "twitter"]
	    }
    },
    "id2": {
        "syntactic": [...],
        "semantic": [...],
        "union": [...],
        "enhanced": [...],
        "explanation": {...}
    }
}

Parameters

Beside the paper(s), the function running the CSO Classifier accepts seven additional parameters: (i) workers, (ii) modules, (iii) enhancement, (iv) explanation, (v) delete_outliers, (vi) fast_classification, and (vii) silent. There is no particular order on how to specify these paramaters. Here we explain their usage. The workers parameters is an integer (equal or greater than 1), modules and enhancement are strings that define a particular behaviour for the classifier. The explanation, delete_outliers, fast_classification, and silent parameters are booleans.

(i) The parameter workers defines the number of threads to run for classifying the input corpus. For instance, if workers = 4, there will be 4 instances of the CSO Classifier, each one receiving a chunk (equally split) of the corpus to process. Once all processes are completed, the results will be aggregated and returned. The default value for workers is 1. This parameter is available only when running the classifier in batch mode.

(ii) The parameter modules can be either "syntactic", "semantic", or "both". Using the value "syntactic", the classifier will run only the syntactic module. Using the "semantic" value, instead, the classifier will use only the semantic module. Finally, using "both", the classifier will run both syntactic and semantic modules and combine their results. The default value for modules is both.

(iii) The parameter enhancement can be either "first", "all", or "no". This parameter controls whether the classifier will try to infer, given a topic (e.g., Linked Data), only the direct super-topics (e.g., Semantic Web) or all its super-topics (e.g., Semantic Web, WWW, Computer Science). Using "first" as a value will infer only the direct super topics. Instead, if using "all", the classifier will infer all its super-topics. Using "no" the classifier will not perform any enhancement. The default value for enhancement is first.

(iv) The parameter explanation can be either True or False. This parameter defines whether the classifier should return an explanation. This explanation consists of chunks of text, coming from the input paper, that allowed the classifier to return a given topic. This supports the user in better understanding why a certain topic has been inferred. The classifier will return an explanation for all topics, even for the enhanced ones. In this case, it will join all the text chunks of all its sub-topics. The default value for explanation is False.

(v) The parameter delete_outliers can be either True or False. This parameter controls whether to run the outlier detection component within the post-processing module. This component improves the results by removing erroneous topics that were conceptually distant from the others. Due to their computation, users might experience slowdowns. For this reason, users can decide between good results and low computational time or improved results and slower computation. The default value for delete_outliers is True.

(vi) The parameter fast_classification can be either True or False. This parameter determines whether the semantic module should use the full model or the cached one. Using the full model provides slightly better results than the cached one. However, using the cached model is more than 15x faster. Read here for more details about these two models. The default value for fast_classification is True.

(vii) The parameter silent can be either True or False. This determines whether the classifier prints its progress in the console. If set to True, the classifier will be silent and will not print any output while classifying. The default value for silent is False.

# Parameter Single Paper Batch Mode
i workers
ii modules
iii enhancement
iv explanation
v delete_outliers
vi fast_classification
vii silent

Table 1: Parameters availability when using CSO Classifier

Releases

Here we list the available releases for the CSO Classifier. These releases are available for download both from Github and Zenodo.

v3.1

This release brings in two main changes. The first change is related to the library (and the code) to perform the Levenshtein similarity. Before we relied on python-Levenshtein which required python3-devel. This new version uses rapidfuzz which as fast as the previous library and it is much easier to install on the various systems. The second change is related to an updated list of dependencies. We updated some libraries including igraph.

Download from:

DOI

v3.0

This release welcomes some improvements under the hood. In particular:

  • we refactored the code, reorganising scripts into more elegant classes
  • we added functionalities to automatically setup and update the classifier to the latest version of CSO
  • we added the explanation feature, which returns chunks of text that allowed the classifier to infer a given topic
  • the syntactic module takes now advantage of Spacy POS tagger (as previously done only by semantic module)
  • the grammar for the chunk parser is now more robust: {<JJ.*>*<HYPH>*<JJ.*>*<HYPH>*<NN.*>*<HYPH>*<NN.*>+}

In addition, in the post-processing module, we added the outlier detection component. This component improves the accuracy of the result set, by removing erroneous topics that were conceptually distant from the others. This component is enabled by default and can be disabled by setting delete_outliers = False when calling the CSO Classifier (see Parameters).

Please, be aware that having substantially restructured the code into classes, the way of running the classifier has changed too. Thus, if you are using a previous version of the classifier, we encourage you to update it (pip install -U cso-classifier) and modify your calls to the classifier, accordingly. Read our usage examples.

We would like to thank James Dunham @jamesdunham from CSET (Georgetown University) for suggesting to us how to improve the code.

Download from:

DOI

v2.3.2

Version alignement with Pypi. Similar to version 2.3.1.

Download from:

DOI

v2.3.1

Bug Fix. Added some exception handles. Notice: Please note that during the upload of this version on Pypi (python index), we encountered some issues. We can't guarantee this version will work properly. To this end, we created a new release: v2.3.2. Use this last one, please. Apologies for any inconvenience.

v2.3

This new release contains a bug fix and the latest version of the CSO ontology.

Bug Fix: When running in batch mode, the classifier was treating the keyword field as an array instead of a string. In this way, instead of processing keywords (separated by comma), it was processing each single letters, hence inferring wrong topics. This now has been fixed. In addition, if the keyword field is actually an array, the classifier will first 'stringify' it and then process it.

We also downloaded and packed the latest version of the CSO ontology.

Download from:

DOI

v2.2

In this version (release v2.2), we (i) updated the requirements needed to run the classifier, (ii) removed all unnecessary warnings, and (iii) enabled multiprocessing. In particular, we removed all useless requirements that were installed in development mode, by cleaning the requirements.txt file.

When computing certain research papers, the classifier display warnings raised by the kneed library. Since the classifier can automatically adapt to such warnings, we decided to hide them and prevent users from being concerned about such an outcome.

This version of the classifier provides improved scalability through multiprocessing. Once the number of workers is set (i.e. num_workers >= 1), each worker will be given a copy of the CSO Classifier with a chunk of the corpus to process. Then, the results will be aggregated once all processes are completed. Please be aware that this function is only available in batch mode. See section Classifying in batch mode (BM) for more details.

Download from:

DOI

v2.1

This new release (version v2.1) makes the CSO Classifier more scalable. Compared to its previous version (v2.0), the classifier relies on a cached word2vec model which connects the words within the model vocabulary directly with the CSO topics. Thanks to this cache, the classifier is able to quickly retrieve all CSO topics that could be inferred by given tokens, speeding up the processing time. In addition, this cache is lighter (~64M) compared to the actual word2vec model (~366MB), which allows saving additional time at loading time.

Thanks to this improvement the CSO Classifier is around 24x faster and can be easily run on a large corpus of scholarly data.

Download from:

DOI

v2.0

The second version (v2.0) implements the CSO Classifier as described in the about section. It combines the topics of both the syntactic and semantic modules and enriches them with their super-topics. Compared to v1.0, it adds a semantic layer that allows generating a more comprehensive result, identifying research topics that are not explicitly available in the metadata. The semantic module relies on a Word2vec model trained on over 4.5M papers in Computer Science. Below we show more in detail how we trained such a model. In this version of the classifier, we pickled the model to speed up the process of loading into memory (~4.5 times faster).

Salatino, A.A., Osborne, F., Thanapalasingam, T., Motta, E.: The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles. In: TPDL 2019: 23rd International Conference on Theory and Practice of Digital Libraries. Springer. Read More

Download from:

DOI

v1.0

The first version (v1.0) of the CSO Classifier is an implementation of the syntactic module, which was also previously used to support the semi-automatic annotation of proceedings at Springer Nature [1]. This classifier aims at syntactically match n-grams (unigrams, bigrams and trigrams) of the input document with concepts within CSO.

More details about this version of the classifier can be found within:

Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F. and Motta, E. 2018. Classifying Research Papers with the Computer Science Ontology. ISWC-P&D-Industry-BlueSky 2018 (2018). Read more

Download from:

DOI

List of Files

  • CSO-Classifier.ipynb: 📄 Python notebook for executing the classifier
  • CSO-Classifier.py: 📄 Python script for executing the classifier
  • images: 📁 folder containing some pictures, e.g., the workflow showed above
  • cso_classifier: 📁 Folder containing the main functionalities of the classifier
    • classifier.py: 📄 class that implements the CSO Classifier
    • syntacticmodule.py: 📄 class that implements the syntactic module
    • semanticmodule.py: 📄 class that implements the semantic module
    • postprocmodule.py: 📄
    • paper.py: 📄 class that implements the functionalities to operate on papers, such as POS tagger, grammar-based chunk parser
    • result.py: 📄 class that implements the functionality to operate on the results
    • ontology.py: 📄 class that implements the functionalities to operate on the ontology: get primary label, get topics and so on
    • model.py: 📄 class that implements the functionalities to operate on the word2vec model: get similar words and so on
    • misc.py: 📄 some miscellaneous functionalities
    • test.py: 📄 some test functionalities
    • config.py: 📄 class that implements the functionalities to operate on the config file
    • config.ini: 📄 config file. It contains all information about packaage, ontology and model.
    • assets: 📁 Folder containing the word2vec model and CSO
      • cso.csv: 📄 file containing the Computer Science Ontology in csv
      • cso.p: 📄 serialised file containing the Computer Science Ontology (pickled)
      • cso_graph.p 📄 file containing the Computer Science Ontology as an iGraph object
      • model.p: 📄 the trained word2vec model (pickled)
      • token-to-cso-combined.json: 📄 file containing the cached word2vec model. This json file contains a dictionary in which each token of the corpus vocabulary, has been mapped with the corresponding CSO topics. Below we explain how this file has been generated.

Word2vec model and token-to-cso-combined file generation

In this section, we describe how we generated the word2vec model used within the CSO Classifier and what is the token-to-cso-combined file.

Word Embedding generation

We applied the word2vec approach [2,3] to a collection of text from the Microsoft Academic Graph (MAG) for generating word embeddings. MAG is a scientific knowledge base and a heterogeneous graph containing scientific publication records, citation relationships, authors, institutions, journals, conferences, and fields of study. It is the largest dataset of scholarly data publicly available, and, as of April 2021, it contains more than 250 million publications.

We first downloaded titles, and abstracts of 4,654,062 English papers in the field of Computer Science. Then we pre-processed the data by replacing spaces with underscores in all n-grams matching the CSO topic labels (e.g., “digital libraries” became “digital_libraries”) and for frequent bigrams and trigrams (e.g., “highest_accuracies”, “highly_cited_journals”). These frequent n-grams were identified by analysing combinations of words that co-occur together, as suggested in [2] and using the parameters showed in Table 2. Indeed, while it is possible to obtain the vector of an n-gram by averaging the embedding vectors of all its words, the resulting representation usually is not as good as the one obtained by considering the n-gram as a single word during the training phase.

Finally, we trained the word2vec model using the parameters provided in Table 3. The parameters were set to these values after testing several combinations.

min-count threshold
5 10

Table 2: Parameters used during the collocation words analysis

method emb. size window size min count cutoff
skipgram 128 10 10

Table 3: Parameters used for training the word2vec model.

After training the model, we obtained a gensim.models.keyedvectors.Word2VecKeyedVectors object weighing 366MB. You can download the model from here.

The size of the model hindered the performance of the classifier in two ways. Firstly, it required several seconds to be loaded into memory. This was partially fixed by serialising the model file (using python pickle, see version v2.0 of CSO Classifier, ~4.5x faster). Secondly, while processing a document, the classifier needs to retrieve the top 10 similar words for all tokens, and compare them with CSO topics. In performing such an operation, the model would require several seconds, becoming a bottleneck for the classification process.

To this end, we decided to create a cached model (token-to-cso-combined.json) which is a dictionary that directly connects all token available in the vocabulary of the model with the CSO topics. This strategy allows to quickly retrieve all CSO topics that can be inferred by a particular token.

token-to-cso-combined file

To generate this file, we collected all the set of words available within the vocabulary of the model. Then iterating on each word, we retrieved its top 10 similar words from the model, and we computed their Levenshtein similarity against all CSO topics. If the similarity was above 0.7, we created a record that stored all CSO topics triggered by the initial word.

Use the CSO Classifier in other domains of Science

In order to use the CSO Classifier in other domains of Science, it is necessary to replace the two external sources mentioned in the previous section. In particular, there is a need for a comprehensive ontology or taxonomy of research areas, within the new domain, which will work as a controlled list of research topics. In addition, it is important to train a new word2vec model that fits the language model and the semantic of the terms, in this particular domain. We wrote a blog article on how to integrate knowledge from other fields of Science within the CSO Classifier.

Please read here for more info: How to use the CSO Classifier in other domains

How to Cite CSO Classifier

We kindly ask that any published research making use of the CSO Classifier cites our paper listed below:

Salatino, A.A., Osborne, F., Thanapalasingam, T., Motta, E.: The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles. In: TPDL 2019: 23rd International Conference on Theory and Practice of Digital Libraries. Springer.

License

Apache 2.0

References

[1] Osborne, F., Salatino, A., Birukou, A. and Motta, E. 2016. Automatic Classification of Springer Nature Proceedings with Smart Topic Miner. The Semantic Web -- ISWC 2016. 9982 LNCS, (2016), 383–399. DOI:https://doi.org/10.1007/978-3-319-46547-0_33

[2] Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. (Jan. 2013).

[3] Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Distributed Representations of Words and Phrases and their Compositionality. Advances in neural information processing systems. 3111–3119.

cso-classifier's People

Contributors

andremann avatar angelosalatino avatar thiviyant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cso-classifier's Issues

Getting requirements to build wheel ... error >>> Error compiling Cython file: spacy/vocab.pxd:28:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.

pip3 install cso.classifier
Collecting cso.classifier
Obtaining dependency information for cso.classifier from https://files.pythonhosted.org/packages/e3/77/85c910382095e77bf6a431b3e3d9fb588c4db6d2f455c18dfd9dc620dbaf/cso_classifier-3.1-py3-none-any.whl.metadata
Using cached cso_classifier-3.1-py3-none-any.whl.metadata (43 kB)
Collecting gensim==3.8.3 (from cso.classifier)
Using cached gensim-3.8.3.tar.gz (23.4 MB)
Preparing metadata (setup.py) ... done
Collecting click==7.1.2 (from cso.classifier)
Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting hurry.filesize==0.9 (from cso.classifier)
Using cached hurry.filesize-0.9.tar.gz (2.8 kB)
Preparing metadata (setup.py) ... done
Collecting kneed==0.3.1 (from cso.classifier)
Using cached kneed-0.3.1.tar.gz (9.1 kB)
Preparing metadata (setup.py) ... done
Collecting nltk==3.6.2 (from cso.classifier)
Using cached nltk-3.6.2-py3-none-any.whl (1.5 MB)
Collecting rapidfuzz==2.11.1 (from cso.classifier)
Using cached rapidfuzz-2.11.1-cp38-cp38-macosx_11_0_arm64.whl (1.1 MB)
Requirement already satisfied: numpy>=1.19.5 in ./outdovenv/lib/python3.8/site-packages (from cso.classifier) (1.24.4)
Collecting requests==2.25.1 (from cso.classifier)
Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting spacy==3.0.5 (from cso.classifier)
Using cached spacy-3.0.5.tar.gz (7.0 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [164 lines of output]

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      int length
  
  
  cdef class Vocab:
      cdef Pool mem
      cpdef readonly StringStore strings
            ^
  ------------------------------------------------------------
  
  spacy/vocab.pxd:28:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
  
  cdef class Vocab:
      cdef Pool mem
      cpdef readonly StringStore strings
      cpdef public Morphology morphology
            ^
  ------------------------------------------------------------
  
  spacy/vocab.pxd:29:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
  cdef class Vocab:
      cdef Pool mem
      cpdef readonly StringStore strings
      cpdef public Morphology morphology
      cpdef public object vectors
            ^
  ------------------------------------------------------------
  
  spacy/vocab.pxd:30:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  cdef class Vocab:
      cdef Pool mem
      cpdef readonly StringStore strings
      cpdef public Morphology morphology
      cpdef public object vectors
      cpdef public object _lookups
            ^
  ------------------------------------------------------------
  
  spacy/vocab.pxd:31:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      cdef Pool mem
      cpdef readonly StringStore strings
      cpdef public Morphology morphology
      cpdef public object vectors
      cpdef public object _lookups
      cpdef public object writing_system
            ^
  ------------------------------------------------------------
  
  spacy/vocab.pxd:32:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      cpdef readonly StringStore strings
      cpdef public Morphology morphology
      cpdef public object vectors
      cpdef public object _lookups
      cpdef public object writing_system
      cpdef public object get_noun_chunks
            ^
  ------------------------------------------------------------
  
  spacy/vocab.pxd:33:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      cdef float prior_prob
  
  
  cdef class KnowledgeBase:
      cdef Pool mem
      cpdef readonly Vocab vocab
            ^
  ------------------------------------------------------------
  
  spacy/kb.pxd:31:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
  Copied /private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-install-tahcqqn5/spacy_ddb6de360bea4cdc9ba96d5305d51c66/setup.cfg -> /private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-install-tahcqqn5/spacy_ddb6de360bea4cdc9ba96d5305d51c66/spacy/tests/package
  Copied /private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-install-tahcqqn5/spacy_ddb6de360bea4cdc9ba96d5305d51c66/pyproject.toml -> /private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-install-tahcqqn5/spacy_ddb6de360bea4cdc9ba96d5305d51c66/spacy/tests/package
  Cythonizing sources
  Compiling spacy/training/example.pyx because it changed.
  Compiling spacy/parts_of_speech.pyx because it changed.
  Compiling spacy/strings.pyx because it changed.
  Compiling spacy/lexeme.pyx because it changed.
  Compiling spacy/vocab.pyx because it changed.
  Compiling spacy/attrs.pyx because it changed.
  Compiling spacy/kb.pyx because it changed.
  Compiling spacy/ml/parser_model.pyx because it changed.
  Compiling spacy/morphology.pyx because it changed.
  Compiling spacy/pipeline/dep_parser.pyx because it changed.
  Compiling spacy/pipeline/morphologizer.pyx because it changed.
  Compiling spacy/pipeline/multitask.pyx because it changed.
  Compiling spacy/pipeline/ner.pyx because it changed.
  Compiling spacy/pipeline/pipe.pyx because it changed.
  Compiling spacy/pipeline/trainable_pipe.pyx because it changed.
  Compiling spacy/pipeline/sentencizer.pyx because it changed.
  Compiling spacy/pipeline/senter.pyx because it changed.
  Compiling spacy/pipeline/tagger.pyx because it changed.
  Compiling spacy/pipeline/transition_parser.pyx because it changed.
  Compiling spacy/pipeline/_parser_internals/arc_eager.pyx because it changed.
  Compiling spacy/pipeline/_parser_internals/ner.pyx because it changed.
  Compiling spacy/pipeline/_parser_internals/nonproj.pyx because it changed.
  Compiling spacy/pipeline/_parser_internals/_state.pyx because it changed.
  Compiling spacy/pipeline/_parser_internals/stateclass.pyx because it changed.
  Compiling spacy/pipeline/_parser_internals/transition_system.pyx because it changed.
  Compiling spacy/pipeline/_parser_internals/_beam_utils.pyx because it changed.
  Compiling spacy/tokenizer.pyx because it changed.
  Compiling spacy/training/align.pyx because it changed.
  Compiling spacy/training/gold_io.pyx because it changed.
  Compiling spacy/tokens/doc.pyx because it changed.
  Compiling spacy/tokens/span.pyx because it changed.
  Compiling spacy/tokens/token.pyx because it changed.
  Compiling spacy/tokens/span_group.pyx because it changed.
  Compiling spacy/tokens/graph.pyx because it changed.
  Compiling spacy/tokens/morphanalysis.pyx because it changed.
  Compiling spacy/tokens/_retokenize.pyx because it changed.
  Compiling spacy/matcher/matcher.pyx because it changed.
  Compiling spacy/matcher/phrasematcher.pyx because it changed.
  Compiling spacy/matcher/dependencymatcher.pyx because it changed.
  Compiling spacy/symbols.pyx because it changed.
  Compiling spacy/vectors.pyx because it changed.
  [ 1/41] Cythonizing spacy/attrs.pyx
  [ 2/41] Cythonizing spacy/kb.pyx
  Traceback (most recent call last):
    File "/Users/maisonsirginou/Documents/outdo/codeOutdo/outdovenv/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
      main()
    File "/Users/maisonsirginou/Documents/outdo/codeOutdo/outdovenv/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/Users/maisonsirginou/Documents/outdo/codeOutdo/outdovenv/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-build-env-utncu2_x/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
      return self._get_build_requires(config_settings, requirements=['wheel'])
    File "/private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-build-env-utncu2_x/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
      self.run_setup()
    File "/private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-build-env-utncu2_x/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in run_setup
      exec(code, locals())
    File "<string>", line 224, in <module>
    File "<string>", line 211, in setup_package
    File "/private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-build-env-utncu2_x/overlay/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 1134, in cythonize
      cythonize_one(*args)
    File "/private/var/folders/xk/5rw7rdq566x_h9w02mgphnn80000gn/T/pip-build-env-utncu2_x/overlay/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 1301, in cythonize_one
      raise CompileError(None, pyx_file)
  Cython.Compiler.Errors.CompileError: spacy/kb.pyx
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

I am using python 3.8.17 in a venv, MacOS
Cython version is 0.29.28
I have another spacy installed in the venv version 3.6.0

I tried pip install pip setuptools wheel Cython==0.29.28....
Same issue.

Thank you for your time and comments on how this can be solved

Error while generating package metadata/metadata-generation-failed

pip install cso_classifier
Collecting cso_classifier
Using cached cso_classifier-3.1-py3-none-any.whl (44 kB)
Collecting gensim==3.8.3 (from cso_classifier)
Using cached gensim-3.8.3.tar.gz (23.4 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [61 lines of output]
running dist_info
creating C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-modern-metadata-8wigd2ba\gensim.egg-info
writing C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-modern-metadata-8wigd2ba\gensim.egg-info\PKG-INFO
writing dependency_links to C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-modern-metadata-8wigd2ba\gensim.egg-info\dependency_links.txt
writing requirements to C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-modern-metadata-8wigd2ba\gensim.egg-info\requires.txt
writing top-level names to C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-modern-metadata-8wigd2ba\gensim.egg-info\top_level.txt
writing manifest file 'C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-modern-metadata-8wigd2ba\gensim.egg-info\SOURCES.txt'
Traceback (most recent call last):
File "d:\HP\skillmatch.venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "d:\HP\skillmatch.venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "d:\HP\skillmatch.venv\Lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 149, in prepare_metadata_for_build_wheel
return hook(metadata_directory, config_settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\build_meta.py", line 380, in prepare_metadata_for_build_wheel
self.run_setup()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\build_meta.py", line 488, in run_setup
self).run_setup(setup_script=setup_script)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 367, in
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_init_.py", line 107, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_distutils\core.py", line 185, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_distutils\core.py", line 201, in run_commands
dist.run_commands()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\command\dist_info.py", line 99, in run
self.egg_info.run()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\command\egg_info.py", line 314, in run
self.find_sources()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\command\egg_info.py", line 322, in find_sources
mm.run()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\command\egg_info.py", line 551, in run
self.add_defaults()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\command\egg_info.py", line 589, in add_defaults
sdist.add_defaults(self)
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools\command\sdist.py", line 104, in add_defaults
super().add_defaults()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_distutils\command\sdist.py", line 251, in add_defaults
self._add_defaults_ext()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_distutils\command\sdist.py", line 335, in _add_defaults_ext
build_ext = self.get_finalized_command('build_ext')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_distutils\cmd.py", line 305, in get_finalized_command
cmd_obj.ensure_finalized()
File "C:\Users\Hari Prasath M S\AppData\Local\Temp\pip-build-env-5gkugv0y\overlay\Lib\site-packages\setuptools_distutils\cmd.py", line 111, in ensure_finalized
self.finalize_options()
File "", line 111, in finalize_options
AttributeError: 'dict' object has no attribute 'NUMPY_SETUP'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

//////////////
Python 3.11.0

running issue with collections package

When running "test.test_classifier_single_paper()",

>>> import cso_classifier as test
>>> test.test_classifier_single_paper()

got the following messages:

De-anonymizing Social Networks
Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc. We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy "sybil" nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary's auxiliary information is small.
data mining, data privacy, graph theory, social networking (online)
Computer Science Ontology loaded.
Model loaded.
Traceback (most recent call last):
File "", line 1, in
File "[my-personal-path]python3.10/site-packages/cso_classifier/test.py", line 35, in test_classifier_single_paper
result = cso_classifier.run(paper)
File "[my-personal-path]/python3.10/site-packages/cso_classifier/classifier.py", line 80, in run
self.model = MODEL(use_full_model=self.use_full_model, silent = self.silent)
File "[my-personal-path]/python3.10/site-packages/cso_classifier/model.py", line 27, in _init_
self.load_models()
File "[my-personal-path]/python3.10/site-packages/cso_classifier/model.py", line 35, in load_models
self.__load_word2vec_model()
File "[my-personal-path]python3.10/site-packages/cso_classifier/model.py", line 167, in __load_word2vec_model
self.full_model = pickle.load(open(self.config.get_model_pickle_path(), "rb"))
File "[my-personal-path]/python3.10/site-packages/gensim/_init_.py", line 5, in
from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils # noqa:F401
File "[my-personal-path]/python3.10/site-packages/gensim/corpora/_init_.py", line 12, in
from .dictionary import Dictionary # noqa:F401
File "[my-personal-path]/python3.10/site-packages/gensim/corpora/dictionary.py", line 11, in
from collections import Mapping, defaultdict
ImportError: cannot import name 'Mapping' from 'collections' ([my-personal-path]lib/python3.10/collections/_init_.py)

and "from collections.abc import Mapping" works well. @angelosalatino

If changing File "[my-personal-path]lib/python3.10/site-packages/gensim/corpora/dictionary.py", modify the import as "from collections.abc import Mapping", another importing error occurred:

ImportError: cannot import name 'Iterable' from 'collections' ([my-personal-path]/lib/python3.10/collections/_init_.py)

gensim/models/fasttext.py also uses "Iterable"...

details about my environment
os: ubuntu 18.04.6 LTS
versions:

  • conda: 4.12.0
  • python: 3.10.4
  • cso-classifier: 3.0
  • cso ontology: 3.3

Error in installing dependencies

I got the following error when installing cso-classifier:

ERROR: Failed building wheel for spacy
Building wheel for thinc (pyproject.toml) ... done
Created wheel for thinc: filename=thinc-8.0.17-cp311-cp311-linux_x86_64.whl size=2509690 sha256=16782448708920aa8b8bec879f0d1e7042b97bf227bc860d6485e5399c018e80
Stored in directory: /tmp/pip-ephem-wheel-cache-kamc_79q/wheels/b0/1d/5e/36ed7863d65cb226891461dc8a53586a0a226429f943f0e868
Successfully built cso-classifier kneed thinc
Failed to build gensim spacy
ERROR: Could not build wheels for gensim, spacy, which is required to install pyproject.toml-based projects

Before this there is a long list of messages with stuff like this:
copying spacy/lang/sr/lemma_lookup_licence.txt -> build/lib.linux-x86_64-cpython-311/spacy/lang/sr
copying spacy/lang/hr/lemma_lookup_license.txt -> build/lib.linux-x86_64-cpython-311/spacy/lang/hr
running build_ext
building 'spacy.training.example' extension
creating build/temp.linux-x86_64-cpython-311
creating build/temp.linux-x86_64-cpython-311/spacy
creating build/temp.linux-x86_64-cpython-311/spacy/training
x86_64-linux-gnu-gcc -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/tmp/pip-build-env-sdiapmvh/overlay/local/lib/python3.11/dist-packages/numpy/core/include -I/usr/include/python3.11 -I/usr/include/python3.11 -c spacy/training/example.cpp -o build/temp.linux-x86_64-cpython-311/spacy/training/example.o -std=c++11 -O2 -Wno-strict-prototypes -Wno-unused-function
cc1plus: warning: command-line option ‘-Wno-strict-prototypes’ is valid for C/ObjC but not for C++
In file included from /tmp/pip-build-env-sdiapmvh/overlay/local/lib/python3.11/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1940,
from /tmp/pip-build-env-sdiapmvh/overlay/local/lib/python3.11/dist-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /tmp/pip-build-env-sdiapmvh/overlay/local/lib/python3.11/dist-packages/numpy/core/include/numpy/arrayobject.h:5,
from spacy/training/example.cpp:792:

I'm using Python 3.11, with Ubuntu 22.04

How do I use the klink-2 algorithm?

The cso-classifier is a very useful work.

I would like to ask, is there any code implementation that I can refer to for the algorithm klink-2, which is mentioned in your paper for automatic domain ontology generation, or any other way to use/trial it? I have read the paper on klink-2, but I can't reproduce the implementation process in the paper, so I hope you can give me some suggestions and help.

I am looking forward to hearing from you.
Kind regards.
Thanks!

installation error

Hello!

I try to install your classifier ang get error: Failed building wheel for python-igraph
Stored in directory: c:\users\guest\appdata\local\pip\cache\wheels\7d\1d\2c\a4989f424c14d3f3bb5ab05a470275cf1d8f69857d81249b22
Building wheel for python-igraph (setup.py) ... error
error: subprocess-exited-with-error

And thereafter

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for python-Levenshtein
Running setup.py clean for python-Levenshtein
Building wheel for spacy (pyproject.toml) ... error
error: subprocess-exited-with-error

Can you help me, please, with my error.

Windows 11, python 3.10.6

Issue installing requirements

pip install -r requirements.txt gives me the following error message:

Collecting en-core-web-sm==2.0.0 (from -r requirements.txt (line 23))
  Could not find a version that satisfies the requirement en-core-web-sm==2.0.0 (from -r requirements.txt (line 23)) (from versions: )
No matching distribution found for en-core-web-sm==2.0.0 (from -r requirements.txt (line 23))

How to apply to a different ontology/domain?

Very useful and great work.

How do I use a different ontology from a different domain? I can replicate the format used in the current CS ontology, but what about the cached model? Is that a generalized model or specific to the ontology. If the latter, how do I go about constructing one for a different ontology?

Many thanks and happy to share back the results of my work

EDIT:

Learning about word2vec... but would love to hear from you anyway, if you have any tips or instructions.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.