Giter Club home page Giter Club logo

mag2rdf's Introduction

MAG2RDF

MAG2RDF contains the code for generating the Microsoft Academic Knowledge Graph (MAKG) in RDF.

For more information, see The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data.

The main class is MAG2RDF.

Input

The generation of the MAKG RDF data requires to have the data dump files of the Microsoft Academic Graph. Specifically, the following files are needed:

  • Affiliations.txt

  • Authors.txt

  • ConferenceInstances.txt

  • ConferenceSeries.txt

  • FieldOfStudyChildren.txt

  • RelatedFieldOfStudy.txt

  • FieldsOFStudy.txt

  • Journals.txt

  • PaperAbstractsInvertedIndex.txt

  • PaperAuthorAffiliations.txt

  • PaperCitationContexts.txt

  • PaperFieldsOfStudy.txt

  • PaperLanguages.txt

  • PaperReferences.txtA. MAG Dump-Files:

    1. Affiliations.txt.nt i. Wiki articles were expected to be https. A conversion of wiki to dbpedia was not achieved. FIXED ii. Created date is now in column 12 (previously 10)

    2. ConferenceInstances.txt.nt i. Created date is now in column 17 (previously 15)

    3. FieldOfStudyRelationship.txt.nt i. Entity two is now in column 3 (previously 4) ii. Type one is now in column 2 (previously 3) iii. Type two is now in column 4 (previously 6)

    4. Journals.txt.nt i. Created date is now in column 23 (previously 22)

    5. PaperLanguages.txt.nt i. File doesn't exist anymore. If a paper has a tagged language, the language can be found in column 4 of PaperUrls.txt.

B: MAG2RDF Code:

1. textannotation/TullTextAnnotationClientXML
	i.   Line 113: Changed 'new ByteArrayInputStream(xmlstring.getBytes())' to 'new ByteArrayInputStream(xmlstring.getBytes(Charsets.UTF_8))'
	ii.  Line 263: Changed '.type(MediaType.APPLICATION_XML)' to '.type(MediaType.APPLICATION_XML + "; charset=UTF-8")'
  • Papers.txt
  • PaperUrls.txt

To obtain these files, please follow the instructions at https://docs.microsoft.com/en-us/academic-services/graph/get-started-setup-provisioning.

Processing

Compile MAG2RDF.java, create the corresponding jar file and run

java MAG2RDF.jar

Output

For each input file, the program creates a corresponding output file in the RDF format:

  • Affiliations.txt.nt
  • Authors.txt.nt
  • ConferenceInstances.txt.nt
  • ConferenceSeries.txt.nt
  • FieldOfStudyChildren.txt.nt
  • RelatedFieldOfStudy.txt.nt
  • FieldsOFStudy.txt.nt
  • Journals.txt.nt
  • PaperAbstractsInvertedIndex.txt.nt
  • PaperAuthorAffiliations.txt.nt
  • PaperCitationContexts.txt.nt
  • PaperFieldsOfStudy.txt.nt
  • PaperLanguages.txt.nt
  • PaperReferences.txt.nt
  • Papers.txt.nt
  • PaperUrls.txt.nt

Contact & More Information

More information can be found in my ISWC'19 paper The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data.

Feel free to reach out to me in case of questions or comments:

Michael Färber, [email protected]

How to Cite

Please cite my work (described in this paper) as follows:

@inproceedings{DBLP:conf/semweb/Farber19,
  author    = {Michael F{\"{a}}rber},
  title     = "{The Microsoft Academic Knowledge Graph: {A} Linked Data Source with
               8 Billion Triples of Scholarly Data}",
  booktitle = "{Proceedings of the 18th International Semantic Web Conference}",
  series    = "{ISWC'19}",
  location  = "{Auckland, New Zealand}",
  pages     = {113--129},
  year      = {2019},
  url       = {https://doi.org/10.1007/978-3-030-30796-7\_8},
  doi       = {10.1007/978-3-030-30796-7\_8}
}

Last Major Updates

  • 2020-07-09
  • 2019-07-15

Changes for Version 2020-07-09

A. MAG Dump-Files:

	1. Affiliations.txt.nt
		i.   Wiki articles were expected to be https. A conversion of wiki to dbpedia was not achieved. FIXED
		ii.  Created date is now in column 12 (previously 10)
		
	2. ConferenceInstances.txt.nt
		i.   Created date is now in column 17 (previously 15)
		
	3. FieldOfStudyRelationship.txt.nt
		i.   Entity two is now in column 3 (previously 4)
		ii.  Type one is now in column 2 (previously 3)
		iii. Type two is now in column 4 (previously 6)
		
	4. Journals.txt.nt
		i.   Created date is now in column 23 (previously 22)
		
	5. PaperLanguages.txt.nt
		i.   File doesn't exist anymore. If a paper has a tagged language, the language can be found in column 4 of PaperUrls.txt.
		
B: MAG2RDF Code:

	1. textannotation/TullTextAnnotationClientXML
		i.   Line 113: Changed 'new ByteArrayInputStream(xmlstring.getBytes())' to 'new ByteArrayInputStream(xmlstring.getBytes(Charsets.UTF_8))'
		ii.  Line 263: Changed '.type(MediaType.APPLICATION_XML)' to '.type(MediaType.APPLICATION_XML + "; charset=UTF-8")'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.