Giter Club home page Giter Club logo

Comments (6)

cverluise avatar cverluise commented on May 11, 2024

Hello @felixpoege,

thanks for the feedback.

Could you give us a bit more details on your request?

Just to give you some more insights, the NPL part of the database comes from PatStat, tables TLS214. The npl_publn_id is the 2018b npl_publn_id. This provides a natural bridge to PatStat.

Now, we are also in the process of going further. For the NPL part of the database (the latest version is v02-npl), we have identified different classes on npl citations:

  • For the bibliographical references (70%) the natural identifier is the DOI (we find 1 in 40 of the cases) and opens up to almost all bibliographical datasets. We are currently working with Crossref and thinking about adding DBLP. See #30 and #21
  • For the other citation classes, the external datasets which would help precisely characterize the very unique nature of the cited item and enrich the citation data are not yet clearly identified. This is work in progress and this is the purpose of #22 , #23, #24, #25, #26, #27, #28, #29 . We are currently trying to involve people who are used to work with these different kinds of data in order to be as useful as possible to the end-user. Happy for ideas, contacts!

Is this the kind of info you would like to find in the documentation website?

Cheers,

Cyril

from patcit.

felixpoege avatar felixpoege commented on May 11, 2024

Hi Cyril,

okay, I see - the point that the DOI is your linkage recommendation at this point did not become clear to me from the front page documentation. But making sure that the DOI point comes across is important - and providing easy linkage access to further datasets will be important for proliferation. I think the idea of Matt Marx to use open-source datasets is good - maybe you can just recommend usage (i.e. recommend to download Microsoft Academic Graph or Pubmed or DBLP and link via DOI) (This was also my main question, wanted to keep it documentation-only)

But if you can link to commercial datasets like WoS or Scopus that would allow access to many users too (we could not publish our linkage with WoS because of licensing issues though). Same goes for the more specialized types of references that you are talking about.

Personally, I have tried a few times to match bibliometric items across databases (Scopus-WoS-DBLP, MAG-WoS, Pubmed-WoS) and the DOI alone did not give me super convincing results. Mostly because coverage in my WoS is zero before 2000 and poor in the early 2000s. For the ones not matched with a DOI, I resorted to an exact match on the title, cleaning up duplicates using additional characteristics. Doesn't really get you perfect coverage, but it does give you high precision, which is what I wanted in those projects. In terms of coverage, the latest one (MAG-WoS) was really unconvincing, so I only use it for robustness checks in that project.

To what else you said: The PATSTAT part makes sense. However, Patent IDs unrelated to a specific dataset would make accessibility easier for a broader range of users. I think the natural information is that contained in `tls211'.

Anyways, cool work you are doing! Keep it up!

Cheers,
Felix

from patcit.

cverluise avatar cverluise commented on May 11, 2024

Hello,

I think the idea of Matt Marx to use open-source datasets is good - maybe you can just recommend usage (i.e. recommend to download Microsoft Academic Graph or Pubmed or DBLP and link via DOI)

The issue with MAG is that this is a relatively low quality dataset. That's why:

  • we focus on Crossref which is based directly on publisher data.
  • we created #30 so that the database covers a large set of use cases. We have already added the abstracts, subject and funder data from crossref. Crossref has an Open API from which we can get the data (CC-BY4). Feel free to participate.

The PATSTAT part makes sense. However, Patent IDs unrelated to a specific dataset would make accessibility easier for a broader range of users. I think the natural information is that contained in `tls211'.

Yes, we have a cited_by field which contains:

  • the publication_number of the citing patents (DOCDB flavor)
  • the origin of the citing patents (Applicants, Search report)

Last but not least, everything discussed above is made publicly available and easy to consult in the table schema, which can be accessed on BigQuery interallia, see here (schema pane). Good to know that it does not seem to be sufficient ;).

We will integrate your comments to later versions of the doc website.

Cheers

from patcit.

cverluise avatar cverluise commented on May 11, 2024

I'm thinking about adding the schema of the table to the data section, see below for the v02-npl table

name description type
npl_publn_id Non-Patent Literature publication identification. Source: PATSTAT and DOCDB. INTEGER
npl_class NPL class (e.g. BIBLIOGRAPHICAL_REFERENCE, OFFICE_ACTION, PATENT, SEARCH_REPORT, etc) STRING
cited_by.origin Origin of the citation (e.g. APPlicant, SEArch report, etc) STRING
cited_by.publication_number DOCDB publication number of citing patent(s) STRING
author.first First name STRING
author.middle Middle name STRING
author.surname Surname STRING
author.genname Gender name STRING
funder.subject Subject (from Crossref) STRING
funder.DOI Funder DOI (from Crossref) STRING
funder.award Funding award identifier (from Crossref) STRING
funder.name Funder name (from Crossref) STRING
doi Digital Object Identifier. DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports and data sets, and official publications. STRING
ISSN International Standard Serial Number. It is an 8-digit code used to identify newspapers, journals, magazines and periodicals of all kinds and on all media –- print and electronic. STRING
ISSNe Electronic International Standard Serial Number. When a serial with the same content is published in more than one media type, a different ISSN is assigned to each media type. The ISSN system refers to the electronic ISSN as the ISSNe. STRING
PMCID PubMed Central Identifier. The PMCID is a unique reference number or identifier that is assigned to every article that is accepted into PubMed Central – an archive of full-text journal articles. STRING
PMID PubMed Identifier. The PMID is a unique reference number for PubMed citations. INTEGER
idno Document-specific identifier, if not in DOI, ISSN, ISSNe, PMCID, PMID. STRING
target Open Access url STRING
title_j Journal title STRING
title_abbrev_j Journal title (abbreviated) STRING
title_m Title of the item holding the NPL – for non journal items only , e.g. conference, proceedings, etc. STRING
title_main_m   STRING
title_main_a Article title STRING
year Publication year INTEGER
issue Issue number of the item holding the NPL (e.g. journal, proceedings, etc) INTEGER
bibl volume Volume number of the item holding the NPL (e.g. journal, proceedings, etc)
from First page INTEGER
to Last page INTEGER
abstract Abstract (from Crossref) STRING

Same for other tables (e.g. in-text).

Would it be sufficient or would you like any additional info?

Cheers

from patcit.

felixpoege avatar felixpoege commented on May 11, 2024

My issue is really just about documentation: when I read the landing pages, I don't exactly know how I can include your project into a potential workflow. For me, I was thinking about the situation where I have a bibliometrics dataset and want to attach some high-quality information about patents. Whether I use a high-quality specialized database or a lower-quality large-scale database is really project-specific.

People that come from the bibliometrics side (like me in this case) will want to stick to the database that they have and add just a little bit of patent information. Can this be done easily here? Then it would help to include this information. For example, in your table about you have the PMID. Just write this on the page. (Simply, "This project provides links of patent NPL citations with links to databases in Pubmed, [...]. For other databases, we recommend matching with the DOI.") -> Boom, I download and use your data.

I understand now that this project mostly goes the other way (sorry for me being slow), starting from a patent and asking to know as much as possible about the NPL. Which is really cool, but I think the other perspective is an additional use-case.

All in all: The documentation needs to a better job at explaining what this data will do for me. 😄

from patcit.

cverluise avatar cverluise commented on May 11, 2024

People that come from the bibliometrics side (like me in this case) will want to stick to the database that they have and add just a little bit of patent information. Can this be done easily here?

Yes, we can. Toy example: let's say that you are interested in a list of DOIs (e.g. 10.1038/nmat712 and 10.1038/nature03090, 2 highly cited articles in the patent field)

SELECT
  doi,
  cited_by
FROM
  `npl-parsing.patcit.v02_npl`
WHERE
  doi IN ("10.1038/nmat712",
    "10.1038/nature03090")

there you are (I report only the first rows (they cumulate 5500+ rows ), see details

Row doi cited_by.origin cited_by.publication_number  
1 10.1038/nmat712 APP US-9508861-B2  
2 10.1038/nmat712 APP US-9176571-B2  
3 10.1038/nmat712 APP US-9129667-B2  
4 10.1038/nmat712 APP US-8519990-B2  
5 10.1038/nmat712 APP US-8563976-B2

For a larger list, the best would be to upload the list on bigquery and then restrict to the inner join of the v02-npl and the table list (on the dois) and then to keep only the doi and cited_by field. This is an interesting use case, and we might work on a CLI to make it super easy. That's something which we started to discuss in #16 some times ago.

"This project provides links of patent NPL citations with links to databases in Pubmed, [...]. For other databases, we recommend matching with the DOI."

Will do, 👍

Boom, I download and use your data.

Great! There we are 😎 . Actually, in your case, I recommend to go through BigQuery.

Cheers

from patcit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.