Hi all, just tried to find out whether this dataset would be useful

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Sources of NPL about patcit HOT 6 CLOSED

cverluise commented on May 11, 2024 1

Sources of NPL

from patcit.

Comments (6)

cverluise commented on May 11, 2024

Hello @felixpoege,

thanks for the feedback.

Could you give us a bit more details on your request?

Just to give you some more insights, the NPL part of the database comes from PatStat, tables TLS214. The npl_publn_id is the 2018b npl_publn_id. This provides a natural bridge to PatStat.

Now, we are also in the process of going further. For the NPL part of the database (the latest version is v02-npl), we have identified different classes on npl citations:

For the bibliographical references (70%) the natural identifier is the DOI (we find 1 in 40 of the cases) and opens up to almost all bibliographical datasets. We are currently working with Crossref and thinking about adding DBLP. See #30 and #21
For the other citation classes, the external datasets which would help precisely characterize the very unique nature of the cited item and enrich the citation data are not yet clearly identified. This is work in progress and this is the purpose of #22 , #23, #24, #25, #26, #27, #28, #29 . We are currently trying to involve people who are used to work with these different kinds of data in order to be as useful as possible to the end-user. Happy for ideas, contacts!

Is this the kind of info you would like to find in the documentation website?

Cheers,

Cyril

from patcit.

felixpoege commented on May 11, 2024

Hi Cyril,

okay, I see - the point that the DOI is your linkage recommendation at this point did not become clear to me from the front page documentation. But making sure that the DOI point comes across is important - and providing easy linkage access to further datasets will be important for proliferation. I think the idea of Matt Marx to use open-source datasets is good - maybe you can just recommend usage (i.e. recommend to download Microsoft Academic Graph or Pubmed or DBLP and link via DOI) (This was also my main question, wanted to keep it documentation-only)

But if you can link to commercial datasets like WoS or Scopus that would allow access to many users too (we could not publish our linkage with WoS because of licensing issues though). Same goes for the more specialized types of references that you are talking about.

Personally, I have tried a few times to match bibliometric items across databases (Scopus-WoS-DBLP, MAG-WoS, Pubmed-WoS) and the DOI alone did not give me super convincing results. Mostly because coverage in my WoS is zero before 2000 and poor in the early 2000s. For the ones not matched with a DOI, I resorted to an exact match on the title, cleaning up duplicates using additional characteristics. Doesn't really get you perfect coverage, but it does give you high precision, which is what I wanted in those projects. In terms of coverage, the latest one (MAG-WoS) was really unconvincing, so I only use it for robustness checks in that project.

To what else you said: The PATSTAT part makes sense. However, Patent IDs unrelated to a specific dataset would make accessibility easier for a broader range of users. I think the natural information is that contained in `tls211'.

Anyways, cool work you are doing! Keep it up!

Cheers,
Felix

from patcit.

cverluise commented on May 11, 2024

Hello,

I think the idea of Matt Marx to use open-source datasets is good - maybe you can just recommend usage (i.e. recommend to download Microsoft Academic Graph or Pubmed or DBLP and link via DOI)

The issue with MAG is that this is a relatively low quality dataset. That's why:

we focus on Crossref which is based directly on publisher data.
we created #30 so that the database covers a large set of use cases. We have already added the abstracts, subject and funder data from crossref. Crossref has an Open API from which we can get the data (CC-BY4). Feel free to participate.

The PATSTAT part makes sense. However, Patent IDs unrelated to a specific dataset would make accessibility easier for a broader range of users. I think the natural information is that contained in `tls211'.

Yes, we have a cited_by field which contains:

the publication_number of the citing patents (DOCDB flavor)
the origin of the citing patents (Applicants, Search report)

Last but not least, everything discussed above is made publicly available and easy to consult in the table schema, which can be accessed on BigQuery interallia, see here (schema pane). Good to know that it does not seem to be sufficient ;).

We will integrate your comments to later versions of the doc website.

Cheers

from patcit.

cverluise commented on May 11, 2024

I'm thinking about adding the schema of the table to the data section, see below for the v02-npl table

name	description	type
npl_publn_id	Non-Patent Literature publication identification. Source: PATSTAT and DOCDB.	INTEGER
npl_class	NPL class (e.g. BIBLIOGRAPHICAL_REFERENCE, OFFICE_ACTION, PATENT, SEARCH_REPORT, etc)	STRING
cited_by.origin	Origin of the citation (e.g. APPlicant, SEArch report, etc)	STRING
cited_by.publication_number	DOCDB publication number of citing patent(s)	STRING
author.first	First name	STRING
author.middle	Middle name	STRING
author.surname	Surname	STRING
author.genname	Gender name	STRING
funder.subject	Subject (from Crossref)	STRING
funder.DOI	Funder DOI (from Crossref)	STRING
funder.award	Funding award identifier (from Crossref)	STRING
funder.name	Funder name (from Crossref)	STRING
doi	Digital Object Identifier. DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports and data sets, and official publications.	STRING
ISSN	International Standard Serial Number. It is an 8-digit code used to identify newspapers, journals, magazines and periodicals of all kinds and on all media –- print and electronic.	STRING
ISSNe	Electronic International Standard Serial Number. When a serial with the same content is published in more than one media type, a different ISSN is assigned to each media type. The ISSN system refers to the electronic ISSN as the ISSNe.	STRING
PMCID	PubMed Central Identifier. The PMCID is a unique reference number or identifier that is assigned to every article that is accepted into PubMed Central – an archive of full-text journal articles.	STRING
PMID	PubMed Identifier. The PMID is a unique reference number for PubMed citations.	INTEGER
idno	Document-specific identifier, if not in DOI, ISSN, ISSNe, PMCID, PMID.	STRING
target	Open Access url	STRING
title_j	Journal title	STRING
title_abbrev_j	Journal title (abbreviated)	STRING
title_m	Title of the item holding the NPL – for non journal items only , e.g. conference, proceedings, etc.	STRING
title_main_m		STRING
title_main_a	Article title	STRING
year	Publication year	INTEGER
issue	Issue number of the item holding the NPL (e.g. journal, proceedings, etc)	INTEGER
bibl	volume	Volume number of the item holding the NPL (e.g. journal, proceedings, etc)
from	First page	INTEGER
to	Last page	INTEGER
abstract	Abstract (from Crossref)	STRING

Same for other tables (e.g. in-text).

Would it be sufficient or would you like any additional info?

Cheers

from patcit.

felixpoege commented on May 11, 2024

My issue is really just about documentation: when I read the landing pages, I don't exactly know how I can include your project into a potential workflow. For me, I was thinking about the situation where I have a bibliometrics dataset and want to attach some high-quality information about patents. Whether I use a high-quality specialized database or a lower-quality large-scale database is really project-specific.

People that come from the bibliometrics side (like me in this case) will want to stick to the database that they have and add just a little bit of patent information. Can this be done easily here? Then it would help to include this information. For example, in your table about you have the PMID. Just write this on the page. (Simply, "This project provides links of patent NPL citations with links to databases in Pubmed, [...]. For other databases, we recommend matching with the DOI.") -> Boom, I download and use your data.

I understand now that this project mostly goes the other way (sorry for me being slow), starting from a patent and asking to know as much as possible about the NPL. Which is really cool, but I think the other perspective is an additional use-case.

All in all: The documentation needs to a better job at explaining what this data will do for me. 😄

from patcit.

cverluise commented on May 11, 2024

People that come from the bibliometrics side (like me in this case) will want to stick to the database that they have and add just a little bit of patent information. Can this be done easily here?

Yes, we can. Toy example: let's say that you are interested in a list of DOIs (e.g. 10.1038/nmat712 and 10.1038/nature03090, 2 highly cited articles in the patent field)

SELECT
  doi,
  cited_by
FROM
  `npl-parsing.patcit.v02_npl`
WHERE
  doi IN ("10.1038/nmat712",
    "10.1038/nature03090")

there you are (I report only the first rows (they cumulate 5500+ rows ), see details

Row	doi	cited_by.origin	cited_by.publication_number
1	10.1038/nmat712	APP	US-9508861-B2
2	10.1038/nmat712	APP	US-9176571-B2
3	10.1038/nmat712	APP	US-9129667-B2
4	10.1038/nmat712	APP	US-8519990-B2
5	10.1038/nmat712	APP	US-8563976-B2

For a larger list, the best would be to upload the list on bigquery and then restrict to the inner join of the v02-npl and the table list (on the dois) and then to keep only the doi and cited_by field. This is an interesting use case, and we might work on a CLI to make it super easy. That's something which we started to discuss in #16 some times ago.

"This project provides links of patent NPL citations with links to databases in Pubmed, [...]. For other databases, we recommend matching with the DOI."

Will do, 👍

Boom, I download and use your data.

Great! There we are 😎 . Actually, in your case, I recommend to go through BigQuery.

Cheers

from patcit.

Sources of NPL about patcit HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent