Comments (6)
Hello @felixpoege,
thanks for the feedback.
Could you give us a bit more details on your request?
Just to give you some more insights, the NPL part of the database comes from PatStat, tables TLS214. The npl_publn_id
is the 2018b npl_publn_id
. This provides a natural bridge to PatStat.
Now, we are also in the process of going further. For the NPL part of the database (the latest version is v02-npl
), we have identified different classes on npl citations:
- For the bibliographical references (70%) the natural identifier is the DOI (we find 1 in 40 of the cases) and opens up to almost all bibliographical datasets. We are currently working with Crossref and thinking about adding DBLP. See #30 and #21
- For the other citation classes, the external datasets which would help precisely characterize the very unique nature of the cited item and enrich the citation data are not yet clearly identified. This is work in progress and this is the purpose of #22 , #23, #24, #25, #26, #27, #28, #29 . We are currently trying to involve people who are used to work with these different kinds of data in order to be as useful as possible to the end-user. Happy for ideas, contacts!
Is this the kind of info you would like to find in the documentation website?
Cheers,
Cyril
from patcit.
Hi Cyril,
okay, I see - the point that the DOI is your linkage recommendation at this point did not become clear to me from the front page documentation. But making sure that the DOI point comes across is important - and providing easy linkage access to further datasets will be important for proliferation. I think the idea of Matt Marx to use open-source datasets is good - maybe you can just recommend usage (i.e. recommend to download Microsoft Academic Graph or Pubmed or DBLP and link via DOI) (This was also my main question, wanted to keep it documentation-only)
But if you can link to commercial datasets like WoS or Scopus that would allow access to many users too (we could not publish our linkage with WoS because of licensing issues though). Same goes for the more specialized types of references that you are talking about.
Personally, I have tried a few times to match bibliometric items across databases (Scopus-WoS-DBLP, MAG-WoS, Pubmed-WoS) and the DOI alone did not give me super convincing results. Mostly because coverage in my WoS is zero before 2000 and poor in the early 2000s. For the ones not matched with a DOI, I resorted to an exact match on the title, cleaning up duplicates using additional characteristics. Doesn't really get you perfect coverage, but it does give you high precision, which is what I wanted in those projects. In terms of coverage, the latest one (MAG-WoS) was really unconvincing, so I only use it for robustness checks in that project.
To what else you said: The PATSTAT part makes sense. However, Patent IDs unrelated to a specific dataset would make accessibility easier for a broader range of users. I think the natural information is that contained in `tls211'.
Anyways, cool work you are doing! Keep it up!
Cheers,
Felix
from patcit.
Hello,
I think the idea of Matt Marx to use open-source datasets is good - maybe you can just recommend usage (i.e. recommend to download Microsoft Academic Graph or Pubmed or DBLP and link via DOI)
The issue with MAG is that this is a relatively low quality dataset. That's why:
- we focus on Crossref which is based directly on publisher data.
- we created #30 so that the database covers a large set of use cases. We have already added the abstracts, subject and funder data from crossref. Crossref has an Open API from which we can get the data (CC-BY4). Feel free to participate.
The PATSTAT part makes sense. However, Patent IDs unrelated to a specific dataset would make accessibility easier for a broader range of users. I think the natural information is that contained in `tls211'.
Yes, we have a cited_by
field which contains:
- the
publication_number
of the citing patents (DOCDB flavor) - the
origin
of the citing patents (Applicants, Search report)
Last but not least, everything discussed above is made publicly available and easy to consult in the table schema, which can be accessed on BigQuery interallia, see here (schema pane). Good to know that it does not seem to be sufficient ;).
We will integrate your comments to later versions of the doc website.
Cheers
from patcit.
I'm thinking about adding the schema of the table to the data section, see below for the v02-npl
table
name | description | type |
---|---|---|
npl_publn_id | Non-Patent Literature publication identification. Source: PATSTAT and DOCDB. | INTEGER |
npl_class | NPL class (e.g. BIBLIOGRAPHICAL_REFERENCE, OFFICE_ACTION, PATENT, SEARCH_REPORT, etc) | STRING |
cited_by.origin | Origin of the citation (e.g. APPlicant, SEArch report, etc) | STRING |
cited_by.publication_number | DOCDB publication number of citing patent(s) | STRING |
author.first | First name | STRING |
author.middle | Middle name | STRING |
author.surname | Surname | STRING |
author.genname | Gender name | STRING |
funder.subject | Subject (from Crossref) | STRING |
funder.DOI | Funder DOI (from Crossref) | STRING |
funder.award | Funding award identifier (from Crossref) | STRING |
funder.name | Funder name (from Crossref) | STRING |
doi | Digital Object Identifier. DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports and data sets, and official publications. | STRING |
ISSN | International Standard Serial Number. It is an 8-digit code used to identify newspapers, journals, magazines and periodicals of all kinds and on all media –- print and electronic. | STRING |
ISSNe | Electronic International Standard Serial Number. When a serial with the same content is published in more than one media type, a different ISSN is assigned to each media type. The ISSN system refers to the electronic ISSN as the ISSNe. | STRING |
PMCID | PubMed Central Identifier. The PMCID is a unique reference number or identifier that is assigned to every article that is accepted into PubMed Central – an archive of full-text journal articles. | STRING |
PMID | PubMed Identifier. The PMID is a unique reference number for PubMed citations. | INTEGER |
idno | Document-specific identifier, if not in DOI, ISSN, ISSNe, PMCID, PMID. | STRING |
target | Open Access url | STRING |
title_j | Journal title | STRING |
title_abbrev_j | Journal title (abbreviated) | STRING |
title_m | Title of the item holding the NPL – for non journal items only , e.g. conference, proceedings, etc. | STRING |
title_main_m | STRING | |
title_main_a | Article title | STRING |
year | Publication year | INTEGER |
issue | Issue number of the item holding the NPL (e.g. journal, proceedings, etc) | INTEGER |
bibl | volume | Volume number of the item holding the NPL (e.g. journal, proceedings, etc) |
from | First page | INTEGER |
to | Last page | INTEGER |
abstract | Abstract (from Crossref) | STRING |
Same for other tables (e.g. in-text
).
Would it be sufficient or would you like any additional info?
Cheers
from patcit.
My issue is really just about documentation: when I read the landing pages, I don't exactly know how I can include your project into a potential workflow. For me, I was thinking about the situation where I have a bibliometrics dataset and want to attach some high-quality information about patents. Whether I use a high-quality specialized database or a lower-quality large-scale database is really project-specific.
People that come from the bibliometrics side (like me in this case) will want to stick to the database that they have and add just a little bit of patent information. Can this be done easily here? Then it would help to include this information. For example, in your table about you have the PMID. Just write this on the page. (Simply, "This project provides links of patent NPL citations with links to databases in Pubmed, [...]. For other databases, we recommend matching with the DOI.") -> Boom, I download and use your data.
I understand now that this project mostly goes the other way (sorry for me being slow), starting from a patent and asking to know as much as possible about the NPL. Which is really cool, but I think the other perspective is an additional use-case.
All in all: The documentation needs to a better job at explaining what this data will do for me. 😄
from patcit.
People that come from the bibliometrics side (like me in this case) will want to stick to the database that they have and add just a little bit of patent information. Can this be done easily here?
Yes, we can. Toy example: let's say that you are interested in a list of DOIs (e.g. 10.1038/nmat712 and 10.1038/nature03090, 2 highly cited articles in the patent field)
SELECT
doi,
cited_by
FROM
`npl-parsing.patcit.v02_npl`
WHERE
doi IN ("10.1038/nmat712",
"10.1038/nature03090")
there you are (I report only the first rows (they cumulate 5500+ rows ), see details
Row | doi | cited_by.origin | cited_by.publication_number | |
---|---|---|---|---|
1 | 10.1038/nmat712 | APP | US-9508861-B2 | |
2 | 10.1038/nmat712 | APP | US-9176571-B2 | |
3 | 10.1038/nmat712 | APP | US-9129667-B2 | |
4 | 10.1038/nmat712 | APP | US-8519990-B2 | |
5 | 10.1038/nmat712 | APP | US-8563976-B2 |
For a larger list, the best would be to upload the list on bigquery and then restrict to the inner join of the v02-npl
and the table list (on the dois) and then to keep only the doi
and cited_by
field. This is an interesting use case, and we might work on a CLI to make it super easy. That's something which we started to discuss in #16 some times ago.
"This project provides links of patent NPL citations with links to databases in Pubmed, [...]. For other databases, we recommend matching with the DOI."
Will do, 👍
Boom, I download and use your data.
Great! There we are 😎 . Actually, in your case, I recommend to go through BigQuery.
Cheers
from patcit.
Related Issues (20)
- Title disambiguation HOT 7
- Dead links in `target`
- Variable description HOT 1
- Missing `title_*`
- "Pages" in `title_j`
- Make data available for download HOT 2
- Add the version of the PATSTAT that was used as source data into the description HOT 1
- npl_publn_id with same doi -> merge? HOT 1
- Create variable dedicated to NPL class (bibliographical resources, search report, standards, etc) HOT 1
- Non latin NPL citations mess up the npl_class HOT 2
- Add link to patstat appln_id HOT 1
- Naming of the files in the tar archives HOT 1
- Broken link
- Using npl_publn_id to merge PatCit to PATSTAT ??? HOT 1
- Zotero gzipped file is corrupt HOT 1
- Geographic information
- Multiple `title_j` for the same `ISSN`/`ISSNe` HOT 1
- Consolidate technical bulletins and conferences
- Standardise and/or propagate `title_abbrev_j`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from patcit.