Comments (7)
non english characters in crossrefworks.json are replaced by question mark
Yuck some sort of encoding issue at some point in the pipeline. At what stage exactly did you notice the question marks?
The API call for the above record does seem to encode unicode characters properly in the name (and not Heidb??chel
).
from crossref.
The issue is in downloaded file from figshare. To reproduce, run the following java code on decompressed file -
package com.racloop.crossref;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class FileRead {
public static void main(String[] args) {
String filePath = "/Volumes/Seagate/crossrefworks.json";
BufferedReader fileReader = null;
try {
fileReader =
new BufferedReader(new FileReader(filePath));
String line = fileReader.readLine();
int index = 0;
while (line != null) {
index++;
line = fileReader.readLine();
if(line.contains("58d96fec0c62134f84023a29")) {
System.out.println(line);
break;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
fileReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
from crossref.
I've confirmed the question marks using.
xzcat data/mongo-export/crossref-works.json.xz | grep "58d96fec0c62134f84023a29"
I'll look into where these question marks first infiltrated the pipeline.
from crossref.
It looks like the question marks are part of our MongoDB database (and did not get inserted by mongoexport
), according to the following python:
import pymongo
client = pymongo.MongoClient('localhost', 27017)
crossref_db = client.crossref
result = crossref_db.works.find({'DOI': '10.1097/01.hjr.0000224482.95597.7a'})
result['author']
Which returns:
[{'given': 'Asterios', 'family': 'Deligiannis', 'affiliation': []},
{'given': 'Hans', 'family': 'Bj??rnstad', 'affiliation': []},
{'given': 'Francois', 'family': 'Carre', 'affiliation': []},
{'given': 'Hein', 'family': 'Heidb??chel', 'affiliation': []},
{'given': 'Evangelia', 'family': 'Kouidi', 'affiliation': []},
{'given': 'Nicole M.', 'family': 'Panhuyzen-Goedkoop', 'affiliation': []},
{'given': 'Fabio', 'family': 'Pigozzi', 'affiliation': []},
{'given': 'Wilhelm', 'family': 'Sch??nzer', 'affiliation': []},
{'given': 'Luc', 'family': 'Vanhees', 'affiliation': []}]
So the question is now whether our pipeline clobbered the unicode characters or whether there was an upstream issue that we inherited.
from crossref.
Note that I have found an instance where unicode content was encoded properly:
crossref_db.works.find_one({'DOI': '10.7717/peerj.100'})['author']
See "Angélica" below:
{'given': 'Angélica L.',
'family': 'Gonzalez',
'affiliation': [{'name': 'Department of Zoology, University of British Columbia, Vancouver, Canada'}]},
I'm thinking that the issue with the metadata for https://doi.org/10.1097/01.hjr.0000224482.95597.7a was an upstream issue. If we take a look at the current metadata:
curl --location --silent https://api.crossref.org/works/10.1097/01.hjr.0000224482.95597.7a | python -m json.tool
I've pasted some of the relevant timestamp fields:
"created": {
"date-parts": [
[
2006,
9,
22
]
],
"date-time": "2006-09-22T08:19:38Z",
"timestamp": 1158913178000
},
"deposited": {
"date-parts": [
[
2017,
12,
29
]
],
"date-time": "2017-12-29T04:03:00Z",
"timestamp": 1514520180000
},
"indexed": {
"date-parts": [
[
2018,
5,
3
]
],
"date-time": "2018-05-03T03:51:58Z",
"timestamp": 1525319518794
},
I'm not sure what indexed means, but it looks like the publisher re-deposited the information to Crossref on 2017-12-29, at which point I'm guessing they fixed the author names. You'll find that publishers often deposit incorrect information, and this may be a case where they have improved there metadata since we queried Crossref. If this specific record is of importance to you, it may be fix in @bnewbold's slightly more recent Crossref dump using this codebase at https://archive.org/download/crossref_doi_dump_201801.
from crossref.
For DOI 10.1097/01.hjr.0000224482.95597.7a
, the 2018-01 dump seems to have "correct" unicode characters (not question marks).
from crossref.
the 2018-01 dump seems to have "correct" unicode characters (not question marks).
Glad that the error was not on our end!
from crossref.
Related Issues (9)
- Export MongoDB database from Docker HOT 1
- Question on keyword coverage HOT 1
- Invalid character error when using mongoimport HOT 2
- Schema of the datasets HOT 2
- 2019-09 IA Bulk Dump Update HOT 2
- What percent of scholarly publication DOIs are registered with Crossref? HOT 4
- Future updates of bulk Crossref metadata corpus HOT 23
- Storing at zenodo? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crossref.